2025-03-04 04:34:23 +01:00

6.4 KiB

Raw Permalink Blame History

Ollama Testing Framework Documentation

Version: 1.0 Last Updated: 2025-02-23

Overview

The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:

Code correctness (test cases)
Performance metrics (inference time, tokens/sec)
Consistency across multiple runs

Goals

Validate model responses for correctness and functionality
Measure and compare performance across different models
Provide detailed insights into model behavior and reliability
Enable easy comparison through a leaderboard system

Core Components

1. Test Suite

The test suite consists of multiple coding challenges, each with:

A clear problem description
A validator function
Multiple test cases
Expected outputs

Current Test Cases: a) Fibonacci Sequence

Tests edge cases (negative, zero)
Tests standard cases (n=1 to n=10)
Validates performance for larger inputs

b) Binary Search

Tests empty list case
Tests element not found
Tests finding elements at different positions

c) Palindrome Check

Tests empty string
Tests single character
Tests various palindrome and non-palindrome cases

d) Anagram Check

Tests empty strings
Tests case sensitivity
Tests strings with spaces and special characters

2. Inference Pipeline

Request Flow:

Format prompt with problem description
Send to Ollama API with timing start
Receive response and stop timing
Extract code from response
Validate code syntax
Run test cases
Calculate performance metrics

Performance Metrics:

Total Duration (s): Time from request to response completion
Total Tokens: Number of tokens in the response (eval_count)
Tokens per Second: Processing speed (tokens/duration)

3. Validation System

Code Validation:

Syntax check (is_valid_python)
Function name verification
Test case execution
Together API integration for failure analysis

Test Results:

Individual test case results (pass/fail)
Error messages and debug info
Together API opinions on failures

4. Benchmarking System

Benchmark Process:

Run multiple iterations (default: 4 runs)
Use last 3 runs for final metrics
Calculate averages across runs
Store detailed results in JSON

Metrics Tracked:

Success rate per test
Overall success rate
Average inference time
Average tokens per second

5. Leaderboard System

Ranking Algorithm:

Primary sort: Overall success rate
- Calculated as (total passed cases / total cases) across all tests
Secondary sort: Tokens per second
- Higher speed breaks ties between equal success rates

Display Format:

🏆 Model Leaderboard:
1. model_name
   Overall Success Rate: XX.X% (passed/total cases)
   Average Tokens/sec: XX.XX
   Average Duration: XX.XXs
   Test Results:
   - Test1: ✅/❌ passed/total cases (success_rate%)
   - Test2: ✅/❌ passed/total cases (success_rate%)

Usage

Basic Run:

python3.10 main.py --server 'z60' --number '2'

Options:

--model: Specify a single model to test
--server: Custom Ollama server URL
--runs: Number of benchmark runs

Output:

Real-time test progress
Performance metrics per inference
Test results summary
Final leaderboard
JSON results file with timestamp

Results Storage

Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
Contains full test details and metrics
Enables historical comparison and analysis

Error Handling

API communication errors
Code execution timeouts
Invalid responses
Test case failures
Performance metric calculation errors

Future Improvements

Add more diverse test cases
Implement parallel testing
Add memory usage tracking
Create historical performance trends
Add code quality metrics

Leaderboard Data Processing

Test Results Processing

Processes only the latest benchmark results
Determines maximum test cases from successful runs
Handles validation failures as complete failures (0 passed cases)
Uses dynamic test case counting based on actual successful runs
Maintains consistent test case counting across all scenarios

Success Rate Calculations

Calculates success rates based on expected total cases
Counts failed validations as 0/expected_cases
Uses maximum observed test cases as the baseline
Includes validation status in success rate reporting
Prevents skipping of failed validations in total counts

Performance Metrics

Tracks tokens per second from model responses
Measures total duration across all tests
Calculates success rate vs duration ratio
Excellence criteria: >90% success AND success_rate > 5*duration
Prevents duplicate model entries in metrics

Data Visualization

Development Notes

Project Structure

main.py: Core benchmarking functionality
lboard.py: Leaderboard visualization and results analysis
benchmark_results/: Directory containing JSON benchmark results

Visualization Features

Blue bars: Tokens per second performance
Red + markers: Overall success rate (%)
Green - markers: Total duration (seconds)
Green model names: Models with >90% success rate
Triple y-axis plot for easy metric comparison

Running the Leaderboard

# View latest results
python lboard.py

# View specific results file
python lboard.py path/to/results.json

Single plot per model (no duplicates)
Color-coded bars based on performance
Success rate indicators (red +)
Duration indicators (green -)
Dynamic axis scaling
Combined legend for all metrics

Output Format

Benchmark Run Output

For each model being tested, the output shows:

Individual test runs (1-4) with:
- Test case results
- Performance metrics
- Pass/fail status
Cumulative Results Summary: After all runs are completed, a summary is displayed:

Detailed test results per model
Individual test case counts
Validation status indicators
Overall success rates
Performance metrics
Plot position information

Model: model_name ├─ Tokens/sec: XX.XX ├─ Total Duration: XX.XXs ├─ Test Results: │ ├─ Test1: X/Y cases (ZZ.Z%) [validation status] │ ├─ Test2: X/Y cases (ZZ.Z%) [validation status] └─ Overall Success Rate: X/Y (ZZ.Z%)

Error Handling

Handles missing test results
Processes validation failures appropriately
Maintains consistent case counting
Prevents data duplication
Ensures accurate success rate calculations

6.4 KiB Raw Permalink Blame History

Ollama Testing Framework Documentation

Overview

Goals

Core Components

1. Test Suite

2. Inference Pipeline

Request Flow:

Performance Metrics:

3. Validation System

Code Validation:

Test Results:

4. Benchmarking System

Benchmark Process:

Metrics Tracked:

5. Leaderboard System

Ranking Algorithm:

Display Format:

Usage

Basic Run:

Options:

Output:

Results Storage

Error Handling

Future Improvements

Leaderboard Data Processing

Test Results Processing

Success Rate Calculations

Performance Metrics

Data Visualization

Development Notes

Project Structure

Visualization Features

Running the Leaderboard

Output Format

Benchmark Run Output

Error Handling

6.4 KiB

Raw Permalink Blame History