6.4 KiB
6.4 KiB
Ollama Testing Framework Documentation
Version: 1.0 Last Updated: 2025-02-23
Overview
The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:
- Code correctness (test cases)
- Performance metrics (inference time, tokens/sec)
- Consistency across multiple runs
Goals
- Validate model responses for correctness and functionality
- Measure and compare performance across different models
- Provide detailed insights into model behavior and reliability
- Enable easy comparison through a leaderboard system
Core Components
1. Test Suite
The test suite consists of multiple coding challenges, each with:
- A clear problem description
- A validator function
- Multiple test cases
- Expected outputs
Current Test Cases: a) Fibonacci Sequence
- Tests edge cases (negative, zero)
- Tests standard cases (n=1 to n=10)
- Validates performance for larger inputs
b) Binary Search
- Tests empty list case
- Tests element not found
- Tests finding elements at different positions
c) Palindrome Check
- Tests empty string
- Tests single character
- Tests various palindrome and non-palindrome cases
d) Anagram Check
- Tests empty strings
- Tests case sensitivity
- Tests strings with spaces and special characters
2. Inference Pipeline
Request Flow:
- Format prompt with problem description
- Send to Ollama API with timing start
- Receive response and stop timing
- Extract code from response
- Validate code syntax
- Run test cases
- Calculate performance metrics
Performance Metrics:
- Total Duration (s): Time from request to response completion
- Total Tokens: Number of tokens in the response (eval_count)
- Tokens per Second: Processing speed (tokens/duration)
3. Validation System
Code Validation:
- Syntax check (is_valid_python)
- Function name verification
- Test case execution
- Together API integration for failure analysis
Test Results:
- Individual test case results (pass/fail)
- Error messages and debug info
- Together API opinions on failures
4. Benchmarking System
Benchmark Process:
- Run multiple iterations (default: 4 runs)
- Use last 3 runs for final metrics
- Calculate averages across runs
- Store detailed results in JSON
Metrics Tracked:
- Success rate per test
- Overall success rate
- Average inference time
- Average tokens per second
5. Leaderboard System
Ranking Algorithm:
- Primary sort: Overall success rate
- Calculated as (total passed cases / total cases) across all tests
- Secondary sort: Tokens per second
- Higher speed breaks ties between equal success rates
Display Format:
🏆 Model Leaderboard:
1. model_name
Overall Success Rate: XX.X% (passed/total cases)
Average Tokens/sec: XX.XX
Average Duration: XX.XXs
Test Results:
- Test1: ✅/❌ passed/total cases (success_rate%)
- Test2: ✅/❌ passed/total cases (success_rate%)
Usage
Basic Run:
python3.10 main.py --server 'z60' --number '2'
Options:
--model
: Specify a single model to test--server
: Custom Ollama server URL--runs
: Number of benchmark runs
Output:
- Real-time test progress
- Performance metrics per inference
- Test results summary
- Final leaderboard
- JSON results file with timestamp
Results Storage
- Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
- Contains full test details and metrics
- Enables historical comparison and analysis
Error Handling
- API communication errors
- Code execution timeouts
- Invalid responses
- Test case failures
- Performance metric calculation errors
Future Improvements
- Add more diverse test cases
- Implement parallel testing
- Add memory usage tracking
- Create historical performance trends
- Add code quality metrics
Leaderboard Data Processing
Test Results Processing
- Processes only the latest benchmark results
- Determines maximum test cases from successful runs
- Handles validation failures as complete failures (0 passed cases)
- Uses dynamic test case counting based on actual successful runs
- Maintains consistent test case counting across all scenarios
Success Rate Calculations
- Calculates success rates based on expected total cases
- Counts failed validations as 0/expected_cases
- Uses maximum observed test cases as the baseline
- Includes validation status in success rate reporting
- Prevents skipping of failed validations in total counts
Performance Metrics
- Tracks tokens per second from model responses
- Measures total duration across all tests
- Calculates success rate vs duration ratio
- Excellence criteria: >90% success AND success_rate > 5*duration
- Prevents duplicate model entries in metrics
Data Visualization
Development Notes
Project Structure
main.py
: Core benchmarking functionalitylboard.py
: Leaderboard visualization and results analysisbenchmark_results/
: Directory containing JSON benchmark results
Visualization Features
- Blue bars: Tokens per second performance
- Red + markers: Overall success rate (%)
- Green - markers: Total duration (seconds)
- Green model names: Models with >90% success rate
- Triple y-axis plot for easy metric comparison
Running the Leaderboard
# View latest results
python lboard.py
# View specific results file
python lboard.py path/to/results.json
- Single plot per model (no duplicates)
- Color-coded bars based on performance
- Success rate indicators (red +)
- Duration indicators (green -)
- Dynamic axis scaling
- Combined legend for all metrics
Output Format
Benchmark Run Output
For each model being tested, the output shows:
-
Individual test runs (1-4) with:
- Test case results
- Performance metrics
- Pass/fail status
-
Cumulative Results Summary: After all runs are completed, a summary is displayed:
- Detailed test results per model
- Individual test case counts
- Validation status indicators
- Overall success rates
- Performance metrics
- Plot position information
Model: model_name ├─ Tokens/sec: XX.XX ├─ Total Duration: XX.XXs ├─ Test Results: │ ├─ Test1: X/Y cases (ZZ.Z%) [validation status] │ ├─ Test2: X/Y cases (ZZ.Z%) [validation status] └─ Overall Success Rate: X/Y (ZZ.Z%)
Error Handling
- Handles missing test results
- Processes validation failures appropriately
- Maintains consistent case counting
- Prevents data duplication
- Ensures accurate success rate calculations