233 lines
6.4 KiB
Markdown
233 lines
6.4 KiB
Markdown
# Ollama Testing Framework Documentation
|
|
Version: 1.0
|
|
Last Updated: 2025-02-23
|
|
|
|
## Overview
|
|
The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:
|
|
1. Code correctness (test cases)
|
|
2. Performance metrics (inference time, tokens/sec)
|
|
3. Consistency across multiple runs
|
|
|
|
## Goals
|
|
1. Validate model responses for correctness and functionality
|
|
2. Measure and compare performance across different models
|
|
3. Provide detailed insights into model behavior and reliability
|
|
4. Enable easy comparison through a leaderboard system
|
|
|
|
## Core Components
|
|
|
|
### 1. Test Suite
|
|
The test suite consists of multiple coding challenges, each with:
|
|
- A clear problem description
|
|
- A validator function
|
|
- Multiple test cases
|
|
- Expected outputs
|
|
|
|
Current Test Cases:
|
|
a) Fibonacci Sequence
|
|
- Tests edge cases (negative, zero)
|
|
- Tests standard cases (n=1 to n=10)
|
|
- Validates performance for larger inputs
|
|
|
|
b) Binary Search
|
|
- Tests empty list case
|
|
- Tests element not found
|
|
- Tests finding elements at different positions
|
|
|
|
c) Palindrome Check
|
|
- Tests empty string
|
|
- Tests single character
|
|
- Tests various palindrome and non-palindrome cases
|
|
|
|
d) Anagram Check
|
|
- Tests empty strings
|
|
- Tests case sensitivity
|
|
- Tests strings with spaces and special characters
|
|
|
|
### 2. Inference Pipeline
|
|
|
|
#### Request Flow:
|
|
1. Format prompt with problem description
|
|
2. Send to Ollama API with timing start
|
|
3. Receive response and stop timing
|
|
4. Extract code from response
|
|
5. Validate code syntax
|
|
6. Run test cases
|
|
7. Calculate performance metrics
|
|
|
|
#### Performance Metrics:
|
|
- Total Duration (s): Time from request to response completion
|
|
- Total Tokens: Number of tokens in the response (eval_count)
|
|
- Tokens per Second: Processing speed (tokens/duration)
|
|
|
|
### 3. Validation System
|
|
|
|
#### Code Validation:
|
|
1. Syntax check (is_valid_python)
|
|
2. Function name verification
|
|
3. Test case execution
|
|
4. Together API integration for failure analysis
|
|
|
|
#### Test Results:
|
|
- Individual test case results (pass/fail)
|
|
- Error messages and debug info
|
|
- Together API opinions on failures
|
|
|
|
### 4. Benchmarking System
|
|
|
|
#### Benchmark Process:
|
|
1. Run multiple iterations (default: 4 runs)
|
|
2. Use last 3 runs for final metrics
|
|
3. Calculate averages across runs
|
|
4. Store detailed results in JSON
|
|
|
|
#### Metrics Tracked:
|
|
- Success rate per test
|
|
- Overall success rate
|
|
- Average inference time
|
|
- Average tokens per second
|
|
|
|
### 5. Leaderboard System
|
|
|
|
#### Ranking Algorithm:
|
|
1. Primary sort: Overall success rate
|
|
- Calculated as (total passed cases / total cases) across all tests
|
|
2. Secondary sort: Tokens per second
|
|
- Higher speed breaks ties between equal success rates
|
|
|
|
#### Display Format:
|
|
```
|
|
🏆 Model Leaderboard:
|
|
1. model_name
|
|
Overall Success Rate: XX.X% (passed/total cases)
|
|
Average Tokens/sec: XX.XX
|
|
Average Duration: XX.XXs
|
|
Test Results:
|
|
- Test1: ✅/❌ passed/total cases (success_rate%)
|
|
- Test2: ✅/❌ passed/total cases (success_rate%)
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Run:
|
|
```bash
|
|
python3.10 main.py --server 'z60' --number '2'
|
|
```
|
|
|
|
### Options:
|
|
- `--model`: Specify a single model to test
|
|
- `--server`: Custom Ollama server URL
|
|
- `--runs`: Number of benchmark runs
|
|
|
|
### Output:
|
|
1. Real-time test progress
|
|
2. Performance metrics per inference
|
|
3. Test results summary
|
|
4. Final leaderboard
|
|
5. JSON results file with timestamp
|
|
|
|
## Results Storage
|
|
- Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
|
|
- Contains full test details and metrics
|
|
- Enables historical comparison and analysis
|
|
|
|
## Error Handling
|
|
1. API communication errors
|
|
2. Code execution timeouts
|
|
3. Invalid responses
|
|
4. Test case failures
|
|
5. Performance metric calculation errors
|
|
|
|
## Future Improvements
|
|
1. Add more diverse test cases
|
|
2. Implement parallel testing
|
|
3. Add memory usage tracking
|
|
4. Create historical performance trends
|
|
5. Add code quality metrics
|
|
|
|
## Leaderboard Data Processing
|
|
|
|
### Test Results Processing
|
|
- Processes only the latest benchmark results
|
|
- Determines maximum test cases from successful runs
|
|
- Handles validation failures as complete failures (0 passed cases)
|
|
- Uses dynamic test case counting based on actual successful runs
|
|
- Maintains consistent test case counting across all scenarios
|
|
|
|
### Success Rate Calculations
|
|
- Calculates success rates based on expected total cases
|
|
- Counts failed validations as 0/expected_cases
|
|
- Uses maximum observed test cases as the baseline
|
|
- Includes validation status in success rate reporting
|
|
- Prevents skipping of failed validations in total counts
|
|
|
|
### Performance Metrics
|
|
- Tracks tokens per second from model responses
|
|
- Measures total duration across all tests
|
|
- Calculates success rate vs duration ratio
|
|
- Excellence criteria: >90% success AND success_rate > 5*duration
|
|
- Prevents duplicate model entries in metrics
|
|
|
|
### Data Visualization
|
|
# Development Notes
|
|
|
|
## Project Structure
|
|
- `main.py`: Core benchmarking functionality
|
|
- `lboard.py`: Leaderboard visualization and results analysis
|
|
- `benchmark_results/`: Directory containing JSON benchmark results
|
|
|
|
## Visualization Features
|
|
- Blue bars: Tokens per second performance
|
|
- Red + markers: Overall success rate (%)
|
|
- Green - markers: Total duration (seconds)
|
|
- Green model names: Models with >90% success rate
|
|
- Triple y-axis plot for easy metric comparison
|
|
|
|
## Running the Leaderboard
|
|
```bash
|
|
# View latest results
|
|
python lboard.py
|
|
|
|
# View specific results file
|
|
python lboard.py path/to/results.json
|
|
```
|
|
- Single plot per model (no duplicates)
|
|
- Color-coded bars based on performance
|
|
- Success rate indicators (red +)
|
|
- Duration indicators (green -)
|
|
- Dynamic axis scaling
|
|
- Combined legend for all metrics
|
|
|
|
## Output Format
|
|
|
|
### Benchmark Run Output
|
|
For each model being tested, the output shows:
|
|
|
|
1. Individual test runs (1-4) with:
|
|
- Test case results
|
|
- Performance metrics
|
|
- Pass/fail status
|
|
|
|
2. Cumulative Results Summary:
|
|
After all runs are completed, a summary is displayed:
|
|
- Detailed test results per model
|
|
- Individual test case counts
|
|
- Validation status indicators
|
|
- Overall success rates
|
|
- Performance metrics
|
|
- Plot position information
|
|
|
|
Model: model_name
|
|
├─ Tokens/sec: XX.XX
|
|
├─ Total Duration: XX.XXs
|
|
├─ Test Results:
|
|
│ ├─ Test1: X/Y cases (ZZ.Z%) [validation status]
|
|
│ ├─ Test2: X/Y cases (ZZ.Z%) [validation status]
|
|
└─ Overall Success Rate: X/Y (ZZ.Z%)
|
|
|
|
### Error Handling
|
|
- Handles missing test results
|
|
- Processes validation failures appropriately
|
|
- Maintains consistent case counting
|
|
- Prevents data duplication
|
|
- Ensures accurate success rate calculations |