codebench/devbook.md
2025-03-04 04:34:23 +01:00

6.4 KiB

Ollama Testing Framework Documentation

Version: 1.0 Last Updated: 2025-02-23

Overview

The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:

  1. Code correctness (test cases)
  2. Performance metrics (inference time, tokens/sec)
  3. Consistency across multiple runs

Goals

  1. Validate model responses for correctness and functionality
  2. Measure and compare performance across different models
  3. Provide detailed insights into model behavior and reliability
  4. Enable easy comparison through a leaderboard system

Core Components

1. Test Suite

The test suite consists of multiple coding challenges, each with:

  • A clear problem description
  • A validator function
  • Multiple test cases
  • Expected outputs

Current Test Cases: a) Fibonacci Sequence

  • Tests edge cases (negative, zero)
  • Tests standard cases (n=1 to n=10)
  • Validates performance for larger inputs

b) Binary Search

  • Tests empty list case
  • Tests element not found
  • Tests finding elements at different positions

c) Palindrome Check

  • Tests empty string
  • Tests single character
  • Tests various palindrome and non-palindrome cases

d) Anagram Check

  • Tests empty strings
  • Tests case sensitivity
  • Tests strings with spaces and special characters

2. Inference Pipeline

Request Flow:

  1. Format prompt with problem description
  2. Send to Ollama API with timing start
  3. Receive response and stop timing
  4. Extract code from response
  5. Validate code syntax
  6. Run test cases
  7. Calculate performance metrics

Performance Metrics:

  • Total Duration (s): Time from request to response completion
  • Total Tokens: Number of tokens in the response (eval_count)
  • Tokens per Second: Processing speed (tokens/duration)

3. Validation System

Code Validation:

  1. Syntax check (is_valid_python)
  2. Function name verification
  3. Test case execution
  4. Together API integration for failure analysis

Test Results:

  • Individual test case results (pass/fail)
  • Error messages and debug info
  • Together API opinions on failures

4. Benchmarking System

Benchmark Process:

  1. Run multiple iterations (default: 4 runs)
  2. Use last 3 runs for final metrics
  3. Calculate averages across runs
  4. Store detailed results in JSON

Metrics Tracked:

  • Success rate per test
  • Overall success rate
  • Average inference time
  • Average tokens per second

5. Leaderboard System

Ranking Algorithm:

  1. Primary sort: Overall success rate
    • Calculated as (total passed cases / total cases) across all tests
  2. Secondary sort: Tokens per second
    • Higher speed breaks ties between equal success rates

Display Format:

🏆 Model Leaderboard:
1. model_name
   Overall Success Rate: XX.X% (passed/total cases)
   Average Tokens/sec: XX.XX
   Average Duration: XX.XXs
   Test Results:
   - Test1: ✅/❌ passed/total cases (success_rate%)
   - Test2: ✅/❌ passed/total cases (success_rate%)

Usage

Basic Run:

python3.10 main.py --server 'z60' --number '2' 

Options:

  • --model: Specify a single model to test
  • --server: Custom Ollama server URL
  • --runs: Number of benchmark runs

Output:

  1. Real-time test progress
  2. Performance metrics per inference
  3. Test results summary
  4. Final leaderboard
  5. JSON results file with timestamp

Results Storage

  • Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
  • Contains full test details and metrics
  • Enables historical comparison and analysis

Error Handling

  1. API communication errors
  2. Code execution timeouts
  3. Invalid responses
  4. Test case failures
  5. Performance metric calculation errors

Future Improvements

  1. Add more diverse test cases
  2. Implement parallel testing
  3. Add memory usage tracking
  4. Create historical performance trends
  5. Add code quality metrics

Leaderboard Data Processing

Test Results Processing

  • Processes only the latest benchmark results
  • Determines maximum test cases from successful runs
  • Handles validation failures as complete failures (0 passed cases)
  • Uses dynamic test case counting based on actual successful runs
  • Maintains consistent test case counting across all scenarios

Success Rate Calculations

  • Calculates success rates based on expected total cases
  • Counts failed validations as 0/expected_cases
  • Uses maximum observed test cases as the baseline
  • Includes validation status in success rate reporting
  • Prevents skipping of failed validations in total counts

Performance Metrics

  • Tracks tokens per second from model responses
  • Measures total duration across all tests
  • Calculates success rate vs duration ratio
  • Excellence criteria: >90% success AND success_rate > 5*duration
  • Prevents duplicate model entries in metrics

Data Visualization

Development Notes

Project Structure

  • main.py: Core benchmarking functionality
  • lboard.py: Leaderboard visualization and results analysis
  • benchmark_results/: Directory containing JSON benchmark results

Visualization Features

  • Blue bars: Tokens per second performance
  • Red + markers: Overall success rate (%)
  • Green - markers: Total duration (seconds)
  • Green model names: Models with >90% success rate
  • Triple y-axis plot for easy metric comparison

Running the Leaderboard

# View latest results
python lboard.py

# View specific results file
python lboard.py path/to/results.json
  • Single plot per model (no duplicates)
  • Color-coded bars based on performance
  • Success rate indicators (red +)
  • Duration indicators (green -)
  • Dynamic axis scaling
  • Combined legend for all metrics

Output Format

Benchmark Run Output

For each model being tested, the output shows:

  1. Individual test runs (1-4) with:

    • Test case results
    • Performance metrics
    • Pass/fail status
  2. Cumulative Results Summary: After all runs are completed, a summary is displayed:

  • Detailed test results per model
  • Individual test case counts
  • Validation status indicators
  • Overall success rates
  • Performance metrics
  • Plot position information

Model: model_name ├─ Tokens/sec: XX.XX ├─ Total Duration: XX.XXs ├─ Test Results: │ ├─ Test1: X/Y cases (ZZ.Z%) [validation status] │ ├─ Test2: X/Y cases (ZZ.Z%) [validation status] └─ Overall Success Rate: X/Y (ZZ.Z%)

Error Handling

  • Handles missing test results
  • Processes validation failures appropriately
  • Maintains consistent case counting
  • Prevents data duplication
  • Ensures accurate success rate calculations