codebench/devbook.md

# Ollama Testing Framework Documentation
Version: 1.0
Last Updated: 2025-02-23

## Overview
The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:
1. Code correctness (test cases)
2. Performance metrics (inference time, tokens/sec)
3. Consistency across multiple runs

## Goals
1. Validate model responses for correctness and functionality
2. Measure and compare performance across different models
3. Provide detailed insights into model behavior and reliability
4. Enable easy comparison through a leaderboard system

## Core Components

### 1. Test Suite
The test suite consists of multiple coding challenges, each with:
- A clear problem description
- A validator function
- Multiple test cases
- Expected outputs

Current Test Cases:
a) Fibonacci Sequence
   - Tests edge cases (negative, zero)
   - Tests standard cases (n=1 to n=10)
   - Validates performance for larger inputs

b) Binary Search
   - Tests empty list case
   - Tests element not found
   - Tests finding elements at different positions

c) Palindrome Check
   - Tests empty string
   - Tests single character
   - Tests various palindrome and non-palindrome cases

d) Anagram Check
   - Tests empty strings
   - Tests case sensitivity
   - Tests strings with spaces and special characters

### 2. Inference Pipeline

#### Request Flow:
1. Format prompt with problem description
2. Send to Ollama API with timing start
3. Receive response and stop timing
4. Extract code from response
5. Validate code syntax
6. Run test cases
7. Calculate performance metrics

#### Performance Metrics:
- Total Duration (s): Time from request to response completion
- Total Tokens: Number of tokens in the response (eval_count)
- Tokens per Second: Processing speed (tokens/duration)

### 3. Validation System

#### Code Validation:
1. Syntax check (is_valid_python)
2. Function name verification
3. Test case execution
4. Together API integration for failure analysis

#### Test Results:
- Individual test case results (pass/fail)
- Error messages and debug info
- Together API opinions on failures

### 4. Benchmarking System

#### Benchmark Process:
1. Run multiple iterations (default: 4 runs)
2. Use last 3 runs for final metrics
3. Calculate averages across runs
4. Store detailed results in JSON

#### Metrics Tracked:
- Success rate per test
- Overall success rate
- Average inference time
- Average tokens per second

### 5. Leaderboard System

#### Ranking Algorithm:
1. Primary sort: Overall success rate
   - Calculated as (total passed cases / total cases) across all tests
2. Secondary sort: Tokens per second
   - Higher speed breaks ties between equal success rates

#### Display Format:
```
🏆 Model Leaderboard:
1. model_name
   Overall Success Rate: XX.X% (passed/total cases)
   Average Tokens/sec: XX.XX
   Average Duration: XX.XXs
   Test Results:
   - Test1: ✅/❌ passed/total cases (success_rate%)
   - Test2: ✅/❌ passed/total cases (success_rate%)
```

## Usage

### Basic Run:
```bash
python3.10 main.py --server 'z60' --number '2'
```

### Options:
- `--model`: Specify a single model to test
- `--server`: Custom Ollama server URL
- `--runs`: Number of benchmark runs

### Output:
1. Real-time test progress
2. Performance metrics per inference
3. Test results summary
4. Final leaderboard
5. JSON results file with timestamp

## Results Storage
- Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
- Contains full test details and metrics
- Enables historical comparison and analysis

## Error Handling
1. API communication errors
2. Code execution timeouts
3. Invalid responses
4. Test case failures
5. Performance metric calculation errors

## Future Improvements
1. Add more diverse test cases
2. Implement parallel testing
3. Add memory usage tracking
4. Create historical performance trends
5. Add code quality metrics

## Leaderboard Data Processing

### Test Results Processing
- Processes only the latest benchmark results
- Determines maximum test cases from successful runs
- Handles validation failures as complete failures (0 passed cases)
- Uses dynamic test case counting based on actual successful runs
- Maintains consistent test case counting across all scenarios

### Success Rate Calculations
- Calculates success rates based on expected total cases
- Counts failed validations as 0/expected_cases
- Uses maximum observed test cases as the baseline
- Includes validation status in success rate reporting
- Prevents skipping of failed validations in total counts

### Performance Metrics
- Tracks tokens per second from model responses
- Measures total duration across all tests
- Calculates success rate vs duration ratio
- Excellence criteria: >90% success AND success_rate > 5*duration
- Prevents duplicate model entries in metrics

### Data Visualization
# Development Notes

## Project Structure
- `main.py`: Core benchmarking functionality
- `lboard.py`: Leaderboard visualization and results analysis
- `benchmark_results/`: Directory containing JSON benchmark results

## Visualization Features
- Blue bars: Tokens per second performance
- Red + markers: Overall success rate (%)
- Green - markers: Total duration (seconds)
- Green model names: Models with >90% success rate
- Triple y-axis plot for easy metric comparison

## Running the Leaderboard
```bash
# View latest results
python lboard.py

# View specific results file
python lboard.py path/to/results.json
```
- Single plot per model (no duplicates)
- Color-coded bars based on performance
- Success rate indicators (red +)
- Duration indicators (green -)
- Dynamic axis scaling
- Combined legend for all metrics

## Output Format

### Benchmark Run Output
For each model being tested, the output shows:

1. Individual test runs (1-4) with:
   - Test case results
   - Performance metrics
   - Pass/fail status

2. Cumulative Results Summary:
   After all runs are completed, a summary is displayed:
- Detailed test results per model
- Individual test case counts
- Validation status indicators
- Overall success rates
- Performance metrics
- Plot position information

Model: model_name
├─ Tokens/sec: XX.XX
├─ Total Duration: XX.XXs
├─ Test Results:
│  ├─ Test1: X/Y cases (ZZ.Z%) [validation status]
│  ├─ Test2: X/Y cases (ZZ.Z%) [validation status]
└─ Overall Success Rate: X/Y (ZZ.Z%)

### Error Handling
- Handles missing test results
- Processes validation failures appropriately
- Maintains consistent case counting
- Prevents data duplication
- Ensures accurate success rate calculations