Custom benchmarks for LLMs

Go to file

leduc 730d61cfe3 ollama automatic pull		2025-03-15 11:27:46 +01:00
benchmark_results	Leader Board picture and bug Fixs	2025-03-15 01:35:25 +01:00
.DS_Store	ollama automatic pull	2025-03-15 11:27:46 +01:00
devbook.md	bug fixs	2025-03-04 04:34:23 +01:00
lboard.py	Leader Board picture and bug Fixs	2025-03-15 01:35:25 +01:00
LICENSE	Initial commit	2025-02-26 21:13:42 +00:00
main.py	ollama automatic pull	2025-03-15 11:27:46 +01:00
models.py	bug fixs	2025-03-04 04:34:23 +01:00
ollama_model_performance.json	bug fixs	2025-03-04 04:34:23 +01:00
README.md	ollama automatic pull	2025-03-15 11:27:46 +01:00
requirements.txt	Leader Board picture and bug Fixs	2025-03-15 01:35:25 +01:00
tsbench.py	bug fixs	2025-03-04 04:34:23 +01:00

README.md

Codebench - Ollama Model Benchmark Tool

A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results.

Components

Benchmarking Engine: main.py - Core benchmarking functionality with integrated plotting
Visualization Tool: lboard.py - Standalone visualization for benchmark results

Features

Test multiple Ollama models against common coding problems
Measure performance metrics (tokens/sec, response time)
Track success rates across different coding challenges
Support for local and remote Ollama servers
Automatic model download if not available locally
Detailed test results and leaderboard generation
CPU information tracking for benchmarks

Prerequisites

Python 3.8+
Ollama server (local or remote)
Required Python packages (see Installation)
Together API key (optional, for advanced code analysis)

Installation

Clone the repository:

git clone https://github.com/yourusername/codebench.git
cd codebench

Install required packages:

pip install -r requirements.txt

Or install the required packages manually:

pip install requests matplotlib py-cpuinfo

(Optional) Set up Together API for advanced code analysis:

export TOGETHER_API_KEY='your_api_key_here'

Usage

Basic usage:

python3 main.py

Available options:

python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose --plot-only --no-plot --file [results_file]

Arguments:

--server : Choose Ollama server (default: local)
--model : Test specific model only (will be automatically downloaded if not available locally)
--number : Number of models to test
--verbose : Enable detailed output
--plot-only : Skip benchmarking and just generate graphs from existing results
--no-plot : Run benchmarking without plotting graphs at the end
--file : Specify a benchmark results file to use for plotting (only with --plot-only)

Supported Tests

The tool currently tests models on these coding challenges:

Fibonacci Sequence
Binary Search
Palindrome Check
Anagram Detection

Test Process & Validation

Code Generation

Each model is prompted with specific coding tasks
Generated code is extracted from the model's response
Initial syntax validation is performed

Test Validation

For each test case:

Input values are provided to the function
Output is compared with expected results
Test results are marked as ✅ (pass) or ❌ (fail)

Example test cases:

Fibonacci:
- Input: 6      Expected: 8
- Input: 0      Expected: 0
- Input: -1     Expected: -1

Binary Search:
- Input: ([1,2,3,4,5], 3)    Expected: 2
- Input: ([], 1)             Expected: -1
- Input: ([1], 1)            Expected: 0

Output

Results are saved in the benchmark_results directory with the following naming convention:

[CPU_Model]_[Server_Address].json

Example:

Apple_M1_Pro_localhost_11434.json

Visualizing Results

There are two ways to generate a visual comparison of model performances as a leaderboard:

Option 1: Using main.py (Recommended)

By default, main.py will now automatically generate graphs after benchmarking. You can also use it to just generate graphs without running benchmarks:

# Run benchmarks and generate graphs (default behavior)
python3 main.py

# Test a specific model (will be downloaded automatically if not available locally)
python3 main.py --model llama3

# Skip benchmarking and just generate graphs from the latest results
python3 main.py --plot-only

# Skip benchmarking and generate graphs from a specific results file
python3 main.py --plot-only --file path/to/results.json

# Run benchmarks without generating graphs
python3 main.py --no-plot

The plot will be saved as benchmark_results/model_comparison.png with high resolution (300 DPI).

Option 2: Using lboard.py (Legacy)

You can still use the standalone lboard.py script:

python3 lboard.py

This will:

Automatically find the latest benchmark results
Generate a graph showing:
- Token processing speed (blue bars)
- Success rates (red markers)
- Duration ranges (green vertical lines)

You can also specify a specific results file:

python3 lboard.py path/to/results.json
# or
python3 lboard.py --file path/to/results.json

Visualization Features

The visualization includes:

Model performance comparison
Token processing speeds with min/max ranges
Success rates across all tests
Execution duration ranges
Color-coded model names (green for high performers)

Server Configuration

Default servers are configured in the code:

Local: http://localhost:11434
Z60: http://192.168.196.60:11434

Example Output

🏆 Final Model Leaderboard:

codellama:13b
   Overall Success Rate: 95.8% (23/24 cases)
   Average Tokens/sec: 145.23
   Average Duration: 2.34s
   Test Results:
   - Fibonacci: ✅ 6/6 cases (100.0%)
   - Binary Search: ✅ 6/6 cases (100.0%)

Output Files

The tool generates several output files in the benchmark_results directory:

JSON Results File: [CPU_Model]_[Server_Address].json
- Contains detailed benchmark results for all tested models
- Used for later analysis and visualization
Log File: [CPU_Model]_[Server_Address].log
- Contains console output from the benchmark run
- Useful for debugging and reviewing test details
Plot Image: model_comparison.png
- High-resolution (300 DPI) visualization of model performance
- Shows token processing speed, success rates, and duration ranges

Recent Updates

March 2025 Updates

Added --plot-only option to skip benchmarking and directly generate plots
Added --no-plot option to run benchmarks without generating plots
Added --file option to specify a benchmark results file for plotting
Fixed plot generation to ensure high-quality output images
Improved visualization with better formatting and higher resolution
Updated documentation with comprehensive usage instructions

Troubleshooting

Common Issues

Ollama Server Connection
- Ensure your Ollama server is running and accessible
- Check the server URL in the --server option
Missing Dependencies
- Run pip install -r requirements.txt to install all required packages
- Ensure matplotlib is properly installed for visualization
Plot Generation
- If plots appear empty, ensure you have the latest version of matplotlib
- Check that the benchmark results file contains valid data

Contributing

Feel free to submit issues and enhancement requests!

License

CC NC BY