6.5 KiB
Codebench - Ollama Model Benchmark Tool
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results.
Components
- Benchmarking Engine:
main.py
- Core benchmarking functionality with integrated plotting - Visualization Tool:
lboard.py
- Standalone visualization for benchmark results
Features
- Test multiple Ollama models against common coding problems
- Measure performance metrics (tokens/sec, response time)
- Track success rates across different coding challenges
- Support for local and remote Ollama servers
- Detailed test results and leaderboard generation
- CPU information tracking for benchmarks
Prerequisites
- Python 3.8+
- Ollama server (local or remote)
- Required Python packages (see Installation)
- Together API key (optional, for advanced code analysis)
Installation
- Clone the repository:
git clone https://github.com/yourusername/codebench.git
cd codebench
- Install required packages:
pip install -r requirements.txt
Or install the required packages manually:
pip install requests matplotlib py-cpuinfo
- (Optional) Set up Together API for advanced code analysis:
export TOGETHER_API_KEY='your_api_key_here'
Usage
Basic usage:
python3 main.py
Available options:
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose --plot-only --no-plot --file [results_file]
Arguments:
- --server : Choose Ollama server (default: local)
- --model : Test specific model only
- --number : Number of models to test
- --verbose : Enable detailed output
- --plot-only : Skip benchmarking and just generate graphs from existing results
- --no-plot : Run benchmarking without plotting graphs at the end
- --file : Specify a benchmark results file to use for plotting (only with --plot-only)
Supported Tests
The tool currently tests models on these coding challenges:
- Fibonacci Sequence
- Binary Search
- Palindrome Check
- Anagram Detection
Test Process & Validation
Code Generation
- Each model is prompted with specific coding tasks
- Generated code is extracted from the model's response
- Initial syntax validation is performed
Test Validation
For each test case:
- Input values are provided to the function
- Output is compared with expected results
- Test results are marked as ✅ (pass) or ❌ (fail)
Example test cases:
Fibonacci:
- Input: 6 Expected: 8
- Input: 0 Expected: 0
- Input: -1 Expected: -1
Binary Search:
- Input: ([1,2,3,4,5], 3) Expected: 2
- Input: ([], 1) Expected: -1
- Input: ([1], 1) Expected: 0
Output
Results are saved in the benchmark_results directory with the following naming convention:
[CPU_Model]_[Server_Address].json
Example:
Apple_M1_Pro_localhost_11434.json
Visualizing Results
There are two ways to generate a visual comparison of model performances as a leaderboard:
Option 1: Using main.py (Recommended)
By default, main.py will now automatically generate graphs after benchmarking. You can also use it to just generate graphs without running benchmarks:
# Run benchmarks and generate graphs (default behavior)
python3 main.py
# Skip benchmarking and just generate graphs from the latest results
python3 main.py --plot-only
# Skip benchmarking and generate graphs from a specific results file
python3 main.py --plot-only --file path/to/results.json
# Run benchmarks without generating graphs
python3 main.py --no-plot
The plot will be saved as benchmark_results/model_comparison.png
with high resolution (300 DPI).
Option 2: Using lboard.py (Legacy)
You can still use the standalone lboard.py script:
python3 lboard.py
This will:
- Automatically find the latest benchmark results
- Generate a graph showing:
- Token processing speed (blue bars)
- Success rates (red markers)
- Duration ranges (green vertical lines)
You can also specify a specific results file:
python3 lboard.py path/to/results.json
# or
python3 lboard.py --file path/to/results.json
Visualization Features
The visualization includes:
- Model performance comparison
- Token processing speeds with min/max ranges
- Success rates across all tests
- Execution duration ranges
- Color-coded model names (green for high performers)
Server Configuration
Default servers are configured in the code:
- Local: http://localhost:11434
- Z60: http://192.168.196.60:11434
Example Output
🏆 Final Model Leaderboard:
codellama:13b
Overall Success Rate: 95.8% (23/24 cases)
Average Tokens/sec: 145.23
Average Duration: 2.34s
Test Results:
- Fibonacci: ✅ 6/6 cases (100.0%)
- Binary Search: ✅ 6/6 cases (100.0%)
Output Files
The tool generates several output files in the benchmark_results
directory:
-
JSON Results File:
[CPU_Model]_[Server_Address].json
- Contains detailed benchmark results for all tested models
- Used for later analysis and visualization
-
Log File:
[CPU_Model]_[Server_Address].log
- Contains console output from the benchmark run
- Useful for debugging and reviewing test details
-
Plot Image:
model_comparison.png
- High-resolution (300 DPI) visualization of model performance
- Shows token processing speed, success rates, and duration ranges
Recent Updates
March 2025 Updates
- Added
--plot-only
option to skip benchmarking and directly generate plots - Added
--no-plot
option to run benchmarks without generating plots - Added
--file
option to specify a benchmark results file for plotting - Fixed plot generation to ensure high-quality output images
- Improved visualization with better formatting and higher resolution
- Updated documentation with comprehensive usage instructions
Troubleshooting
Common Issues
-
Ollama Server Connection
- Ensure your Ollama server is running and accessible
- Check the server URL in the
--server
option
-
Missing Dependencies
- Run
pip install -r requirements.txt
to install all required packages - Ensure matplotlib is properly installed for visualization
- Run
-
Plot Generation
- If plots appear empty, ensure you have the latest version of matplotlib
- Check that the benchmark results file contains valid data
Contributing
Feel free to submit issues and enhancement requests!
License
CC NC BY