codebench/README.md
2025-03-02 22:40:55 +01:00

129 lines
2.7 KiB
Markdown

# Codebench - Ollama Model Benchmark Tool
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.
## Features
- Test multiple Ollama models against common coding problems
- Measure performance metrics (tokens/sec, response time)
- Track success rates across different coding challenges
- Support for local and remote Ollama servers
- Detailed test results and leaderboard generation
- CPU information tracking for benchmarks
## Prerequisites
- Python 3.8+
- Ollama server (local or remote)
- Together API key (optional, for advanced code analysis)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/codebench.git
cd codebench
2. Install required packages:
```bash
pip install -r requirements.txt
```
3. (Optional) Set up Together API:
```bash
export TOGETHER_API_KEY='your_api_key_here'
```
## Usage
Basic usage:
```bash
python3 main.py
```
Available options:
```bash
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
```
## Arguments:
- --server : Choose Ollama server (default: local)
- --model : Test specific model only
- --number : Number of models to test
- --verbose : Enable detailed output
## Supported Tests
The tool currently tests models on these coding challenges:
1. Fibonacci Sequence
2. Binary Search
3. Palindrome Check
4. Anagram Detection
## Test Process & Validation
### Code Generation
1. Each model is prompted with specific coding tasks
2. Generated code is extracted from the model's response
3. Initial syntax validation is performed
### Test Validation
For each test case:
- Input values are provided to the function
- Output is compared with expected results
- Test results are marked as ✅ (pass) or ❌ (fail)
Example test cases:
```plaintext
Fibonacci:
- Input: 6 Expected: 8
- Input: 0 Expected: 0
- Input: -1 Expected: -1
Binary Search:
- Input: ([1,2,3,4,5], 3) Expected: 2
- Input: ([], 1) Expected: -1
- Input: ([1], 1) Expected: 0
```
## Output
Results are saved in the benchmark_results directory with the following naming convention:
```plaintext
[CPU_Model]_[Server_Address].json
```
Example:
```plaintext
Apple_M1_Pro_localhost_11434.json
```
## Server Configuration
Default servers are configured in the code:
- Local: http://localhost:11434
- Z60: http://192.168.196.60:11434
## Example Output
```plaintext
🏆 Final Model Leaderboard:
codellama:13b
Overall Success Rate: 95.8% (23/24 cases)
Average Tokens/sec: 145.23
Average Duration: 2.34s
Test Results:
- Fibonacci: ✅ 6/6 cases (100.0%)
- Binary Search: ✅ 6/6 cases (100.0%)
```
## Contributing
Feel free to submit issues and enhancement requests!
## License
[Your chosen license]