129 lines
2.7 KiB
Markdown
129 lines
2.7 KiB
Markdown
# Codebench - Ollama Model Benchmark Tool
|
|
|
|
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.
|
|
|
|
## Features
|
|
|
|
- Test multiple Ollama models against common coding problems
|
|
- Measure performance metrics (tokens/sec, response time)
|
|
- Track success rates across different coding challenges
|
|
- Support for local and remote Ollama servers
|
|
- Detailed test results and leaderboard generation
|
|
- CPU information tracking for benchmarks
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8+
|
|
- Ollama server (local or remote)
|
|
- Together API key (optional, for advanced code analysis)
|
|
|
|
## Installation
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/yourusername/codebench.git
|
|
cd codebench
|
|
|
|
|
|
2. Install required packages:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. (Optional) Set up Together API:
|
|
```bash
|
|
export TOGETHER_API_KEY='your_api_key_here'
|
|
```
|
|
|
|
## Usage
|
|
Basic usage:
|
|
|
|
```bash
|
|
python3 main.py
|
|
```
|
|
|
|
Available options:
|
|
|
|
```bash
|
|
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
|
|
```
|
|
|
|
## Arguments:
|
|
|
|
- --server : Choose Ollama server (default: local)
|
|
- --model : Test specific model only
|
|
- --number : Number of models to test
|
|
- --verbose : Enable detailed output
|
|
|
|
## Supported Tests
|
|
The tool currently tests models on these coding challenges:
|
|
|
|
1. Fibonacci Sequence
|
|
2. Binary Search
|
|
3. Palindrome Check
|
|
4. Anagram Detection
|
|
|
|
## Test Process & Validation
|
|
|
|
### Code Generation
|
|
1. Each model is prompted with specific coding tasks
|
|
2. Generated code is extracted from the model's response
|
|
3. Initial syntax validation is performed
|
|
|
|
### Test Validation
|
|
For each test case:
|
|
- Input values are provided to the function
|
|
- Output is compared with expected results
|
|
- Test results are marked as ✅ (pass) or ❌ (fail)
|
|
|
|
Example test cases:
|
|
```plaintext
|
|
Fibonacci:
|
|
- Input: 6 Expected: 8
|
|
- Input: 0 Expected: 0
|
|
- Input: -1 Expected: -1
|
|
|
|
Binary Search:
|
|
- Input: ([1,2,3,4,5], 3) Expected: 2
|
|
- Input: ([], 1) Expected: -1
|
|
- Input: ([1], 1) Expected: 0
|
|
```
|
|
|
|
## Output
|
|
Results are saved in the benchmark_results directory with the following naming convention:
|
|
|
|
```plaintext
|
|
[CPU_Model]_[Server_Address].json
|
|
```
|
|
|
|
Example:
|
|
|
|
```plaintext
|
|
Apple_M1_Pro_localhost_11434.json
|
|
```
|
|
|
|
## Server Configuration
|
|
Default servers are configured in the code:
|
|
|
|
- Local: http://localhost:11434
|
|
- Z60: http://192.168.196.60:11434
|
|
|
|
## Example Output
|
|
```plaintext
|
|
🏆 Final Model Leaderboard:
|
|
|
|
codellama:13b
|
|
Overall Success Rate: 95.8% (23/24 cases)
|
|
Average Tokens/sec: 145.23
|
|
Average Duration: 2.34s
|
|
Test Results:
|
|
- Fibonacci: ✅ 6/6 cases (100.0%)
|
|
- Binary Search: ✅ 6/6 cases (100.0%)
|
|
```
|
|
|
|
|
|
## Contributing
|
|
Feel free to submit issues and enhancement requests!
|
|
|
|
## License
|
|
[Your chosen license] |