codebench/README.md
2025-03-02 22:40:55 +01:00

2.7 KiB

Codebench - Ollama Model Benchmark Tool

A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.

Features

  • Test multiple Ollama models against common coding problems
  • Measure performance metrics (tokens/sec, response time)
  • Track success rates across different coding challenges
  • Support for local and remote Ollama servers
  • Detailed test results and leaderboard generation
  • CPU information tracking for benchmarks

Prerequisites

  • Python 3.8+
  • Ollama server (local or remote)
  • Together API key (optional, for advanced code analysis)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/codebench.git
cd codebench


2. Install required packages:
```bash
pip install -r requirements.txt
  1. (Optional) Set up Together API:
export TOGETHER_API_KEY='your_api_key_here'

Usage

Basic usage:

python3 main.py

Available options:

python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose

Arguments:

  • --server : Choose Ollama server (default: local)
  • --model : Test specific model only
  • --number : Number of models to test
  • --verbose : Enable detailed output

Supported Tests

The tool currently tests models on these coding challenges:

  1. Fibonacci Sequence
  2. Binary Search
  3. Palindrome Check
  4. Anagram Detection

Test Process & Validation

Code Generation

  1. Each model is prompted with specific coding tasks
  2. Generated code is extracted from the model's response
  3. Initial syntax validation is performed

Test Validation

For each test case:

  • Input values are provided to the function
  • Output is compared with expected results
  • Test results are marked as (pass) or (fail)

Example test cases:

Fibonacci:
- Input: 6      Expected: 8
- Input: 0      Expected: 0
- Input: -1     Expected: -1

Binary Search:
- Input: ([1,2,3,4,5], 3)    Expected: 2
- Input: ([], 1)             Expected: -1
- Input: ([1], 1)            Expected: 0

Output

Results are saved in the benchmark_results directory with the following naming convention:

[CPU_Model]_[Server_Address].json

Example:

Apple_M1_Pro_localhost_11434.json

Server Configuration

Default servers are configured in the code:

Example Output

🏆 Final Model Leaderboard:

codellama:13b
   Overall Success Rate: 95.8% (23/24 cases)
   Average Tokens/sec: 145.23
   Average Duration: 2.34s
   Test Results:
   - Fibonacci: ✅ 6/6 cases (100.0%)
   - Binary Search: ✅ 6/6 cases (100.0%)

Contributing

Feel free to submit issues and enhancement requests!

License

[Your chosen license]