codebench/README.md

# Codebench - Ollama Model Benchmark Tool

A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.

## Features

- Test multiple Ollama models against common coding problems
- Measure performance metrics (tokens/sec, response time)
- Track success rates across different coding challenges
- Support for local and remote Ollama servers
- Detailed test results and leaderboard generation
- CPU information tracking for benchmarks

## Prerequisites

- Python 3.8+
- Ollama server (local or remote)
- Together API key (optional, for advanced code analysis)

## Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/codebench.git
cd codebench


2. Install required packages:
```bash
pip install -r requirements.txt
 ```

3. (Optional) Set up Together API:
```bash
export TOGETHER_API_KEY='your_api_key_here'
 ```

## Usage
Basic usage:

```bash
python3 main.py
 ```

Available options:

```bash
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
 ```

## Arguments:

- --server : Choose Ollama server (default: local)
- --model : Test specific model only
- --number : Number of models to test
- --verbose : Enable detailed output

## Supported Tests
The tool currently tests models on these coding challenges:

1. Fibonacci Sequence
2. Binary Search
3. Palindrome Check
4. Anagram Detection

## Test Process & Validation

### Code Generation
1. Each model is prompted with specific coding tasks
2. Generated code is extracted from the model's response
3. Initial syntax validation is performed

### Test Validation
For each test case:
- Input values are provided to the function
- Output is compared with expected results
- Test results are marked as ✅ (pass) or ❌ (fail)

Example test cases:
```plaintext
Fibonacci:
- Input: 6      Expected: 8
- Input: 0      Expected: 0
- Input: -1     Expected: -1

Binary Search:
- Input: ([1,2,3,4,5], 3)    Expected: 2
- Input: ([], 1)             Expected: -1
- Input: ([1], 1)            Expected: 0
```

## Output
Results are saved in the benchmark_results directory with the following naming convention:

```plaintext
[CPU_Model]_[Server_Address].json
 ```

Example:

```plaintext
Apple_M1_Pro_localhost_11434.json
 ```

## Server Configuration
Default servers are configured in the code:

- Local: http://localhost:11434
- Z60: http://192.168.196.60:11434

## Example Output
```plaintext
🏆 Final Model Leaderboard:

codellama:13b
   Overall Success Rate: 95.8% (23/24 cases)
   Average Tokens/sec: 145.23
   Average Duration: 2.34s
   Test Results:
   - Fibonacci: ✅ 6/6 cases (100.0%)
   - Binary Search: ✅ 6/6 cases (100.0%)
 ```


## Contributing
Feel free to submit issues and enhancement requests!

## License
[Your chosen license]