Leader Board picture and bug Fixs
This commit is contained in:
parent
dedbeceb8e
commit
f538ed1bd3
122
README.md
122
README.md
@ -1,6 +1,11 @@
|
||||
# Codebench - Ollama Model Benchmark Tool
|
||||
|
||||
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.
|
||||
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results.
|
||||
|
||||
## Components
|
||||
- **Benchmarking Engine**: `main.py` - Core benchmarking functionality with integrated plotting
|
||||
- **Visualization Tool**: `lboard.py` - Standalone visualization for benchmark results
|
||||
|
||||
|
||||
## Features
|
||||
|
||||
@ -15,6 +20,7 @@ A Python-based benchmarking tool for testing and comparing different Ollama mode
|
||||
|
||||
- Python 3.8+
|
||||
- Ollama server (local or remote)
|
||||
- Required Python packages (see Installation)
|
||||
- Together API key (optional, for advanced code analysis)
|
||||
|
||||
## Installation
|
||||
@ -23,17 +29,21 @@ A Python-based benchmarking tool for testing and comparing different Ollama mode
|
||||
```bash
|
||||
git clone https://github.com/yourusername/codebench.git
|
||||
cd codebench
|
||||
|
||||
```
|
||||
|
||||
2. Install required packages:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
```
|
||||
Or install the required packages manually:
|
||||
```bash
|
||||
pip install requests matplotlib py-cpuinfo
|
||||
```
|
||||
|
||||
3. (Optional) Set up Together API:
|
||||
3. (Optional) Set up Together API for advanced code analysis:
|
||||
```bash
|
||||
export TOGETHER_API_KEY='your_api_key_here'
|
||||
```
|
||||
```
|
||||
|
||||
## Usage
|
||||
Basic usage:
|
||||
@ -45,7 +55,7 @@ python3 main.py
|
||||
Available options:
|
||||
|
||||
```bash
|
||||
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
|
||||
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose --plot-only --no-plot --file [results_file]
|
||||
```
|
||||
|
||||
## Arguments:
|
||||
@ -54,6 +64,9 @@ python main.py --server [local|z60] --model [model_name] --number [count|all] --
|
||||
- --model : Test specific model only
|
||||
- --number : Number of models to test
|
||||
- --verbose : Enable detailed output
|
||||
- --plot-only : Skip benchmarking and just generate graphs from existing results
|
||||
- --no-plot : Run benchmarking without plotting graphs at the end
|
||||
- --file : Specify a benchmark results file to use for plotting (only with --plot-only)
|
||||
|
||||
## Supported Tests
|
||||
The tool currently tests models on these coding challenges:
|
||||
@ -94,7 +107,7 @@ Results are saved in the benchmark_results directory with the following naming c
|
||||
|
||||
```plaintext
|
||||
[CPU_Model]_[Server_Address].json
|
||||
```
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
@ -102,6 +115,57 @@ Example:
|
||||
Apple_M1_Pro_localhost_11434.json
|
||||
```
|
||||
|
||||
## Visualizing Results
|
||||
There are two ways to generate a visual comparison of model performances as a leaderboard:
|
||||
|
||||
### Option 1: Using main.py (Recommended)
|
||||
By default, main.py will now automatically generate graphs after benchmarking. You can also use it to just generate graphs without running benchmarks:
|
||||
|
||||
```bash
|
||||
# Run benchmarks and generate graphs (default behavior)
|
||||
python3 main.py
|
||||
|
||||
# Skip benchmarking and just generate graphs from the latest results
|
||||
python3 main.py --plot-only
|
||||
|
||||
# Skip benchmarking and generate graphs from a specific results file
|
||||
python3 main.py --plot-only --file path/to/results.json
|
||||
|
||||
# Run benchmarks without generating graphs
|
||||
python3 main.py --no-plot
|
||||
```
|
||||
|
||||
The plot will be saved as `benchmark_results/model_comparison.png` with high resolution (300 DPI).
|
||||
|
||||
### Option 2: Using lboard.py (Legacy)
|
||||
You can still use the standalone lboard.py script:
|
||||
|
||||
```bash
|
||||
python3 lboard.py
|
||||
```
|
||||
This will:
|
||||
|
||||
- Automatically find the latest benchmark results
|
||||
- Generate a graph showing:
|
||||
- Token processing speed (blue bars)
|
||||
- Success rates (red markers)
|
||||
- Duration ranges (green vertical lines)
|
||||
|
||||
You can also specify a specific results file:
|
||||
|
||||
```bash
|
||||
python3 lboard.py path/to/results.json
|
||||
# or
|
||||
python3 lboard.py --file path/to/results.json
|
||||
```
|
||||
## Visualization Features
|
||||
The visualization includes:
|
||||
- Model performance comparison
|
||||
- Token processing speeds with min/max ranges
|
||||
- Success rates across all tests
|
||||
- Execution duration ranges
|
||||
- Color-coded model names (green for high performers)
|
||||
|
||||
## Server Configuration
|
||||
Default servers are configured in the code:
|
||||
|
||||
@ -122,8 +186,50 @@ codellama:13b
|
||||
```
|
||||
|
||||
|
||||
## Output Files
|
||||
|
||||
The tool generates several output files in the `benchmark_results` directory:
|
||||
|
||||
1. **JSON Results File**: `[CPU_Model]_[Server_Address].json`
|
||||
- Contains detailed benchmark results for all tested models
|
||||
- Used for later analysis and visualization
|
||||
|
||||
2. **Log File**: `[CPU_Model]_[Server_Address].log`
|
||||
- Contains console output from the benchmark run
|
||||
- Useful for debugging and reviewing test details
|
||||
|
||||
3. **Plot Image**: `model_comparison.png`
|
||||
- High-resolution (300 DPI) visualization of model performance
|
||||
- Shows token processing speed, success rates, and duration ranges
|
||||
|
||||
## Recent Updates
|
||||
|
||||
### March 2025 Updates
|
||||
- Added `--plot-only` option to skip benchmarking and directly generate plots
|
||||
- Added `--no-plot` option to run benchmarks without generating plots
|
||||
- Added `--file` option to specify a benchmark results file for plotting
|
||||
- Fixed plot generation to ensure high-quality output images
|
||||
- Improved visualization with better formatting and higher resolution
|
||||
- Updated documentation with comprehensive usage instructions
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Ollama Server Connection**
|
||||
- Ensure your Ollama server is running and accessible
|
||||
- Check the server URL in the `--server` option
|
||||
|
||||
2. **Missing Dependencies**
|
||||
- Run `pip install -r requirements.txt` to install all required packages
|
||||
- Ensure matplotlib is properly installed for visualization
|
||||
|
||||
3. **Plot Generation**
|
||||
- If plots appear empty, ensure you have the latest version of matplotlib
|
||||
- Check that the benchmark results file contains valid data
|
||||
|
||||
## Contributing
|
||||
Feel free to submit issues and enhancement requests!
|
||||
|
||||
## License
|
||||
[Your chosen license]
|
||||
CC NC BY
|
@ -998,6 +998,251 @@
|
||||
"max_avg_duration": 12.908918361333333,
|
||||
"min_tokens_per_second": 18.377766002186945,
|
||||
"max_tokens_per_second": 18.9448229322312
|
||||
},
|
||||
{
|
||||
"model": "phi4-mini:latest",
|
||||
"total_duration": 10.860303611333332,
|
||||
"tokens_per_second": 29.361579428697542,
|
||||
"test_results": {
|
||||
"Fibonacci": {
|
||||
"success_rate": 61.111111111111114,
|
||||
"passed_cases": 11,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.6111111111111112,
|
||||
"avg_duration": 10.860303611333332,
|
||||
"avg_tokens_sec": 29.361579428697542
|
||||
},
|
||||
"Binary Search": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 10.22926025,
|
||||
"avg_tokens_sec": 29.360358027471495
|
||||
},
|
||||
"Palindrome": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 7.7338954719999995,
|
||||
"avg_tokens_sec": 29.349959100715157
|
||||
},
|
||||
"Anagram Check": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 9.66612725,
|
||||
"avg_tokens_sec": 29.794841927435822
|
||||
}
|
||||
},
|
||||
"overall_success_rate": 90.27777777777779,
|
||||
"overall_success_cases_rate": 0.9027777777777778,
|
||||
"min_avg_duration": 7.7338954719999995,
|
||||
"max_avg_duration": 10.860303611333332,
|
||||
"min_tokens_per_second": 29.349959100715157,
|
||||
"max_tokens_per_second": 29.794841927435822
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"timestamp": "20250313_051856",
|
||||
"results": [
|
||||
{
|
||||
"model": "gemma3:12b",
|
||||
"total_duration": 17.904428624666668,
|
||||
"tokens_per_second": 11.206900603314153,
|
||||
"test_results": {
|
||||
"Fibonacci": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 17.904428624666668,
|
||||
"avg_tokens_sec": 11.206900603314153
|
||||
},
|
||||
"Binary Search": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 14.096915041666666,
|
||||
"avg_tokens_sec": 11.209157987254114
|
||||
},
|
||||
"Palindrome": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 9.514898375333333,
|
||||
"avg_tokens_sec": 11.037508677057549
|
||||
},
|
||||
"Anagram Check": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 24.419397555666666,
|
||||
"avg_tokens_sec": 11.87609409055045
|
||||
}
|
||||
},
|
||||
"overall_success_rate": 100.0,
|
||||
"overall_success_cases_rate": 1.0,
|
||||
"min_avg_duration": 9.514898375333333,
|
||||
"max_avg_duration": 24.419397555666666,
|
||||
"min_tokens_per_second": 11.037508677057549,
|
||||
"max_tokens_per_second": 11.87609409055045
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"timestamp": "20250314_024439",
|
||||
"results": [
|
||||
{
|
||||
"model": "SiliconBasedWorld/Qwen2.5-7B-Instruct-1M",
|
||||
"total_duration": 20.47047556933333,
|
||||
"tokens_per_second": 19.721316911932245,
|
||||
"test_results": {
|
||||
"Fibonacci": {
|
||||
"success_rate": 61.111111111111114,
|
||||
"passed_cases": 11,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.6111111111111112,
|
||||
"avg_duration": 20.47047556933333,
|
||||
"avg_tokens_sec": 19.721316911932245
|
||||
},
|
||||
"Binary Search": {
|
||||
"success_rate": 66.66666666666666,
|
||||
"passed_cases": 12,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.6666666666666666,
|
||||
"avg_duration": 89.59582123599999,
|
||||
"avg_tokens_sec": 19.522371869517652
|
||||
},
|
||||
"Palindrome": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 29.476939527666666,
|
||||
"avg_tokens_sec": 19.835750358255293
|
||||
},
|
||||
"Anagram Check": {
|
||||
"success_rate": 33.33333333333333,
|
||||
"passed_cases": 6,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.3333333333333333,
|
||||
"avg_duration": 52.099640236333336,
|
||||
"avg_tokens_sec": 19.661776969493513
|
||||
}
|
||||
},
|
||||
"overall_success_rate": 65.27777777777779,
|
||||
"overall_success_cases_rate": 0.6527777777777778,
|
||||
"min_avg_duration": 20.47047556933333,
|
||||
"max_avg_duration": 89.59582123599999,
|
||||
"min_tokens_per_second": 19.522371869517652,
|
||||
"max_tokens_per_second": 19.835750358255293
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"timestamp": "20250314_110909",
|
||||
"results": [
|
||||
{
|
||||
"model": "olmo2:13b",
|
||||
"total_duration": 25.239670416666666,
|
||||
"tokens_per_second": 8.973277631244137,
|
||||
"test_results": {
|
||||
"Fibonacci": {
|
||||
"success_rate": 61.111111111111114,
|
||||
"passed_cases": 11,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.6111111111111112,
|
||||
"avg_duration": 25.239670416666666,
|
||||
"avg_tokens_sec": 8.973277631244137
|
||||
},
|
||||
"Binary Search": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 10.511362861,
|
||||
"avg_tokens_sec": 8.094987124683419
|
||||
},
|
||||
"Palindrome": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 7.803927528,
|
||||
"avg_tokens_sec": 8.07489922259982
|
||||
},
|
||||
"Anagram Check": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 16.829488430333335,
|
||||
"avg_tokens_sec": 8.85685146687769
|
||||
}
|
||||
},
|
||||
"overall_success_rate": 90.27777777777779,
|
||||
"overall_success_cases_rate": 0.9027777777777778,
|
||||
"min_avg_duration": 7.803927528,
|
||||
"max_avg_duration": 25.239670416666666,
|
||||
"min_tokens_per_second": 8.07489922259982,
|
||||
"max_tokens_per_second": 8.973277631244137
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"timestamp": "20250314_111430",
|
||||
"results": [
|
||||
{
|
||||
"model": "olmo2:13b-1124-instruct-q4_K_M",
|
||||
"total_duration": 27.796664694333334,
|
||||
"tokens_per_second": 9.16360668962085,
|
||||
"test_results": {
|
||||
"Fibonacci": {
|
||||
"success_rate": 27.77777777777778,
|
||||
"passed_cases": 5,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 0.2777777777777778,
|
||||
"avg_duration": 27.796664694333334,
|
||||
"avg_tokens_sec": 9.16360668962085
|
||||
},
|
||||
"Binary Search": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 21.839994722333333,
|
||||
"avg_tokens_sec": 9.000336176480124
|
||||
},
|
||||
"Palindrome": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 10.587036805333334,
|
||||
"avg_tokens_sec": 8.492606444397637
|
||||
},
|
||||
"Anagram Check": {
|
||||
"success_rate": 100.0,
|
||||
"passed_cases": 18,
|
||||
"total_cases": 18,
|
||||
"success_cases_rate": 1.0,
|
||||
"avg_duration": 9.969617250333334,
|
||||
"avg_tokens_sec": 8.499243210997909
|
||||
}
|
||||
},
|
||||
"overall_success_rate": 81.94444444444444,
|
||||
"overall_success_cases_rate": 0.8194444444444444,
|
||||
"min_avg_duration": 9.969617250333334,
|
||||
"max_avg_duration": 27.796664694333334,
|
||||
"min_tokens_per_second": 8.492606444397637,
|
||||
"max_tokens_per_second": 9.16360668962085
|
||||
}
|
||||
]
|
||||
}
|
||||
|
@ -1,4 +1,4 @@
|
||||
Benchmark Run: 20250303_174821
|
||||
Benchmark Run: 20250314_111430
|
||||
Server: http://localhost:11434
|
||||
|
||||
CPU Information:
|
||||
@ -15,222 +15,13 @@ Benchmark Results:
|
||||
|
||||
[38;5;147m[1m🏆 Final Model Leaderboard:[0m
|
||||
|
||||
[1m[38;5;221mqwen2.5-coder:7b-instruct-q4_K_M[0m
|
||||
[1mOverall Success Rate:[0m 100.0% (72/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 19.33 (18.75 - 19.58)
|
||||
[1mAverage Duration:[0m 17.32s
|
||||
[1mMin/Max Avg Duration:[0m 8.67s / 17.99s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mfalcon3:10b[0m
|
||||
[1mOverall Success Rate:[0m 100.0% (72/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 13.21 (12.53 - 13.31)
|
||||
[1mAverage Duration:[0m 13.46s
|
||||
[1mMin/Max Avg Duration:[0m 6.76s / 13.46s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mqwen2.5:14b[0m
|
||||
[1mOverall Success Rate:[0m 100.0% (72/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 9.78 (9.78 - 9.88)
|
||||
[1mAverage Duration:[0m 35.25s
|
||||
[1mMin/Max Avg Duration:[0m 30.09s / 35.25s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mqwen2.5-coder:14b-instruct-q4_K_M[0m
|
||||
[1mOverall Success Rate:[0m 100.0% (72/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 9.68 (9.65 - 9.88)
|
||||
[1mAverage Duration:[0m 37.18s
|
||||
[1mMin/Max Avg Duration:[0m 23.06s / 37.18s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mphi4:latest[0m
|
||||
[1mOverall Success Rate:[0m 100.0% (72/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 9.01 (8.96 - 9.32)
|
||||
[1mAverage Duration:[0m 23.44s
|
||||
[1mMin/Max Avg Duration:[0m 23.44s / 38.82s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mdeepseek-r1:14b[0m
|
||||
[1mOverall Success Rate:[0m 97.2% (70/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 9.05 (8.90 - 9.38)
|
||||
[1mAverage Duration:[0m 278.32s
|
||||
[1mMin/Max Avg Duration:[0m 174.30s / 482.10s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ❌ 16/18 cases (88.9%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mllama3.2-vision:11b-instruct-q4_K_M[0m
|
||||
[1mOverall Success Rate:[0m 95.8% (69/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 15.68 (14.92 - 15.92)
|
||||
[1mAverage Duration:[0m 22.33s
|
||||
[1mMin/Max Avg Duration:[0m 16.31s / 28.85s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 16/18 cases (88.9%)
|
||||
- Binary Search: ❌ 17/18 cases (94.4%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mllama3.2:3b[0m
|
||||
[1mOverall Success Rate:[0m 94.4% (68/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 36.09 (30.85 - 37.53)
|
||||
[1mAverage Duration:[0m 2.67s
|
||||
[1mMin/Max Avg Duration:[0m 1.04s / 2.76s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 14/18 cases (77.8%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mllama3.1:8b[0m
|
||||
[1mOverall Success Rate:[0m 94.4% (68/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 17.92 (17.92 - 18.45)
|
||||
[1mAverage Duration:[0m 18.04s
|
||||
[1mMin/Max Avg Duration:[0m 14.68s / 19.56s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 14/18 cases (77.8%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mhhao/qwen2.5-coder-tools:7b[0m
|
||||
[1mOverall Success Rate:[0m 91.7% (66/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 17.75 (16.05 - 17.75)
|
||||
[1mAverage Duration:[0m 9.35s
|
||||
[1mMin/Max Avg Duration:[0m 4.17s / 9.35s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 12/18 cases (66.7%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mQwen2.5-Coder-7B-Instruct-s1k:latest[0m
|
||||
[1mOverall Success Rate:[0m 88.9% (64/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 18.38 (18.38 - 18.94)
|
||||
[1mAverage Duration:[0m 9.95s
|
||||
[1mMin/Max Avg Duration:[0m 9.06s / 12.91s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 16/18 cases (88.9%)
|
||||
- Binary Search: ❌ 12/18 cases (66.7%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mdeepseek-r1:8b[0m
|
||||
[1mOverall Success Rate:[0m 86.1% (62/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 17.43 (17.29 - 18.01)
|
||||
[1mAverage Duration:[0m 168.97s
|
||||
[1mMin/Max Avg Duration:[0m 107.91s / 168.97s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ❌ 16/18 cases (88.9%)
|
||||
- Anagram Check: ❌ 10/18 cases (55.6%)
|
||||
|
||||
[1m[38;5;221mllama3.2:1b-instruct-q4_K_M[0m
|
||||
[1m[38;5;221molmo2:13b-1124-instruct-q4_K_M[0m
|
||||
[1mOverall Success Rate:[0m 81.9% (59/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 88.24 (88.24 - 88.93)
|
||||
[1mAverage Duration:[0m 3.64s
|
||||
[1mMin/Max Avg Duration:[0m 1.87s / 4.93s
|
||||
[1mAverage Tokens/sec:[0m 9.16 (8.49 - 9.16)
|
||||
[1mAverage Duration:[0m 27.80s
|
||||
[1mMin/Max Avg Duration:[0m 9.97s / 27.80s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 5/18 cases (27.8%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221msamantha-mistral:latest[0m
|
||||
[1mOverall Success Rate:[0m 80.6% (58/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 23.92 (23.91 - 24.79)
|
||||
[1mAverage Duration:[0m 12.21s
|
||||
[1mMin/Max Avg Duration:[0m 7.59s / 12.21s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 8/18 cases (44.4%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ❌ 16/18 cases (88.9%)
|
||||
- Anagram Check: ❌ 16/18 cases (88.9%)
|
||||
|
||||
[1m[38;5;221mmarco-o1:latest[0m
|
||||
[1mOverall Success Rate:[0m 80.6% (58/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 19.19 (19.19 - 19.39)
|
||||
[1mAverage Duration:[0m 41.14s
|
||||
[1mMin/Max Avg Duration:[0m 33.28s / 51.50s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ✅ 18/18 cases (100.0%)
|
||||
- Binary Search: ❌ 6/18 cases (33.3%)
|
||||
- Palindrome: ✅ 18/18 cases (100.0%)
|
||||
- Anagram Check: ❌ 16/18 cases (88.9%)
|
||||
|
||||
[1m[38;5;221mdeepseek-r1:7b[0m
|
||||
[1mOverall Success Rate:[0m 80.6% (58/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 18.01 (18.01 - 19.07)
|
||||
[1mAverage Duration:[0m 336.87s
|
||||
[1mMin/Max Avg Duration:[0m 78.71s / 336.87s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 10/18 cases (55.6%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ❌ 12/18 cases (66.7%)
|
||||
- Anagram Check: ✅ 18/18 cases (100.0%)
|
||||
|
||||
[1m[38;5;221mdeepseek-r1:1.5b-qwen-distill-q8_0[0m
|
||||
[1mOverall Success Rate:[0m 52.8% (38/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 57.37 (53.88 - 59.60)
|
||||
[1mAverage Duration:[0m 137.59s
|
||||
[1mMin/Max Avg Duration:[0m 41.38s / 371.13s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 11/18 cases (61.1%)
|
||||
- Binary Search: ❌ 12/18 cases (66.7%)
|
||||
- Palindrome: ❌ 6/18 cases (33.3%)
|
||||
- Anagram Check: ❌ 9/18 cases (50.0%)
|
||||
|
||||
[1m[38;5;221mopenthinker:7b[0m
|
||||
[1mOverall Success Rate:[0m 47.2% (34/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 18.16 (17.98 - 18.29)
|
||||
[1mAverage Duration:[0m 263.00s
|
||||
[1mMin/Max Avg Duration:[0m 168.91s / 302.79s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 0/18 cases (0.0%)
|
||||
- Binary Search: ✅ 18/18 cases (100.0%)
|
||||
- Palindrome: ❌ 12/18 cases (66.7%)
|
||||
- Anagram Check: ❌ 4/18 cases (22.2%)
|
||||
|
||||
[1m[38;5;221mwizard-vicuna-uncensored:latest[0m
|
||||
[1mOverall Success Rate:[0m 9.7% (7/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 22.01 (22.01 - 24.42)
|
||||
[1mAverage Duration:[0m 9.06s
|
||||
[1mMin/Max Avg Duration:[0m 5.60s / 11.45s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 0/18 cases (0.0%)
|
||||
- Binary Search: ❌ 0/18 cases (0.0%)
|
||||
- Palindrome: ❌ 6/18 cases (33.3%)
|
||||
- Anagram Check: ❌ 1/18 cases (5.6%)
|
||||
|
||||
[1m[38;5;221mmxbai-embed-large:latest[0m
|
||||
[1mOverall Success Rate:[0m 0.0% (0/72 cases)
|
||||
[1mAverage Tokens/sec:[0m 0.00 (0.00 - 0.00)
|
||||
[1mAverage Duration:[0m 0.00s
|
||||
[1mMin/Max Avg Duration:[0m 0.00s / 0.00s
|
||||
[1mTest Results:[0m
|
||||
- Fibonacci: ❌ 0/18 cases (0.0%)
|
||||
- Binary Search: ❌ 0/18 cases (0.0%)
|
||||
- Palindrome: ❌ 0/18 cases (0.0%)
|
||||
- Anagram Check: ❌ 0/18 cases (0.0%)
|
||||
|
Binary file not shown.
Before Width: | Height: | Size: 2.3 KiB After Width: | Height: | Size: 490 KiB |
67
lboard.py
67
lboard.py
@ -17,15 +17,52 @@ def calculate_model_stats(model_result):
|
||||
success_rates = [test['success_rate'] for test in test_results.values()]
|
||||
overall_success_rate = sum(success_rates) / len(success_rates)
|
||||
|
||||
# Handle the case where some test results might not have avg_duration or avg_tokens_sec
|
||||
# This is for backward compatibility with older benchmark results
|
||||
min_avg_duration = max_avg_duration = None
|
||||
min_tokens_per_second = max_tokens_per_second = None
|
||||
|
||||
# First try to get these values from the model_result directly (new format)
|
||||
if 'min_avg_duration' in model_result and 'max_avg_duration' in model_result:
|
||||
min_avg_duration = model_result['min_avg_duration']
|
||||
max_avg_duration = model_result['max_avg_duration']
|
||||
|
||||
if 'min_tokens_per_second' in model_result and 'max_tokens_per_second' in model_result:
|
||||
min_tokens_per_second = model_result['min_tokens_per_second']
|
||||
max_tokens_per_second = model_result['max_tokens_per_second']
|
||||
|
||||
# If not available in the model_result, try to calculate from test_results (old format)
|
||||
if min_avg_duration is None or max_avg_duration is None:
|
||||
try:
|
||||
min_avg_duration = min(test.get('avg_duration', float('inf')) for test in test_results.values() if 'avg_duration' in test)
|
||||
max_avg_duration = max(test.get('avg_duration', 0) for test in test_results.values() if 'avg_duration' in test)
|
||||
# If no test has avg_duration, use total_duration as fallback
|
||||
if min_avg_duration == float('inf') or max_avg_duration == 0:
|
||||
min_avg_duration = max_avg_duration = model_result['total_duration']
|
||||
except (ValueError, KeyError):
|
||||
# If calculation fails, use total_duration as fallback
|
||||
min_avg_duration = max_avg_duration = model_result['total_duration']
|
||||
|
||||
if min_tokens_per_second is None or max_tokens_per_second is None:
|
||||
try:
|
||||
min_tokens_per_second = min(test.get('avg_tokens_sec', float('inf')) for test in test_results.values() if 'avg_tokens_sec' in test)
|
||||
max_tokens_per_second = max(test.get('avg_tokens_sec', 0) for test in test_results.values() if 'avg_tokens_sec' in test)
|
||||
# If no test has avg_tokens_sec, use tokens_per_second as fallback
|
||||
if min_tokens_per_second == float('inf') or max_tokens_per_second == 0:
|
||||
min_tokens_per_second = max_tokens_per_second = model_result['tokens_per_second']
|
||||
except (ValueError, KeyError):
|
||||
# If calculation fails, use tokens_per_second as fallback
|
||||
min_tokens_per_second = max_tokens_per_second = model_result['tokens_per_second']
|
||||
|
||||
return {
|
||||
'model': model_result['model'],
|
||||
'overall_success_rate': overall_success_rate,
|
||||
'tokens_per_second': model_result['tokens_per_second'],
|
||||
'total_duration': model_result['total_duration'],
|
||||
'min_avg_duration': model_result.get('min_avg_duration', min(test['avg_duration'] for test in test_results.values())),
|
||||
'max_avg_duration': model_result.get('max_avg_duration', max(test['avg_duration'] for test in test_results.values())),
|
||||
'min_tokens_per_second': model_result.get('min_tokens_per_second', min(test['avg_tokens_sec'] for test in test_results.values())),
|
||||
'max_tokens_per_second': model_result.get('max_tokens_per_second', max(test['avg_tokens_sec'] for test in test_results.values())),
|
||||
'min_avg_duration': min_avg_duration,
|
||||
'max_avg_duration': max_avg_duration,
|
||||
'min_tokens_per_second': min_tokens_per_second,
|
||||
'max_tokens_per_second': max_tokens_per_second,
|
||||
'test_results': test_results
|
||||
}
|
||||
|
||||
@ -120,12 +157,26 @@ def print_leaderboard(benchmark_data):
|
||||
print("No benchmark data to display")
|
||||
return
|
||||
|
||||
# Get the latest benchmark results
|
||||
latest_benchmark = benchmark_data['benchmarks'][-1]
|
||||
model_results = latest_benchmark['results']
|
||||
# Get all benchmark results and combine them
|
||||
all_model_results = []
|
||||
model_names = set()
|
||||
|
||||
# Process all benchmarks, keeping only the latest result for each model
|
||||
for benchmark in benchmark_data['benchmarks']:
|
||||
for model_result in benchmark.get('results', []):
|
||||
model_name = model_result.get('model')
|
||||
if model_name and model_name not in model_names:
|
||||
all_model_results.append(model_result)
|
||||
model_names.add(model_name)
|
||||
elif model_name in model_names:
|
||||
# Replace existing model with newer version
|
||||
for i, existing_model in enumerate(all_model_results):
|
||||
if existing_model.get('model') == model_name:
|
||||
all_model_results[i] = model_result
|
||||
break
|
||||
|
||||
# Calculate stats and sort models
|
||||
model_stats = [calculate_model_stats(model) for model in model_results]
|
||||
model_stats = [calculate_model_stats(model) for model in all_model_results]
|
||||
sorted_stats = sorted(model_stats,
|
||||
key=lambda x: (x['overall_success_rate'], x['tokens_per_second']),
|
||||
reverse=True)
|
||||
|
756
main copie.py
756
main copie.py
@ -1,756 +0,0 @@
|
||||
from tabnanny import verbose
|
||||
import ollama
|
||||
import time
|
||||
from typing import List, Dict, Any
|
||||
import json
|
||||
from statistics import mean
|
||||
import re
|
||||
import ast
|
||||
import argparse
|
||||
import requests
|
||||
import os
|
||||
from together import Together
|
||||
from cpuinfo import get_cpu_info
|
||||
import subprocess
|
||||
|
||||
|
||||
# ANSI color codes
|
||||
SUCCESS = '\033[38;5;78m' # Soft mint green for success
|
||||
ERROR = '\033[38;5;203m' # Soft coral red for errors
|
||||
INFO = '\033[38;5;75m' # Sky blue for info
|
||||
HEADER = '\033[38;5;147m' # Soft purple for headers
|
||||
WARNING = '\033[38;5;221m' # Warm gold for warnings
|
||||
EMPHASIS = '\033[38;5;159m' # Cyan for emphasis
|
||||
MUTED = '\033[38;5;246m' # Subtle gray for less important text
|
||||
ENDC = '\033[0m'
|
||||
BOLD = '\033[1m'
|
||||
|
||||
# Replace existing color usages
|
||||
GREEN = SUCCESS
|
||||
RED = ERROR
|
||||
BLUE = INFO
|
||||
YELLOW = WARNING
|
||||
WHITE = MUTED
|
||||
|
||||
# Server configurations
|
||||
SERVERS = {
|
||||
'local': 'http://localhost:11434',
|
||||
'z60': 'http://192.168.196.60:11434'
|
||||
}
|
||||
|
||||
class Timer:
|
||||
def __init__(self):
|
||||
self.start_time = None
|
||||
self.end_time = None
|
||||
|
||||
def start(self):
|
||||
self.start_time = time.time()
|
||||
|
||||
def stop(self):
|
||||
self.end_time = time.time()
|
||||
|
||||
def elapsed_time(self):
|
||||
if self.start_time is None:
|
||||
return 0
|
||||
if self.end_time is None:
|
||||
return time.time() - self.start_time
|
||||
return self.end_time - self.start_time
|
||||
|
||||
def extract_code_from_response(response: str) -> str:
|
||||
"""Extract Python code from a markdown-formatted string."""
|
||||
code_blocks = re.findall(r'```python\n(.*?)```', response, re.DOTALL)
|
||||
if code_blocks:
|
||||
return code_blocks[0].strip()
|
||||
return response
|
||||
|
||||
def is_valid_python(code: str) -> bool:
|
||||
"""Check if the code is valid Python syntax."""
|
||||
try:
|
||||
ast.parse(code)
|
||||
return True
|
||||
except SyntaxError:
|
||||
return False
|
||||
|
||||
def analyze_failed_code(code: str, test_case: tuple, expected: any, actual: any, function_name: str, model: str) -> bool:
|
||||
"""Analyze why code failed using Together API. Returns True if Together thinks the code should work."""
|
||||
prompt = f"""Analyze this Python code and explain why it failed the test case. Format your response EXACTLY as follows:
|
||||
|
||||
ASSESSMENT: [Write a one-line assessment: either "SHOULD PASS" or "SHOULD FAIL" followed by a brief reason]
|
||||
|
||||
ANALYSIS:
|
||||
[Detailed analysis of why the code failed and how to fix it]
|
||||
|
||||
Code:
|
||||
{code}
|
||||
|
||||
Test case:
|
||||
Input: {test_case}
|
||||
Expected output: {expected}
|
||||
Actual output: {actual}
|
||||
Function name required: {function_name}
|
||||
Model: {model}"""
|
||||
|
||||
try:
|
||||
TOGETHER_API_KEY = os.environ["TOGETHER_API_KEY"]
|
||||
together_client = Together(api_key=TOGETHER_API_KEY)
|
||||
response = together_client.chat.completions.create(
|
||||
model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a Python expert analyzing code failures. Always format your response with ASSESSMENT and ANALYSIS sections."},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
max_tokens=1000,
|
||||
temperature=0.7,
|
||||
top_p=0.7,
|
||||
top_k=50,
|
||||
repetition_penalty=1,
|
||||
stop=["<|eot_id|>", "<|eom_id|>"]
|
||||
)
|
||||
|
||||
analysis = response.choices[0].message.content
|
||||
should_pass = "SHOULD PASS" in analysis.upper()
|
||||
if verbose: print(f"\n{BLUE}[{model}] Together Analysis:{ENDC}")
|
||||
if verbose: print(f"{GREEN if should_pass else RED}{analysis}{ENDC}")
|
||||
return should_pass
|
||||
except Exception as e:
|
||||
print(f"\n{RED}Error getting Together API analysis: {e}{ENDC}")
|
||||
return False
|
||||
|
||||
def validate_with_debug(code: str, function_name: str, test_cases: List[tuple], model: str) -> tuple[bool, str, List[bool]]:
|
||||
"""Validate code with detailed debug information. Returns (success, debug_info, test_results)"""
|
||||
debug_info = []
|
||||
test_results = [] # Track individual test case results
|
||||
test_outputs = [] # Store test outputs for combined display
|
||||
|
||||
try:
|
||||
# Create a local namespace
|
||||
namespace = {}
|
||||
debug_info.append(f"Executing code:\n{code}")
|
||||
|
||||
try:
|
||||
# Redirect stdout to capture prints from the executed code
|
||||
import io
|
||||
import sys
|
||||
stdout = sys.stdout
|
||||
sys.stdout = io.StringIO()
|
||||
|
||||
# Execute the code
|
||||
exec(code, namespace)
|
||||
|
||||
# Restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
return False, f"Error executing code: {str(e)}", test_results
|
||||
|
||||
if function_name not in namespace:
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
together_opinion = analyze_failed_code(code, "N/A", f"Function named '{function_name}'",
|
||||
f"Found functions: {list(namespace.keys())}", function_name, model)
|
||||
print(f"\nTests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
return False, f"Function '{function_name}' not found in code. Available names: {list(namespace.keys())}", test_results
|
||||
|
||||
function = namespace[function_name]
|
||||
debug_info.append(f"Function {function_name} found")
|
||||
|
||||
# Run test cases
|
||||
all_passed = True
|
||||
for i, (test_input, expected) in enumerate(test_cases):
|
||||
try:
|
||||
# Redirect stdout for each test case
|
||||
stdout = sys.stdout
|
||||
sys.stdout = io.StringIO()
|
||||
|
||||
if isinstance(test_input, tuple):
|
||||
result = function(*test_input)
|
||||
else:
|
||||
result = function(test_input)
|
||||
|
||||
# Restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
# Store result but don't print individually
|
||||
test_outputs.append(str(result))
|
||||
test_passed = result == expected
|
||||
test_results.append(test_passed)
|
||||
|
||||
if not test_passed:
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
print(f"\n{RED}Test case {i+1} failed:{ENDC}")
|
||||
print(f"Input: {test_input} Expected: {expected} Got: {result}")
|
||||
|
||||
together_opinion = analyze_failed_code(code, test_input, expected, result, function_name, model)
|
||||
print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
|
||||
all_passed = False
|
||||
continue
|
||||
|
||||
debug_info.append(f"Test case {i+1} passed: {test_input} → {result}")
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
test_outputs.append(f"Error: {str(e)}")
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
print(f"\n{RED}{str(e)} in test case {i+1} Input: {test_input} Expected: {expected}")
|
||||
|
||||
together_opinion = analyze_failed_code(code, test_input, expected, f"Error: {str(e)}", function_name, model)
|
||||
print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
|
||||
test_results.append(False)
|
||||
all_passed = False
|
||||
continue
|
||||
finally:
|
||||
if 'sys' in locals(): # Always restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
# Print all test outputs on one line
|
||||
# print(f"{WHITE}{BOLD}Test outputs: {join(test_outputs)}{ENDC}")
|
||||
print(f"{WHITE}Test outputs: {', '.join(test_outputs)}{ENDC}")
|
||||
|
||||
if all_passed:
|
||||
print(f"Tests passed: ✅")
|
||||
return True, "All tests passed!\n" + "\n".join(debug_info), test_results
|
||||
print(f"Tests passed: ❌")
|
||||
return False, "Some tests failed", test_results
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
print(f"\n{RED}Error in validate_with_debug: {str(e)}{ENDC}")
|
||||
return False, f"Unexpected error: {str(e)}", test_results
|
||||
|
||||
def test_fibonacci():
|
||||
question = """Write a Python function named EXACTLY 'fibonacci' (not fibonacci_dp or any other name) that returns the nth Fibonacci number.
|
||||
The function signature must be: def fibonacci(n)
|
||||
|
||||
Requirements:
|
||||
1. Handle edge cases:
|
||||
- For n = 0, return 0
|
||||
- For n = 1 or n = 2, return 1
|
||||
- For negative numbers, return -1
|
||||
2. For n > 2: F(n) = F(n-1) + F(n-2)
|
||||
3. Use dynamic programming or memoization for efficiency
|
||||
4. Do NOT use any print statements - just return the values
|
||||
|
||||
Example sequence: 0,1,1,2,3,5,8,13,21,...
|
||||
Example calls:
|
||||
- fibonacci(6) returns 8
|
||||
- fibonacci(0) returns 0
|
||||
- fibonacci(-1) returns -1"""
|
||||
|
||||
test_cases = [
|
||||
(0, 0), # Edge case: n = 0
|
||||
(1, 1), # Edge case: n = 1
|
||||
(2, 1), # Edge case: n = 2
|
||||
(6, 8), # Regular case
|
||||
(10, 55), # Larger number
|
||||
(-1, -1), # Edge case: negative input
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'fibonacci', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_binary_search():
|
||||
question = """Write a Python function named EXACTLY 'binary_search' that performs binary search on a sorted list.
|
||||
The function signature must be: def binary_search(arr, target)
|
||||
|
||||
Requirements:
|
||||
1. The function takes two arguments:
|
||||
- arr: a sorted list of integers
|
||||
- target: the integer to find
|
||||
2. Return the index of the target if found
|
||||
3. Return -1 if the target is not in the list
|
||||
4. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- binary_search([1,2,3,4,5], 3) returns 2
|
||||
- binary_search([1,2,3,4,5], 6) returns -1"""
|
||||
|
||||
test_cases = [
|
||||
(([1,2,3,4,5], 3), 2), # Regular case: target in middle
|
||||
(([1,2,3,4,5], 1), 0), # Edge case: target at start
|
||||
(([1,2,3,4,5], 5), 4), # Edge case: target at end
|
||||
(([1,2,3,4,5], 6), -1), # Edge case: target not in list
|
||||
(([], 1), -1), # Edge case: empty list
|
||||
(([1], 1), 0), # Edge case: single element list
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'binary_search', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_palindrome():
|
||||
question = """Write a Python function named EXACTLY 'is_palindrome' that checks if a string is a palindrome.
|
||||
The function signature must be: def is_palindrome(s)
|
||||
|
||||
Requirements:
|
||||
1. The function takes one argument:
|
||||
- s: a string to check
|
||||
2. Return True if the string is a palindrome, False otherwise
|
||||
3. Ignore case (treat uppercase and lowercase as the same)
|
||||
4. Ignore non-alphanumeric characters (spaces, punctuation)
|
||||
5. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- is_palindrome("A man, a plan, a canal: Panama") returns True
|
||||
- is_palindrome("race a car") returns False"""
|
||||
|
||||
test_cases = [
|
||||
("A man, a plan, a canal: Panama", True), # Regular case with punctuation
|
||||
("race a car", False), # Regular case, not palindrome
|
||||
("", True), # Edge case: empty string
|
||||
("a", True), # Edge case: single character
|
||||
("Was it a car or a cat I saw?", True), # Complex case with punctuation
|
||||
("hello", False), # Simple case, not palindrome
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'is_palindrome', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_anagram():
|
||||
question = """Write a Python function named EXACTLY 'are_anagrams' that checks if two strings are anagrams.
|
||||
The function signature must be: def are_anagrams(str1, str2)
|
||||
|
||||
Requirements:
|
||||
1. The function takes two arguments:
|
||||
- str1: first string
|
||||
- str2: second string
|
||||
2. Return True if the strings are anagrams, False otherwise
|
||||
3. Ignore case (treat uppercase and lowercase as the same)
|
||||
4. Ignore spaces
|
||||
5. Consider only alphanumeric characters
|
||||
6. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- are_anagrams("listen", "silent") returns True
|
||||
- are_anagrams("hello", "world") returns False"""
|
||||
|
||||
test_cases = [
|
||||
(("listen", "silent"), True), # Regular case
|
||||
(("hello", "world"), False), # Not anagrams
|
||||
(("", ""), True), # Edge case: empty strings
|
||||
(("a", "a"), True), # Edge case: single char
|
||||
(("Debit Card", "Bad Credit"), True), # Case and space test
|
||||
(("Python", "Java"), False), # Different lengths
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'are_anagrams', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
# List of all test cases
|
||||
CODING_QUESTIONS = [
|
||||
test_fibonacci(),
|
||||
test_binary_search(),
|
||||
test_palindrome(),
|
||||
test_anagram()
|
||||
]
|
||||
|
||||
# Add test names as constants
|
||||
TEST_NAMES = {
|
||||
"Write a Python func": "Fibonacci",
|
||||
"Write a Python func": "Binary Search",
|
||||
"Write a Python func": "Palindrome",
|
||||
"Write a Python func": "Anagram Check"
|
||||
}
|
||||
|
||||
def get_test_name(question: str) -> str:
|
||||
"""Get a friendly name for the test based on the question."""
|
||||
if "fibonacci" in question.lower():
|
||||
return "Fibonacci"
|
||||
elif "binary_search" in question.lower():
|
||||
return "Binary Search"
|
||||
elif "palindrome" in question.lower():
|
||||
return "Palindrome"
|
||||
elif "anagram" in question.lower():
|
||||
return "Anagram Check"
|
||||
return question[:20] + "..."
|
||||
|
||||
def get_model_stats(model: str, question_tuple: tuple, server_url: str) -> Dict:
|
||||
"""
|
||||
Get performance statistics for a specific model and validate the response.
|
||||
"""
|
||||
question, validator = question_tuple
|
||||
timer = Timer()
|
||||
results = {
|
||||
'model': model,
|
||||
'total_duration': 0,
|
||||
'tokens_per_second': 0,
|
||||
'code_valid': False,
|
||||
'tests_passed': False,
|
||||
'error': None,
|
||||
'test_results': [] # Track individual test case results
|
||||
}
|
||||
|
||||
try:
|
||||
timer.start()
|
||||
print(f'{WHITE}Requesting code from {server_url} with {model}{ENDC}')
|
||||
response = requests.post(
|
||||
f"{server_url}/api/chat",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": [{'role': 'user', 'content': question}],
|
||||
"stream": False
|
||||
}
|
||||
).json()
|
||||
timer.stop()
|
||||
|
||||
# Get performance metrics from response
|
||||
total_tokens = response.get('eval_count', 0)
|
||||
total_duration = response.get('total_duration', 0)
|
||||
total_response_time = float(total_duration) / 1e9
|
||||
|
||||
results['total_duration'] = total_response_time
|
||||
if total_tokens > 0 and total_response_time > 0:
|
||||
results['tokens_per_second'] = total_tokens / total_response_time
|
||||
|
||||
# Print concise performance metrics
|
||||
print(f"Total Duration (s): {total_response_time:.2f} / Total Tokens: {total_tokens} / Tokens per Second: {results['tokens_per_second']:.2f}")
|
||||
|
||||
# Extract code from response
|
||||
if 'message' in response and 'content' in response['message']:
|
||||
code = extract_code_from_response(response['message']['content'])
|
||||
|
||||
# Validate code
|
||||
results['code_valid'] = is_valid_python(code)
|
||||
|
||||
if results['code_valid']:
|
||||
print(f"Code validation: ✅")
|
||||
# Get validation results
|
||||
print(f'{WHITE}Running tests...{ENDC}')
|
||||
for test_case in CODING_QUESTIONS:
|
||||
if test_case[0] == question: # Found matching test case
|
||||
function_name = get_function_name_from_question(question)
|
||||
test_cases = test_case[2] # Get test cases from tuple
|
||||
success, debug_info, test_results = validate_with_debug(code, function_name, test_cases, model)
|
||||
results['tests_passed'] = success
|
||||
results['test_results'] = test_results
|
||||
break
|
||||
else:
|
||||
print(f"Code Validation: ❌")
|
||||
|
||||
else:
|
||||
results['error'] = f"Unexpected response format: {response}"
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n{RED}Error in get_model_stats: {str(e)}{ENDC}")
|
||||
results['error'] = str(e)
|
||||
|
||||
return results
|
||||
|
||||
def get_function_name_from_question(question: str) -> str:
|
||||
"""Extract function name from question."""
|
||||
if "fibonacci" in question.lower():
|
||||
return "fibonacci"
|
||||
elif "binary_search" in question.lower():
|
||||
return "binary_search"
|
||||
elif "palindrome" in question.lower():
|
||||
return "is_palindrome"
|
||||
elif "anagram" in question.lower():
|
||||
return "are_anagrams"
|
||||
return ""
|
||||
|
||||
def run_model_benchmark(model: str, server_url: str, num_runs: int = 4) -> Dict:
|
||||
"""
|
||||
Run multiple benchmarks for a model and calculate average metrics.
|
||||
"""
|
||||
|
||||
metrics = []
|
||||
|
||||
for i in range(num_runs):
|
||||
print(f"\n{YELLOW}[{model}] Run {i+1}/{num_runs}:{ENDC}")
|
||||
|
||||
run_results = {}
|
||||
for question, validator, test_cases in CODING_QUESTIONS:
|
||||
test_name = get_test_name(question)
|
||||
print(f"\n{BOLD}Testing {test_name}...{ENDC}")
|
||||
try:
|
||||
result = get_model_stats(model, (question, validator), server_url)
|
||||
result['total_tests'] = len(test_cases)
|
||||
run_results[test_name] = result
|
||||
except Exception as e:
|
||||
print(f"Error in run {i+1}: {e}")
|
||||
continue
|
||||
|
||||
if run_results:
|
||||
metrics.append(run_results)
|
||||
|
||||
# Take only the last 3 runs for averaging
|
||||
metrics = metrics[-3:]
|
||||
|
||||
if not metrics:
|
||||
return {}
|
||||
|
||||
# Aggregate results
|
||||
aggregated = {
|
||||
'model': model,
|
||||
'total_duration': mean([m[list(m.keys())[0]]['total_duration'] for m in metrics if m]),
|
||||
'tokens_per_second': mean([m[list(m.keys())[0]]['tokens_per_second'] for m in metrics if m]),
|
||||
'test_results': {}
|
||||
}
|
||||
|
||||
# Print final test results summary
|
||||
print(f"\n{BLUE}[{model}] Test Results Summary (last {len(metrics)} runs):{ENDC}")
|
||||
for test_name in metrics[-1].keys():
|
||||
# Calculate success rate across all runs
|
||||
passed_cases = 0
|
||||
total_cases = 0
|
||||
for m in metrics:
|
||||
if test_name in m:
|
||||
test_results = m[test_name].get('test_results', [])
|
||||
passed_cases += sum(1 for r in test_results if r)
|
||||
total_cases += len(test_results)
|
||||
|
||||
success_rate = (passed_cases / total_cases * 100) if total_cases > 0 else 0
|
||||
|
||||
status = '✅' if success_rate == 100 else '❌'
|
||||
print(f"{test_name}: {status} ({passed_cases}/{total_cases} cases)")
|
||||
|
||||
# Calculate average duration and tokens/sec for this test
|
||||
avg_duration = mean([m[test_name]['total_duration'] for m in metrics])
|
||||
avg_tokens_sec = mean([m[test_name]['tokens_per_second'] for m in metrics])
|
||||
|
||||
aggregated['test_results'][test_name] = {
|
||||
'success_rate': success_rate,
|
||||
'passed_cases': passed_cases,
|
||||
'total_cases': total_cases,
|
||||
'avg_duration': avg_duration,
|
||||
'avg_tokens_sec': avg_tokens_sec
|
||||
}
|
||||
|
||||
return aggregated
|
||||
|
||||
def print_leaderboard(results: List[Dict]):
|
||||
"""Print leaderboard of model results."""
|
||||
if not results:
|
||||
print("No results to display")
|
||||
return
|
||||
|
||||
# Sort by success rate first, then by tokens per second
|
||||
sorted_results = sorted(results, key=lambda x: (
|
||||
sum(t['passed_cases'] for t in x['test_results'].values()) / sum(t['total_cases'] for t in x['test_results'].values()) if sum(t['total_cases'] for t in x['test_results'].values()) > 0 else 0,
|
||||
x['tokens_per_second']
|
||||
), reverse=True)
|
||||
|
||||
print(f"\n{HEADER}{BOLD}🏆 Final Model Leaderboard:{ENDC}")
|
||||
for i, result in enumerate(sorted_results, 1):
|
||||
# Calculate stats for each model
|
||||
total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
|
||||
total_cases = sum(t['total_cases'] for t in result['test_results'].values())
|
||||
success_rate = (total_passed / total_cases * 100) if total_cases > 0 else 0
|
||||
|
||||
print(f"\n{BOLD}{YELLOW}{result['model']}{ENDC}")
|
||||
print(f" {BOLD}Overall Success Rate:{ENDC} {success_rate:.1f}% ({total_passed}/{total_cases} cases)")
|
||||
print(f" {BOLD}Average Tokens/sec:{ENDC} {result['tokens_per_second']:.2f}")
|
||||
print(f" {BOLD}Average Duration:{ENDC} {result['total_duration']:.2f}s")
|
||||
print(f" {BOLD}Test Results:{ENDC}")
|
||||
for test_name, test_result in result['test_results'].items():
|
||||
status = '✅' if test_result['success_rate'] == 100 else '❌'
|
||||
print(f" - {test_name}: {status} {test_result['passed_cases']}/{test_result['total_cases']} cases ({test_result['success_rate']:.1f}%)")
|
||||
|
||||
def get_available_models(server_url: str) -> List[str]:
|
||||
"""Get list of available models from the specified Ollama server."""
|
||||
try:
|
||||
response = requests.get(f"{server_url}/api/tags").json()
|
||||
return [model['name'] for model in response['models']]
|
||||
except Exception as e:
|
||||
print(f"{RED}Error getting model list from {server_url}: {e}{ENDC}")
|
||||
return []
|
||||
|
||||
def get_model_details(model_name):
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ollama", "show", model_name],
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
encoding='utf-8',
|
||||
errors='replace'
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"Error: {result.stderr.strip()}")
|
||||
return None
|
||||
|
||||
if not result.stdout.strip():
|
||||
print(f"No details available for model: {model_name}")
|
||||
return None
|
||||
|
||||
raw_output = result.stdout.strip()
|
||||
lines = raw_output.split('\n')
|
||||
current_section = None
|
||||
|
||||
for line in lines:
|
||||
line = line.rstrip()
|
||||
if line and not line.startswith(' '): # Section headers
|
||||
current_section = line.strip()
|
||||
print(f"\n {current_section}")
|
||||
elif line and current_section: # Section content
|
||||
# Split by multiple spaces and filter out empty parts
|
||||
parts = [part for part in line.split(' ') if part.strip()]
|
||||
if len(parts) >= 2:
|
||||
key, value = parts[0].strip(), parts[-1].strip()
|
||||
# Ensure consistent spacing for alignment
|
||||
print(f" {key:<16} {value}")
|
||||
elif len(parts) == 1:
|
||||
# Handle single-value lines (like license text)
|
||||
print(f" {parts[0].strip()}")
|
||||
|
||||
return None # No need to return formatted details anymore
|
||||
|
||||
except Exception as e:
|
||||
print(f"An error occurred while getting model details: {e}")
|
||||
return None
|
||||
|
||||
def update_server_results(server_url: str, results: List[Dict]) -> None:
|
||||
try:
|
||||
# Get CPU brand and format it for filename
|
||||
cpu_info = get_cpu_info()
|
||||
cpu_brand = cpu_info.get('brand_raw', 'Unknown_CPU').replace(' ', '_')
|
||||
|
||||
# Create a unique filename for this server's results
|
||||
server_id = server_url.replace('http://', '').replace(':', '_').replace('/', '_')
|
||||
results_dir = "benchmark_results"
|
||||
|
||||
# Create results directory if it doesn't exist
|
||||
os.makedirs(results_dir, exist_ok=True)
|
||||
|
||||
# Include CPU brand in filename
|
||||
filename = os.path.join(results_dir, f"{cpu_brand}_{server_id}.json")
|
||||
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
||||
|
||||
# Load existing results or create new file
|
||||
try:
|
||||
with open(filename, 'r') as f:
|
||||
existing_data = json.load(f)
|
||||
except FileNotFoundError:
|
||||
existing_data = {
|
||||
'server_url': server_url,
|
||||
'benchmarks': []
|
||||
}
|
||||
|
||||
# Add new results with timestamp
|
||||
existing_data['benchmarks'].append({
|
||||
'timestamp': timestamp,
|
||||
'results': results
|
||||
})
|
||||
|
||||
# Save updated results
|
||||
with open(filename, 'w') as f:
|
||||
json.dump(existing_data, f, indent=2)
|
||||
print(f"{GREEN}Successfully saved results to {filename}{ENDC}")
|
||||
except Exception as e:
|
||||
print(f"{RED}Failed to save results: {str(e)}{ENDC}")
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Run Ollama model benchmarks')
|
||||
parser.add_argument('--server', choices=['local', 'z60'], default='local',
|
||||
help='Choose Ollama server (default: local)')
|
||||
parser.add_argument('--model', type=str, help='Specific model to benchmark')
|
||||
parser.add_argument('--number', type=str, help='Number of models to benchmark (number or "all")')
|
||||
parser.add_argument('--verbose', action='store_true', help='Enable verbose output')
|
||||
args = parser.parse_args()
|
||||
|
||||
server_url = SERVERS[args.server]
|
||||
|
||||
print()
|
||||
print(f"{HEADER}{BOLD}CPU Information:{ENDC}")
|
||||
cpu_info = get_cpu_info()
|
||||
for key, value in cpu_info.items():
|
||||
print(f"{MUTED}{key}: {value}{ENDC}")
|
||||
|
||||
print()
|
||||
print(f"{INFO}Using Ollama server at {server_url}...{ENDC}")
|
||||
|
||||
# Get available models or use specified model
|
||||
if args.model:
|
||||
models = [args.model]
|
||||
else:
|
||||
models = get_available_models(server_url)
|
||||
|
||||
if not models:
|
||||
print(f"{RED}No models found on server {server_url}. Exiting.{ENDC}")
|
||||
return
|
||||
|
||||
# Handle number of models to test
|
||||
if args.number and args.number.lower() != 'all':
|
||||
try:
|
||||
num_models = int(args.number)
|
||||
if num_models > 0:
|
||||
models = models[:num_models]
|
||||
else:
|
||||
print(f"{WARNING}Invalid number of models. Using all available models.{ENDC}")
|
||||
except ValueError:
|
||||
print(f"{WARNING}Invalid number format. Using all available models.{ENDC}")
|
||||
|
||||
print(f"{INFO}Testing {len(models)} models :{ENDC}")
|
||||
for i, model in enumerate(models, 1):
|
||||
print(f"{YELLOW}{i}. {model}{ENDC}")
|
||||
|
||||
# Run benchmarks
|
||||
all_results = []
|
||||
|
||||
for model in models:
|
||||
print(f"\n{HEADER}{BOLD}Benchmarking {model}...{ENDC}")
|
||||
details = get_model_details(model)
|
||||
if details:
|
||||
print(f"\n{INFO}Model Details:{ENDC}")
|
||||
if "details" in details:
|
||||
for section, items in details["details"].items():
|
||||
print(f"\n{BOLD}{section}{ENDC}")
|
||||
for key, value in items.items():
|
||||
print(f" {key}: {value}")
|
||||
else:
|
||||
print(json.dumps(details, indent=2))
|
||||
result = run_model_benchmark(model, server_url)
|
||||
if 'error' not in result:
|
||||
all_results.append(result)
|
||||
|
||||
# Print and save results
|
||||
print_leaderboard(all_results)
|
||||
update_server_results(server_url, all_results)
|
||||
'''
|
||||
# Create leaderboard data structure
|
||||
leaderboard = []
|
||||
for result in sorted(all_results, key=lambda x: (
|
||||
sum(t['passed_cases'] for t in x['test_results'].values()) / sum(t['total_cases'] for t in x['test_results'].values()) if sum(t['total_cases'] for t in x['test_results'].values()) > 0 else 0,
|
||||
x['tokens_per_second']
|
||||
), reverse=True):
|
||||
total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
|
||||
total_cases = sum(t['total_cases'] for t in result['test_results'].values())
|
||||
success_rate = (total_passed / total_cases * 100) if total_cases > 0 else 0
|
||||
|
||||
leaderboard.append({
|
||||
'model': result['model'],
|
||||
'success_rate': success_rate,
|
||||
'total_passed': total_passed,
|
||||
'total_cases': total_cases,
|
||||
'tokens_per_second': result['tokens_per_second'],
|
||||
'average_duration': result['total_duration']
|
||||
})
|
||||
|
||||
# Save detailed results and leaderboard to file
|
||||
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"benchmark_results/model_benchmark_{timestamp}.json"
|
||||
with open(filename, 'w') as f:
|
||||
json.dump({
|
||||
'timestamp': timestamp,
|
||||
'server_url': server_url,
|
||||
'leaderboard': leaderboard,
|
||||
'detailed_results': all_results
|
||||
}, f, indent=2)
|
||||
print(f"\n{GREEN}Detailed results saved to {filename}{ENDC}")
|
||||
'''
|
||||
if __name__ == "__main__":
|
||||
main()
|
234
main.py
234
main.py
@ -9,6 +9,8 @@ import ast
|
||||
import argparse
|
||||
import requests
|
||||
import os
|
||||
import glob
|
||||
import matplotlib.pyplot as plt
|
||||
from together import Together
|
||||
from cpuinfo import get_cpu_info
|
||||
import subprocess
|
||||
@ -633,7 +635,7 @@ def get_model_details(model_name):
|
||||
print(f"An error occurred while getting model details: {e}")
|
||||
return None
|
||||
|
||||
def update_server_results(server_url: str, results: List[Dict]) -> None:
|
||||
def update_server_results(server_url: str, results: List[Dict]) -> str:
|
||||
try:
|
||||
# Get CPU brand and format it for filename
|
||||
cpu_info = get_cpu_info()
|
||||
@ -711,8 +713,215 @@ def update_server_results(server_url: str, results: List[Dict]) -> None:
|
||||
|
||||
print(f"{GREEN}Console output saved to {log_filename}{ENDC}")
|
||||
|
||||
return json_filename
|
||||
|
||||
except Exception as e:
|
||||
print(f"{RED}Failed to save results: {str(e)}{ENDC}")
|
||||
return None
|
||||
|
||||
def plot_benchmark_results(json_file=None):
|
||||
"""
|
||||
Plot benchmark results using the same functionality as lboard.py
|
||||
|
||||
Args:
|
||||
json_file: Path to the JSON file with benchmark results. If None, uses the latest file.
|
||||
"""
|
||||
try:
|
||||
# If no file specified, find the latest
|
||||
if not json_file:
|
||||
json_file = get_latest_json_file('benchmark_results')
|
||||
if not json_file:
|
||||
print(f"{RED}No benchmark results found{ENDC}")
|
||||
return
|
||||
|
||||
with open(json_file, 'r') as f:
|
||||
benchmark_data = json.load(f)
|
||||
|
||||
print(f"{INFO}Using benchmark file: {json_file}{ENDC}")
|
||||
|
||||
# Get all benchmark results and combine them
|
||||
all_model_results = []
|
||||
model_names = set()
|
||||
|
||||
# Process all benchmarks, keeping only the latest result for each model
|
||||
for benchmark in benchmark_data['benchmarks']:
|
||||
for model_result in benchmark.get('results', []):
|
||||
model_name = model_result.get('model')
|
||||
if model_name and model_name not in model_names:
|
||||
all_model_results.append(model_result)
|
||||
model_names.add(model_name)
|
||||
elif model_name in model_names:
|
||||
# Replace existing model with newer version
|
||||
for i, existing_model in enumerate(all_model_results):
|
||||
if existing_model.get('model') == model_name:
|
||||
all_model_results[i] = model_result
|
||||
break
|
||||
|
||||
# Calculate stats and sort models
|
||||
model_stats = [calculate_model_stats(model) for model in all_model_results]
|
||||
sorted_stats = sorted(model_stats,
|
||||
key=lambda x: (x['overall_success_rate'], x['tokens_per_second']),
|
||||
reverse=True)
|
||||
|
||||
print(f"\n🏆 Final Model Leaderboard:")
|
||||
for stats in sorted_stats:
|
||||
print(f"\n{stats['model']}")
|
||||
print(f" Overall Success Rate: {stats['overall_success_rate']:.1f}%")
|
||||
print(f" Average Tokens/sec: {stats['tokens_per_second']:.2f} ({stats['min_tokens_per_second']:.2f} - {stats['max_tokens_per_second']:.2f})")
|
||||
print(f" Average Duration: {stats['total_duration']:.2f}s")
|
||||
print(f" Min/Max Avg Duration: {stats['min_avg_duration']:.2f}s / {stats['max_avg_duration']:.2f}s")
|
||||
print(f" Test Results:")
|
||||
|
||||
for test_name, test_result in stats['test_results'].items():
|
||||
status = '✅' if test_result['success_rate'] == 100 else '❌'
|
||||
print(f" - {test_name}: {status} {test_result['passed_cases']}/{test_result['total_cases']} cases ({test_result['success_rate']:.1f}%)")
|
||||
|
||||
# Generate visualization
|
||||
plot_model_comparison(sorted_stats)
|
||||
except Exception as e:
|
||||
print(f"{RED}Error loading benchmark data: {e}{ENDC}")
|
||||
|
||||
def calculate_model_stats(model_result):
|
||||
"""Calculate average stats for a model from its test results."""
|
||||
test_results = model_result['test_results']
|
||||
|
||||
# Calculate overall success rate (average of all test success rates)
|
||||
success_rates = [test['success_rate'] for test in test_results.values()]
|
||||
overall_success_rate = sum(success_rates) / len(success_rates)
|
||||
|
||||
# Handle the case where some test results might not have avg_duration or avg_tokens_sec
|
||||
# This is for backward compatibility with older benchmark results
|
||||
min_avg_duration = max_avg_duration = None
|
||||
min_tokens_per_second = max_tokens_per_second = None
|
||||
|
||||
# First try to get these values from the model_result directly (new format)
|
||||
if 'min_avg_duration' in model_result and 'max_avg_duration' in model_result:
|
||||
min_avg_duration = model_result['min_avg_duration']
|
||||
max_avg_duration = model_result['max_avg_duration']
|
||||
|
||||
if 'min_tokens_per_second' in model_result and 'max_tokens_per_second' in model_result:
|
||||
min_tokens_per_second = model_result['min_tokens_per_second']
|
||||
max_tokens_per_second = model_result['max_tokens_per_second']
|
||||
|
||||
# If not available in the model_result, try to calculate from test_results (old format)
|
||||
if min_avg_duration is None or max_avg_duration is None:
|
||||
try:
|
||||
min_avg_duration = min(test.get('avg_duration', float('inf')) for test in test_results.values() if 'avg_duration' in test)
|
||||
max_avg_duration = max(test.get('avg_duration', 0) for test in test_results.values() if 'avg_duration' in test)
|
||||
# If no test has avg_duration, use total_duration as fallback
|
||||
if min_avg_duration == float('inf') or max_avg_duration == 0:
|
||||
min_avg_duration = max_avg_duration = model_result['total_duration']
|
||||
except (ValueError, KeyError):
|
||||
# If calculation fails, use total_duration as fallback
|
||||
min_avg_duration = max_avg_duration = model_result['total_duration']
|
||||
|
||||
if min_tokens_per_second is None or max_tokens_per_second is None:
|
||||
try:
|
||||
min_tokens_per_second = min(test.get('avg_tokens_sec', float('inf')) for test in test_results.values() if 'avg_tokens_sec' in test)
|
||||
max_tokens_per_second = max(test.get('avg_tokens_sec', 0) for test in test_results.values() if 'avg_tokens_sec' in test)
|
||||
# If no test has avg_tokens_sec, use tokens_per_second as fallback
|
||||
if min_tokens_per_second == float('inf') or max_tokens_per_second == 0:
|
||||
min_tokens_per_second = max_tokens_per_second = model_result['tokens_per_second']
|
||||
except (ValueError, KeyError):
|
||||
# If calculation fails, use tokens_per_second as fallback
|
||||
min_tokens_per_second = max_tokens_per_second = model_result['tokens_per_second']
|
||||
|
||||
return {
|
||||
'model': model_result['model'],
|
||||
'overall_success_rate': overall_success_rate,
|
||||
'tokens_per_second': model_result['tokens_per_second'],
|
||||
'total_duration': model_result['total_duration'],
|
||||
'min_avg_duration': min_avg_duration,
|
||||
'max_avg_duration': max_avg_duration,
|
||||
'min_tokens_per_second': min_tokens_per_second,
|
||||
'max_tokens_per_second': max_tokens_per_second,
|
||||
'test_results': test_results
|
||||
}
|
||||
|
||||
def plot_model_comparison(model_stats):
|
||||
"""Plot model comparison with dual y-axes for tokens/sec and success rate."""
|
||||
models = [stat['model'] for stat in model_stats]
|
||||
token_speeds = [stat['tokens_per_second'] for stat in model_stats]
|
||||
success_rates = [stat['overall_success_rate'] for stat in model_stats]
|
||||
durations = [stat['total_duration'] for stat in model_stats]
|
||||
|
||||
# Create figure and primary axis
|
||||
fig, ax1 = plt.subplots(figsize=(15, 8))
|
||||
|
||||
# Plot tokens/sec bars using min and max values
|
||||
for i, stat in enumerate(model_stats):
|
||||
min_tokens = stat['min_tokens_per_second']
|
||||
max_tokens = stat['max_tokens_per_second']
|
||||
|
||||
# Plot lower part (0 to min) with slightly darker blue
|
||||
ax1.bar(i, min_tokens, color='royalblue', alpha=0.4)
|
||||
# Plot upper part (min to max) with lighter blue
|
||||
bar_height = max_tokens - min_tokens
|
||||
ax1.bar(i, bar_height, bottom=min_tokens, color='royalblue', alpha=0.3)
|
||||
|
||||
ax1.set_ylabel('Tokens per Second', color='blue')
|
||||
ax1.tick_params(axis='y', labelcolor='blue')
|
||||
# Set y-axis range for tokens per second
|
||||
max_token_speed = max(stat['max_tokens_per_second'] for stat in model_stats)
|
||||
ax1.set_ylim(0, max(100, max_token_speed * 1.1)) # Add 10% padding above max value
|
||||
|
||||
# Set x-axis labels
|
||||
ax1.set_xticks(range(len(models)))
|
||||
ax1.set_xticklabels(models, rotation=45, ha='right', rotation_mode='anchor')
|
||||
|
||||
# Create secondary y-axis for success rate
|
||||
ax2 = ax1.twinx()
|
||||
ax2.plot(models, success_rates, 'r+', markersize=15, label='Success Rate', linestyle='None')
|
||||
ax2.set_ylabel('Success Rate (%)', color='red')
|
||||
ax2.tick_params(axis='y', labelcolor='red')
|
||||
ax2.set_ylim(0, 100)
|
||||
|
||||
# Create third y-axis for duration
|
||||
ax3 = ax1.twinx()
|
||||
ax3.spines['right'].set_position(('outward', 60)) # Move third axis outward
|
||||
# Add min and max duration markers
|
||||
min_durations = [stat['min_avg_duration'] for stat in model_stats]
|
||||
max_durations = [stat['max_avg_duration'] for stat in model_stats]
|
||||
# Plot duration ranges with vertical lines and markers
|
||||
for i, (min_d, max_d) in enumerate(zip(min_durations, max_durations)):
|
||||
ax3.plot([i, i], [min_d, max_d], 'g-', linewidth=1) # Vertical line
|
||||
ax3.plot(i, min_d, 'g-', markersize=10) # Min marker
|
||||
ax3.plot(i, max_d, 'g-', markersize=10) # Max marker
|
||||
|
||||
ax3.set_ylabel('Duration (s)', color='green')
|
||||
ax3.tick_params(axis='y', labelcolor='green')
|
||||
|
||||
# Customize x-axis labels with proper rotation
|
||||
ax1.set_xticks(range(len(models)))
|
||||
ax1.set_xticklabels(models, rotation=45, ha='right', rotation_mode='anchor')
|
||||
for i, model in enumerate(models):
|
||||
# Shorten model names by removing common suffixes
|
||||
short_name = model.replace(':latest', '').replace('-uncensored', '')
|
||||
ax1.get_xticklabels()[i].set_text(short_name)
|
||||
# Updated conditions: success rate > 95% AND success rate / duration >= 5
|
||||
if success_rates[i] > 95 and (success_rates[i] / durations[i] >= 5):
|
||||
ax1.get_xticklabels()[i].set_color('green')
|
||||
|
||||
# Adjust layout to prevent label cutoff
|
||||
plt.subplots_adjust(bottom=0.25, left=0.1, right=0.85)
|
||||
|
||||
plt.title('Model Performance Comparison')
|
||||
plt.tight_layout()
|
||||
|
||||
# Save the figure before showing it
|
||||
output_path = 'benchmark_results/model_comparison.png'
|
||||
plt.savefig(output_path, dpi=300, bbox_inches='tight')
|
||||
print(f"{INFO}Plot saved as '{output_path}'{ENDC}")
|
||||
|
||||
# Show the figure (optional - can be removed for headless environments)
|
||||
plt.show()
|
||||
|
||||
def get_latest_json_file(directory):
|
||||
"""Find the latest JSON file in the specified directory."""
|
||||
json_files = glob.glob(os.path.join(directory, '*.json'))
|
||||
print(f"{INFO}Found JSON files: {json_files}{ENDC}")
|
||||
latest_file = max(json_files, key=os.path.getmtime) if json_files else None
|
||||
return latest_file
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Run Ollama model benchmarks')
|
||||
@ -721,8 +930,24 @@ def main():
|
||||
parser.add_argument('--model', type=str, help='Specific model to benchmark')
|
||||
parser.add_argument('--number', type=str, help='Number of models to benchmark (number or "all")')
|
||||
parser.add_argument('--verbose', action='store_true', help='Enable verbose output')
|
||||
parser.add_argument('--plot-only', action='store_true',
|
||||
help='Skip benchmarking and just plot graphs from existing results')
|
||||
parser.add_argument('--no-plot', action='store_true',
|
||||
help='Run benchmarking without plotting graphs at the end')
|
||||
parser.add_argument('--file', type=str,
|
||||
help='Specify a benchmark results file to use for plotting (only with --plot-only)')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Set global verbose flag
|
||||
global verbose
|
||||
verbose = args.verbose
|
||||
|
||||
# Handle plot-only mode
|
||||
if args.plot_only:
|
||||
print(f"{INFO}Running in plot-only mode...{ENDC}")
|
||||
plot_benchmark_results(args.file)
|
||||
return
|
||||
|
||||
server_url = SERVERS[args.server]
|
||||
|
||||
print()
|
||||
@ -780,7 +1005,12 @@ def main():
|
||||
|
||||
# Print and save results
|
||||
print_leaderboard(all_results)
|
||||
update_server_results(server_url, all_results)
|
||||
json_file = update_server_results(server_url, all_results)
|
||||
|
||||
# Plot results unless --no-plot is specified
|
||||
if not args.no_plot:
|
||||
print(f"{INFO}Generating performance plot...{ENDC}")
|
||||
plot_benchmark_results(json_file)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
@ -1,6 +1,12 @@
|
||||
# Core dependencies
|
||||
requests>=2.31.0
|
||||
together>=0.2.8
|
||||
ollama>=0.1.6
|
||||
python-dotenv>=1.0.0
|
||||
GPUtil==1.4.0
|
||||
py-cpuinfo
|
||||
py-cpuinfo>=9.0.0
|
||||
|
||||
# Visualization
|
||||
matplotlib>=3.7.0
|
||||
|
||||
# Analysis
|
||||
together>=0.2.8
|
||||
GPUtil==1.4.0
|
Loading…
Reference in New Issue
Block a user