first commit
This commit is contained in:
parent
92d4b26ac2
commit
a3b06718a2
130
README.md
130
README.md
@ -1,3 +1,129 @@
|
||||
# codebench
|
||||
# Codebench - Ollama Model Benchmark Tool
|
||||
|
||||
Custom benchmarks for LLMs
|
||||
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.
|
||||
|
||||
## Features
|
||||
|
||||
- Test multiple Ollama models against common coding problems
|
||||
- Measure performance metrics (tokens/sec, response time)
|
||||
- Track success rates across different coding challenges
|
||||
- Support for local and remote Ollama servers
|
||||
- Detailed test results and leaderboard generation
|
||||
- CPU information tracking for benchmarks
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Ollama server (local or remote)
|
||||
- Together API key (optional, for advanced code analysis)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone https://github.com/yourusername/codebench.git
|
||||
cd codebench
|
||||
|
||||
|
||||
2. Install required packages:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. (Optional) Set up Together API:
|
||||
```bash
|
||||
export TOGETHER_API_KEY='your_api_key_here'
|
||||
```
|
||||
|
||||
## Usage
|
||||
Basic usage:
|
||||
|
||||
```bash
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
Available options:
|
||||
|
||||
```bash
|
||||
python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
|
||||
```
|
||||
|
||||
## Arguments:
|
||||
|
||||
- --server : Choose Ollama server (default: local)
|
||||
- --model : Test specific model only
|
||||
- --number : Number of models to test
|
||||
- --verbose : Enable detailed output
|
||||
|
||||
## Supported Tests
|
||||
The tool currently tests models on these coding challenges:
|
||||
|
||||
1. Fibonacci Sequence
|
||||
2. Binary Search
|
||||
3. Palindrome Check
|
||||
4. Anagram Detection
|
||||
|
||||
## Test Process & Validation
|
||||
|
||||
### Code Generation
|
||||
1. Each model is prompted with specific coding tasks
|
||||
2. Generated code is extracted from the model's response
|
||||
3. Initial syntax validation is performed
|
||||
|
||||
### Test Validation
|
||||
For each test case:
|
||||
- Input values are provided to the function
|
||||
- Output is compared with expected results
|
||||
- Test results are marked as ✅ (pass) or ❌ (fail)
|
||||
|
||||
Example test cases:
|
||||
```plaintext
|
||||
Fibonacci:
|
||||
- Input: 6 Expected: 8
|
||||
- Input: 0 Expected: 0
|
||||
- Input: -1 Expected: -1
|
||||
|
||||
Binary Search:
|
||||
- Input: ([1,2,3,4,5], 3) Expected: 2
|
||||
- Input: ([], 1) Expected: -1
|
||||
- Input: ([1], 1) Expected: 0
|
||||
```
|
||||
|
||||
## Output
|
||||
Results are saved in the benchmark_results directory with the following naming convention:
|
||||
|
||||
```plaintext
|
||||
[CPU_Model]_[Server_Address].json
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```plaintext
|
||||
Apple_M1_Pro_localhost_11434.json
|
||||
```
|
||||
|
||||
## Server Configuration
|
||||
Default servers are configured in the code:
|
||||
|
||||
- Local: http://localhost:11434
|
||||
- Z60: http://192.168.196.60:11434
|
||||
|
||||
## Example Output
|
||||
```plaintext
|
||||
🏆 Final Model Leaderboard:
|
||||
|
||||
codellama:13b
|
||||
Overall Success Rate: 95.8% (23/24 cases)
|
||||
Average Tokens/sec: 145.23
|
||||
Average Duration: 2.34s
|
||||
Test Results:
|
||||
- Fibonacci: ✅ 6/6 cases (100.0%)
|
||||
- Binary Search: ✅ 6/6 cases (100.0%)
|
||||
```
|
||||
|
||||
|
||||
## Contributing
|
||||
Feel free to submit issues and enhancement requests!
|
||||
|
||||
## License
|
||||
[Your chosen license]
|
BIN
benchmark_results/.DS_Store
vendored
Normal file
BIN
benchmark_results/.DS_Store
vendored
Normal file
Binary file not shown.
222
devbook.md
Normal file
222
devbook.md
Normal file
@ -0,0 +1,222 @@
|
||||
# Ollama Testing Framework Documentation
|
||||
Version: 1.0
|
||||
Last Updated: 2025-02-23
|
||||
|
||||
## Overview
|
||||
The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:
|
||||
1. Code correctness (test cases)
|
||||
2. Performance metrics (inference time, tokens/sec)
|
||||
3. Consistency across multiple runs
|
||||
|
||||
## Goals
|
||||
1. Validate model responses for correctness and functionality
|
||||
2. Measure and compare performance across different models
|
||||
3. Provide detailed insights into model behavior and reliability
|
||||
4. Enable easy comparison through a leaderboard system
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Test Suite
|
||||
The test suite consists of multiple coding challenges, each with:
|
||||
- A clear problem description
|
||||
- A validator function
|
||||
- Multiple test cases
|
||||
- Expected outputs
|
||||
|
||||
Current Test Cases:
|
||||
a) Fibonacci Sequence
|
||||
- Tests edge cases (negative, zero)
|
||||
- Tests standard cases (n=1 to n=10)
|
||||
- Validates performance for larger inputs
|
||||
|
||||
b) Binary Search
|
||||
- Tests empty list case
|
||||
- Tests element not found
|
||||
- Tests finding elements at different positions
|
||||
|
||||
c) Palindrome Check
|
||||
- Tests empty string
|
||||
- Tests single character
|
||||
- Tests various palindrome and non-palindrome cases
|
||||
|
||||
d) Anagram Check
|
||||
- Tests empty strings
|
||||
- Tests case sensitivity
|
||||
- Tests strings with spaces and special characters
|
||||
|
||||
### 2. Inference Pipeline
|
||||
|
||||
#### Request Flow:
|
||||
1. Format prompt with problem description
|
||||
2. Send to Ollama API with timing start
|
||||
3. Receive response and stop timing
|
||||
4. Extract code from response
|
||||
5. Validate code syntax
|
||||
6. Run test cases
|
||||
7. Calculate performance metrics
|
||||
|
||||
#### Performance Metrics:
|
||||
- Total Duration (s): Time from request to response completion
|
||||
- Total Tokens: Number of tokens in the response (eval_count)
|
||||
- Tokens per Second: Processing speed (tokens/duration)
|
||||
|
||||
### 3. Validation System
|
||||
|
||||
#### Code Validation:
|
||||
1. Syntax check (is_valid_python)
|
||||
2. Function name verification
|
||||
3. Test case execution
|
||||
4. Together API integration for failure analysis
|
||||
|
||||
#### Test Results:
|
||||
- Individual test case results (pass/fail)
|
||||
- Error messages and debug info
|
||||
- Together API opinions on failures
|
||||
|
||||
### 4. Benchmarking System
|
||||
|
||||
#### Benchmark Process:
|
||||
1. Run multiple iterations (default: 4 runs)
|
||||
2. Use last 3 runs for final metrics
|
||||
3. Calculate averages across runs
|
||||
4. Store detailed results in JSON
|
||||
|
||||
#### Metrics Tracked:
|
||||
- Success rate per test
|
||||
- Overall success rate
|
||||
- Average inference time
|
||||
- Average tokens per second
|
||||
|
||||
### 5. Leaderboard System
|
||||
|
||||
#### Ranking Algorithm:
|
||||
1. Primary sort: Overall success rate
|
||||
- Calculated as (total passed cases / total cases) across all tests
|
||||
2. Secondary sort: Tokens per second
|
||||
- Higher speed breaks ties between equal success rates
|
||||
|
||||
#### Display Format:
|
||||
```
|
||||
🏆 Model Leaderboard:
|
||||
1. model_name
|
||||
Overall Success Rate: XX.X% (passed/total cases)
|
||||
Average Tokens/sec: XX.XX
|
||||
Average Duration: XX.XXs
|
||||
Test Results:
|
||||
- Test1: ✅/❌ passed/total cases (success_rate%)
|
||||
- Test2: ✅/❌ passed/total cases (success_rate%)
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Run:
|
||||
```bash
|
||||
python3.10 main.py --server 'z60' --number '2'
|
||||
```
|
||||
|
||||
### Options:
|
||||
- `--model`: Specify a single model to test
|
||||
- `--server`: Custom Ollama server URL
|
||||
- `--runs`: Number of benchmark runs
|
||||
|
||||
### Output:
|
||||
1. Real-time test progress
|
||||
2. Performance metrics per inference
|
||||
3. Test results summary
|
||||
4. Final leaderboard
|
||||
5. JSON results file with timestamp
|
||||
|
||||
## Results Storage
|
||||
- Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
|
||||
- Contains full test details and metrics
|
||||
- Enables historical comparison and analysis
|
||||
|
||||
## Error Handling
|
||||
1. API communication errors
|
||||
2. Code execution timeouts
|
||||
3. Invalid responses
|
||||
4. Test case failures
|
||||
5. Performance metric calculation errors
|
||||
|
||||
## Future Improvements
|
||||
1. Add more diverse test cases
|
||||
2. Implement parallel testing
|
||||
3. Add memory usage tracking
|
||||
4. Create historical performance trends
|
||||
5. Add code quality metrics
|
||||
|
||||
## Leaderboard Data Processing
|
||||
|
||||
### Test Results Processing
|
||||
- Processes only the latest benchmark results
|
||||
- Determines maximum test cases from successful runs
|
||||
- Handles validation failures as complete failures (0 passed cases)
|
||||
- Uses dynamic test case counting based on actual successful runs
|
||||
- Maintains consistent test case counting across all scenarios
|
||||
|
||||
### Success Rate Calculations
|
||||
- Calculates success rates based on expected total cases
|
||||
- Counts failed validations as 0/expected_cases
|
||||
- Uses maximum observed test cases as the baseline
|
||||
- Includes validation status in success rate reporting
|
||||
- Prevents skipping of failed validations in total counts
|
||||
|
||||
### Performance Metrics
|
||||
- Tracks tokens per second from model responses
|
||||
- Measures total duration across all tests
|
||||
- Calculates success rate vs duration ratio
|
||||
- Excellence criteria: >90% success AND success_rate > 5*duration
|
||||
- Prevents duplicate model entries in metrics
|
||||
|
||||
### Data Visualization
|
||||
# Development Notes
|
||||
|
||||
## Project Structure
|
||||
- `main.py`: Core benchmarking functionality
|
||||
- `lboard.py`: Leaderboard visualization and results analysis
|
||||
- `benchmark_results/`: Directory containing JSON benchmark results
|
||||
|
||||
## Visualization Features
|
||||
- Blue bars: Tokens per second performance
|
||||
- Red + markers: Overall success rate (%)
|
||||
- Green - markers: Total duration (seconds)
|
||||
- Green model names: Models with >90% success rate
|
||||
- Triple y-axis plot for easy metric comparison
|
||||
|
||||
## Running the Leaderboard
|
||||
```bash
|
||||
# View latest results
|
||||
python lboard.py
|
||||
|
||||
# View specific results file
|
||||
python lboard.py path/to/results.json
|
||||
```
|
||||
- Single plot per model (no duplicates)
|
||||
- Color-coded bars based on performance
|
||||
- Success rate indicators (red +)
|
||||
- Duration indicators (green -)
|
||||
- Dynamic axis scaling
|
||||
- Combined legend for all metrics
|
||||
|
||||
### Output Format
|
||||
- Detailed test results per model
|
||||
- Individual test case counts
|
||||
- Validation status indicators
|
||||
- Overall success rates
|
||||
- Performance metrics
|
||||
- Plot position information
|
||||
|
||||
Model: model_name
|
||||
├─ Tokens/sec: XX.XX
|
||||
├─ Total Duration: XX.XXs
|
||||
├─ Test Results:
|
||||
│ ├─ Test1: X/Y cases (ZZ.Z%) [validation status]
|
||||
│ ├─ Test2: X/Y cases (ZZ.Z%) [validation status]
|
||||
└─ Overall Success Rate: X/Y (ZZ.Z%)
|
||||
|
||||
### Error Handling
|
||||
- Handles missing test results
|
||||
- Processes validation failures appropriately
|
||||
- Maintains consistent case counting
|
||||
- Prevents data duplication
|
||||
- Ensures accurate success rate calculations
|
144
lboard.py
Normal file
144
lboard.py
Normal file
@ -0,0 +1,144 @@
|
||||
import json
|
||||
import os
|
||||
import argparse
|
||||
import glob
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
def get_latest_json_file(directory):
|
||||
json_files = glob.glob(os.path.join(directory, '*.json'))
|
||||
print(f"Found JSON files: {json_files}")
|
||||
latest_file = max(json_files, key=os.path.getmtime) if json_files else None
|
||||
return latest_file
|
||||
def calculate_model_stats(model_result):
|
||||
"""Calculate average stats for a model from its test results."""
|
||||
test_results = model_result['test_results']
|
||||
|
||||
# Calculate overall success rate (average of all test success rates)
|
||||
success_rates = [test['success_rate'] for test in test_results.values()]
|
||||
overall_success_rate = sum(success_rates) / len(success_rates)
|
||||
|
||||
return {
|
||||
'model': model_result['model'],
|
||||
'overall_success_rate': overall_success_rate,
|
||||
'tokens_per_second': model_result['tokens_per_second'],
|
||||
'total_duration': model_result['total_duration'],
|
||||
'test_results': test_results
|
||||
}
|
||||
|
||||
def plot_model_comparison(model_stats):
|
||||
"""Plot model comparison with dual y-axes for tokens/sec and success rate."""
|
||||
models = [stat['model'] for stat in model_stats]
|
||||
token_speeds = [stat['tokens_per_second'] for stat in model_stats]
|
||||
success_rates = [stat['overall_success_rate'] for stat in model_stats]
|
||||
durations = [stat['total_duration'] for stat in model_stats]
|
||||
|
||||
# Create figure and primary axis
|
||||
fig, ax1 = plt.subplots(figsize=(15, 8))
|
||||
|
||||
# Plot tokens/sec bars on primary y-axis with lighter blue and more transparency
|
||||
bars = ax1.bar(models, token_speeds, color='royalblue', alpha=0.3)
|
||||
ax1.set_ylabel('Tokens per Second', color='blue')
|
||||
ax1.tick_params(axis='y', labelcolor='blue')
|
||||
|
||||
# Create secondary y-axis for success rate
|
||||
ax2 = ax1.twinx()
|
||||
ax2.plot(models, success_rates, 'r+', markersize=15, label='Success Rate', linestyle='None')
|
||||
ax2.set_ylabel('Success Rate (%)', color='red')
|
||||
ax2.tick_params(axis='y', labelcolor='red')
|
||||
ax2.set_ylim(0, 100)
|
||||
|
||||
# Create third y-axis for duration
|
||||
ax3 = ax1.twinx()
|
||||
ax3.spines['right'].set_position(('outward', 60)) # Move third axis outward
|
||||
ax3.plot(models, durations, 'g_', markersize=15, label='Duration', linestyle='None')
|
||||
ax3.set_ylabel('Duration (s)', color='green')
|
||||
ax3.tick_params(axis='y', labelcolor='green')
|
||||
|
||||
# Customize x-axis labels with proper rotation
|
||||
ax1.set_xticks(range(len(models)))
|
||||
ax1.set_xticklabels(models, rotation=45, ha='right', rotation_mode='anchor')
|
||||
for i, model in enumerate(models):
|
||||
# Shorten model names by removing common suffixes
|
||||
short_name = model.replace(':latest', '').replace('-uncensored', '')
|
||||
ax1.get_xticklabels()[i].set_text(short_name)
|
||||
if success_rates[i] > 90:
|
||||
ax1.get_xticklabels()[i].set_color('green')
|
||||
|
||||
# Adjust layout to prevent label cutoff
|
||||
plt.subplots_adjust(bottom=0.25, left=0.1, right=0.85)
|
||||
|
||||
'''
|
||||
# Add value labels
|
||||
for i, bar in enumerate(bars):
|
||||
ax1.text(i, token_speeds[i], f'{token_speeds[i]:.1f}',
|
||||
ha='center', va='bottom', color='black')
|
||||
ax2.text(i, success_rates[i], f'{success_rates[i]:.1f}%',
|
||||
ha='center', va='bottom', color='black')
|
||||
ax3.text(i, durations[i], f'{durations[i]:.1f}s',
|
||||
ha='center', va='top', color='black')
|
||||
'''
|
||||
plt.title('Model Performance Comparison')
|
||||
plt.tight_layout()
|
||||
|
||||
plt.show()
|
||||
plt.savefig('benchmark_results/model_comparison.png')
|
||||
print("\nPlot saved as 'benchmark_results/model_comparison.png'")
|
||||
|
||||
def print_leaderboard(benchmark_data):
|
||||
"""Print leaderboard from benchmark results."""
|
||||
if not benchmark_data.get('benchmarks'):
|
||||
print("No benchmark data to display")
|
||||
return
|
||||
|
||||
# Get the latest benchmark results
|
||||
latest_benchmark = benchmark_data['benchmarks'][-1]
|
||||
model_results = latest_benchmark['results']
|
||||
|
||||
# Calculate stats and sort models
|
||||
model_stats = [calculate_model_stats(model) for model in model_results]
|
||||
sorted_stats = sorted(model_stats,
|
||||
key=lambda x: (x['overall_success_rate'], x['tokens_per_second']),
|
||||
reverse=True)
|
||||
|
||||
print(f"\n🏆 Final Model Leaderboard:")
|
||||
for stats in sorted_stats:
|
||||
print(f"\n{stats['model']}")
|
||||
print(f" Overall Success Rate: {stats['overall_success_rate']:.1f}%")
|
||||
print(f" Average Tokens/sec: {stats['tokens_per_second']:.2f}")
|
||||
print(f" Average Duration: {stats['total_duration']:.2f}s")
|
||||
print(f" Test Results:")
|
||||
|
||||
for test_name, test_result in stats['test_results'].items():
|
||||
status = '✅' if test_result['success_rate'] == 100 else '❌'
|
||||
print(f" - {test_name}: {status} {test_result['success_rate']:.1f}%")
|
||||
|
||||
# Generate visualization
|
||||
plot_model_comparison(sorted_stats)
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Display benchmark leaderboard')
|
||||
parser.add_argument('filepath', nargs='?', help='Path to benchmark results JSON file')
|
||||
parser.add_argument('--file', type=str, help='Path to benchmark results JSON file (alternative way)')
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
# Use filepath if provided, then --file, otherwise find latest
|
||||
if args.filepath:
|
||||
json_file = args.filepath
|
||||
elif args.file:
|
||||
json_file = args.file
|
||||
else:
|
||||
json_file = get_latest_json_file('benchmark_results')
|
||||
if not json_file:
|
||||
print("No benchmark results found")
|
||||
return
|
||||
|
||||
with open(json_file, 'r') as f:
|
||||
benchmark_data = json.load(f)
|
||||
print(f"Using benchmark file: {json_file}")
|
||||
print_leaderboard(benchmark_data)
|
||||
except Exception as e:
|
||||
print(f"Error loading benchmark data: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
732
main.py
Normal file
732
main.py
Normal file
@ -0,0 +1,732 @@
|
||||
from tabnanny import verbose
|
||||
import ollama
|
||||
import time
|
||||
from typing import List, Dict, Any
|
||||
import json
|
||||
from statistics import mean
|
||||
import re
|
||||
import ast
|
||||
import argparse
|
||||
import requests
|
||||
import os
|
||||
from together import Together
|
||||
from cpuinfo import get_cpu_info
|
||||
import subprocess
|
||||
|
||||
|
||||
# ANSI color codes
|
||||
SUCCESS = '\033[38;5;78m' # Soft mint green for success
|
||||
ERROR = '\033[38;5;203m' # Soft coral red for errors
|
||||
INFO = '\033[38;5;75m' # Sky blue for info
|
||||
HEADER = '\033[38;5;147m' # Soft purple for headers
|
||||
WARNING = '\033[38;5;221m' # Warm gold for warnings
|
||||
EMPHASIS = '\033[38;5;159m' # Cyan for emphasis
|
||||
MUTED = '\033[38;5;246m' # Subtle gray for less important text
|
||||
ENDC = '\033[0m'
|
||||
BOLD = '\033[1m'
|
||||
|
||||
# Replace existing color usages
|
||||
GREEN = SUCCESS
|
||||
RED = ERROR
|
||||
BLUE = INFO
|
||||
YELLOW = WARNING
|
||||
WHITE = MUTED
|
||||
|
||||
# Server configurations
|
||||
SERVERS = {
|
||||
'local': 'http://localhost:11434',
|
||||
'z60': 'http://192.168.196.60:11434'
|
||||
}
|
||||
|
||||
class Timer:
|
||||
def __init__(self):
|
||||
self.start_time = None
|
||||
self.end_time = None
|
||||
|
||||
def start(self):
|
||||
self.start_time = time.time()
|
||||
|
||||
def stop(self):
|
||||
self.end_time = time.time()
|
||||
|
||||
def elapsed_time(self):
|
||||
if self.start_time is None:
|
||||
return 0
|
||||
if self.end_time is None:
|
||||
return time.time() - self.start_time
|
||||
return self.end_time - self.start_time
|
||||
|
||||
def extract_code_from_response(response: str) -> str:
|
||||
"""Extract Python code from a markdown-formatted string."""
|
||||
code_blocks = re.findall(r'```python\n(.*?)```', response, re.DOTALL)
|
||||
if code_blocks:
|
||||
return code_blocks[0].strip()
|
||||
return response
|
||||
|
||||
def is_valid_python(code: str) -> bool:
|
||||
"""Check if the code is valid Python syntax."""
|
||||
try:
|
||||
ast.parse(code)
|
||||
return True
|
||||
except SyntaxError:
|
||||
return False
|
||||
|
||||
def analyze_failed_code(code: str, test_case: tuple, expected: any, actual: any, function_name: str, model: str) -> bool:
|
||||
"""Analyze why code failed using Together API. Returns True if Together thinks the code should work."""
|
||||
prompt = f"""Analyze this Python code and explain why it failed the test case. Format your response EXACTLY as follows:
|
||||
|
||||
ASSESSMENT: [Write a one-line assessment: either "SHOULD PASS" or "SHOULD FAIL" followed by a brief reason]
|
||||
|
||||
ANALYSIS:
|
||||
[Detailed analysis of why the code failed and how to fix it]
|
||||
|
||||
Code:
|
||||
{code}
|
||||
|
||||
Test case:
|
||||
Input: {test_case}
|
||||
Expected output: {expected}
|
||||
Actual output: {actual}
|
||||
Function name required: {function_name}
|
||||
Model: {model}"""
|
||||
|
||||
try:
|
||||
TOGETHER_API_KEY = os.environ["TOGETHER_API_KEY"]
|
||||
together_client = Together(api_key=TOGETHER_API_KEY)
|
||||
response = together_client.chat.completions.create(
|
||||
model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a Python expert analyzing code failures. Always format your response with ASSESSMENT and ANALYSIS sections."},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
max_tokens=1000,
|
||||
temperature=0.7,
|
||||
top_p=0.7,
|
||||
top_k=50,
|
||||
repetition_penalty=1,
|
||||
stop=["<|eot_id|>", "<|eom_id|>"]
|
||||
)
|
||||
|
||||
analysis = response.choices[0].message.content
|
||||
should_pass = "SHOULD PASS" in analysis.upper()
|
||||
if verbose: print(f"\n{BLUE}[{model}] Together Analysis:{ENDC}")
|
||||
if verbose: print(f"{GREEN if should_pass else RED}{analysis}{ENDC}")
|
||||
return should_pass
|
||||
except Exception as e:
|
||||
print(f"\n{RED}Error getting Together API analysis: {e}{ENDC}")
|
||||
return False
|
||||
|
||||
def validate_with_debug(code: str, function_name: str, test_cases: List[tuple], model: str) -> tuple[bool, str, List[bool]]:
|
||||
"""Validate code with detailed debug information. Returns (success, debug_info, test_results)"""
|
||||
debug_info = []
|
||||
test_results = [] # Track individual test case results
|
||||
test_outputs = [] # Store test outputs for combined display
|
||||
|
||||
try:
|
||||
# Create a local namespace
|
||||
namespace = {}
|
||||
debug_info.append(f"Executing code:\n{code}")
|
||||
|
||||
try:
|
||||
# Redirect stdout to capture prints from the executed code
|
||||
import io
|
||||
import sys
|
||||
stdout = sys.stdout
|
||||
sys.stdout = io.StringIO()
|
||||
|
||||
# Execute the code
|
||||
exec(code, namespace)
|
||||
|
||||
# Restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
return False, f"Error executing code: {str(e)}", test_results
|
||||
|
||||
if function_name not in namespace:
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
together_opinion = analyze_failed_code(code, "N/A", f"Function named '{function_name}'",
|
||||
f"Found functions: {list(namespace.keys())}", function_name, model)
|
||||
print(f"\nTests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
return False, f"Function '{function_name}' not found in code. Available names: {list(namespace.keys())}", test_results
|
||||
|
||||
function = namespace[function_name]
|
||||
debug_info.append(f"Function {function_name} found")
|
||||
|
||||
# Run test cases
|
||||
all_passed = True
|
||||
for i, (test_input, expected) in enumerate(test_cases):
|
||||
try:
|
||||
# Redirect stdout for each test case
|
||||
stdout = sys.stdout
|
||||
sys.stdout = io.StringIO()
|
||||
|
||||
if isinstance(test_input, tuple):
|
||||
result = function(*test_input)
|
||||
else:
|
||||
result = function(test_input)
|
||||
|
||||
# Restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
# Store result but don't print individually
|
||||
test_outputs.append(str(result))
|
||||
test_passed = result == expected
|
||||
test_results.append(test_passed)
|
||||
|
||||
if not test_passed:
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
print(f"\n{RED}Test case {i+1} failed:{ENDC}")
|
||||
print(f"Input: {test_input} Expected: {expected} Got: {result}")
|
||||
|
||||
together_opinion = analyze_failed_code(code, test_input, expected, result, function_name, model)
|
||||
print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
|
||||
all_passed = False
|
||||
continue
|
||||
|
||||
debug_info.append(f"Test case {i+1} passed: {test_input} → {result}")
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
test_outputs.append(f"Error: {str(e)}")
|
||||
if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
|
||||
print(f"\n{RED}{str(e)} in test case {i+1} Input: {test_input} Expected: {expected}")
|
||||
|
||||
together_opinion = analyze_failed_code(code, test_input, expected, f"Error: {str(e)}", function_name, model)
|
||||
print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
|
||||
|
||||
test_results.append(False)
|
||||
all_passed = False
|
||||
continue
|
||||
finally:
|
||||
if 'sys' in locals(): # Always restore stdout
|
||||
sys.stdout = stdout
|
||||
|
||||
# Print all test outputs on one line
|
||||
# print(f"{WHITE}{BOLD}Test outputs: {join(test_outputs)}{ENDC}")
|
||||
print(f"{WHITE}Test outputs: {', '.join(test_outputs)}{ENDC}")
|
||||
|
||||
if all_passed:
|
||||
print(f"Tests passed: ✅")
|
||||
return True, "All tests passed!\n" + "\n".join(debug_info), test_results
|
||||
print(f"Tests passed: ❌")
|
||||
return False, "Some tests failed", test_results
|
||||
except Exception as e:
|
||||
if 'sys' in locals(): # Restore stdout if it was changed
|
||||
sys.stdout = stdout
|
||||
print(f"\n{RED}Error in validate_with_debug: {str(e)}{ENDC}")
|
||||
return False, f"Unexpected error: {str(e)}", test_results
|
||||
|
||||
def test_fibonacci():
|
||||
question = """Write a Python function named EXACTLY 'fibonacci' (not fibonacci_dp or any other name) that returns the nth Fibonacci number.
|
||||
The function signature must be: def fibonacci(n)
|
||||
|
||||
Requirements:
|
||||
1. Handle edge cases:
|
||||
- For n = 0, return 0
|
||||
- For n = 1 or n = 2, return 1
|
||||
- For negative numbers, return -1
|
||||
2. For n > 2: F(n) = F(n-1) + F(n-2)
|
||||
3. Use dynamic programming or memoization for efficiency
|
||||
4. Do NOT use any print statements - just return the values
|
||||
|
||||
Example sequence: 0,1,1,2,3,5,8,13,21,...
|
||||
Example calls:
|
||||
- fibonacci(6) returns 8
|
||||
- fibonacci(0) returns 0
|
||||
- fibonacci(-1) returns -1"""
|
||||
|
||||
test_cases = [
|
||||
(0, 0), # Edge case: n = 0
|
||||
(1, 1), # Edge case: n = 1
|
||||
(2, 1), # Edge case: n = 2
|
||||
(6, 8), # Regular case
|
||||
(10, 55), # Larger number
|
||||
(-1, -1), # Edge case: negative input
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'fibonacci', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_binary_search():
|
||||
question = """Write a Python function named EXACTLY 'binary_search' that performs binary search on a sorted list.
|
||||
The function signature must be: def binary_search(arr, target)
|
||||
|
||||
Requirements:
|
||||
1. The function takes two arguments:
|
||||
- arr: a sorted list of integers
|
||||
- target: the integer to find
|
||||
2. Return the index of the target if found
|
||||
3. Return -1 if the target is not in the list
|
||||
4. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- binary_search([1,2,3,4,5], 3) returns 2
|
||||
- binary_search([1,2,3,4,5], 6) returns -1"""
|
||||
|
||||
test_cases = [
|
||||
(([1,2,3,4,5], 3), 2), # Regular case: target in middle
|
||||
(([1,2,3,4,5], 1), 0), # Edge case: target at start
|
||||
(([1,2,3,4,5], 5), 4), # Edge case: target at end
|
||||
(([1,2,3,4,5], 6), -1), # Edge case: target not in list
|
||||
(([], 1), -1), # Edge case: empty list
|
||||
(([1], 1), 0), # Edge case: single element list
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'binary_search', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_palindrome():
|
||||
question = """Write a Python function named EXACTLY 'is_palindrome' that checks if a string is a palindrome.
|
||||
The function signature must be: def is_palindrome(s)
|
||||
|
||||
Requirements:
|
||||
1. The function takes one argument:
|
||||
- s: a string to check
|
||||
2. Return True if the string is a palindrome, False otherwise
|
||||
3. Ignore case (treat uppercase and lowercase as the same)
|
||||
4. Ignore non-alphanumeric characters (spaces, punctuation)
|
||||
5. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- is_palindrome("A man, a plan, a canal: Panama") returns True
|
||||
- is_palindrome("race a car") returns False"""
|
||||
|
||||
test_cases = [
|
||||
("A man, a plan, a canal: Panama", True), # Regular case with punctuation
|
||||
("race a car", False), # Regular case, not palindrome
|
||||
("", True), # Edge case: empty string
|
||||
("a", True), # Edge case: single character
|
||||
("Was it a car or a cat I saw?", True), # Complex case with punctuation
|
||||
("hello", False), # Simple case, not palindrome
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'is_palindrome', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
def test_anagram():
|
||||
question = """Write a Python function named EXACTLY 'are_anagrams' that checks if two strings are anagrams.
|
||||
The function signature must be: def are_anagrams(str1, str2)
|
||||
|
||||
Requirements:
|
||||
1. The function takes two arguments:
|
||||
- str1: first string
|
||||
- str2: second string
|
||||
2. Return True if the strings are anagrams, False otherwise
|
||||
3. Ignore case (treat uppercase and lowercase as the same)
|
||||
4. Ignore spaces
|
||||
5. Consider only alphanumeric characters
|
||||
6. Do NOT use any print statements - just return the values
|
||||
|
||||
Example:
|
||||
- are_anagrams("listen", "silent") returns True
|
||||
- are_anagrams("hello", "world") returns False"""
|
||||
|
||||
test_cases = [
|
||||
(("listen", "silent"), True), # Regular case
|
||||
(("hello", "world"), False), # Not anagrams
|
||||
(("", ""), True), # Edge case: empty strings
|
||||
(("a", "a"), True), # Edge case: single char
|
||||
(("Debit Card", "Bad Credit"), True), # Case and space test
|
||||
(("Python", "Java"), False), # Different lengths
|
||||
]
|
||||
|
||||
def validate(code: str) -> bool:
|
||||
success, debug_info, test_results = validate_with_debug(code, 'are_anagrams', test_cases, "N/A")
|
||||
return success
|
||||
|
||||
return (question, validate, test_cases)
|
||||
|
||||
# List of all test cases
|
||||
CODING_QUESTIONS = [
|
||||
test_fibonacci(),
|
||||
test_binary_search(),
|
||||
test_palindrome(),
|
||||
test_anagram()
|
||||
]
|
||||
|
||||
# Add test names as constants
|
||||
TEST_NAMES = {
|
||||
"Write a Python func": "Fibonacci",
|
||||
"Write a Python func": "Binary Search",
|
||||
"Write a Python func": "Palindrome",
|
||||
"Write a Python func": "Anagram Check"
|
||||
}
|
||||
|
||||
def get_test_name(question: str) -> str:
|
||||
"""Get a friendly name for the test based on the question."""
|
||||
if "fibonacci" in question.lower():
|
||||
return "Fibonacci"
|
||||
elif "binary_search" in question.lower():
|
||||
return "Binary Search"
|
||||
elif "palindrome" in question.lower():
|
||||
return "Palindrome"
|
||||
elif "anagram" in question.lower():
|
||||
return "Anagram Check"
|
||||
return question[:20] + "..."
|
||||
|
||||
def get_model_stats(model_name: str, question_tuple: tuple, server_url: str) -> Dict:
|
||||
"""
|
||||
Get performance statistics for a specific model and validate the response.
|
||||
"""
|
||||
question, validator, test_cases = question_tuple
|
||||
timer = Timer()
|
||||
results = {
|
||||
'model': model_name,
|
||||
'total_duration': 0,
|
||||
'tokens_per_second': 0,
|
||||
'code_valid': False,
|
||||
'tests_passed': False,
|
||||
'error': None,
|
||||
'test_results': []
|
||||
}
|
||||
|
||||
try:
|
||||
timer.start()
|
||||
print(f'{WHITE}Requesting code from {server_url} with {model_name}{ENDC}')
|
||||
response = requests.post(
|
||||
f"{server_url}/api/chat",
|
||||
json={
|
||||
"model": model_name,
|
||||
"messages": [{'role': 'user', 'content': question}],
|
||||
"stream": False
|
||||
}
|
||||
).json()
|
||||
timer.stop()
|
||||
|
||||
# Get performance metrics from response
|
||||
total_tokens = response.get('eval_count', 0)
|
||||
total_duration = response.get('total_duration', 0)
|
||||
total_response_time = float(total_duration) / 1e9
|
||||
|
||||
results['total_duration'] = total_response_time
|
||||
if total_tokens > 0 and total_response_time > 0:
|
||||
results['tokens_per_second'] = total_tokens / total_response_time
|
||||
|
||||
# Print concise performance metrics
|
||||
print(f"Total Duration (s): {total_response_time:.2f} / Total Tokens: {total_tokens} / Tokens per Second: {results['tokens_per_second']:.2f}")
|
||||
|
||||
# Extract code from response
|
||||
if 'message' in response and 'content' in response['message']:
|
||||
code = extract_code_from_response(response['message']['content'])
|
||||
|
||||
# Validate code
|
||||
results['code_valid'] = is_valid_python(code)
|
||||
|
||||
if results['code_valid']:
|
||||
print(f"Code validation: ✅")
|
||||
# Get validation results
|
||||
print(f'{WHITE}Running tests...{ENDC}')
|
||||
for test_case in CODING_QUESTIONS:
|
||||
if test_case[0] == question: # Found matching test case
|
||||
function_name = get_function_name_from_question(question)
|
||||
test_cases = test_case[2] # Get test cases from tuple
|
||||
success, debug_info, test_results = validate_with_debug(code, function_name, test_cases, model_name) # Changed model to model_name
|
||||
results['tests_passed'] = success
|
||||
results['test_results'] = test_results
|
||||
break
|
||||
else:
|
||||
print(f"Code Validation: ❌")
|
||||
|
||||
else:
|
||||
results['error'] = f"Unexpected response format: {response}"
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n{RED}Error in get_model_stats: {str(e)}{ENDC}")
|
||||
results['error'] = str(e)
|
||||
|
||||
return results
|
||||
|
||||
def get_function_name_from_question(question: str) -> str:
|
||||
"""Extract function name from question."""
|
||||
if "fibonacci" in question.lower():
|
||||
return "fibonacci"
|
||||
elif "binary_search" in question.lower():
|
||||
return "binary_search"
|
||||
elif "palindrome" in question.lower():
|
||||
return "is_palindrome"
|
||||
elif "anagram" in question.lower():
|
||||
return "are_anagrams"
|
||||
return ""
|
||||
|
||||
def run_model_benchmark(model: str, server_url: str, num_runs: int = 4) -> Dict:
|
||||
"""
|
||||
Run multiple benchmarks for a model and calculate average metrics.
|
||||
"""
|
||||
metrics = []
|
||||
|
||||
for i in range(num_runs):
|
||||
print(f"\n{YELLOW}[{model}] Run {i+1}/{num_runs}:{ENDC}")
|
||||
|
||||
run_results = {}
|
||||
for question_tuple in CODING_QUESTIONS:
|
||||
test_name = get_test_name(question_tuple[0])
|
||||
print(f"\n{BOLD}Testing {test_name}...{ENDC}")
|
||||
try:
|
||||
result = get_model_stats(model, question_tuple, server_url)
|
||||
# Fix: Count actual passed cases from test results
|
||||
result['passed_cases'] = len([r for r in result.get('test_results', []) if r])
|
||||
result['total_cases'] = len(question_tuple[2])
|
||||
run_results[test_name] = result
|
||||
except Exception as e:
|
||||
print(f"Error in run {i+1}: {e}")
|
||||
continue
|
||||
|
||||
if run_results:
|
||||
metrics.append(run_results)
|
||||
|
||||
# Take only the last 3 runs for averaging
|
||||
metrics = metrics[-3:]
|
||||
num_runs_used = len(metrics) # Actual number of runs used
|
||||
|
||||
if not metrics:
|
||||
return {}
|
||||
|
||||
# Aggregate results
|
||||
aggregated = {
|
||||
'model': model,
|
||||
'total_duration': mean([m[list(m.keys())[0]]['total_duration'] for m in metrics if m]),
|
||||
'tokens_per_second': mean([m[list(m.keys())[0]]['tokens_per_second'] for m in metrics if m]),
|
||||
'test_results': {}
|
||||
}
|
||||
|
||||
# Calculate results per test
|
||||
for test_name in metrics[-1].keys():
|
||||
# Sum up actual passed cases for this test across runs
|
||||
passed_cases = sum(m[test_name]['passed_cases'] for m in metrics)
|
||||
# Calculate total possible cases (6 cases × number of actual runs)
|
||||
total_possible_cases = 6 * num_runs_used
|
||||
|
||||
success_rate = (passed_cases / total_possible_cases * 100)
|
||||
status = '✅' if success_rate == 100 else '❌'
|
||||
print(f"{test_name}: {status} ({passed_cases}/{total_possible_cases} cases)")
|
||||
|
||||
aggregated['test_results'][test_name] = {
|
||||
'success_rate': success_rate,
|
||||
'passed_cases': passed_cases,
|
||||
'total_cases': total_possible_cases,
|
||||
'success_cases_rate': passed_cases / total_possible_cases, # Add success cases rate
|
||||
'avg_duration': mean([m[test_name]['total_duration'] for m in metrics]),
|
||||
'avg_tokens_sec': mean([m[test_name]['tokens_per_second'] for m in metrics])
|
||||
}
|
||||
|
||||
# Calculate overall success rate across all tests
|
||||
total_passed = sum(t['passed_cases'] for t in aggregated['test_results'].values())
|
||||
total_cases = sum(t['total_cases'] for t in aggregated['test_results'].values())
|
||||
aggregated['overall_success_rate'] = (total_passed / total_cases * 100) if total_cases > 0 else 0
|
||||
aggregated['overall_success_cases_rate'] = (total_passed / total_cases) if total_cases > 0 else 0
|
||||
|
||||
return aggregated
|
||||
|
||||
def print_leaderboard(results: List[Dict]):
|
||||
"""Print leaderboard of model results."""
|
||||
if not results:
|
||||
print("No results to display")
|
||||
return
|
||||
|
||||
# Sort by success rate first, then by tokens per second
|
||||
sorted_results = sorted(results, key=lambda x: (
|
||||
sum(t['passed_cases'] for t in x['test_results'].values()) / sum(t['total_cases'] for t in x['test_results'].values()) if sum(t['total_cases'] for t in x['test_results'].values()) > 0 else 0,
|
||||
x['tokens_per_second']
|
||||
), reverse=True)
|
||||
|
||||
print(f"\n{HEADER}{BOLD}🏆 Final Model Leaderboard:{ENDC}")
|
||||
for i, result in enumerate(sorted_results, 1):
|
||||
# Calculate stats for each model
|
||||
total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
|
||||
total_cases = sum(t['total_cases'] for t in result['test_results'].values())
|
||||
success_rate = (total_passed / total_cases * 100) if total_cases > 0 else 0
|
||||
|
||||
print(f"\n{BOLD}{YELLOW}{result['model']}{ENDC}")
|
||||
print(f" {BOLD}Overall Success Rate:{ENDC} {success_rate:.1f}% ({total_passed}/{total_cases} cases)")
|
||||
print(f" {BOLD}Average Tokens/sec:{ENDC} {result['tokens_per_second']:.2f}")
|
||||
print(f" {BOLD}Average Duration:{ENDC} {result['total_duration']:.2f}s")
|
||||
print(f" {BOLD}Test Results:{ENDC}")
|
||||
for test_name, test_result in result['test_results'].items():
|
||||
status = '✅' if test_result['success_rate'] == 100 else '❌'
|
||||
print(f" - {test_name}: {status} {test_result['passed_cases']}/{test_result['total_cases']} cases ({test_result['success_rate']:.1f}%)")
|
||||
|
||||
def get_available_models(server_url: str) -> List[str]:
|
||||
"""Get list of available models from the specified Ollama server."""
|
||||
try:
|
||||
response = requests.get(f"{server_url}/api/tags").json()
|
||||
return [model['name'] for model in response['models']]
|
||||
except Exception as e:
|
||||
print(f"{RED}Error getting model list from {server_url}: {e}{ENDC}")
|
||||
return []
|
||||
|
||||
def get_model_details(model_name):
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ollama", "show", model_name],
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
encoding='utf-8',
|
||||
errors='replace'
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f"Error: {result.stderr.strip()}")
|
||||
return None
|
||||
|
||||
if not result.stdout.strip():
|
||||
print(f"No details available for model: {model_name}")
|
||||
return None
|
||||
|
||||
raw_output = result.stdout.strip()
|
||||
lines = raw_output.split('\n')
|
||||
current_section = None
|
||||
|
||||
for line in lines:
|
||||
line = line.rstrip()
|
||||
if line and not line.startswith(' '): # Section headers
|
||||
current_section = line.strip()
|
||||
print(f"\n {current_section}")
|
||||
elif line and current_section: # Section content
|
||||
# Split by multiple spaces and filter out empty parts
|
||||
parts = [part for part in line.split(' ') if part.strip()]
|
||||
if len(parts) >= 2:
|
||||
key, value = parts[0].strip(), parts[-1].strip()
|
||||
# Ensure consistent spacing for alignment
|
||||
print(f" {key:<16} {value}")
|
||||
elif len(parts) == 1:
|
||||
# Handle single-value lines (like license text)
|
||||
print(f" {parts[0].strip()}")
|
||||
|
||||
return None # No need to return formatted details anymore
|
||||
|
||||
except Exception as e:
|
||||
print(f"An error occurred while getting model details: {e}")
|
||||
return None
|
||||
|
||||
def update_server_results(server_url: str, results: List[Dict]) -> None:
|
||||
try:
|
||||
# Get CPU brand and format it for filename
|
||||
cpu_info = get_cpu_info()
|
||||
cpu_brand = cpu_info.get('brand_raw', 'Unknown_CPU').replace(' ', '_')
|
||||
timestamp = time.strftime("%Y%m%d_%H%M%S")
|
||||
|
||||
# Create a unique filename for this server's results
|
||||
server_id = server_url.replace('http://', '').replace(':', '_').replace('/', '_')
|
||||
results_dir = "benchmark_results"
|
||||
|
||||
os.makedirs(results_dir, exist_ok=True)
|
||||
|
||||
# Include CPU brand in filename
|
||||
filename = os.path.join(results_dir, f"{cpu_brand}_{server_id}.json")
|
||||
|
||||
# Load existing results or create new file
|
||||
try:
|
||||
with open(filename, 'r') as f:
|
||||
existing_data = json.load(f)
|
||||
except FileNotFoundError:
|
||||
existing_data = {
|
||||
'server_url': server_url,
|
||||
'benchmarks': []
|
||||
}
|
||||
|
||||
# Add new results with timestamp and ensure overall success rate is included
|
||||
benchmark_entry = {
|
||||
'timestamp': timestamp,
|
||||
'results': []
|
||||
}
|
||||
|
||||
# Add overall success rate to each model's results
|
||||
for result in results:
|
||||
total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
|
||||
total_cases = sum(t['total_cases'] for t in result['test_results'].values())
|
||||
result['overall_success_rate'] = (total_passed / total_cases * 100) if total_cases > 0 else 0
|
||||
benchmark_entry['results'].append(result)
|
||||
|
||||
existing_data['benchmarks'].append(benchmark_entry)
|
||||
|
||||
# Save updated results
|
||||
with open(filename, 'w') as f:
|
||||
json.dump(existing_data, f, indent=2)
|
||||
print(f"{GREEN}Successfully saved results to {filename}{ENDC}")
|
||||
except Exception as e:
|
||||
print(f"{RED}Failed to save results: {str(e)}{ENDC}")
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Run Ollama model benchmarks')
|
||||
parser.add_argument('--server', choices=['local', 'z60'], default='local',
|
||||
help='Choose Ollama server (default: local)')
|
||||
parser.add_argument('--model', type=str, help='Specific model to benchmark')
|
||||
parser.add_argument('--number', type=str, help='Number of models to benchmark (number or "all")')
|
||||
parser.add_argument('--verbose', action='store_true', help='Enable verbose output')
|
||||
args = parser.parse_args()
|
||||
|
||||
server_url = SERVERS[args.server]
|
||||
|
||||
print()
|
||||
print(f"{HEADER}{BOLD}CPU Information:{ENDC}")
|
||||
cpu_info = get_cpu_info()
|
||||
for key, value in cpu_info.items():
|
||||
print(f"{MUTED}{key}: {value}{ENDC}")
|
||||
|
||||
print()
|
||||
print(f"{INFO}Using Ollama server at {server_url}...{ENDC}")
|
||||
|
||||
# Get available models or use specified model
|
||||
if args.model:
|
||||
models = [args.model]
|
||||
else:
|
||||
models = get_available_models(server_url)
|
||||
|
||||
if not models:
|
||||
print(f"{RED}No models found on server {server_url}. Exiting.{ENDC}")
|
||||
return
|
||||
|
||||
# Handle number of models to test
|
||||
if args.number and args.number.lower() != 'all':
|
||||
try:
|
||||
num_models = int(args.number)
|
||||
if num_models > 0:
|
||||
models = models[:num_models]
|
||||
else:
|
||||
print(f"{WARNING}Invalid number of models. Using all available models.{ENDC}")
|
||||
except ValueError:
|
||||
print(f"{WARNING}Invalid number format. Using all available models.{ENDC}")
|
||||
|
||||
print(f"{INFO}Testing {len(models)} models :{ENDC}")
|
||||
for i, model in enumerate(models, 1):
|
||||
print(f"{YELLOW}{i}. {model}{ENDC}")
|
||||
|
||||
# Run benchmarks
|
||||
all_results = []
|
||||
|
||||
for model in models:
|
||||
print(f"\n{HEADER}{BOLD}Benchmarking {model}...{ENDC}")
|
||||
details = get_model_details(model)
|
||||
if details:
|
||||
print(f"\n{INFO}Model Details:{ENDC}")
|
||||
if "details" in details:
|
||||
for section, items in details["details"].items():
|
||||
print(f"\n{BOLD}{section}{ENDC}")
|
||||
for key, value in items.items():
|
||||
print(f" {key}: {value}")
|
||||
else:
|
||||
print(json.dumps(details, indent=2))
|
||||
result = run_model_benchmark(model, server_url)
|
||||
if 'error' not in result:
|
||||
all_results.append(result)
|
||||
|
||||
# Print and save results
|
||||
print_leaderboard(all_results)
|
||||
update_server_results(server_url, all_results)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
6
requirements.txt
Normal file
6
requirements.txt
Normal file
@ -0,0 +1,6 @@
|
||||
requests>=2.31.0
|
||||
together>=0.2.8
|
||||
ollama>=0.1.6
|
||||
python-dotenv>=1.0.0
|
||||
GPUtil==1.4.0
|
||||
py-cpuinfo
|
Loading…
Reference in New Issue
Block a user