first commit

2025-03-02 22:40:55 +01:00 · 2025-03-02 22:40:55 +01:00 · a3b06718a2
commit a3b06718a2
parent 92d4b26ac2
6 changed files with 1232 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,129 @@
-# codebench
+# Codebench - Ollama Model Benchmark Tool
-Custom benchmarks for LLMs
+A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks.
 ## Features
 - Test multiple Ollama models against common coding problems
 - Measure performance metrics (tokens/sec, response time)
 - Track success rates across different coding challenges
 - Support for local and remote Ollama servers
 - Detailed test results and leaderboard generation
 - CPU information tracking for benchmarks
 ## Prerequisites
 - Python 3.8+
 - Ollama server (local or remote)
 - Together API key (optional, for advanced code analysis)
 ## Installation
 1. Clone the repository:
 ```bash
 git clone https://github.com/yourusername/codebench.git
 cd codebench
 2. Install required packages:
 ```bash
 pip install -r requirements.txt
 ```
 3. (Optional) Set up Together API:
 ```bash
 export TOGETHER_API_KEY='your_api_key_here'
 ```
 ## Usage
 Basic usage:
 ```bash
 python3 main.py
 ```
 Available options:
 ```bash
 python main.py --server [local|z60] --model [model_name] --number [count|all] --verbose
 ```
 ## Arguments:
 - --server : Choose Ollama server (default: local)
 - --model : Test specific model only
 - --number : Number of models to test
 - --verbose : Enable detailed output
 ## Supported Tests
 The tool currently tests models on these coding challenges:
 1. Fibonacci Sequence
 2. Binary Search
 3. Palindrome Check
 4. Anagram Detection
 ## Test Process & Validation
 ### Code Generation
 1. Each model is prompted with specific coding tasks
 2. Generated code is extracted from the model's response
 3. Initial syntax validation is performed
 ### Test Validation
 For each test case:
 - Input values are provided to the function
 - Output is compared with expected results
 - Test results are marked as ✅ (pass) or ❌ (fail)
 Example test cases:
 ```plaintext
 Fibonacci:
 - Input: 6      Expected: 8
 - Input: 0      Expected: 0
 - Input: -1     Expected: -1
 Binary Search:
 - Input: ([1,2,3,4,5], 3)    Expected: 2
 - Input: ([], 1)             Expected: -1
 - Input: ([1], 1)            Expected: 0
 ```
 ## Output
 Results are saved in the benchmark_results directory with the following naming convention:
 ```plaintext
 [CPU_Model]_[Server_Address].json
 ```
 Example:
 ```plaintext
 Apple_M1_Pro_localhost_11434.json
 ```
 ## Server Configuration
 Default servers are configured in the code:
 - Local: http://localhost:11434
 - Z60: http://192.168.196.60:11434
 ## Example Output
 ```plaintext
 🏆 Final Model Leaderboard:
 codellama:13b
   Overall Success Rate: 95.8% (23/24 cases)
   Average Tokens/sec: 145.23
   Average Duration: 2.34s
   Test Results:
   - Fibonacci: ✅ 6/6 cases (100.0%)
   - Binary Search: ✅ 6/6 cases (100.0%)
 ```
 ## Contributing
 Feel free to submit issues and enhancement requests!
 ## License
 [Your chosen license]
--- a/benchmark_results/.DS_Store
+++ b/benchmark_results/.DS_Store
--- a/devbook.md
+++ b/devbook.md
@ -0,0 +1,222 @@
 # Ollama Testing Framework Documentation
 Version: 1.0
 Last Updated: 2025-02-23
 ## Overview
 The Ollama Testing Framework is designed to benchmark and validate different Ollama models on coding tasks. It evaluates models based on:
 1. Code correctness (test cases)
 2. Performance metrics (inference time, tokens/sec)
 3. Consistency across multiple runs
 ## Goals
 1. Validate model responses for correctness and functionality
 2. Measure and compare performance across different models
 3. Provide detailed insights into model behavior and reliability
 4. Enable easy comparison through a leaderboard system
 ## Core Components
 ### 1. Test Suite
 The test suite consists of multiple coding challenges, each with:
 - A clear problem description
 - A validator function
 - Multiple test cases
 - Expected outputs
 Current Test Cases:
 a) Fibonacci Sequence
   - Tests edge cases (negative, zero)
   - Tests standard cases (n=1 to n=10)
   - Validates performance for larger inputs
 b) Binary Search
   - Tests empty list case
   - Tests element not found
   - Tests finding elements at different positions
 c) Palindrome Check
   - Tests empty string
   - Tests single character
   - Tests various palindrome and non-palindrome cases
 d) Anagram Check
   - Tests empty strings
   - Tests case sensitivity
   - Tests strings with spaces and special characters
 ### 2. Inference Pipeline
 #### Request Flow:
 1. Format prompt with problem description
 2. Send to Ollama API with timing start
 3. Receive response and stop timing
 4. Extract code from response
 5. Validate code syntax
 6. Run test cases
 7. Calculate performance metrics
 #### Performance Metrics:
 - Total Duration (s): Time from request to response completion
 - Total Tokens: Number of tokens in the response (eval_count)
 - Tokens per Second: Processing speed (tokens/duration)
 ### 3. Validation System
 #### Code Validation:
 1. Syntax check (is_valid_python)
 2. Function name verification
 3. Test case execution
 4. Together API integration for failure analysis
 #### Test Results:
 - Individual test case results (pass/fail)
 - Error messages and debug info
 - Together API opinions on failures
 ### 4. Benchmarking System
 #### Benchmark Process:
 1. Run multiple iterations (default: 4 runs)
 2. Use last 3 runs for final metrics
 3. Calculate averages across runs
 4. Store detailed results in JSON
 #### Metrics Tracked:
 - Success rate per test
 - Overall success rate
 - Average inference time
 - Average tokens per second
 ### 5. Leaderboard System
 #### Ranking Algorithm:
 1. Primary sort: Overall success rate
   - Calculated as (total passed cases / total cases) across all tests
 2. Secondary sort: Tokens per second
   - Higher speed breaks ties between equal success rates
 #### Display Format:
 ```
 🏆 Model Leaderboard:
 1. model_name
   Overall Success Rate: XX.X% (passed/total cases)
   Average Tokens/sec: XX.XX
   Average Duration: XX.XXs
   Test Results:
   - Test1: ✅/❌ passed/total cases (success_rate%)
   - Test2: ✅/❌ passed/total cases (success_rate%)
 ```
 ## Usage
 ### Basic Run:
 ```bash
 python3.10 main.py --server 'z60' --number '2' 
 ```
 ### Options:
 - `--model`: Specify a single model to test
 - `--server`: Custom Ollama server URL
 - `--runs`: Number of benchmark runs
 ### Output:
 1. Real-time test progress
 2. Performance metrics per inference
 3. Test results summary
 4. Final leaderboard
 5. JSON results file with timestamp
 ## Results Storage
 - Results saved as: model_benchmark_YYYYMMDD_HHMMSS.json
 - Contains full test details and metrics
 - Enables historical comparison and analysis
 ## Error Handling
 1. API communication errors
 2. Code execution timeouts
 3. Invalid responses
 4. Test case failures
 5. Performance metric calculation errors
 ## Future Improvements
 1. Add more diverse test cases
 2. Implement parallel testing
 3. Add memory usage tracking
 4. Create historical performance trends
 5. Add code quality metrics
 ## Leaderboard Data Processing
 ### Test Results Processing
 - Processes only the latest benchmark results
 - Determines maximum test cases from successful runs
 - Handles validation failures as complete failures (0 passed cases)
 - Uses dynamic test case counting based on actual successful runs
 - Maintains consistent test case counting across all scenarios
 ### Success Rate Calculations
 - Calculates success rates based on expected total cases
 - Counts failed validations as 0/expected_cases
 - Uses maximum observed test cases as the baseline
 - Includes validation status in success rate reporting
 - Prevents skipping of failed validations in total counts
 ### Performance Metrics
 - Tracks tokens per second from model responses
 - Measures total duration across all tests
 - Calculates success rate vs duration ratio
 - Excellence criteria: >90% success AND success_rate > 5*duration
 - Prevents duplicate model entries in metrics
 ### Data Visualization
 # Development Notes
 ## Project Structure
 - `main.py`: Core benchmarking functionality
 - `lboard.py`: Leaderboard visualization and results analysis
 - `benchmark_results/`: Directory containing JSON benchmark results
 ## Visualization Features
 - Blue bars: Tokens per second performance
 - Red + markers: Overall success rate (%)
 - Green - markers: Total duration (seconds)
 - Green model names: Models with >90% success rate
 - Triple y-axis plot for easy metric comparison
 ## Running the Leaderboard
 ```bash
 # View latest results
 python lboard.py
 # View specific results file
 python lboard.py path/to/results.json
 ```
 - Single plot per model (no duplicates)
 - Color-coded bars based on performance
 - Success rate indicators (red +)
 - Duration indicators (green -)
 - Dynamic axis scaling
 - Combined legend for all metrics
 ### Output Format
 - Detailed test results per model
 - Individual test case counts
 - Validation status indicators
 - Overall success rates
 - Performance metrics
 - Plot position information
 Model: model_name
 ├─ Tokens/sec: XX.XX
 ├─ Total Duration: XX.XXs
 ├─ Test Results:
 │  ├─ Test1: X/Y cases (ZZ.Z%) [validation status]
 │  ├─ Test2: X/Y cases (ZZ.Z%) [validation status]
 └─ Overall Success Rate: X/Y (ZZ.Z%)
 ### Error Handling
 - Handles missing test results
 - Processes validation failures appropriately
 - Maintains consistent case counting
 - Prevents data duplication
 - Ensures accurate success rate calculations
--- a/lboard.py
+++ b/lboard.py
@ -0,0 +1,144 @@
 import json
 import os
 import argparse
 import glob
 import matplotlib.pyplot as plt
 def get_latest_json_file(directory):
    json_files = glob.glob(os.path.join(directory, '*.json'))
    print(f"Found JSON files: {json_files}")
    latest_file = max(json_files, key=os.path.getmtime) if json_files else None
    return latest_file
 def calculate_model_stats(model_result):
    """Calculate average stats for a model from its test results."""
    test_results = model_result['test_results']
    # Calculate overall success rate (average of all test success rates)
    success_rates = [test['success_rate'] for test in test_results.values()]
    overall_success_rate = sum(success_rates) / len(success_rates)
    return {
        'model': model_result['model'],
        'overall_success_rate': overall_success_rate,
        'tokens_per_second': model_result['tokens_per_second'],
        'total_duration': model_result['total_duration'],
        'test_results': test_results
    }
 def plot_model_comparison(model_stats):
    """Plot model comparison with dual y-axes for tokens/sec and success rate."""
    models = [stat['model'] for stat in model_stats]
    token_speeds = [stat['tokens_per_second'] for stat in model_stats]
    success_rates = [stat['overall_success_rate'] for stat in model_stats]
    durations = [stat['total_duration'] for stat in model_stats]
    # Create figure and primary axis
    fig, ax1 = plt.subplots(figsize=(15, 8))
    # Plot tokens/sec bars on primary y-axis with lighter blue and more transparency
    bars = ax1.bar(models, token_speeds, color='royalblue', alpha=0.3)
    ax1.set_ylabel('Tokens per Second', color='blue')
    ax1.tick_params(axis='y', labelcolor='blue')
    # Create secondary y-axis for success rate
    ax2 = ax1.twinx()
    ax2.plot(models, success_rates, 'r+', markersize=15, label='Success Rate', linestyle='None')
    ax2.set_ylabel('Success Rate (%)', color='red')
    ax2.tick_params(axis='y', labelcolor='red')
    ax2.set_ylim(0, 100)
    # Create third y-axis for duration
    ax3 = ax1.twinx()
    ax3.spines['right'].set_position(('outward', 60))  # Move third axis outward
    ax3.plot(models, durations, 'g_', markersize=15, label='Duration', linestyle='None')
    ax3.set_ylabel('Duration (s)', color='green')
    ax3.tick_params(axis='y', labelcolor='green')
    # Customize x-axis labels with proper rotation
    ax1.set_xticks(range(len(models)))
    ax1.set_xticklabels(models, rotation=45, ha='right', rotation_mode='anchor')
    for i, model in enumerate(models):
        # Shorten model names by removing common suffixes
        short_name = model.replace(':latest', '').replace('-uncensored', '')
        ax1.get_xticklabels()[i].set_text(short_name)
        if success_rates[i] > 90:
            ax1.get_xticklabels()[i].set_color('green')
    # Adjust layout to prevent label cutoff
    plt.subplots_adjust(bottom=0.25, left=0.1, right=0.85)
    '''
    # Add value labels
    for i, bar in enumerate(bars):
        ax1.text(i, token_speeds[i], f'{token_speeds[i]:.1f}', 
                ha='center', va='bottom', color='black')
        ax2.text(i, success_rates[i], f'{success_rates[i]:.1f}%', 
                ha='center', va='bottom', color='black')
        ax3.text(i, durations[i], f'{durations[i]:.1f}s', 
                ha='center', va='top', color='black')
    '''
    plt.title('Model Performance Comparison')
    plt.tight_layout()
    plt.show()
    plt.savefig('benchmark_results/model_comparison.png')
    print("\nPlot saved as 'benchmark_results/model_comparison.png'")
 def print_leaderboard(benchmark_data):
    """Print leaderboard from benchmark results."""
    if not benchmark_data.get('benchmarks'):
        print("No benchmark data to display")
        return
    # Get the latest benchmark results
    latest_benchmark = benchmark_data['benchmarks'][-1]
    model_results = latest_benchmark['results']
    # Calculate stats and sort models
    model_stats = [calculate_model_stats(model) for model in model_results]
    sorted_stats = sorted(model_stats, 
                         key=lambda x: (x['overall_success_rate'], x['tokens_per_second']), 
                         reverse=True)
    print(f"\n🏆 Final Model Leaderboard:")
    for stats in sorted_stats:
        print(f"\n{stats['model']}")
        print(f"   Overall Success Rate: {stats['overall_success_rate']:.1f}%")
        print(f"   Average Tokens/sec: {stats['tokens_per_second']:.2f}")
        print(f"   Average Duration: {stats['total_duration']:.2f}s")
        print(f"   Test Results:")
        for test_name, test_result in stats['test_results'].items():
            status = '✅' if test_result['success_rate'] == 100 else '❌'
            print(f"   - {test_name}: {status} {test_result['success_rate']:.1f}%")
    # Generate visualization
    plot_model_comparison(sorted_stats)
 def main():
    parser = argparse.ArgumentParser(description='Display benchmark leaderboard')
    parser.add_argument('filepath', nargs='?', help='Path to benchmark results JSON file')
    parser.add_argument('--file', type=str, help='Path to benchmark results JSON file (alternative way)')
    args = parser.parse_args()
    try:
        # Use filepath if provided, then --file, otherwise find latest
        if args.filepath:
            json_file = args.filepath
        elif args.file:
            json_file = args.file
        else:
            json_file = get_latest_json_file('benchmark_results')
            if not json_file:
                print("No benchmark results found")
                return
        with open(json_file, 'r') as f:
            benchmark_data = json.load(f)
        print(f"Using benchmark file: {json_file}")
        print_leaderboard(benchmark_data)
    except Exception as e:
        print(f"Error loading benchmark data: {e}")
 if __name__ == "__main__":
    main()
--- a/main.py
+++ b/main.py
@ -0,0 +1,732 @@
 from tabnanny import verbose
 import ollama
 import time
 from typing import List, Dict, Any
 import json
 from statistics import mean
 import re
 import ast
 import argparse
 import requests
 import os
 from together import Together
 from cpuinfo import get_cpu_info
 import subprocess
 # ANSI color codes
 SUCCESS = '\033[38;5;78m'   # Soft mint green for success
 ERROR = '\033[38;5;203m'    # Soft coral red for errors
 INFO = '\033[38;5;75m'      # Sky blue for info
 HEADER = '\033[38;5;147m'   # Soft purple for headers
 WARNING = '\033[38;5;221m'  # Warm gold for warnings
 EMPHASIS = '\033[38;5;159m' # Cyan for emphasis
 MUTED = '\033[38;5;246m'    # Subtle gray for less important text
 ENDC = '\033[0m'
 BOLD = '\033[1m'
 # Replace existing color usages
 GREEN = SUCCESS
 RED = ERROR
 BLUE = INFO
 YELLOW = WARNING
 WHITE = MUTED
 # Server configurations
 SERVERS = {
    'local': 'http://localhost:11434',
    'z60': 'http://192.168.196.60:11434'
 }
 class Timer:
    def __init__(self):
        self.start_time = None
        self.end_time = None
    def start(self):
        self.start_time = time.time()
    def stop(self):
        self.end_time = time.time()
    def elapsed_time(self):
        if self.start_time is None:
            return 0
        if self.end_time is None:
            return time.time() - self.start_time
        return self.end_time - self.start_time
 def extract_code_from_response(response: str) -> str:
    """Extract Python code from a markdown-formatted string."""
    code_blocks = re.findall(r'```python\n(.*?)```', response, re.DOTALL)
    if code_blocks:
        return code_blocks[0].strip()
    return response
 def is_valid_python(code: str) -> bool:
    """Check if the code is valid Python syntax."""
    try:
        ast.parse(code)
        return True
    except SyntaxError:
        return False
 def analyze_failed_code(code: str, test_case: tuple, expected: any, actual: any, function_name: str, model: str) -> bool:
    """Analyze why code failed using Together API. Returns True if Together thinks the code should work."""
    prompt = f"""Analyze this Python code and explain why it failed the test case. Format your response EXACTLY as follows:
 ASSESSMENT: [Write a one-line assessment: either "SHOULD PASS" or "SHOULD FAIL" followed by a brief reason]
 ANALYSIS:
 [Detailed analysis of why the code failed and how to fix it]
 Code:
 {code}
 Test case:
 Input: {test_case}
 Expected output: {expected}
 Actual output: {actual}
 Function name required: {function_name}
 Model: {model}"""
    try:
        TOGETHER_API_KEY = os.environ["TOGETHER_API_KEY"]
        together_client = Together(api_key=TOGETHER_API_KEY)
        response = together_client.chat.completions.create(
            model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
            messages=[
                {"role": "system", "content": "You are a Python expert analyzing code failures. Always format your response with ASSESSMENT and ANALYSIS sections."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=1000,
            temperature=0.7,
            top_p=0.7,
            top_k=50,
            repetition_penalty=1,
            stop=["<|eot_id|>", "<|eom_id|>"]
        )
        analysis = response.choices[0].message.content
        should_pass = "SHOULD PASS" in analysis.upper()
        if verbose: print(f"\n{BLUE}[{model}] Together Analysis:{ENDC}")
        if verbose: print(f"{GREEN if should_pass else RED}{analysis}{ENDC}")
        return should_pass
    except Exception as e:
        print(f"\n{RED}Error getting Together API analysis: {e}{ENDC}")
        return False
 def validate_with_debug(code: str, function_name: str, test_cases: List[tuple], model: str) -> tuple[bool, str, List[bool]]:
    """Validate code with detailed debug information. Returns (success, debug_info, test_results)"""
    debug_info = []
    test_results = []  # Track individual test case results
    test_outputs = []  # Store test outputs for combined display
    try:
        # Create a local namespace
        namespace = {}
        debug_info.append(f"Executing code:\n{code}")
        try:
            # Redirect stdout to capture prints from the executed code
            import io
            import sys
            stdout = sys.stdout
            sys.stdout = io.StringIO()
            # Execute the code
            exec(code, namespace)
            # Restore stdout
            sys.stdout = stdout
        except Exception as e:
            if 'sys' in locals():  # Restore stdout if it was changed
                sys.stdout = stdout
            if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
            return False, f"Error executing code: {str(e)}", test_results
        if function_name not in namespace:
            if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
            together_opinion = analyze_failed_code(code, "N/A", f"Function named '{function_name}'", 
                                                f"Found functions: {list(namespace.keys())}", function_name, model)
            print(f"\nTests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
            return False, f"Function '{function_name}' not found in code. Available names: {list(namespace.keys())}", test_results
        function = namespace[function_name]
        debug_info.append(f"Function {function_name} found")
        # Run test cases
        all_passed = True
        for i, (test_input, expected) in enumerate(test_cases):
            try:
                # Redirect stdout for each test case
                stdout = sys.stdout
                sys.stdout = io.StringIO()
                if isinstance(test_input, tuple):
                    result = function(*test_input)
                else:
                    result = function(test_input)
                # Restore stdout
                sys.stdout = stdout
                # Store result but don't print individually
                test_outputs.append(str(result))
                test_passed = result == expected
                test_results.append(test_passed)
                if not test_passed:
                    if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
                    print(f"\n{RED}Test case {i+1} failed:{ENDC}")
                    print(f"Input: {test_input} Expected: {expected} Got: {result}")
                    together_opinion = analyze_failed_code(code, test_input, expected, result, function_name, model)
                    print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
                    all_passed = False
                    continue
                debug_info.append(f"Test case {i+1} passed: {test_input} → {result}")
            except Exception as e:
                if 'sys' in locals():  # Restore stdout if it was changed
                    sys.stdout = stdout
                test_outputs.append(f"Error: {str(e)}")
                if verbose: print(f"\n{RED}Failed code:{ENDC}\n{code}")
                print(f"\n{RED}{str(e)} in test case {i+1} Input: {test_input} Expected: {expected}")
                together_opinion = analyze_failed_code(code, test_input, expected, f"Error: {str(e)}", function_name, model)
                print(f"Tests passed: ❌ Together opinion: {'✅' if together_opinion else '❌'}")
                test_results.append(False)
                all_passed = False
                continue
            finally:
                if 'sys' in locals():  # Always restore stdout
                    sys.stdout = stdout
        # Print all test outputs on one line
        # print(f"{WHITE}{BOLD}Test outputs: {join(test_outputs)}{ENDC}")
        print(f"{WHITE}Test outputs: {', '.join(test_outputs)}{ENDC}")
        if all_passed:
            print(f"Tests passed: ✅")
            return True, "All tests passed!\n" + "\n".join(debug_info), test_results
        print(f"Tests passed: ❌")
        return False, "Some tests failed", test_results
    except Exception as e:
        if 'sys' in locals():  # Restore stdout if it was changed
            sys.stdout = stdout
        print(f"\n{RED}Error in validate_with_debug: {str(e)}{ENDC}")
        return False, f"Unexpected error: {str(e)}", test_results
 def test_fibonacci():
    question = """Write a Python function named EXACTLY 'fibonacci' (not fibonacci_dp or any other name) that returns the nth Fibonacci number.
 The function signature must be: def fibonacci(n)
 Requirements:
 1. Handle edge cases:
   - For n = 0, return 0
   - For n = 1 or n = 2, return 1
   - For negative numbers, return -1
 2. For n > 2: F(n) = F(n-1) + F(n-2)
 3. Use dynamic programming or memoization for efficiency
 4. Do NOT use any print statements - just return the values
 Example sequence: 0,1,1,2,3,5,8,13,21,...
 Example calls:
 - fibonacci(6) returns 8
 - fibonacci(0) returns 0
 - fibonacci(-1) returns -1"""
    test_cases = [
        (0, 0),    # Edge case: n = 0
        (1, 1),    # Edge case: n = 1
        (2, 1),    # Edge case: n = 2
        (6, 8),    # Regular case
        (10, 55),  # Larger number
        (-1, -1),  # Edge case: negative input
    ]
    def validate(code: str) -> bool:
        success, debug_info, test_results = validate_with_debug(code, 'fibonacci', test_cases, "N/A")
        return success
    return (question, validate, test_cases)
 def test_binary_search():
    question = """Write a Python function named EXACTLY 'binary_search' that performs binary search on a sorted list.
 The function signature must be: def binary_search(arr, target)
 Requirements:
 1. The function takes two arguments:
   - arr: a sorted list of integers
   - target: the integer to find
 2. Return the index of the target if found
 3. Return -1 if the target is not in the list
 4. Do NOT use any print statements - just return the values
 Example:
 - binary_search([1,2,3,4,5], 3) returns 2
 - binary_search([1,2,3,4,5], 6) returns -1"""
    test_cases = [
        (([1,2,3,4,5], 3), 2),     # Regular case: target in middle
        (([1,2,3,4,5], 1), 0),     # Edge case: target at start
        (([1,2,3,4,5], 5), 4),     # Edge case: target at end
        (([1,2,3,4,5], 6), -1),    # Edge case: target not in list
        (([], 1), -1),             # Edge case: empty list
        (([1], 1), 0),             # Edge case: single element list
    ]
    def validate(code: str) -> bool:
        success, debug_info, test_results = validate_with_debug(code, 'binary_search', test_cases, "N/A")
        return success
    return (question, validate, test_cases)
 def test_palindrome():
    question = """Write a Python function named EXACTLY 'is_palindrome' that checks if a string is a palindrome.
 The function signature must be: def is_palindrome(s)
 Requirements:
 1. The function takes one argument:
   - s: a string to check
 2. Return True if the string is a palindrome, False otherwise
 3. Ignore case (treat uppercase and lowercase as the same)
 4. Ignore non-alphanumeric characters (spaces, punctuation)
 5. Do NOT use any print statements - just return the values
 Example:
 - is_palindrome("A man, a plan, a canal: Panama") returns True
 - is_palindrome("race a car") returns False"""
    test_cases = [
        ("A man, a plan, a canal: Panama", True),   # Regular case with punctuation
        ("race a car", False),                      # Regular case, not palindrome
        ("", True),                                 # Edge case: empty string
        ("a", True),                                # Edge case: single character
        ("Was it a car or a cat I saw?", True),     # Complex case with punctuation
        ("hello", False),                           # Simple case, not palindrome
    ]
    def validate(code: str) -> bool:
        success, debug_info, test_results = validate_with_debug(code, 'is_palindrome', test_cases, "N/A")
        return success
    return (question, validate, test_cases)
 def test_anagram():
    question = """Write a Python function named EXACTLY 'are_anagrams' that checks if two strings are anagrams.
 The function signature must be: def are_anagrams(str1, str2)
 Requirements:
 1. The function takes two arguments:
   - str1: first string
   - str2: second string
 2. Return True if the strings are anagrams, False otherwise
 3. Ignore case (treat uppercase and lowercase as the same)
 4. Ignore spaces
 5. Consider only alphanumeric characters
 6. Do NOT use any print statements - just return the values
 Example:
 - are_anagrams("listen", "silent") returns True
 - are_anagrams("hello", "world") returns False"""
    test_cases = [
        (("listen", "silent"), True),           # Regular case
        (("hello", "world"), False),            # Not anagrams
        (("", ""), True),                       # Edge case: empty strings
        (("a", "a"), True),                     # Edge case: single char
        (("Debit Card", "Bad Credit"), True),   # Case and space test
        (("Python", "Java"), False),            # Different lengths
    ]
    def validate(code: str) -> bool:
        success, debug_info, test_results = validate_with_debug(code, 'are_anagrams', test_cases, "N/A")
        return success
    return (question, validate, test_cases)
 # List of all test cases
 CODING_QUESTIONS = [
    test_fibonacci(),
    test_binary_search(),
    test_palindrome(),
    test_anagram()
 ]
 # Add test names as constants
 TEST_NAMES = {
    "Write a Python func": "Fibonacci",
    "Write a Python func": "Binary Search",
    "Write a Python func": "Palindrome",
    "Write a Python func": "Anagram Check"
 }
 def get_test_name(question: str) -> str:
    """Get a friendly name for the test based on the question."""
    if "fibonacci" in question.lower():
        return "Fibonacci"
    elif "binary_search" in question.lower():
        return "Binary Search"
    elif "palindrome" in question.lower():
        return "Palindrome"
    elif "anagram" in question.lower():
        return "Anagram Check"
    return question[:20] + "..."
 def get_model_stats(model_name: str, question_tuple: tuple, server_url: str) -> Dict:
    """
    Get performance statistics for a specific model and validate the response.
    """
    question, validator, test_cases = question_tuple
    timer = Timer()
    results = {
        'model': model_name,
        'total_duration': 0,
        'tokens_per_second': 0,
        'code_valid': False,
        'tests_passed': False,
        'error': None,
        'test_results': []
    }
    try:
        timer.start()
        print(f'{WHITE}Requesting code from {server_url} with {model_name}{ENDC}')
        response = requests.post(
            f"{server_url}/api/chat",
            json={
                "model": model_name,
                "messages": [{'role': 'user', 'content': question}],
                "stream": False
            }
        ).json()
        timer.stop()
        # Get performance metrics from response
        total_tokens = response.get('eval_count', 0)
        total_duration = response.get('total_duration', 0)
        total_response_time = float(total_duration) / 1e9
        results['total_duration'] = total_response_time
        if total_tokens > 0 and total_response_time > 0:
            results['tokens_per_second'] = total_tokens / total_response_time
        # Print concise performance metrics
        print(f"Total Duration (s): {total_response_time:.2f} / Total Tokens: {total_tokens} / Tokens per Second: {results['tokens_per_second']:.2f}")
        # Extract code from response
        if 'message' in response and 'content' in response['message']:
            code = extract_code_from_response(response['message']['content'])
            # Validate code
            results['code_valid'] = is_valid_python(code)
            if results['code_valid']:
                print(f"Code validation: ✅")
                # Get validation results
                print(f'{WHITE}Running tests...{ENDC}')
                for test_case in CODING_QUESTIONS:
                    if test_case[0] == question:  # Found matching test case
                        function_name = get_function_name_from_question(question)
                        test_cases = test_case[2]  # Get test cases from tuple
                        success, debug_info, test_results = validate_with_debug(code, function_name, test_cases, model_name)  # Changed model to model_name
                        results['tests_passed'] = success
                        results['test_results'] = test_results
                        break
            else:
                print(f"Code Validation: ❌")
        else:
            results['error'] = f"Unexpected response format: {response}"
    except Exception as e:
        print(f"\n{RED}Error in get_model_stats: {str(e)}{ENDC}")
        results['error'] = str(e)
    return results
 def get_function_name_from_question(question: str) -> str:
    """Extract function name from question."""
    if "fibonacci" in question.lower():
        return "fibonacci"
    elif "binary_search" in question.lower():
        return "binary_search"
    elif "palindrome" in question.lower():
        return "is_palindrome"
    elif "anagram" in question.lower():
        return "are_anagrams"
    return ""
 def run_model_benchmark(model: str, server_url: str, num_runs: int = 4) -> Dict:
    """
    Run multiple benchmarks for a model and calculate average metrics.
    """
    metrics = []
    for i in range(num_runs):
        print(f"\n{YELLOW}[{model}] Run {i+1}/{num_runs}:{ENDC}")
        run_results = {}
        for question_tuple in CODING_QUESTIONS:
            test_name = get_test_name(question_tuple[0])
            print(f"\n{BOLD}Testing {test_name}...{ENDC}")
            try:
                result = get_model_stats(model, question_tuple, server_url)
                # Fix: Count actual passed cases from test results
                result['passed_cases'] = len([r for r in result.get('test_results', []) if r])
                result['total_cases'] = len(question_tuple[2])
                run_results[test_name] = result
            except Exception as e:
                print(f"Error in run {i+1}: {e}")
                continue
        if run_results:
            metrics.append(run_results)
    # Take only the last 3 runs for averaging
    metrics = metrics[-3:]
    num_runs_used = len(metrics)  # Actual number of runs used
    if not metrics:
        return {}
    # Aggregate results
    aggregated = {
        'model': model,
        'total_duration': mean([m[list(m.keys())[0]]['total_duration'] for m in metrics if m]),
        'tokens_per_second': mean([m[list(m.keys())[0]]['tokens_per_second'] for m in metrics if m]),
        'test_results': {}
    }
    # Calculate results per test
    for test_name in metrics[-1].keys():
        # Sum up actual passed cases for this test across runs
        passed_cases = sum(m[test_name]['passed_cases'] for m in metrics)
        # Calculate total possible cases (6 cases × number of actual runs)
        total_possible_cases = 6 * num_runs_used
        success_rate = (passed_cases / total_possible_cases * 100)
        status = '✅' if success_rate == 100 else '❌'
        print(f"{test_name}: {status} ({passed_cases}/{total_possible_cases} cases)")
        aggregated['test_results'][test_name] = {
            'success_rate': success_rate,
            'passed_cases': passed_cases,
            'total_cases': total_possible_cases,
            'success_cases_rate': passed_cases / total_possible_cases,  # Add success cases rate
            'avg_duration': mean([m[test_name]['total_duration'] for m in metrics]),
            'avg_tokens_sec': mean([m[test_name]['tokens_per_second'] for m in metrics])
        }
    # Calculate overall success rate across all tests
    total_passed = sum(t['passed_cases'] for t in aggregated['test_results'].values())
    total_cases = sum(t['total_cases'] for t in aggregated['test_results'].values())
    aggregated['overall_success_rate'] = (total_passed / total_cases * 100) if total_cases > 0 else 0
    aggregated['overall_success_cases_rate'] = (total_passed / total_cases) if total_cases > 0 else 0
    return aggregated
 def print_leaderboard(results: List[Dict]):
    """Print leaderboard of model results."""
    if not results:
        print("No results to display")
        return
    # Sort by success rate first, then by tokens per second
    sorted_results = sorted(results, key=lambda x: (
        sum(t['passed_cases'] for t in x['test_results'].values()) / sum(t['total_cases'] for t in x['test_results'].values()) if sum(t['total_cases'] for t in x['test_results'].values()) > 0 else 0,
        x['tokens_per_second']
    ), reverse=True)
    print(f"\n{HEADER}{BOLD}🏆 Final Model Leaderboard:{ENDC}")
    for i, result in enumerate(sorted_results, 1):
        # Calculate stats for each model
        total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
        total_cases = sum(t['total_cases'] for t in result['test_results'].values())
        success_rate = (total_passed / total_cases * 100) if total_cases > 0 else 0
        print(f"\n{BOLD}{YELLOW}{result['model']}{ENDC}")
        print(f"   {BOLD}Overall Success Rate:{ENDC} {success_rate:.1f}% ({total_passed}/{total_cases} cases)")
        print(f"   {BOLD}Average Tokens/sec:{ENDC} {result['tokens_per_second']:.2f}")
        print(f"   {BOLD}Average Duration:{ENDC} {result['total_duration']:.2f}s")
        print(f"   {BOLD}Test Results:{ENDC}")
        for test_name, test_result in result['test_results'].items():
            status = '✅' if test_result['success_rate'] == 100 else '❌'
            print(f"   - {test_name}: {status} {test_result['passed_cases']}/{test_result['total_cases']} cases ({test_result['success_rate']:.1f}%)")
 def get_available_models(server_url: str) -> List[str]:
    """Get list of available models from the specified Ollama server."""
    try:
        response = requests.get(f"{server_url}/api/tags").json()
        return [model['name'] for model in response['models']]
    except Exception as e:
        print(f"{RED}Error getting model list from {server_url}: {e}{ENDC}")
        return []
 def get_model_details(model_name):
    try:
        result = subprocess.run(
            ["ollama", "show", model_name],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            encoding='utf-8',
            errors='replace'
        )
        if result.returncode != 0:
            print(f"Error: {result.stderr.strip()}")
            return None
        if not result.stdout.strip():
            print(f"No details available for model: {model_name}")
            return None
        raw_output = result.stdout.strip()
        lines = raw_output.split('\n')
        current_section = None
        for line in lines:
            line = line.rstrip()
            if line and not line.startswith('  '):  # Section headers
                current_section = line.strip()
                print(f"\n  {current_section}")
            elif line and current_section:  # Section content
                # Split by multiple spaces and filter out empty parts
                parts = [part for part in line.split('  ') if part.strip()]
                if len(parts) >= 2:
                    key, value = parts[0].strip(), parts[-1].strip()
                    # Ensure consistent spacing for alignment
                    print(f"    {key:<16} {value}")
                elif len(parts) == 1:
                    # Handle single-value lines (like license text)
                    print(f"    {parts[0].strip()}")
        return None  # No need to return formatted details anymore
    except Exception as e:
        print(f"An error occurred while getting model details: {e}")
        return None
 def update_server_results(server_url: str, results: List[Dict]) -> None:
    try:
        # Get CPU brand and format it for filename
        cpu_info = get_cpu_info()
        cpu_brand = cpu_info.get('brand_raw', 'Unknown_CPU').replace(' ', '_')
        timestamp = time.strftime("%Y%m%d_%H%M%S")
        # Create a unique filename for this server's results
        server_id = server_url.replace('http://', '').replace(':', '_').replace('/', '_')
        results_dir = "benchmark_results"
        os.makedirs(results_dir, exist_ok=True)
        # Include CPU brand in filename
        filename = os.path.join(results_dir, f"{cpu_brand}_{server_id}.json")
        # Load existing results or create new file
        try:
            with open(filename, 'r') as f:
                existing_data = json.load(f)
        except FileNotFoundError:
            existing_data = {
                'server_url': server_url,
                'benchmarks': []
            }
        # Add new results with timestamp and ensure overall success rate is included
        benchmark_entry = {
            'timestamp': timestamp,
            'results': []
        }
        # Add overall success rate to each model's results
        for result in results:
            total_passed = sum(t['passed_cases'] for t in result['test_results'].values())
            total_cases = sum(t['total_cases'] for t in result['test_results'].values())
            result['overall_success_rate'] = (total_passed / total_cases * 100) if total_cases > 0 else 0
            benchmark_entry['results'].append(result)
        existing_data['benchmarks'].append(benchmark_entry)
        # Save updated results
        with open(filename, 'w') as f:
            json.dump(existing_data, f, indent=2)
        print(f"{GREEN}Successfully saved results to {filename}{ENDC}")
    except Exception as e:
        print(f"{RED}Failed to save results: {str(e)}{ENDC}")
 def main():
    parser = argparse.ArgumentParser(description='Run Ollama model benchmarks')
    parser.add_argument('--server', choices=['local', 'z60'], default='local',
                      help='Choose Ollama server (default: local)')
    parser.add_argument('--model', type=str, help='Specific model to benchmark')
    parser.add_argument('--number', type=str, help='Number of models to benchmark (number or "all")')
    parser.add_argument('--verbose', action='store_true', help='Enable verbose output')
    args = parser.parse_args()
    server_url = SERVERS[args.server]
    print()
    print(f"{HEADER}{BOLD}CPU Information:{ENDC}")
    cpu_info = get_cpu_info()
    for key, value in cpu_info.items():
        print(f"{MUTED}{key}: {value}{ENDC}")
    print()
    print(f"{INFO}Using Ollama server at {server_url}...{ENDC}")
    # Get available models or use specified model
    if args.model:
        models = [args.model]
    else:
        models = get_available_models(server_url)
    if not models:
        print(f"{RED}No models found on server {server_url}. Exiting.{ENDC}")
        return
    # Handle number of models to test
    if args.number and args.number.lower() != 'all':
        try:
            num_models = int(args.number)
            if num_models > 0:
                models = models[:num_models]
            else:
                print(f"{WARNING}Invalid number of models. Using all available models.{ENDC}")
        except ValueError:
            print(f"{WARNING}Invalid number format. Using all available models.{ENDC}")
    print(f"{INFO}Testing {len(models)} models :{ENDC}")
    for i, model in enumerate(models, 1):
        print(f"{YELLOW}{i}. {model}{ENDC}")
    # Run benchmarks
    all_results = []
    for model in models:
        print(f"\n{HEADER}{BOLD}Benchmarking {model}...{ENDC}")
        details = get_model_details(model)
        if details:
            print(f"\n{INFO}Model Details:{ENDC}")
            if "details" in details:
                for section, items in details["details"].items():
                    print(f"\n{BOLD}{section}{ENDC}")
                    for key, value in items.items():
                        print(f"  {key}: {value}")
            else:
                print(json.dumps(details, indent=2))
        result = run_model_benchmark(model, server_url)
        if 'error' not in result:
            all_results.append(result)
    # Print and save results
    print_leaderboard(all_results)
    update_server_results(server_url, all_results)
 if __name__ == "__main__":
    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,6 @@
 requests>=2.31.0
 together>=0.2.8
 ollama>=0.1.6
 python-dotenv>=1.0.0
 GPUtil==1.4.0
 py-cpuinfo