diff --git a/.DS_Store b/.DS_Store index 870c31a..96bed27 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/README.md b/README.md index 0eb7767..e436279 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Codebench - Ollama Model Benchmark Tool +# Codebench - Ollama Models Python Benchmark Tool A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results. @@ -76,6 +76,9 @@ The tool currently tests models on these coding challenges: 1. Each model is prompted with specific coding tasks 2. Generated code is extracted from the model's response 3. Initial syntax validation is performed +4. Code that fails validation is passed to Together API for advanced code analysis +5. Code that passes validation is executed and validated with given data and compared to expected results +6. Each test is run 4 times for consistency and only the last 3 results are used for metrics ### Test Validation For each test case: