results consistency explanation

2025-03-15 12:12:40 +01:00 · 2025-03-15 12:12:40 +01:00 · 81dc8bdcbe
commit 81dc8bdcbe
parent ee9e3d2a04
2 changed files with 4 additions and 1 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-# Codebench - Ollama Model Benchmark Tool
+# Codebench - Ollama Models Python Benchmark Tool

 A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results.

@ -76,6 +76,9 @@ The tool currently tests models on these coding challenges:
 1. Each model is prompted with specific coding tasks
 2. Generated code is extracted from the model's response
 3. Initial syntax validation is performed
+4. Code that fails validation is passed to Together API for advanced code analysis
+5. Code that passes validation is executed and validated with given data and compared to expected results
+6. Each test is run 4 times for consistency and only the last 3 results are used for metrics

 ### Test Validation
 For each test case: