results consistency explanation

This commit is contained in:
leduc 2025-03-15 12:12:40 +01:00
parent ee9e3d2a04
commit 81dc8bdcbe
2 changed files with 4 additions and 1 deletions

BIN
.DS_Store vendored

Binary file not shown.

View File

@ -1,4 +1,4 @@
# Codebench - Ollama Model Benchmark Tool
# Codebench - Ollama Models Python Benchmark Tool
A Python-based benchmarking tool for testing and comparing different Ollama models on coding tasks. This tool allows you to benchmark multiple Ollama models against common coding problems, measure their performance, and visualize the results.
@ -76,6 +76,9 @@ The tool currently tests models on these coding challenges:
1. Each model is prompted with specific coding tasks
2. Generated code is extracted from the model's response
3. Initial syntax validation is performed
4. Code that fails validation is passed to Together API for advanced code analysis
5. Code that passes validation is executed and validated with given data and compared to expected results
6. Each test is run 4 times for consistency and only the last 3 results are used for metrics
### Test Validation
For each test case: