Better doc

2025-04-23 02:40:31 +02:00 · 2025-04-23 02:40:31 +02:00 · b52a0e61a1
commit b52a0e61a1
parent 2bef2a0b7c
1 changed files with 107 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -13,6 +13,7 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
 - [Command-Line Arguments](#command-line-arguments)
 - [Examples](#examples)
 - [Using as a Python Module](#using-as-a-python-module)
+- [Leaderboard Feature](#leaderboard-feature)
 - [Supported Providers](#supported-providers)
 - [Adding New Models](#adding-new-models)
 - [Troubleshooting](#troubleshooting)
@ -25,9 +26,9 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
   cd allendpoints
   ```

-2. Install the required dependencies:
+2. Install the required dependencies using the requirements.txt file:
   ```bash
-   pip install ollama requests google-generativeai huggingface_hub together groq openai colorama
+   pip install -r requirements.txt
   ```

 3. Install Ollama (optional, for local inference):
@ -199,6 +200,44 @@ response = run_inference(
 print(response)
 ```

+### Model Naming Conventions
+
+The `model` parameter can be specified in two different ways:
+
+1. **Short name** (key): The abbreviated name used in the configuration
+2. **Full name** (value): The complete model identifier used by the provider's API
+
+#### Example with both naming types:
+
+```python
+# Using the short name (key)
+response1 = run_inference(
+    prompt="What is AI?",
+    provider="nvidia",
+    model="qwen2.5-coder-32b"  # Short name
+)
+
+# Using the full name (value)
+response2 = run_inference(
+    prompt="What is AI?",
+    provider="nvidia",
+    model="qwen/qwen2.5-coder-32b-instruct"  # Full name
+)
+```
+
+The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical.
+
+You can view the mapping between short and full names with:
+
+```python
+# View the mapping between short and full model names
+for provider, models in CONFIG["models"].items():
+    if isinstance(models, dict):
+        print(f"\nModels for {provider}:")
+        for short_name, full_name in models.items():
+            print(f"  {short_name} -> {full_name}")
+```
+
 ### Advanced Usage

 ```python
@ -280,6 +319,72 @@ response = run_inference(

 This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface.

+## Leaderboard Feature
+
+AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the `-a/--all` option. This helps you compare the performance of different models across providers.
+
+### How the Leaderboard Works
+
+1. When you run AllEndpoints with the `-a/--all` flag, it executes your prompt on all available models across all providers
+2. The script measures the response time for each model
+3. After all models have completed, a leaderboard is displayed ranking models from fastest to slowest
+
+### Example Leaderboard Output
+
+```
+==================================================
+RESPONSE TIME LEADERBOARD
+==================================================
+Rank  Model                                   Time (seconds) 
+-------------------------------------------------------------
+1     ollama/llama3.2:1b-instruct-q4_K_M      0.19
+2     groq/deepseek-r1-distill-llama-70b      0.26
+3     ollama/cogito:3b                        0.31
+4     ollama/llama3.2:3b                      0.32
+5     nvidia/mixtral-8x7b                     0.40
+6     groq/llama-3.3-70b-versatile            0.40
+7     ollama/wizard-vicuna-uncensored:latest  0.40
+8     ollama/samantha-mistral:latest          0.41
+9     ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50
+10    ollama/qwen2.5:14b                      0.93
+```
+
+### Using the Leaderboard
+
+To generate a leaderboard, use the `-a/--all` option with your prompt:
+
+```bash
+python allendpoints.py -a "What is the capital of France?"
+```
+
+The leaderboard helps you:
+
+1. **Identify the fastest models** for your specific use case
+2. **Compare performance across providers** (Ollama, Groq, NVIDIA, etc.)
+3. **Optimize your workflow** by selecting models with the best speed-to-quality ratio
+4. **Monitor performance changes** as models and APIs are updated
+
+### Factors Affecting Response Time
+
+Response times in the leaderboard are affected by several factors:
+
+- **Model size**: Smaller models generally respond faster
+- **Provider infrastructure**: Cloud-based providers may have different latencies
+- **Network conditions**: Internet speed affects cloud provider response times
+- **Local hardware**: For Ollama models, your CPU/GPU capabilities matter
+- **Model complexity**: Some models are optimized for speed, others for quality
+- **Query complexity**: More complex prompts may take longer to process
+
+### Preloading Mechanism
+
+For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements:
+
+1. Before timing the actual response, the script sends a simple "hello" query to warm up the model
+2. This eliminates the initial loading time from the performance measurement
+3. The reported time reflects only the actual inference time, not model loading
+
+This provides a more accurate comparison between local Ollama models and cloud-based providers.
+
 ## Supported Providers

 ### Ollama (Local)