diff --git a/README.md b/README.md index 011e75e..651a627 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ AllEndpoints is a powerful Python module for making inferences with various LLM - [Command-Line Arguments](#command-line-arguments) - [Examples](#examples) - [Using as a Python Module](#using-as-a-python-module) +- [Leaderboard Feature](#leaderboard-feature) - [Supported Providers](#supported-providers) - [Adding New Models](#adding-new-models) - [Troubleshooting](#troubleshooting) @@ -25,9 +26,9 @@ AllEndpoints is a powerful Python module for making inferences with various LLM cd allendpoints ``` -2. Install the required dependencies: +2. Install the required dependencies using the requirements.txt file: ```bash - pip install ollama requests google-generativeai huggingface_hub together groq openai colorama + pip install -r requirements.txt ``` 3. Install Ollama (optional, for local inference): @@ -199,6 +200,44 @@ response = run_inference( print(response) ``` +### Model Naming Conventions + +The `model` parameter can be specified in two different ways: + +1. **Short name** (key): The abbreviated name used in the configuration +2. **Full name** (value): The complete model identifier used by the provider's API + +#### Example with both naming types: + +```python +# Using the short name (key) +response1 = run_inference( + prompt="What is AI?", + provider="nvidia", + model="qwen2.5-coder-32b" # Short name +) + +# Using the full name (value) +response2 = run_inference( + prompt="What is AI?", + provider="nvidia", + model="qwen/qwen2.5-coder-32b-instruct" # Full name +) +``` + +The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical. + +You can view the mapping between short and full names with: + +```python +# View the mapping between short and full model names +for provider, models in CONFIG["models"].items(): + if isinstance(models, dict): + print(f"\nModels for {provider}:") + for short_name, full_name in models.items(): + print(f" {short_name} -> {full_name}") +``` + ### Advanced Usage ```python @@ -280,6 +319,72 @@ response = run_inference( This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface. +## Leaderboard Feature + +AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the `-a/--all` option. This helps you compare the performance of different models across providers. + +### How the Leaderboard Works + +1. When you run AllEndpoints with the `-a/--all` flag, it executes your prompt on all available models across all providers +2. The script measures the response time for each model +3. After all models have completed, a leaderboard is displayed ranking models from fastest to slowest + +### Example Leaderboard Output + +``` +================================================== +RESPONSE TIME LEADERBOARD +================================================== +Rank Model Time (seconds) +------------------------------------------------------------- +1 ollama/llama3.2:1b-instruct-q4_K_M 0.19 +2 groq/deepseek-r1-distill-llama-70b 0.26 +3 ollama/cogito:3b 0.31 +4 ollama/llama3.2:3b 0.32 +5 nvidia/mixtral-8x7b 0.40 +6 groq/llama-3.3-70b-versatile 0.40 +7 ollama/wizard-vicuna-uncensored:latest 0.40 +8 ollama/samantha-mistral:latest 0.41 +9 ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50 +10 ollama/qwen2.5:14b 0.93 +``` + +### Using the Leaderboard + +To generate a leaderboard, use the `-a/--all` option with your prompt: + +```bash +python allendpoints.py -a "What is the capital of France?" +``` + +The leaderboard helps you: + +1. **Identify the fastest models** for your specific use case +2. **Compare performance across providers** (Ollama, Groq, NVIDIA, etc.) +3. **Optimize your workflow** by selecting models with the best speed-to-quality ratio +4. **Monitor performance changes** as models and APIs are updated + +### Factors Affecting Response Time + +Response times in the leaderboard are affected by several factors: + +- **Model size**: Smaller models generally respond faster +- **Provider infrastructure**: Cloud-based providers may have different latencies +- **Network conditions**: Internet speed affects cloud provider response times +- **Local hardware**: For Ollama models, your CPU/GPU capabilities matter +- **Model complexity**: Some models are optimized for speed, others for quality +- **Query complexity**: More complex prompts may take longer to process + +### Preloading Mechanism + +For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements: + +1. Before timing the actual response, the script sends a simple "hello" query to warm up the model +2. This eliminates the initial loading time from the performance measurement +3. The reported time reflects only the actual inference time, not model loading + +This provides a more accurate comparison between local Ollama models and cloud-based providers. + ## Supported Providers ### Ollama (Local)