Better doc

This commit is contained in:
leduc 2025-04-23 02:40:31 +02:00
parent 2bef2a0b7c
commit b52a0e61a1

109
README.md
View File

@ -13,6 +13,7 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
- [Command-Line Arguments](#command-line-arguments)
- [Examples](#examples)
- [Using as a Python Module](#using-as-a-python-module)
- [Leaderboard Feature](#leaderboard-feature)
- [Supported Providers](#supported-providers)
- [Adding New Models](#adding-new-models)
- [Troubleshooting](#troubleshooting)
@ -25,9 +26,9 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
cd allendpoints
```
2. Install the required dependencies:
2. Install the required dependencies using the requirements.txt file:
```bash
pip install ollama requests google-generativeai huggingface_hub together groq openai colorama
pip install -r requirements.txt
```
3. Install Ollama (optional, for local inference):
@ -199,6 +200,44 @@ response = run_inference(
print(response)
```
### Model Naming Conventions
The `model` parameter can be specified in two different ways:
1. **Short name** (key): The abbreviated name used in the configuration
2. **Full name** (value): The complete model identifier used by the provider's API
#### Example with both naming types:
```python
# Using the short name (key)
response1 = run_inference(
prompt="What is AI?",
provider="nvidia",
model="qwen2.5-coder-32b" # Short name
)
# Using the full name (value)
response2 = run_inference(
prompt="What is AI?",
provider="nvidia",
model="qwen/qwen2.5-coder-32b-instruct" # Full name
)
```
The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical.
You can view the mapping between short and full names with:
```python
# View the mapping between short and full model names
for provider, models in CONFIG["models"].items():
if isinstance(models, dict):
print(f"\nModels for {provider}:")
for short_name, full_name in models.items():
print(f" {short_name} -> {full_name}")
```
### Advanced Usage
```python
@ -280,6 +319,72 @@ response = run_inference(
This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface.
## Leaderboard Feature
AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the `-a/--all` option. This helps you compare the performance of different models across providers.
### How the Leaderboard Works
1. When you run AllEndpoints with the `-a/--all` flag, it executes your prompt on all available models across all providers
2. The script measures the response time for each model
3. After all models have completed, a leaderboard is displayed ranking models from fastest to slowest
### Example Leaderboard Output
```
==================================================
RESPONSE TIME LEADERBOARD
==================================================
Rank Model Time (seconds)
-------------------------------------------------------------
1 ollama/llama3.2:1b-instruct-q4_K_M 0.19
2 groq/deepseek-r1-distill-llama-70b 0.26
3 ollama/cogito:3b 0.31
4 ollama/llama3.2:3b 0.32
5 nvidia/mixtral-8x7b 0.40
6 groq/llama-3.3-70b-versatile 0.40
7 ollama/wizard-vicuna-uncensored:latest 0.40
8 ollama/samantha-mistral:latest 0.41
9 ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50
10 ollama/qwen2.5:14b 0.93
```
### Using the Leaderboard
To generate a leaderboard, use the `-a/--all` option with your prompt:
```bash
python allendpoints.py -a "What is the capital of France?"
```
The leaderboard helps you:
1. **Identify the fastest models** for your specific use case
2. **Compare performance across providers** (Ollama, Groq, NVIDIA, etc.)
3. **Optimize your workflow** by selecting models with the best speed-to-quality ratio
4. **Monitor performance changes** as models and APIs are updated
### Factors Affecting Response Time
Response times in the leaderboard are affected by several factors:
- **Model size**: Smaller models generally respond faster
- **Provider infrastructure**: Cloud-based providers may have different latencies
- **Network conditions**: Internet speed affects cloud provider response times
- **Local hardware**: For Ollama models, your CPU/GPU capabilities matter
- **Model complexity**: Some models are optimized for speed, others for quality
- **Query complexity**: More complex prompts may take longer to process
### Preloading Mechanism
For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements:
1. Before timing the actual response, the script sends a simple "hello" query to warm up the model
2. This eliminates the initial loading time from the performance measurement
3. The reported time reflects only the actual inference time, not model loading
This provides a more accurate comparison between local Ollama models and cloud-based providers.
## Supported Providers
### Ollama (Local)