Better doc
This commit is contained in:
parent
2bef2a0b7c
commit
b52a0e61a1
109
README.md
109
README.md
@ -13,6 +13,7 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
|
||||
- [Command-Line Arguments](#command-line-arguments)
|
||||
- [Examples](#examples)
|
||||
- [Using as a Python Module](#using-as-a-python-module)
|
||||
- [Leaderboard Feature](#leaderboard-feature)
|
||||
- [Supported Providers](#supported-providers)
|
||||
- [Adding New Models](#adding-new-models)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
@ -25,9 +26,9 @@ AllEndpoints is a powerful Python module for making inferences with various LLM
|
||||
cd allendpoints
|
||||
```
|
||||
|
||||
2. Install the required dependencies:
|
||||
2. Install the required dependencies using the requirements.txt file:
|
||||
```bash
|
||||
pip install ollama requests google-generativeai huggingface_hub together groq openai colorama
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Install Ollama (optional, for local inference):
|
||||
@ -199,6 +200,44 @@ response = run_inference(
|
||||
print(response)
|
||||
```
|
||||
|
||||
### Model Naming Conventions
|
||||
|
||||
The `model` parameter can be specified in two different ways:
|
||||
|
||||
1. **Short name** (key): The abbreviated name used in the configuration
|
||||
2. **Full name** (value): The complete model identifier used by the provider's API
|
||||
|
||||
#### Example with both naming types:
|
||||
|
||||
```python
|
||||
# Using the short name (key)
|
||||
response1 = run_inference(
|
||||
prompt="What is AI?",
|
||||
provider="nvidia",
|
||||
model="qwen2.5-coder-32b" # Short name
|
||||
)
|
||||
|
||||
# Using the full name (value)
|
||||
response2 = run_inference(
|
||||
prompt="What is AI?",
|
||||
provider="nvidia",
|
||||
model="qwen/qwen2.5-coder-32b-instruct" # Full name
|
||||
)
|
||||
```
|
||||
|
||||
The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical.
|
||||
|
||||
You can view the mapping between short and full names with:
|
||||
|
||||
```python
|
||||
# View the mapping between short and full model names
|
||||
for provider, models in CONFIG["models"].items():
|
||||
if isinstance(models, dict):
|
||||
print(f"\nModels for {provider}:")
|
||||
for short_name, full_name in models.items():
|
||||
print(f" {short_name} -> {full_name}")
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
|
||||
```python
|
||||
@ -280,6 +319,72 @@ response = run_inference(
|
||||
|
||||
This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface.
|
||||
|
||||
## Leaderboard Feature
|
||||
|
||||
AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the `-a/--all` option. This helps you compare the performance of different models across providers.
|
||||
|
||||
### How the Leaderboard Works
|
||||
|
||||
1. When you run AllEndpoints with the `-a/--all` flag, it executes your prompt on all available models across all providers
|
||||
2. The script measures the response time for each model
|
||||
3. After all models have completed, a leaderboard is displayed ranking models from fastest to slowest
|
||||
|
||||
### Example Leaderboard Output
|
||||
|
||||
```
|
||||
==================================================
|
||||
RESPONSE TIME LEADERBOARD
|
||||
==================================================
|
||||
Rank Model Time (seconds)
|
||||
-------------------------------------------------------------
|
||||
1 ollama/llama3.2:1b-instruct-q4_K_M 0.19
|
||||
2 groq/deepseek-r1-distill-llama-70b 0.26
|
||||
3 ollama/cogito:3b 0.31
|
||||
4 ollama/llama3.2:3b 0.32
|
||||
5 nvidia/mixtral-8x7b 0.40
|
||||
6 groq/llama-3.3-70b-versatile 0.40
|
||||
7 ollama/wizard-vicuna-uncensored:latest 0.40
|
||||
8 ollama/samantha-mistral:latest 0.41
|
||||
9 ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50
|
||||
10 ollama/qwen2.5:14b 0.93
|
||||
```
|
||||
|
||||
### Using the Leaderboard
|
||||
|
||||
To generate a leaderboard, use the `-a/--all` option with your prompt:
|
||||
|
||||
```bash
|
||||
python allendpoints.py -a "What is the capital of France?"
|
||||
```
|
||||
|
||||
The leaderboard helps you:
|
||||
|
||||
1. **Identify the fastest models** for your specific use case
|
||||
2. **Compare performance across providers** (Ollama, Groq, NVIDIA, etc.)
|
||||
3. **Optimize your workflow** by selecting models with the best speed-to-quality ratio
|
||||
4. **Monitor performance changes** as models and APIs are updated
|
||||
|
||||
### Factors Affecting Response Time
|
||||
|
||||
Response times in the leaderboard are affected by several factors:
|
||||
|
||||
- **Model size**: Smaller models generally respond faster
|
||||
- **Provider infrastructure**: Cloud-based providers may have different latencies
|
||||
- **Network conditions**: Internet speed affects cloud provider response times
|
||||
- **Local hardware**: For Ollama models, your CPU/GPU capabilities matter
|
||||
- **Model complexity**: Some models are optimized for speed, others for quality
|
||||
- **Query complexity**: More complex prompts may take longer to process
|
||||
|
||||
### Preloading Mechanism
|
||||
|
||||
For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements:
|
||||
|
||||
1. Before timing the actual response, the script sends a simple "hello" query to warm up the model
|
||||
2. This eliminates the initial loading time from the performance measurement
|
||||
3. The reported time reflects only the actual inference time, not model loading
|
||||
|
||||
This provides a more accurate comparison between local Ollama models and cloud-based providers.
|
||||
|
||||
## Supported Providers
|
||||
|
||||
### Ollama (Local)
|
||||
|
Loading…
Reference in New Issue
Block a user