506 lines
16 KiB
Markdown
506 lines
16 KiB
Markdown
# AllEndpoints - Universal LLM Inference Tool
|
|
|
|
AllEndpoints is a powerful Python module for making inferences with various LLM providers through a unified interface. It supports multiple providers including Ollama (local), HuggingFace, Together, Google Gemini, AIQL, Groq, NVIDIA, and GitHub Copilot APIs.
|
|
|
|
> **Quick Start**: With [uv](https://github.com/astral-sh/uv) installed, you can run AllEndpoints without explicit dependency installation:
|
|
> ```bash
|
|
> uv run allendpoints.py --list
|
|
> ```
|
|
|
|
## Table of Contents
|
|
|
|
- [Installation](#installation)
|
|
- [Environment Variables](#environment-variables)
|
|
- [Setting Up Environment Variables](#setting-up-environment-variables)
|
|
- [Linux/macOS](#linuxmacos)
|
|
- [Windows](#windows)
|
|
- [Usage](#usage)
|
|
- [Command-Line Arguments](#command-line-arguments)
|
|
- [Examples](#examples)
|
|
- [Using as a Python Module](#using-as-a-python-module)
|
|
- [Leaderboard Feature](#leaderboard-feature)
|
|
- [Supported Providers](#supported-providers)
|
|
- [Adding New Models](#adding-new-models)
|
|
- [Troubleshooting](#troubleshooting)
|
|
|
|
## Installation
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/yourusername/allendpoints.git
|
|
cd allendpoints
|
|
```
|
|
|
|
2. Choose one of the following installation methods:
|
|
|
|
### Option A: Using pip
|
|
Install the required dependencies using the requirements.txt file:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
Then run the script directly:
|
|
```bash
|
|
python allendpoints.py [arguments]
|
|
```
|
|
|
|
### Option B: Using uv (Recommended)
|
|
If you have [uv](https://github.com/astral-sh/uv) installed, you can run the script without explicitly installing dependencies:
|
|
```bash
|
|
uv run allendpoints.py [arguments]
|
|
```
|
|
This will automatically create a virtual environment and install all required dependencies on first run.
|
|
|
|
3. Install Ollama (optional, for local inference):
|
|
- [Ollama Installation Guide](https://github.com/ollama/ollama)
|
|
|
|
## Environment Variables
|
|
|
|
The script uses environment variables to store API keys for different providers. Here are the required environment variables for each provider:
|
|
|
|
| Provider | Environment Variable | Description |
|
|
|-------------|---------------------|--------------------------------------------|
|
|
| HuggingFace | `HF_API_KEY` | HuggingFace API key |
|
|
| Together | `TOGETHER_API_KEY` | Together AI API key |
|
|
| Google Gemini | `GEMINI_API_KEY` | Google AI Studio API key |
|
|
| AIQL | `AIQL_API_KEY` | AIQL API key |
|
|
| Groq | `GROQ_API_KEY` | Groq API key |
|
|
| NVIDIA | `NVIDIA_API_KEY` | NVIDIA API key |
|
|
| GitHub | `GITHUB_TOKEN` | GitHub token for Copilot API access |
|
|
|
|
### Setting Up Environment Variables
|
|
|
|
#### Linux/macOS
|
|
|
|
**Temporary (Current Session Only)**
|
|
|
|
```bash
|
|
export HF_API_KEY="your_huggingface_api_key"
|
|
export TOGETHER_API_KEY="your_together_api_key"
|
|
export GEMINI_API_KEY="your_gemini_api_key"
|
|
export AIQL_API_KEY="your_aiql_api_key"
|
|
export GROQ_API_KEY="your_groq_api_key"
|
|
export NVIDIA_API_KEY="your_nvidia_api_key"
|
|
export GITHUB_TOKEN="your_github_token"
|
|
```
|
|
|
|
**Permanent (Add to Shell Profile)**
|
|
|
|
Add the above export commands to your `~/.bashrc`, `~/.zshrc`, or `~/.profile` file:
|
|
|
|
```bash
|
|
echo 'export HF_API_KEY="your_huggingface_api_key"' >> ~/.bashrc
|
|
echo 'export TOGETHER_API_KEY="your_together_api_key"' >> ~/.bashrc
|
|
# Add other API keys similarly
|
|
```
|
|
|
|
Then reload your shell configuration:
|
|
```bash
|
|
source ~/.bashrc # or ~/.zshrc or ~/.profile
|
|
```
|
|
|
|
#### Windows
|
|
|
|
**Command Prompt (Temporary)**
|
|
|
|
```cmd
|
|
set HF_API_KEY=your_huggingface_api_key
|
|
set TOGETHER_API_KEY=your_together_api_key
|
|
set GEMINI_API_KEY=your_gemini_api_key
|
|
set AIQL_API_KEY=your_aiql_api_key
|
|
set GROQ_API_KEY=your_groq_api_key
|
|
set NVIDIA_API_KEY=your_nvidia_api_key
|
|
set GITHUB_TOKEN=your_github_token
|
|
```
|
|
|
|
**PowerShell (Temporary)**
|
|
|
|
```powershell
|
|
$env:HF_API_KEY = "your_huggingface_api_key"
|
|
$env:TOGETHER_API_KEY = "your_together_api_key"
|
|
$env:GEMINI_API_KEY = "your_gemini_api_key"
|
|
$env:AIQL_API_KEY = "your_aiql_api_key"
|
|
$env:GROQ_API_KEY = "your_groq_api_key"
|
|
$env:NVIDIA_API_KEY = "your_nvidia_api_key"
|
|
$env:GITHUB_TOKEN = "your_github_token"
|
|
```
|
|
|
|
**Permanent (System Environment Variables)**
|
|
|
|
1. Right-click on "This PC" or "My Computer" and select "Properties"
|
|
2. Click on "Advanced system settings"
|
|
3. Click on "Environment Variables"
|
|
4. Under "User variables" or "System variables", click "New"
|
|
5. Enter the variable name (e.g., `HF_API_KEY`) and its value
|
|
6. Click "OK" to save
|
|
|
|
## Usage
|
|
|
|
### Command-Line Arguments
|
|
|
|
```
|
|
usage: allendpoints.py [-h] [--provider PROVIDER] [--model MODEL] [--system SYSTEM] [--list] [--debug] [-a] [prompt]
|
|
|
|
LLM Inference Module
|
|
|
|
positional arguments:
|
|
prompt The prompt to send to the model (default: "Why is the sky blue?")
|
|
|
|
options:
|
|
-h, --help show this help message and exit
|
|
--provider PROVIDER The provider to use (ollama, hf, together, gemini, aiql, groq, nvidia, github)
|
|
--model MODEL The specific model to use
|
|
--system SYSTEM System content for chat models (default: "You are a helpful assistant.")
|
|
--list List available providers and models
|
|
--debug Enable debug output
|
|
-a, --all Run inference on all available providers and models
|
|
```
|
|
|
|
### Examples
|
|
|
|
**List all available providers and models:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py --list
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py --list
|
|
```
|
|
|
|
**Run inference with a specific provider and model:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py "What is the capital of France?" --provider ollama --model llama3.2:3b
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py "What is the capital of France?" --provider ollama --model llama3.2:3b
|
|
```
|
|
|
|
**Run inference with a specific provider and its default model:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py "Explain quantum computing" --provider gemini
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py "Explain quantum computing" --provider gemini
|
|
```
|
|
|
|
**Run inference with a custom system prompt:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py "Write a poem about AI" --provider ollama --model llama3.2:3b --system "You are a poetic assistant."
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py "Write a poem about AI" --provider ollama --model llama3.2:3b --system "You are a poetic assistant."
|
|
```
|
|
|
|
**Run inference on all available providers and models:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py "What is the meaning of life?" -a
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py "What is the meaning of life?" -a
|
|
```
|
|
|
|
**Run with debug output:**
|
|
```bash
|
|
# Using python directly
|
|
python allendpoints.py "How does a nuclear reactor work?" --provider nvidia --model qwen2.5-coder-32b --debug
|
|
|
|
# Using uv run
|
|
uv run allendpoints.py "How does a nuclear reactor work?" --provider nvidia --model qwen2.5-coder-32b --debug
|
|
```
|
|
|
|
## Using as a Python Module
|
|
|
|
AllEndpoints can be imported and used as a Python module in your own projects. Here's how to use it programmatically:
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
# Import the necessary functions from allendpoints
|
|
from allendpoints import run_inference, check_available_apis, CONFIG
|
|
|
|
# Run inference with a specific provider and model
|
|
# Always specify the model parameter explicitly
|
|
response = run_inference(
|
|
prompt="What is the capital of France?",
|
|
provider="ollama",
|
|
model="llama3.2:3b",
|
|
system_content="You are a helpful assistant."
|
|
)
|
|
|
|
print(response)
|
|
|
|
# If you want to use the default model for a provider
|
|
default_model = CONFIG["defaults"]["ollama"]
|
|
response = run_inference(
|
|
prompt="What is quantum computing?",
|
|
provider="ollama",
|
|
model=default_model
|
|
)
|
|
|
|
print(response)
|
|
```
|
|
|
|
### Model Naming Conventions
|
|
|
|
The `model` parameter can be specified in two different ways:
|
|
|
|
1. **Short name** (key): The abbreviated name used in the configuration
|
|
2. **Full name** (value): The complete model identifier used by the provider's API
|
|
|
|
#### Example with both naming types:
|
|
|
|
```python
|
|
# Using the short name (key)
|
|
response1 = run_inference(
|
|
prompt="What is AI?",
|
|
provider="nvidia",
|
|
model="qwen2.5-coder-32b" # Short name
|
|
)
|
|
|
|
# Using the full name (value)
|
|
response2 = run_inference(
|
|
prompt="What is AI?",
|
|
provider="nvidia",
|
|
model="qwen/qwen2.5-coder-32b-instruct" # Full name
|
|
)
|
|
```
|
|
|
|
The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical.
|
|
|
|
You can view the mapping between short and full names with:
|
|
|
|
```python
|
|
# View the mapping between short and full model names
|
|
for provider, models in CONFIG["models"].items():
|
|
if isinstance(models, dict):
|
|
print(f"\nModels for {provider}:")
|
|
for short_name, full_name in models.items():
|
|
print(f" {short_name} -> {full_name}")
|
|
```
|
|
|
|
### Advanced Usage
|
|
|
|
```python
|
|
# Import more functions for advanced usage
|
|
from allendpoints import (
|
|
run_inference,
|
|
check_available_apis,
|
|
get_ollama_models,
|
|
InferenceHandler,
|
|
CONFIG
|
|
)
|
|
|
|
# Get all available providers
|
|
available_providers = check_available_apis()
|
|
print(f"Available providers: {available_providers}")
|
|
|
|
# Get all available Ollama models
|
|
ollama_models = get_ollama_models()
|
|
print(f"Available Ollama models: {ollama_models}")
|
|
|
|
# Use a specific provider's handler directly
|
|
if "nvidia" in available_providers:
|
|
nvidia_response = InferenceHandler.nvidia(
|
|
prompt="Explain quantum computing",
|
|
model="qwen/qwen2.5-coder-32b-instruct"
|
|
)
|
|
print(f"NVIDIA response: {nvidia_response}")
|
|
|
|
# Access the configuration
|
|
default_models = CONFIG["defaults"]
|
|
print(f"Default models: {default_models}")
|
|
```
|
|
|
|
### Batch Processing Example
|
|
|
|
```python
|
|
# Process multiple prompts with different providers
|
|
prompts = [
|
|
"What is machine learning?",
|
|
"Explain the theory of relativity",
|
|
"How does a neural network work?"
|
|
]
|
|
|
|
providers = ["ollama", "gemini", "github"]
|
|
|
|
# Process each prompt with each provider
|
|
for prompt in prompts:
|
|
for provider in providers:
|
|
try:
|
|
# Always specify the model parameter explicitly
|
|
default_model = CONFIG["defaults"][provider]
|
|
response = run_inference(prompt, provider, model=default_model)
|
|
print(f"\nPrompt: {prompt}")
|
|
print(f"Provider: {provider}")
|
|
print(f"Response: {response[:100]}...")
|
|
except Exception as e:
|
|
print(f"Error with {provider}: {str(e)}")
|
|
```
|
|
|
|
### Integration with main.py
|
|
|
|
The allendpoints module is integrated with main.py for benchmarking LLM performance on coding tasks:
|
|
|
|
```python
|
|
# In main.py
|
|
from allendpoints import check_available_apis, run_inference
|
|
|
|
# Get available providers
|
|
available_apis = check_available_apis()
|
|
|
|
# Run inference with a specific model
|
|
response = run_inference(
|
|
question, # The coding problem to solve
|
|
provider, # The provider to use
|
|
model_id, # The specific model to use
|
|
system_content # Optional system prompt
|
|
)
|
|
```
|
|
|
|
This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface.
|
|
|
|
## Leaderboard Feature
|
|
|
|
AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the `-a/--all` option. This helps you compare the performance of different models across providers.
|
|
|
|
### How the Leaderboard Works
|
|
|
|
1. When you run AllEndpoints with the `-a/--all` flag, it executes your prompt on all available models across all providers
|
|
2. The script measures the response time for each model
|
|
3. After all models have completed, a leaderboard is displayed ranking models from fastest to slowest
|
|
|
|
### Example Leaderboard Output
|
|
|
|
```
|
|
==================================================
|
|
RESPONSE TIME LEADERBOARD
|
|
==================================================
|
|
Rank Model Time (seconds)
|
|
-------------------------------------------------------------
|
|
1 ollama/llama3.2:1b-instruct-q4_K_M 0.19
|
|
2 groq/deepseek-r1-distill-llama-70b 0.26
|
|
3 ollama/cogito:3b 0.31
|
|
4 ollama/llama3.2:3b 0.32
|
|
5 nvidia/mixtral-8x7b 0.40
|
|
6 groq/llama-3.3-70b-versatile 0.40
|
|
7 ollama/wizard-vicuna-uncensored:latest 0.40
|
|
8 ollama/samantha-mistral:latest 0.41
|
|
9 ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50
|
|
10 ollama/qwen2.5:14b 0.93
|
|
```
|
|
|
|
### Using the Leaderboard
|
|
|
|
To generate a leaderboard, use the `-a/--all` option with your prompt:
|
|
|
|
```bash
|
|
python allendpoints.py -a "What is the capital of France?"
|
|
```
|
|
|
|
The leaderboard helps you:
|
|
|
|
1. **Identify the fastest models** for your specific use case
|
|
2. **Compare performance across providers** (Ollama, Groq, NVIDIA, etc.)
|
|
3. **Optimize your workflow** by selecting models with the best speed-to-quality ratio
|
|
4. **Monitor performance changes** as models and APIs are updated
|
|
|
|
### Factors Affecting Response Time
|
|
|
|
Response times in the leaderboard are affected by several factors:
|
|
|
|
- **Model size**: Smaller models generally respond faster
|
|
- **Provider infrastructure**: Cloud-based providers may have different latencies
|
|
- **Network conditions**: Internet speed affects cloud provider response times
|
|
- **Local hardware**: For Ollama models, your CPU/GPU capabilities matter
|
|
- **Model complexity**: Some models are optimized for speed, others for quality
|
|
- **Query complexity**: More complex prompts may take longer to process
|
|
|
|
### Preloading Mechanism
|
|
|
|
For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements:
|
|
|
|
1. Before timing the actual response, the script sends a simple "hello" query to warm up the model
|
|
2. This eliminates the initial loading time from the performance measurement
|
|
3. The reported time reflects only the actual inference time, not model loading
|
|
|
|
This provides a more accurate comparison between local Ollama models and cloud-based providers.
|
|
|
|
## Supported Providers
|
|
|
|
### Ollama (Local)
|
|
- Runs locally on your machine
|
|
- Supports various open-source models
|
|
- No API key required, but needs Ollama installed
|
|
|
|
### HuggingFace
|
|
- Provides access to HuggingFace's Inference API
|
|
- Requires `HF_API_KEY` environment variable
|
|
|
|
### Together
|
|
- Provides access to Together AI's models
|
|
- Requires `TOGETHER_API_KEY` environment variable
|
|
|
|
### Google Gemini
|
|
- Provides access to Google's Gemini models
|
|
- Requires `GEMINI_API_KEY` environment variable
|
|
|
|
### AIQL
|
|
- Provides access to AIQL's models
|
|
- Requires `AIQL_API_KEY` environment variable
|
|
|
|
### Groq
|
|
- Provides access to Groq's models
|
|
- Requires `GROQ_API_KEY` environment variable
|
|
|
|
### NVIDIA
|
|
- Provides access to NVIDIA's models
|
|
- Requires `NVIDIA_API_KEY` environment variable
|
|
|
|
### GitHub
|
|
- Provides access to GitHub Copilot models
|
|
- Requires `GITHUB_TOKEN` environment variable
|
|
|
|
## Adding New Models
|
|
|
|
To add a new model to an existing provider, edit the `CONFIG` dictionary in the script:
|
|
|
|
```python
|
|
CONFIG = {
|
|
"models": {
|
|
"provider_name": {
|
|
"model_display_name": "actual_model_id",
|
|
# Add your new model here
|
|
"new_model_name": "new_model_id"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### API Key Issues
|
|
- Ensure your API keys are correctly set in your environment variables
|
|
- Check that the API keys have not expired
|
|
- Verify that you have the necessary permissions for the models you're trying to access
|
|
|
|
### Ollama Issues
|
|
- Ensure Ollama is installed and running
|
|
- Check that the model you're trying to use is downloaded (`ollama list`)
|
|
- If a model is not available, pull it with `ollama pull model_name`
|
|
|
|
### Connection Issues
|
|
- Check your internet connection
|
|
- Ensure that the API endpoints are not blocked by your network or firewall
|
|
- Some providers may have rate limits or usage quotas
|
|
|
|
### Model Loading
|
|
- Large models may take time to load, especially on the first run
|
|
- The script preloads Ollama models to ensure fair timing measurements
|
|
- If a model consistently fails to load, try a smaller model or a different provider
|
|
|
|
### Colored Error Messages
|
|
- Install the `colorama` package for colored error messages: `pip install colorama`
|