AllEndpoints is a powerful Python module for making inferences with various LLM providers through a unified interface.

Go to file

leduc f06637e9b2 uv use		2025-04-29 06:20:13 +02:00
allendpoints.py	first commit	2025-04-22 21:42:36 +02:00
example.py	first commit	2025-04-22 21:42:36 +02:00
LICENSE	Initial commit	2025-04-22 07:52:20 +00:00
pyproject.toml	uv use	2025-04-29 06:20:13 +02:00
README.md	uv use	2025-04-29 06:20:13 +02:00
requirements.txt	first commit	2025-04-22 21:42:36 +02:00

README.md

AllEndpoints - Universal LLM Inference Tool

AllEndpoints is a powerful Python module for making inferences with various LLM providers through a unified interface. It supports multiple providers including Ollama (local), HuggingFace, Together, Google Gemini, AIQL, Groq, NVIDIA, and GitHub Copilot APIs.

Quick Start: With uv installed, you can run AllEndpoints without explicit dependency installation:
uv run allendpoints.py --list

Installation
Environment Variables
Setting Up Environment Variables
Linux/macOS
Windows
Usage
Command-Line Arguments
Examples
Using as a Python Module
Leaderboard Feature
Supported Providers
Adding New Models
Troubleshooting

Installation

Clone the repository:

git clone https://github.com/yourusername/allendpoints.git
cd allendpoints

Choose one of the following installation methods:

Option A: Using pip

Install the required dependencies using the requirements.txt file:
```
pip install -r requirements.txt
```
Then run the script directly:
```
python allendpoints.py [arguments]
```
Option B: Using uv (Recommended)

If you have uv installed, you can run the script without explicitly installing dependencies:
```
uv run allendpoints.py [arguments]
```
This will automatically create a virtual environment and install all required dependencies on first run.
Install Ollama (optional, for local inference):
- Ollama Installation Guide

Environment Variables

The script uses environment variables to store API keys for different providers. Here are the required environment variables for each provider:

Provider	Environment Variable	Description
HuggingFace	`HF_API_KEY`	HuggingFace API key
Together	`TOGETHER_API_KEY`	Together AI API key
Google Gemini	`GEMINI_API_KEY`	Google AI Studio API key
AIQL	`AIQL_API_KEY`	AIQL API key
Groq	`GROQ_API_KEY`	Groq API key
NVIDIA	`NVIDIA_API_KEY`	NVIDIA API key
GitHub	`GITHUB_TOKEN`	GitHub token for Copilot API access

Setting Up Environment Variables

Linux/macOS

Temporary (Current Session Only)

export HF_API_KEY="your_huggingface_api_key"
export TOGETHER_API_KEY="your_together_api_key"
export GEMINI_API_KEY="your_gemini_api_key"
export AIQL_API_KEY="your_aiql_api_key"
export GROQ_API_KEY="your_groq_api_key"
export NVIDIA_API_KEY="your_nvidia_api_key"
export GITHUB_TOKEN="your_github_token"

Permanent (Add to Shell Profile)

Add the above export commands to your ~/.bashrc, ~/.zshrc, or ~/.profile file:

echo 'export HF_API_KEY="your_huggingface_api_key"' >> ~/.bashrc
echo 'export TOGETHER_API_KEY="your_together_api_key"' >> ~/.bashrc
# Add other API keys similarly

Then reload your shell configuration:

source ~/.bashrc  # or ~/.zshrc or ~/.profile

Windows

Command Prompt (Temporary)

set HF_API_KEY=your_huggingface_api_key
set TOGETHER_API_KEY=your_together_api_key
set GEMINI_API_KEY=your_gemini_api_key
set AIQL_API_KEY=your_aiql_api_key
set GROQ_API_KEY=your_groq_api_key
set NVIDIA_API_KEY=your_nvidia_api_key
set GITHUB_TOKEN=your_github_token

PowerShell (Temporary)

$env:HF_API_KEY = "your_huggingface_api_key"
$env:TOGETHER_API_KEY = "your_together_api_key"
$env:GEMINI_API_KEY = "your_gemini_api_key"
$env:AIQL_API_KEY = "your_aiql_api_key"
$env:GROQ_API_KEY = "your_groq_api_key"
$env:NVIDIA_API_KEY = "your_nvidia_api_key"
$env:GITHUB_TOKEN = "your_github_token"

Permanent (System Environment Variables)

Right-click on "This PC" or "My Computer" and select "Properties"
Click on "Advanced system settings"
Click on "Environment Variables"
Under "User variables" or "System variables", click "New"
Enter the variable name (e.g., HF_API_KEY) and its value
Click "OK" to save

Usage

Command-Line Arguments

usage: allendpoints.py [-h] [--provider PROVIDER] [--model MODEL] [--system SYSTEM] [--list] [--debug] [-a] [prompt]

LLM Inference Module

positional arguments:
  prompt               The prompt to send to the model (default: "Why is the sky blue?")

options:
  -h, --help           show this help message and exit
  --provider PROVIDER  The provider to use (ollama, hf, together, gemini, aiql, groq, nvidia, github)
  --model MODEL        The specific model to use
  --system SYSTEM      System content for chat models (default: "You are a helpful assistant.")
  --list               List available providers and models
  --debug              Enable debug output
  -a, --all            Run inference on all available providers and models

Examples

List all available providers and models:

# Using python directly
python allendpoints.py --list

# Using uv run
uv run allendpoints.py --list

Run inference with a specific provider and model:

# Using python directly
python allendpoints.py "What is the capital of France?" --provider ollama --model llama3.2:3b

# Using uv run
uv run allendpoints.py "What is the capital of France?" --provider ollama --model llama3.2:3b

Run inference with a specific provider and its default model:

# Using python directly
python allendpoints.py "Explain quantum computing" --provider gemini

# Using uv run
uv run allendpoints.py "Explain quantum computing" --provider gemini

Run inference with a custom system prompt:

# Using python directly
python allendpoints.py "Write a poem about AI" --provider ollama --model llama3.2:3b --system "You are a poetic assistant."

# Using uv run
uv run allendpoints.py "Write a poem about AI" --provider ollama --model llama3.2:3b --system "You are a poetic assistant."

Run inference on all available providers and models:

# Using python directly
python allendpoints.py "What is the meaning of life?" -a

# Using uv run
uv run allendpoints.py "What is the meaning of life?" -a

Run with debug output:

# Using python directly
python allendpoints.py "How does a nuclear reactor work?" --provider nvidia --model qwen2.5-coder-32b --debug

# Using uv run
uv run allendpoints.py "How does a nuclear reactor work?" --provider nvidia --model qwen2.5-coder-32b --debug

Using as a Python Module

AllEndpoints can be imported and used as a Python module in your own projects. Here's how to use it programmatically:

Basic Usage

# Import the necessary functions from allendpoints
from allendpoints import run_inference, check_available_apis, CONFIG

# Run inference with a specific provider and model
# Always specify the model parameter explicitly
response = run_inference(
    prompt="What is the capital of France?",
    provider="ollama",
    model="llama3.2:3b",
    system_content="You are a helpful assistant."
)

print(response)

# If you want to use the default model for a provider
default_model = CONFIG["defaults"]["ollama"]
response = run_inference(
    prompt="What is quantum computing?",
    provider="ollama",
    model=default_model
)

print(response)

Model Naming Conventions

The model parameter can be specified in two different ways:

Short name (key): The abbreviated name used in the configuration
Full name (value): The complete model identifier used by the provider's API

Example with both naming types:

# Using the short name (key)
response1 = run_inference(
    prompt="What is AI?",
    provider="nvidia",
    model="qwen2.5-coder-32b"  # Short name
)

# Using the full name (value)
response2 = run_inference(
    prompt="What is AI?",
    provider="nvidia",
    model="qwen/qwen2.5-coder-32b-instruct"  # Full name
)

The script handles both formats automatically. If you use a short name, it will be converted to the full name internally. For Ollama models, the short and full names are typically identical.

You can view the mapping between short and full names with:

# View the mapping between short and full model names
for provider, models in CONFIG["models"].items():
    if isinstance(models, dict):
        print(f"\nModels for {provider}:")
        for short_name, full_name in models.items():
            print(f"  {short_name} -> {full_name}")

Advanced Usage

# Import more functions for advanced usage
from allendpoints import (
    run_inference, 
    check_available_apis, 
    get_ollama_models, 
    InferenceHandler,
    CONFIG
)

# Get all available providers
available_providers = check_available_apis()
print(f"Available providers: {available_providers}")

# Get all available Ollama models
ollama_models = get_ollama_models()
print(f"Available Ollama models: {ollama_models}")

# Use a specific provider's handler directly
if "nvidia" in available_providers:
    nvidia_response = InferenceHandler.nvidia(
        prompt="Explain quantum computing",
        model="qwen/qwen2.5-coder-32b-instruct"
    )
    print(f"NVIDIA response: {nvidia_response}")

# Access the configuration
default_models = CONFIG["defaults"]
print(f"Default models: {default_models}")

Batch Processing Example

# Process multiple prompts with different providers
prompts = [
    "What is machine learning?",
    "Explain the theory of relativity",
    "How does a neural network work?"
]

providers = ["ollama", "gemini", "github"]

# Process each prompt with each provider
for prompt in prompts:
    for provider in providers:
        try:
            # Always specify the model parameter explicitly
            default_model = CONFIG["defaults"][provider]
            response = run_inference(prompt, provider, model=default_model)
            print(f"\nPrompt: {prompt}")
            print(f"Provider: {provider}")
            print(f"Response: {response[:100]}...")
        except Exception as e:
            print(f"Error with {provider}: {str(e)}")

Integration with main.py

The allendpoints module is integrated with main.py for benchmarking LLM performance on coding tasks:

# In main.py
from allendpoints import check_available_apis, run_inference

# Get available providers
available_apis = check_available_apis()

# Run inference with a specific model
response = run_inference(
    question,      # The coding problem to solve
    provider,      # The provider to use
    model_id,      # The specific model to use
    system_content # Optional system prompt
)

This integration allows main.py to benchmark various LLM providers and models on coding tasks using a unified interface.

Leaderboard Feature

AllEndpoints includes a built-in leaderboard feature that ranks models by their response time when using the -a/--all option. This helps you compare the performance of different models across providers.

How the Leaderboard Works

When you run AllEndpoints with the -a/--all flag, it executes your prompt on all available models across all providers
The script measures the response time for each model
After all models have completed, a leaderboard is displayed ranking models from fastest to slowest

Example Leaderboard Output

==================================================
RESPONSE TIME LEADERBOARD
==================================================
Rank  Model                                   Time (seconds) 
-------------------------------------------------------------
1     ollama/llama3.2:1b-instruct-q4_K_M      0.19
2     groq/deepseek-r1-distill-llama-70b      0.26
3     ollama/cogito:3b                        0.31
4     ollama/llama3.2:3b                      0.32
5     nvidia/mixtral-8x7b                     0.40
6     groq/llama-3.3-70b-versatile            0.40
7     ollama/wizard-vicuna-uncensored:latest  0.40
8     ollama/samantha-mistral:latest          0.41
9     ollama/qwen2.5-coder:7b-instruct-q4_K_M 0.50
10    ollama/qwen2.5:14b                      0.93

Using the Leaderboard

To generate a leaderboard, use the -a/--all option with your prompt:

python allendpoints.py -a "What is the capital of France?"

The leaderboard helps you:

Identify the fastest models for your specific use case
Compare performance across providers (Ollama, Groq, NVIDIA, etc.)
Optimize your workflow by selecting models with the best speed-to-quality ratio
Monitor performance changes as models and APIs are updated

Factors Affecting Response Time

Response times in the leaderboard are affected by several factors:

Model size: Smaller models generally respond faster
Provider infrastructure: Cloud-based providers may have different latencies
Network conditions: Internet speed affects cloud provider response times
Local hardware: For Ollama models, your CPU/GPU capabilities matter
Model complexity: Some models are optimized for speed, others for quality
Query complexity: More complex prompts may take longer to process

Preloading Mechanism

For Ollama models, AllEndpoints uses a preloading mechanism to ensure fair timing measurements:

Before timing the actual response, the script sends a simple "hello" query to warm up the model
This eliminates the initial loading time from the performance measurement
The reported time reflects only the actual inference time, not model loading

This provides a more accurate comparison between local Ollama models and cloud-based providers.

Supported Providers

Ollama (Local)

Runs locally on your machine
Supports various open-source models
No API key required, but needs Ollama installed

HuggingFace

Provides access to HuggingFace's Inference API
Requires HF_API_KEY environment variable

Together

Provides access to Together AI's models
Requires TOGETHER_API_KEY environment variable

Google Gemini

Provides access to Google's Gemini models
Requires GEMINI_API_KEY environment variable

AIQL

Provides access to AIQL's models
Requires AIQL_API_KEY environment variable

Groq

Provides access to Groq's models
Requires GROQ_API_KEY environment variable

NVIDIA

Provides access to NVIDIA's models
Requires NVIDIA_API_KEY environment variable

GitHub

Provides access to GitHub Copilot models
Requires GITHUB_TOKEN environment variable

Adding New Models

To add a new model to an existing provider, edit the CONFIG dictionary in the script:

CONFIG = {
    "models": {
        "provider_name": {
            "model_display_name": "actual_model_id",
            # Add your new model here
            "new_model_name": "new_model_id"
        }
    }
}

Troubleshooting

API Key Issues

Ensure your API keys are correctly set in your environment variables
Check that the API keys have not expired
Verify that you have the necessary permissions for the models you're trying to access

Ollama Issues

Ensure Ollama is installed and running
Check that the model you're trying to use is downloaded (ollama list)
If a model is not available, pull it with ollama pull model_name

Connection Issues

Check your internet connection
Ensure that the API endpoints are not blocked by your network or firewall
Some providers may have rate limits or usage quotas

Model Loading

Large models may take time to load, especially on the first run
The script preloads Ollama models to ensure fair timing measurements
If a model consistently fails to load, try a smaller model or a different provider

Colored Error Messages

Install the colorama package for colored error messages: pip install colorama