InferAll — Run any AI model locally

Why InferAll?

🔌

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Works with LangChain, LlamaIndex, and any OpenAI SDK client out of the box.

🎯

Every Model Type

17 task types — chat, embeddings, reranking, vision, ASR, TTS, image generation, video, translation, classification, object detection, and more.

⚡

Multi-User Ready

Per-model request queuing, API key management with rate limiting, and priority levels. Model A never blocks Model B.

🔄

Pull from Anywhere

Pull models from HuggingFace Hub, Ollama registry, or use Ollama cloud models — all through one CLI.

🖥️

GPU Optimized

Multi-GPU scheduling with load balancing, VRAM-aware allocation, GGUF at full speed (113 tok/s on RTX 4090), fp16/GPTQ/AWQ/BNB quantization, and optional vLLM acceleration for chat and VLM models.

📊

Built-in Dashboard

Terminal UI for real-time monitoring — GPU utilization, request queues, performance metrics, model management, all at a glance.

Get Started in 3 Commands

# Install
pip install inferall

# Pull a model
inferall pull llama3.1          # from Ollama
inferall pull Qwen/Qwen2.5-7B-Instruct  # from HuggingFace

# Start serving
inferall serve

# That's it — OpenAI-compatible API at http://localhost:8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello!"}]}'

See It in Action

Pulling models from HuggingFace and Ollama

Pull models from HuggingFace or Ollama with one command

Every model type in one place — chat, embeddings, vision, speech, and more

Start the server and connect any OpenAI-compatible client

Multiple models and users served concurrently — no blocking

Real-time GPU monitoring with VRAM tracking and model allocation

Supported Model Types

Chat / LLM Embeddings Reranking Vision-Language Speech Recognition Text-to-Speech Image Generation Image-to-Image Video Generation Translation Summarization Classification Object Detection Segmentation Depth Estimation Document QA Audio Processing