Run any AI model locally โ one unified API for chat, vision, speech, images, video, and more. Built for multi-user serving.
Drop-in replacement for the OpenAI API. Works with LangChain, LlamaIndex, and any OpenAI SDK client out of the box.
17 task types โ chat, embeddings, reranking, vision, ASR, TTS, image generation, video, translation, classification, object detection, and more.
Per-model request queuing, API key management with rate limiting, and priority levels. Model A never blocks Model B.
Pull models from HuggingFace Hub, Ollama registry, or use Ollama cloud models โ all through one CLI.
Multi-GPU scheduling with load balancing, VRAM-aware allocation, GGUF at full speed (113 tok/s on RTX 4090), fp16/GPTQ/AWQ/BNB quantization, and optional vLLM acceleration for chat and VLM models.
Terminal UI for real-time monitoring โ GPU utilization, request queues, performance metrics, model management, all at a glance.
# Install pip install inferall # Pull a model inferall pull llama3.1 # from Ollama inferall pull Qwen/Qwen2.5-7B-Instruct # from HuggingFace # Start serving inferall serve # That's it โ OpenAI-compatible API at http://localhost:8000 curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello!"}]}'
Pull models from HuggingFace or Ollama with one command
Every model type in one place โ chat, embeddings, vision, speech, and more
Start the server and connect any OpenAI-compatible client
Multiple models and users served concurrently โ no blocking
Real-time GPU monitoring with VRAM tracking and model allocation