Run any AI model locally โ one unified API for chat, vision, speech, images, video, and more. Built for multi-user serving.
Drop-in replacement for the OpenAI API. Works with LangChain, LlamaIndex, and any OpenAI SDK client out of the box.
17 task types โ chat, embeddings, reranking, vision, ASR, TTS, image generation, video, translation, classification, object detection, and more.
Per-model request queuing, API key management with rate limiting, and priority levels. Model A never blocks Model B.
Pull models from HuggingFace Hub, Ollama registry, or use Ollama cloud models โ all through one CLI.
Multi-GPU scheduling, VRAM-aware allocation, GGUF support at full speed (113 tok/s on RTX 4090), and fp16/GPTQ/AWQ/BNB quantization.
Terminal UI for real-time monitoring โ GPU utilization, request queues, performance metrics, model management, all at a glance.
# Install pip install inferall # Pull a model inferall pull llama3.1 # from Ollama inferall pull Qwen/Qwen2.5-7B-Instruct # from HuggingFace # Start serving inferall serve # That's it โ OpenAI-compatible API at http://localhost:8000 curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello!"}]}'