InferAll

Run any AI model locally โ€” one unified API for chat, vision, speech, images, video, and more. Built for multi-user serving.

$ pip install inferall

Why InferAll?

๐Ÿ”Œ

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Works with LangChain, LlamaIndex, and any OpenAI SDK client out of the box.

๐ŸŽฏ

Every Model Type

17 task types โ€” chat, embeddings, reranking, vision, ASR, TTS, image generation, video, translation, classification, object detection, and more.

โšก

Multi-User Ready

Per-model request queuing, API key management with rate limiting, and priority levels. Model A never blocks Model B.

๐Ÿ”„

Pull from Anywhere

Pull models from HuggingFace Hub, Ollama registry, or use Ollama cloud models โ€” all through one CLI.

๐Ÿ–ฅ๏ธ

GPU Optimized

Multi-GPU scheduling, VRAM-aware allocation, GGUF support at full speed (113 tok/s on RTX 4090), and fp16/GPTQ/AWQ/BNB quantization.

๐Ÿ“Š

Built-in Dashboard

Terminal UI for real-time monitoring โ€” GPU utilization, request queues, performance metrics, model management, all at a glance.

Get Started in 3 Commands

# Install
pip install inferall

# Pull a model
inferall pull llama3.1          # from Ollama
inferall pull Qwen/Qwen2.5-7B-Instruct  # from HuggingFace

# Start serving
inferall serve

# That's it โ€” OpenAI-compatible API at http://localhost:8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello!"}]}'

Supported Model Types

Chat / LLM Embeddings Reranking Vision-Language Speech Recognition Text-to-Speech Image Generation Image-to-Image Video Generation Translation Summarization Classification Object Detection Segmentation Depth Estimation Document QA Audio Processing