InferAll

Run any AI model locally โ€” one unified API for chat, vision, speech, images, video, and more. Built for multi-user serving.

$ pip install inferall

Why InferAll?

๐Ÿ”Œ

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Works with LangChain, LlamaIndex, and any OpenAI SDK client out of the box.

๐ŸŽฏ

Every Model Type

17 task types โ€” chat, embeddings, reranking, vision, ASR, TTS, image generation, video, translation, classification, object detection, and more.

โšก

Multi-User Ready

Per-model request queuing, API key management with rate limiting, and priority levels. Model A never blocks Model B.

๐Ÿ”„

Pull from Anywhere

Pull models from HuggingFace Hub, Ollama registry, or use Ollama cloud models โ€” all through one CLI.

๐Ÿ–ฅ๏ธ

GPU Optimized

Multi-GPU scheduling with load balancing, VRAM-aware allocation, GGUF at full speed (113 tok/s on RTX 4090), fp16/GPTQ/AWQ/BNB quantization, and optional vLLM acceleration for chat and VLM models.

๐Ÿ“Š

Built-in Dashboard

Terminal UI for real-time monitoring โ€” GPU utilization, request queues, performance metrics, model management, all at a glance.

Get Started in 3 Commands

# Install
pip install inferall

# Pull a model
inferall pull llama3.1          # from Ollama
inferall pull Qwen/Qwen2.5-7B-Instruct  # from HuggingFace

# Start serving
inferall serve

# That's it โ€” OpenAI-compatible API at http://localhost:8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello!"}]}'

See It in Action

Pulling models from HuggingFace and Ollama

Pull models from HuggingFace or Ollama with one command

Model list showing all pulled models

Every model type in one place โ€” chat, embeddings, vision, speech, and more

API server running

Start the server and connect any OpenAI-compatible client

Multi-user concurrent serving

Multiple models and users served concurrently โ€” no blocking

GPU monitoring

Real-time GPU monitoring with VRAM tracking and model allocation

Supported Model Types

Chat / LLM Embeddings Reranking Vision-Language Speech Recognition Text-to-Speech Image Generation Image-to-Image Video Generation Translation Summarization Classification Object Detection Segmentation Depth Estimation Document QA Audio Processing