Building Production-Ready AI APIs with FastAPI

After building multiple AI applications, I settled on FastAPI as my go-to framework for production backends. Here is everything I learned about structuring, securing, and deploying AI APIs that handle LLMs, RAG pipelines, and async workflows at scale.

Why FastAPI for AI Workloads

AI applications have unique backend requirements. Requests to large language models can take anywhere from a few seconds to over a minute. Traditional synchronous frameworks like Flask or Django struggle under these constraints because they block the worker thread for the entire duration of the request.

FastAPI solves this with native async/await support built on Starlette and uvicorn. This means a single worker process can handle hundreds of concurrent connections while waiting for external API responses or database queries. For AI applications where latency is unpredictable, this is a game changer.

Additional benefits that made FastAPI my default choice:

Automatic API documentation: Interactive Swagger UI and ReDoc generated from type hints
Pydantic validation: Request and response models are validated automatically
Dependency injection: Clean, testable code for authentication, database sessions, and configuration
Type safety: Full support for Python type hints catches bugs before runtime

Project Architecture

A maintainable FastAPI project for AI requires clear separation of concerns. After iterating through several structures, here is the layout I use for production applications:

ai-backend/
├── app/
│   ├── __init__.py
│   ├── main.py              # Application entry point
│   ├── config.py            # Environment variables and settings
│   ├── dependencies.py      # Shared dependencies (DB, auth)
│   ├── routers/
│   │   ├── __init__.py
│   │   ├── chat.py          # Chat and completion endpoints
│   │   ├── documents.py     # RAG document upload and management
│   │   └── health.py        # Health checks and metrics
│   ├── services/
│   │   ├── __init__.py
│   │   ├── llm_service.py   # LLM provider abstraction
│   │   ├── rag_service.py   # Retrieval and context assembly
│   │   └── embedding_service.py
│   ├── models/
│   │   ├── __init__.py
│   │   └── schemas.py       # Pydantic models
│   └── core/
│       ├── __init__.py
│       ├── security.py      # JWT, rate limiting
│       └── exceptions.py    # Custom exception handlers
├── tests/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env.example

The key insight is isoling external services behind an interface. If I switch from OpenAI to Ollama or from Pinecone to pgvector, only the service layer changes. Routers remain untouched.

Application Setup and Configuration

Configuration management is critical. I use Pydantic Settings to load environment variables with validation and defaults:

# app/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache

class Settings(BaseSettings):
    app_name: str = "AI Backend API"
    debug: bool = False
    
    # LLM Providers
    openai_api_key: str | None = None
    ollama_base_url: str = "http://localhost:11434"
    default_model: str = "gpt-4"
    
    # Vector Database
    database_url: str = "postgresql://user:pass@localhost/db"
    vector_dimension: int = 1536
    
    # Security
    secret_key: str
    access_token_expire_minutes: int = 30
    rate_limit_per_minute: int = 60
    
    class Config:
        env_file = ".env"

@lru_cache()
def get_settings() -> Settings:
    return Settings()

Using lru_cache ensures settings are loaded once and reused across requests, eliminating file I/O overhead.

Main Application Factory

# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.config import get_settings
from app.routers import chat, documents, health
from app.core.exceptions import setup_exception_handlers

settings = get_settings()

app = FastAPI(
    title=settings.app_name,
    debug=settings.debug,
    docs_url="/docs" if settings.debug else None
)

# CORS for frontend integration
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Exception handlers
setup_exception_handlers(app)

# Routers
app.include_router(health.router, prefix="/health", tags=["health"])
app.include_router(chat.router, prefix="/api/v1/chat", tags=["chat"])
app.include_router(documents.router, prefix="/api/v1/documents", tags=["documents"])

Pydantic Models for AI Requests

AI APIs benefit significantly from rigorous request validation. Here are the schemas I use for a RAG-powered chat endpoint:

# app/models/schemas.py
from pydantic import BaseModel, Field
from typing import Literal

class ChatMessage(BaseModel):
    role: Literal["system", "user", "assistant"] = "user"
    content: str = Field(..., min_length=1, max_length=10000)

class ChatRequest(BaseModel):
    messages: list[ChatMessage]
    model: str = "gpt-4"
    temperature: float = Field(0.7, ge=0, le=2)
    max_tokens: int = Field(1024, ge=1, le=4096)
    stream: bool = False
    use_rag: bool = True
    document_ids: list[str] = []

class Citation(BaseModel):
    document_id: str
    chunk_index: int
    text: str
    score: float

class ChatResponse(BaseModel):
    response: str
    citations: list[Citation] = []
    tokens_used: int
    model: str
    processing_time_ms: float

These type-safe models prevent malformed requests and self-document the API through the generated OpenAPI schema.

Abstracting LLM Providers

The biggest mistake I made early on was sprinkling OpenAI client code throughout my application. When I needed to add local model support via Ollama, I had to refactor dozens of endpoints. Now I use a unified service interface:

# app/services/llm_service.py
from abc import ABC, abstractmethod
from app.models.schemas import ChatMessage
import openai
import aiohttp

class LLMProvider(ABC):
    @abstractmethod
    async def generate(
        self, 
        messages: list[ChatMessage], 
        model: str, 
        temperature: float, 
        max_tokens: int
    ) -> str:
        pass

class OpenAIProvider(LLMProvider):
    def __init__(self, api_key: str):
        self.client = openai.AsyncOpenAI(api_key=api_key)
    
    async def generate(
        self, 
        messages: list[ChatMessage], 
        model: str, 
        temperature: float, 
        max_tokens: int
    ) -> str:
        response = await self.client.chat.completions.create(
            model=model,
            messages=[{"role": m.role, "content": m.content} for m in messages],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content

class OllamaProvider(LLMProvider):
    def __init__(self, base_url: str):
        self.base_url = base_url
    
    async def generate(
        self, 
        messages: list[ChatMessage], 
        model: str, 
        temperature: float, 
        max_tokens: int
    ) -> str:
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{self.base_url}/api/chat",
                json={
                    "model": model,
                    "messages": [{"role": m.role, "content": m.content} for m in messages],
                    "options": {"temperature": temperature, "num_predict": max_tokens},
                    "stream": False
                }
            ) as resp:
                data = await resp.json()
                return data["message"]["content"]

class LLMService:
    def __init__(self, settings):
        self.providers = {
            "openai": OpenAIProvider(settings.openai_api_key),
            "ollama": OllamaProvider(settings.ollama_base_url)
        }
    
    async def generate(self, provider: str, **kwargs) -> str:
        if provider not in self.providers:
            raise ValueError(f"Unknown provider: {provider}")
        return await self.providers[provider].generate(**kwargs)

Adding a new provider now requires creating one class and registering it in the dictionary. No router code changes are necessary.

Implementing RAG Endpoints

The document upload and chat endpoints are the heart of any RAG application. Here is how I structure them for clarity and performance:

Document Upload with Background Processing

# app/routers/documents.py
from fastapi import APIRouter, UploadFile, BackgroundTasks, Depends
from app.dependencies import get_db, get_current_user
from app.services.rag_service import process_document

router = APIRouter()

@router.post("/upload")
async def upload_document(
    file: UploadFile,
    background_tasks: BackgroundTasks,
    db = Depends(get_db),
    user = Depends(get_current_user)
):
    # Save file immediately
    file_path = f"/tmp/{user.id}_{file.filename}"
    with open(file_path, "wb") as f:
        content = await file.read()
        f.write(content)
    
    # Process in background to avoid blocking
    doc_id = await db.documents.create(
        user_id=user.id,
        filename=file.filename,
        status="processing"
    )
    
    background_tasks.add_task(process_document, doc_id, file_path, user.id)
    
    return {
        "document_id": doc_id,
        "filename": file.filename,
        "status": "processing",
        "message": "Document queued for processing"
    }

Using BackgroundTasks keeps the HTTP response fast while allowing heavy operations like text extraction and embedding generation to run asynchronously.

Chat with Retrieval

# app/routers/chat.py
from fastapi import APIRouter, Depends
from app.models.schemas import ChatRequest, ChatResponse
from app.dependencies import get_llm_service, get_vector_store, get_current_user

router = APIRouter()

@router.post("/", response_model=ChatResponse)
async def chat(
    request: ChatRequest,
    llm_service = Depends(get_llm_service),
    vector_store = Depends(get_vector_store),
    user = Depends(get_current_user)
):
    import time
    start_time = time.time()
    
    # Retrieve relevant context if RAG is enabled
    context_chunks = []
    if request.use_rag and request.document_ids:
        context_chunks = await vector_store.similarity_search(
            query=request.messages[-1].content,
            user_id=user.id,
            document_ids=request.document_ids,
            top_k=5
        )
    
    # Assemble prompt with context
    system_prompt = "You are a helpful assistant."
    if context_chunks:
        context_text = "\n\n".join([c.text for c in context_chunks])
        system_prompt += f"\n\nRelevant context:\n{context_text}"
    
    messages = [{"role": "system", "content": system_prompt}]
    messages += [{"role": m.role, "content": m.content} for m in request.messages]
    
    # Generate response
    response_text = await llm_service.generate(
        provider="openai",
        messages=messages,
        model=request.model,
        temperature=request.temperature,
        max_tokens=request.max_tokens
    )
    
    processing_time = (time.time() - start_time) * 1000
    
    return ChatResponse(
        response=response_text,
        citations=[
            Citation(
                document_id=c.document_id,
                chunk_index=c.chunk_index,
                text=c.text,
                score=c.score
            ) for c in context_chunks
        ],
        tokens_used=len(response_text.split()),  # Approximate
        model=request.model,
        processing_time_ms=processing_time
    )

Authentication and Security

Production AI APIs must protect both user data and expensive compute resources. I implement JWT-based authentication with per-user rate limiting:

JWT Authentication

# app/core/security.py
from datetime import datetime, timedelta
from jose import JWTError, jwt
from passlib.context import CryptContext
from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
security = HTTPBearer()

def create_access_token(data: dict, secret_key: str, expires_delta: timedelta):
    to_encode = data.copy()
    expire = datetime.utcnow() + expires_delta
    to_encode.update({"exp": expire})
    return jwt.encode(to_encode, secret_key, algorithm="HS256")

async def get_current_user(
    credentials: HTTPAuthorizationCredentials = Depends(security),
    settings = Depends(get_settings)
):
    token = credentials.credentials
    try:
        payload = jwt.decode(token, settings.secret_key, algorithms=["HS256"])
        user_id: str = payload.get("sub")
        if user_id is None:
            raise HTTPException(status_code=401, detail="Invalid token")
        return {"id": user_id, "email": payload.get("email")}
    except JWTError:
        raise HTTPException(status_code=401, detail="Invalid token")

Rate Limiting

# app/dependencies.py
from fastapi import HTTPException, Request
import time
from collections import defaultdict

rate_limit_store = defaultdict(list)

async def rate_limiter(
    request: Request,
    settings = Depends(get_settings)
):
    user_id = request.state.user_id if hasattr(request.state, "user_id") else "anonymous"
    now = time.time()
    window = 60  # 1 minute
    
    # Clean old entries
    rate_limit_store[user_id] = [
        t for t in rate_limit_store[user_id] if now - t < window
    ]
    
    if len(rate_limit_store[user_id]) >= settings.rate_limit_per_minute:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    
    rate_limit_store[user_id].append(now)

For distributed deployments, replace the in-memory store with Redis. The interface remains identical.

Streaming Responses

For chat interfaces, streaming responses dramatically improve perceived performance. FastAPI supports streaming with Server-Sent Events:

from fastapi.responses import StreamingResponse
import json

@router.post("/stream")
async def stream_chat(
    request: ChatRequest,
    llm_service = Depends(get_llm_service)
):
    async def event_generator():
        # This would connect to actual streaming LLM endpoint
        chunks = ["FastAPI ", "makes ", "streaming ", "responses ", "easy."]
        for chunk in chunks:
            yield f"data: {json.dumps({'token': chunk})}\n\n"
        yield f"data: {json.dumps({'done': True})}\n\n"
    
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream"
    )

On the frontend, you consume this with EventSource and append tokens as they arrive, creating a typewriter effect that users expect from modern AI interfaces.

Docker and Deployment

Containerization is essential for consistent deployments. Here is the Dockerfile I use for FastAPI AI backends:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install system dependencies for common AI libraries
RUN apt-get update && apt-get install -y \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Application code
COPY app/ ./app/

# Non-root user for security
RUN useradd -m appuser && chown -R appuser /app
USER appuser

# Uvicorn with multiple workers
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Docker Compose for Local Development

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env
    depends_on:
      - postgres
      - redis
    volumes:
      - ./app:/app/app  # Hot reload for development
    command: uvicorn app.main:app --host 0.0.0.0 --reload

  postgres:
    image: ankane/pgvector:latest
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: ai_db
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  pgdata:
  ollama_data:

This stack gives you the entire backend running locally: FastAPI, PostgreSQL with pgvector, Redis for caching and rate limiting, and Ollama for local LLM inference.

Testing Strategy

Testing AI backends requires mocking external services. I use pytest with dependency overrides to inject test doubles:

# tests/test_chat.py
import pytest
from fastapi.testclient import TestClient
from app.main import app
from app.dependencies import get_llm_service

class MockLLMService:
    async def generate(self, **kwargs):
        return "This is a test response from the mock LLM."

@pytest.fixture
def client():
    app.dependency_overrides[get_llm_service] = lambda: MockLLMService()
    return TestClient(app)

def test_chat_endpoint(client):
    response = client.post("/api/v1/chat/", json={
        "messages": [{"role": "user", "content": "Hello"}],
        "model": "gpt-4",
        "use_rag": False
    })
    
    assert response.status_code == 200
    data = response.json()
    assert data["response"] == "This is a test response from the mock LLM."
    assert data["model"] == "gpt-4"
    assert "processing_time_ms" in data

This approach tests your routing, validation, and response formatting without making expensive API calls during test runs.

Performance Optimization

After deploying several AI APIs, these optimizations provided the most significant performance improvements:

Connection Pooling

LLM HTTP clients and database connections must be pooled. Creating a new connection per request adds hundreds of milliseconds of overhead:

# Reuse clients across requests
ollama_session: aiohttp.ClientSession | None = None

async def get_ollama_session() -> aiohttp.ClientSession:
    global ollama_session
    if ollama_session is None or ollama_session.closed:
        ollama_session = aiohttp.ClientSession()
    return ollama_session

Embedding Caching

Identical queries should not generate identical embeddings. I cache query embeddings in Redis with a 1-hour TTL, reducing latency for common questions by 90%:

import hashlib
import json

async def get_cached_embedding(text: str, redis_client) -> list[float] | None:
    key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
    cached = await redis_client.get(key)
    if cached:
        return json.loads(cached)
    return None

Request Batching

When processing multiple documents, batch embedding requests rather than making individual API calls. Most embedding providers support batches of up to 100 texts with significantly lower per-item cost.

Monitoring and Observability

Production AI systems need visibility into latency, token usage, and error rates. I add middleware to log every request:

from fastapi import Request
import time
import logging

logger = logging.getLogger("ai-backend")

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    
    logger.info(
        f"method={request.method} path={request.url.path} "
        f"status={response.status_code} duration={duration:.3f}s"
    )
    
    return response

Structured logging with JSON format enables easy ingestion into monitoring tools like Grafana or Datadog. Track these metrics specifically for AI endpoints:

Time to first token (TTFT): Latency before streaming starts
Tokens per second: Generation throughput
Cache hit rate: Percentage of queries served from cache
Error rate by provider: OpenAI vs Ollama reliability

Common Pitfalls and Solutions

Here are mistakes I made and how I fixed them:

Pitfall 1: Blocking the Event Loop

I once called openai.ChatCompletion.create (the synchronous client) inside an async endpoint. This blocked the entire worker for 10+ seconds. Always use async clients: openai.AsyncOpenAI for OpenAI, aiohttp for Ollama.

Pitfall 2: Memory Leaks with Large Files

Uploading a 100MB PDF and reading it entirely into memory with file.read() crashed my container. Now I stream uploads to disk and process them in chunks.

Pitfall 3: Missing Timeouts

LLM requests can hang indefinitely. Always set explicit timeouts:

response = await client.post(
    url,
    json=payload,
    timeout=aiohttp.ClientTimeout(total=30)  # 30 seconds max
)

Conclusion

FastAPI provides an excellent foundation for production AI backends, but the framework alone is not enough. Success requires careful attention to async patterns, provider abstraction, security, and deployment practices.

Start with a clean project structure, abstract your LLM providers from day one, and invest in testing and monitoring early. The upfront effort pays dividends when you need to swap models, scale to more users, or debug production issues at 2 AM.

The patterns in this post are battle-tested from my own RAG SaaS and several client projects. Adapt them to your needs, and you will have a backend that is fast, secure, and maintainable.

AI Engineering Insights