FastAPI in Production for AI Engineers

Q: Why does FastAPI need a separate server like Uvicorn?

FastAPI is an ASGI application that consumes ASGI events; it cannot open sockets, parse raw HTTP, or manage the event loop. Uvicorn is the ASGI server that listens on the network, parses HTTP, translates each connection into ASGI events, and invokes your app. The split lets servers and frameworks evolve independently.

Q: When should I use async def vs def in FastAPI?

Use async def with await for non-blocking I/O such as an async HTTP client or async database driver. Use plain def for blocking libraries, which FastAPI runs in a threadpool so they cannot stall the event loop. An async def handler making a blocking call inside it is the deadliest mistake: it freezes every request in the process under load.

Q: How do I stream LLM tokens from FastAPI?

Return a StreamingResponse wrapping an async generator that yields server-sent-event frames as tokens arrive, with a terminal DONE frame. Put cost metering and logging in a finally block because clients disconnect mid-stream constantly. At the infrastructure layer, disable proxy buffering and monitor time-to-first-token separately from total latency.

Q: What's the difference between a liveness and a readiness probe?

Liveness answers whether the process is broken beyond self-recovery; its remedy is a restart, so it must check only process-local health, never the database. Readiness answers whether the pod should receive traffic now; its remedy is removal from the load balancer, so it checks critical dependencies and warm-up state. Conflating them turns a brief dependency blip into a mass-restart cascade.

Q: Is FastAPI good for serving machine learning models?

Yes, it is the de facto serving layer for inference and LLM endpoints. Models live in Python so serving them in-process avoids a serialization boundary, async I/O multiplexes the multi-second waits of hosted LLM calls, and streaming over ASGI enables token-by-token responses. Load the model once in the lifespan handler and offload CPU-bound inference off the event loop.

Q: How do I load a model once instead of on every request?

Load it in the lifespan handler and store it on app.state, then read it in handlers. Per-request loading causes seconds of transfer per call and memory exhaustion, while import-time loading couples loading to module import and breaks tooling. Lifespan loading runs once per worker before traffic and lets a readiness probe report ready only after the model is warm.

Reference: fastapi/fastapi — MIT · Python

FastAPI is a Python web framework where one type annotation does five jobs: parse the request, validate it, convert it, document it, and render the interactive docs. For AI engineers it has become the default layer between an HTTP client and a model. This guide covers that layer end to end, from a first endpoint to a production RAG service.¹

Most of those failures are not framework bugs. They are architectural: a blocking call inside an async handler, an N+1 query that melts at 10,000 rows, a JWT verifier that trusts the token's own algorithm header, a liveness probe that restarts every healthy pod when the database blips. The framework is small and correct. The mistakes are ours. This guide front-loads the ones that recur, with original code you can paste and reason about.

The running examples are deliberately boring — tasks, users, an inference endpoint — because boring patterns transfer. What you learn shaping a CRUD router is exactly what you reuse shaping an LLM gateway. We will keep the foundations tight and spend the depth where production breaks.

What is FastAPI, and why has it taken over AI serving?

FastAPI is an ASGI Python framework for building typed HTTP APIs. Its own documentation describes it as "a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints."¹ It is a thin composition of two libraries: Starlette for the web machinery (routing, middleware, WebSockets) and Pydantic for data validation. FastAPI adds dependency injection and automatic OpenAPI generation on top.

The framework "stands on the shoulders of giants," in the docs' words — Starlette "for the web parts" and Pydantic "for the data parts."¹ This layering matters in practice: when you debug a FastAPI app, you are often reading Starlette or Pydantic source. Knowing which library owns which behavior — middleware ordering is Starlette, validation errors are Pydantic, Depends() resolution is FastAPI itself — is what separates a user of the framework from an engineer who can reason about it.

AI serving converged on FastAPI for concrete reasons. Models live in Python, so serving them in-process removes a serialization boundary. Calling a hosted large-language-model (LLM) API is a multi-second I/O wait, which is exactly what an async event loop is built to multiplex. Token streaming needs server-sent events, which require ASGI. And Pydantic is already the lingua franca of the AI tooling stack, so validating a model's structured output reuses the same boundary that validates a request body. The pieces fit.

flowchart TD
  C[Web / Mobile<br/>clients] --> LB[Load<br/>balancer]
  LB --> U1[Uvicorn<br/>ASGI worker]
  LB --> U2[Uvicorn<br/>ASGI worker]
  U1 --> S[Starlette<br/>routing]
  U2 --> S
  S --> V[Pydantic<br/>validation]
  V --> H[Your<br/>handlers]
  H --> PG[(PostgreSQL)]
  H --> RD[(Redis)]
  H --> M[Model /<br/>LLM backend]

Figure 1: The production FastAPI request path. Clients reach a load balancer that fans requests across Uvicorn (ASGI) worker processes; each runs your app, where Starlette routes the request, Pydantic validates it, and your handlers talk to Postgres, Redis, and a model backend. Every layer in this diagram is a separate, replaceable component.

Why does the server interface (ASGI vs WSGI) decide your costs?

WSGI, the older Python server contract, is synchronous: one request occupies one worker for its full duration, so a handler waiting two seconds on a database does nothing for two seconds. ASGI replaces that blocking call with an async protocol, so a single event loop interleaves thousands of in-flight requests — while one awaits a row, the loop serves others.

The win is specifically for I/O-bound work: waiting on databases, caches, and upstream APIs. For CPU-bound work — in-process inference, image processing — the event loop does not help and actively hurts if you block it. That distinction is the single most common FastAPI production mistake, and we return to it in the async section. ASGI also enables protocols WSGI structurally cannot: WebSockets, server-sent events, and HTTP streaming — all essential for streaming LLM tokens.

The cost difference is large enough to plan around. Consider a handler that spends most of its time waiting on I/O. A synchronous worker that holds one request at a time sustains roughly 1 / total_time requests per second. An async worker holds many requests concurrently, bounded instead by the small CPU slice each one needs. The illustrative arithmetic below is not a benchmark — it is the shape of the trade-off, and it is why an I/O-heavy service can run on a handful of async workers where a sync stack would need hundreds.

Workload per request	Sync worker (1 in flight)	Async worker (many in flight)
300 ms I/O wait + 5 ms CPU	~3.3 req/s per worker	bounded by CPU: ~200 req/s per worker
To serve ~1,000 req/s	~300 workers	~5–8 workers

The FastAPI documentation claims "very high performance, on par with NodeJS and Go (thanks to Starlette and Pydantic)."¹ The performance does not come from FastAPI doing anything clever at runtime — it comes from sitting on an async toolkit and a fast validator, and from not blocking the loop. Treat that as a contract you must uphold, not a free lunch.

How does the declarative contract turn one annotation into five behaviors?

In most Python code, type hints are passive metadata that linters read. In FastAPI they are executed. When you annotate a parameter as item_id: int, the framework derives five behaviors from that one declaration: request parsing, type coercion, validation with a structured 422 error, the OpenAPI schema entry, and the docs rendering.

The annotation becomes the single source of truth for the request contract, eliminating the validation-plus-docs-plus-serializer duplication other stacks require.

Pydantic is the runtime engine behind this. Its philosophy is "parse, don't validate": instead of checking raw data and passing it along still-raw, it converts the input into a typed object that makes illegal states unrepresentable. After validation succeeds, every consumer downstream can trust the object — no defensive re-checking. The boundary does the work once; the core stays clean. Pydantic v2 made this fast by rewriting its core in Rust: the docs state plainly that "Pydantic's core validation logic is written in Rust," and that it is "among the fastest data validation libraries for Python."²

A first application shows the whole idea in a few lines. Notice that no validation code is written by hand — the annotations carry it:

from fastapi import FastAPI
from pydantic import BaseModel, Field

app = FastAPI(title="Inference API", version="1.0.0")

class PredictIn(BaseModel):
    text: str = Field(min_length=1, max_length=4_000)  # length cap = cost control

class PredictOut(BaseModel):
    label: str
    confidence: float = Field(ge=0, le=1)
    model_version: str

@app.post("/v1/predict", response_model=PredictOut)
def predict(req: PredictIn) -> PredictOut:
    # req is already parsed, coerced, and validated — the handler never sees bad data
    return PredictOut(label="positive", confidence=0.97, model_version="sentiment-v3.2.1")

Send a body with a text field longer than the cap and the request never reaches the handler — FastAPI returns a 422 listing the offending field and the reason. The response_model works in the other direction: whatever object you return, only the fields declared on PredictOut are serialized. The annotation guards both doors.

flowchart TD
  C[Client<br/>POST JSON] --> P[Parse<br/>JSON body]
  P --> V{Validate &<br/>coerce}
  V -->|valid| H[Handler runs<br/>typed object]
  V -->|invalid| E[422 + per-field<br/>errors]
  H --> R[Filter via<br/>response_model]
  R --> O[Response]
  E --> O

Figure 2: The two-way validation boundary. An inbound JSON body is parsed, then validated and coerced by Pydantic; invalid input short-circuits into a 422 listing each bad field, so the handler only ever runs on good data. On the way out, the response_model acts as a whitelist that filters the returned object. Validation guards both directions.

A professional habit follows from this: keep request, storage, and response schemas as separate models. Inbound and outbound shapes almost always diverge — a password comes in, an id and timestamps go out. Modeling them separately is the first defense against leaking data, because a response model that simply does not declare a sensitive field cannot emit it.

What do the HTTP verbs actually promise, and how do you make POST safe to retry?

Each HTTP method carries semantics that clients, proxies, and caches rely on. GET is safe and cacheable. PUT and DELETE are idempotent: applying them twice yields the same state as once, which is what makes client retries safe. POST is neither safe nor idempotent — a retried POST may create a duplicate, which is why payment APIs add idempotency keys. PATCH applies a partial update.

Idempotency is a retry contract, not pedantry. Distributed systems retry; networks drop responses after the server has already acted. Designing PUT/DELETE to be idempotent and giving POST an idempotency-key escape hatch is what makes "the client timed out, retried, and now there are two orders" impossible. The client sends a unique key per logical operation; the server records the key before processing — inside the same transaction or lock as the side effect — and returns the stored response on replays.

Status codes are a machine-readable contract, so they must mean what they say. Use 201 for creation, 204 for deletes, 400 for malformed requests, 401 for missing or invalid credentials, 403 for valid credentials lacking permission, 404 for absent resources, 409 for state conflicts (a duplicate unique key, a concurrent edit), 422 for schema validation failures (FastAPI's default), and 429 for rate limiting. Never return 200 with an {"error": ...} body — it breaks caches, monitoring, and every retry library, which treat 4xx as "do not retry" and 5xx as "retry."

A complete CRUD router shows the verbs, status codes, and the PATCH subtlety together:

from typing import Annotated
from fastapi import APIRouter, HTTPException, Query, status
from pydantic import BaseModel, Field

router = APIRouter(prefix="/api/v1/tasks", tags=["tasks"])

class TaskCreate(BaseModel):
    title: str = Field(min_length=1, max_length=200)
    description: str = ""

class TaskUpdate(BaseModel):                 # PATCH: every field optional, no defaults
    title: str | None = Field(default=None, min_length=1, max_length=200)
    description: str | None = None

@router.post("", status_code=status.HTTP_201_CREATED, summary="Create a task")
def create_task(payload: TaskCreate):
    return repo.create(payload)

@router.get("", summary="List tasks")
def list_tasks(
    limit: Annotated[int, Query(ge=1, le=100)] = 20,   # server-enforced max
    offset: Annotated[int, Query(ge=0)] = 0,
):
    return repo.list(limit, offset)

@router.patch("/{task_id}", summary="Partially update a task")
def update_task(task_id: int, payload: TaskUpdate):
    changes = payload.model_dump(exclude_unset=True)   # only fields the client sent
    return repo.update(task_id, changes)

Two production details hide in that code. The list endpoint caps limit at 100 — an unbounded limit is an invitation to dump your table, a denial-of-service vector, and a number you cannot capacity-plan around. And the PATCH handler uses model_dump(exclude_unset=True) to distinguish "client omitted this field" from "client sent null." That distinction is the heart of correct partial updates.

A PATCH carrying {"description": null} expresses three possible client intents, and JSON cannot tell the first two apart by value alone. A field omitted means "leave unchanged"; a field set to null means "clear it"; a field with a value means "set it." Pydantic tracks which fields were actually supplied, so exclude_unset=True returns only provided keys — an omitted description never appears, while an explicit null appears as None. Every PATCH-model field must therefore be optional with no semantic default, and if clearing is disallowed for some field, validate None explicitly rather than silently ignoring it.

Pagination has the same depth. Offset pagination (limit/offset) is simple and supports random page access, but it degrades on deep pages — the database scans and discards offset rows — and can skip or duplicate items when rows are inserted mid-scan. Cursor (keyset) pagination encodes the last-seen sort key and uses an indexed WHERE (created_at, id) > (...) filter, giving constant cost at any depth and stability under writes. The rule of thumb: offset for small admin UIs, cursor for anything public, infinite-scroll, or large.

flowchart TD
  C[Client] --> VL[API version<br/>layer]
  VL --> V1[v1 router<br/>frozen contract]
  VL --> V2[v2 router<br/>current contract]
  V1 --> AD[Compatibility<br/>adapter]
  AD --> SVC[Shared service<br/>layer]
  V2 --> SVC
  SVC --> DB[(Database)]

Figure 3: URL-based API versioning. Both versions route into one shared service layer; the old v1 contract survives through a thin compatibility adapter, so breaking changes get a new router without ever duplicating business logic. Additive optional fields need no new version; renames, removals, and semantic changes do.

Where should business logic live, and how does Depends() wire the layers?

Business logic lives in a service layer, not the handler — and dependency injection is the wiring that gets it there. Instead of constructing collaborators, the handler declares what it needs and the framework supplies them: Depends(get_db) tells FastAPI to call get_db before the handler and pass the result in. Three benefits follow — decoupling, reuse, and substitutability.

The last one is what the entire testing approach rests on. In FastAPI a dependency is just a callable, and get_db resolves its own dependencies first, so the framework assembles a whole graph per request.

Three mechanics carry the system. Yield-based dependencies wrap the request in a setup/teardown pair, the idiomatic home for sessions and locks. Chaining lets one dependency depend on another — a current-user dependency depends on a database dependency — and FastAPI resolves the whole graph. Per-request caching means that if five dependencies in one request all require get_db, it executes once and the result is shared. The graph is computed once at startup, so per-request resolution overhead is small.

flowchart TD
  R[Handler] --> CU[get_current_user]
  CU --> DB[get_db]
  R --> CFG[get_settings]
  DB --> SES[(DB session<br/>per request)]
  CU --> USR[Current user]
  CFG --> S[Settings<br/>cached once]

Figure 4: How Depends() resolves a request. The handler depends on get_current_user, which itself depends on get_db, so FastAPI walks the graph depth-first; a per-request cache means a dependency required by several others runs only once. Settings are cached once per process, while the database session is created fresh per request.

The yield-based session dependency is where transactional correctness lives. Code before yield acquires the session; the handler runs while the generator is suspended; the finally block releases it — guaranteeing the connection returns to the pool on success and on failure. On an unhandled handler exception, FastAPI throws it back into the generator at the yield point, so an except block can roll the transaction back. Without this, error paths leak connections, the pool exhausts under sustained errors, and the whole service locks up — a classic outage.

from sqlalchemy.orm import Session

def get_db():
    db = SessionLocal()
    try:
        yield db
        db.commit()          # commit on success
    except Exception:
        db.rollback()        # roll back on any unhandled error
        raise
    finally:
        db.close()           # always return the connection to the pool

flowchart TD
  A[Acquire session<br/>before yield] --> B[Handler runs<br/>suspended at yield]
  B --> C{Unhandled<br/>exception?}
  C -->|no| D[Commit]
  C -->|yes| E[Rollback]
  D --> F[Close session<br/>finally]
  E --> F
  F --> G[Connection<br/>back to pool]

Figure 5: The yield-session request lifecycle. The dependency acquires a session before yield, the handler runs while the generator is suspended, and the finally block always closes the session. If the handler raised, the except branch rolls the transaction back before closing; otherwise it commits. Either way the connection returns to the pool.

This wiring enables a clean three-layer architecture. Routers translate HTTP to the domain — they parse input, call a service, and map exceptions to status codes, with no business logic. Services hold the business rules and orchestrate transactions, with no HTTP and no raw SQL. Repositories encapsulate all data access, with no business rules. Dependencies point inward only: router calls service, service calls repository. The payoff is substitutability per layer — swap Postgres for a fake repository in service tests, exercise HTTP concerns without a database, move a service to a worker queue without touching SQL.

flowchart TD
  RT[Router HTTP<br/>tasks.py] --> SV[Service rules<br/>task_service.py]
  SV --> RP[Repository data<br/>task_repo.py]
  RP --> DB[(PostgreSQL)]
  RT --> SC[Schemas<br/>Pydantic]
  SV --> EX[Domain<br/>exceptions]

Figure 6: The layered FastAPI architecture. The router handles HTTP and delegates to a service that holds business rules; the service calls a repository that owns all data access and talks to Postgres. The router uses Pydantic schemas; the service raises domain exceptions rather than HTTP errors. Dependencies point left to right only, so each layer is independently replaceable and testable.

The service raises domain exceptions, not HTTPException — the service does not know HTTP exists, which is precisely what lets you reuse it from a command-line tool, a queue worker, or a different transport later. The repository pattern earns its keep four ways: testability (services test against an in-memory fake, cutting suite time from minutes to seconds), a single query-review surface (all SQL in one layer), substitutability (swap a cache or a read-replica behind one interface), and N+1 discipline (eager-loading decisions made once, in the layer that owns them). Skip it honestly only when the app is a small CRUD service whose business logic is the CRUD.

How do databases break under load, and how do you size the connection pool?

Databases break in three predictable ways under load: the N+1 query, an exhausted connection pool, and an unsafe migration. APIs are stateless, so their value lives in the database, and these three failures are what take a passing dev build down in production. The session-per-request yield dependency from the previous section is the foundation; the rest is sizing and discipline.

SQLAlchemy has two central objects: the engine (one per process, owning the connection pool) and the session (one per request, an identity-mapped unit of work). Use SQLite for fast local tests and PostgreSQL for production, but run integration tests against real Postgres, because SQLite is loosely typed, ignores many constraints, and serializes writers — behavioral drift that masks bugs.

The single most common ORM performance failure is the N+1 query. The default relationship() loading strategy is lazy: accessing task.owner triggers a separate SELECT at access time. Render a list of 100 tasks with their owners and you execute 1 + 100 queries. It ships fine on a 10-row dev database and melts at 10,000 rows. The fix is explicit eager loading, and the choice between strategies is a real trade-off:

from sqlalchemy import select
from sqlalchemy.orm import selectinload, joinedload

# selectinload: 2 queries total (tasks, then owners via WHERE id IN (...))
# best for to-many collections — no row multiplication
stmt = select(Task).options(selectinload(Task.owner)).limit(100)

# joinedload: 1 query with a LEFT JOIN
# best for to-one references when you need a single round trip
stmt = select(Task).options(joinedload(Task.owner)).limit(100)

Use selectinload for collections (it avoids row multiplication) and joinedload for to-one references (a single round trip, but it multiplies rows on to-many relationships and inflates transfer). Detect N+1 by logging query counts per request in development and failing any test that exceeds a budget — for example, assert that a list endpoint runs five queries or fewer. In async SQLAlchemy, lazy loading raises an error rather than silently issuing I/O, which usefully converts a silent performance bug into a loud one.

flowchart TD
  N[List 100 tasks] --> NA[N+1: 1 query<br/>for tasks]
  NA --> NB[then 100 queries<br/>one per owner]
  N --> EA[Eager: 1 query<br/>for tasks]
  EA --> EB[1 more query<br/>owners IN list]

Figure 7: N+1 versus eager loading. The naive path fetches 100 tasks in one query, then lazily fires one more query per task to load its owner — 101 queries that scale with row count. Eager loading with selectinload fetches the tasks, then loads every owner in a single second query keyed by an IN list. Two queries instead of 101.

Connection pool sizing is where databases fall over by configuration. By Little's Law, the connections a service needs is roughly arrival_rate × mean_db_holding_time. At 200 req/s with 25 ms of database time per request, that is about 5 connections — pools are smaller than intuition suggests. But the global constraint is the trap: total demanded connections is app_processes × (pool_size + max_overflow), and that must stay below Postgres's max_connections minus a reserve.

Variable	Example value
App topology	10 pods × 4 workers = 40 processes
Per-process pool	`pool_size=10` + `max_overflow=5` = 15
Total demanded	40 × 15 = 600 connections
Postgres default	`max_connections=200`

Six hundred demanded connections against a default of 200 is an outage by configuration. Each Postgres connection also costs roughly 5–10 MB of server RAM, so 600 connections is 3–6 GB before any query runs — which is why PgBouncer, multiplexing thousands of client connections onto tens of server ones, appears in every serious Postgres deployment. Set pool_pre_ping=True so the pool detects connections a cloud network silently dropped, and pool_recycle shorter than any idle timeout on the path.

flowchart TD
  APP[FastAPI app<br/>session per request] --> POOL[Connection pool<br/>engine]
  POOL --> C1[conn in use]
  POOL --> C2[conn in use]
  POOL --> C3[conn idle]
  POOL --> PG[(PostgreSQL<br/>max_connections)]
  AL[Alembic<br/>migrations] -.DDL at deploy.-> PG

Figure 8: The persistence topology. Each request borrows a pooled connection from the engine, which holds a bounded set of connections to Postgres; the pool size times the number of app processes must stay under the database's max_connections. Schema changes flow through Alembic migrations at deploy time, never through the application at request time.

Schemas change only through versioned, reviewed migrations. Alembic autogenerates diffs between your models and the live schema, but autogenerate alone is unsafe: it cannot see a column rename — it sees a drop plus an add, which destroys data. For zero-downtime deploys, write migrations by hand in the expand/contract pattern: add a nullable column, deploy code that writes both old and new, backfill, add the constraint, then remove the old column in a later release. Always review autogenerated migrations, and test the downgrade path before you need it at 3 a.m.

How do you secure an AI API without leaving the OWASP #1 hole?

Security is a property of every endpoint, not a feature you add. Authentication establishes who you are; authorization decides what you may do. The status codes encode the split: 401 means "I don't know who you are," 403 means "I know who you are, and no." Most breaches are authorization failures, not authentication failures.

Keep the two phases architecturally separate: one dependency resolves identity, separate dependencies enforce permission. The most common real breach is a logged-in user reading someone else's object.

Passwords are never stored, encrypted, or logged — only hashed with an algorithm designed to be slow, such as bcrypt or argon2. Slowness is the feature: at roughly 100–300 ms per hash, offline cracking of a stolen database becomes economically painful. SHA-256 is the wrong choice precisely because it is fast — billions of hashes per second on a GPU — so a stolen table falls to brute force regardless of salting. Hash in a threadpool, because it is CPU-bound and would otherwise block the event loop.

JWT access tokens make stateless authorization scale: a signed (not encrypted) header.payload.signature triplet that any worker can verify without a database lookup. The payload is readable by anyone, so never put secrets in it. The verification rule that matters most is pinning the algorithm. RFC 8725, the JWT Best Current Practices, is explicit: "Libraries MUST enable the caller to specify a supported set of algorithms and MUST NOT use any other algorithms when performing cryptographic operations."³

import jwt  # PyJWT
from fastapi import Depends, HTTPException
from fastapi.security import OAuth2PasswordBearer
from typing import Annotated

oauth2_scheme = OAuth2PasswordBearer(tokenUrl="/api/v1/auth/token")

def get_current_user(token: Annotated[str, Depends(oauth2_scheme)], db=Depends(get_db)):
    try:
        payload = jwt.decode(
            token,
            settings.jwt_secret,
            algorithms=["HS256"],          # pin it — never trust the token's own header
            issuer="api.example.com",
            audience="example-clients",     # validate iss and aud, not just the signature
        )
    except jwt.InvalidTokenError:
        raise HTTPException(401, "Invalid or expired token",
                            headers={"WWW-Authenticate": "Bearer"})
    user = db.get(User, int(payload["sub"]))
    if user is None or not user.is_active:
        raise HTTPException(401, "User inactive or deleted")
    return user

CurrentUser = Annotated[User, Depends(get_current_user)]

The alg=none attack is why this is non-negotiable. The JWT header declares which algorithm signed it, and the header is attacker-controlled. Historic libraries that honored "alg": "none" accepted unsigned tokens that verified successfully — an attacker mints any identity. A second variant: against an RS256 service, declare HS256 and sign with the public key as the HMAC secret. The defense is to pass an explicit allowlist to decode() and never derive the algorithm from the token. It is one instance of a general rule: never let untrusted input select the security mechanism that validates it.

flowchart TD
  L[Client login<br/>email + password] --> H{Verify<br/>hash}
  H -->|ok| T[Issue JWT<br/>signed token]
  H -->|fail| E[401 generic<br/>no enumeration]
  T --> R[Request with<br/>Bearer token]
  R --> D[get_current_user<br/>pin algorithm]
  D --> U[Authorized<br/>handler]

Figure 9: JWT login and the current-user dependency. A login verifies the password hash and issues a signed token; both failure branches return the same generic 401 to avoid leaking which accounts exist. On later requests, the get_current_user dependency decodes the token with a pinned algorithm before any handler runs.

The OWASP API Security Top 10's perennial #1 is BOLA — Broken Object-Level Authorization. OWASP defines it: "Object level authorization is an access control mechanism that is usually implemented at the code level to validate that a user can only access the objects that they should have permissions to access."⁴ A GET /orders/{id} that checks you are logged in but not that the order is yours lets an attacker iterate IDs and read everyone's data. OWASP's requirement is blunt: "Every API endpoint that receives an ID of an object, and performs any action on the object, should implement object-level authorization checks."⁴

Make BOLA structurally impossible rather than reviewer-dependent: scope queries by the caller's identity in the repository (WHERE owner_id = :uid) so unauthorized rows are never loaded; use a base repository that injects the owner filter automatically; and return 404 rather than 403 for objects the caller does not own, so you do not confirm their existence. The 2022 Optus breach is the cautionary case — an unauthenticated, internet-reachable API with sequential customer identifiers let attackers iterate IDs and harvest roughly 9.8 million customers' records.⁵ Two other perimeter facts: CORS is enforced by browsers and protects browser users only — it does nothing against curl and is not an access-control mechanism. And a login that returns "email not found" versus "wrong password" enables account enumeration; return one generic message and equalize response time.

The economics justify the slowness. IBM's 2024 Cost of a Data Breach report put the global average breach at USD 4.88 million, up 10% year over year — the largest jump since the pandemic.⁶ Against that, the engineering cost of pinning an algorithm, scoping a query, and rate-limiting a login endpoint is trivial. Secrets never live in code or images: inject them from a secrets manager via environment, load them through a settings layer, rotate them, and scan repositories for leaks in CI. A leaked signing secret means an attacker mints arbitrary identities — treat it like a root password.

How do you test an API so refactors do not ship incidents?

Test the contract, not just the logic. An API's regression surface is its URLs, schemas, status codes, error envelopes, and authorization rules — so a refactor that keeps unit tests green but renames a response field is still an incident for every client. Assert at two levels: unit tests for business logic in isolation, integration tests for the assembled stack.

The classic pyramid applies — many unit tests, fewer integration tests, a handful of end-to-end smoke tests — and the layered architecture is what makes the split cheap.

FastAPI's TestClient runs your ASGI app in-process — no server, no network, no ports — while exercising the full middleware, routing, and validation stack. The architectural payoff from dependency injection shows up here as one line. Because every endpoint obtains its session through Depends, a single override redirects the entire application to a test database, with no monkeypatching and no environment tricks:

from fastapi.testclient import TestClient
from app.main import app
from app.api.deps import get_db

def test_create_task_returns_201_and_public_schema(client):
    resp = client.post("/api/v1/tasks", json={"title": "write tests"})
    assert resp.status_code == 201
    body = resp.json()
    assert set(body) == {"id", "title", "description", "status"}
    assert "owner_id" not in body          # leak check: internal fields must never appear

def test_user_cannot_read_anothers_task(client, db):
    alice, bob = make_user(db, "alice"), make_user(db, "bob")
    task = make_task(db, owner=alice)
    app.dependency_overrides[get_current_user] = lambda: bob
    # 404, not 403: do not confirm the object exists
    assert client.get(f"/api/v1/tasks/{task.id}").status_code == 404

Dependency overrides beat monkeypatching for three reasons. They operate at the declared interface — the same seam production wiring uses — so tests do not break when someone moves an import. They are scoped and discoverable, one dict cleared in fixture teardown. And they compose: override auth, the database, and an LLM client independently. The caveat is that overrides bypass the real dependency entirely, so its own validation logic goes untested — cover that separately, and always clear overrides between tests or state leaks.

flowchart TD
  T[pytest test<br/>client.post] --> APP[App in-process<br/>middleware + routing]
  APP --> VAL[Validation]
  VAL --> HND[Handler]
  APP -.override.-> DBT[SQLite in-memory<br/>get_db]
  HND -.override.-> STB[Stub services<br/>get_llm_client]

Figure 10: The test topology. TestClient drives the real application stack in-process through an ASGI call, so middleware, routing, and validation all run. Dependency overrides swap externalities for fast, deterministic fakes — an in-memory SQLite database for get_db and stub services for an LLM client — without any network calls.

Two assertion habits prevent the most common escapes. Assert on the 422 error body, not just the status code, because clients parse detail[].loc and detail[].type to map errors to form fields — a custom exception handler can change that structure while every status-code-only test stays green. And test that sensitive fields are absent with an exact-shape assertion (set(resp.json()) == {expected fields}), which fails on additions, unlike a "password" not in body check that only catches known names. Coverage measures execution, not verification: a test that calls an endpoint and asserts nothing scores the same coverage as one that pins the whole contract. Use coverage as a floor and a flashlight — 95% coverage still ships incidents when the assertions are weak, the happy path dominates, and the uncovered 5% is exactly the error handling where incidents live.

What happens when a blocking call runs inside an async handler?

A blocking call inside an async def handler freezes every request in the process for its duration — the single deadliest FastAPI bug. An asyncio program runs one event loop on one thread; coroutines run until they await something, then yield control so the loop serves others. The contract has one clause: never block the loop.

A single synchronous time.sleep, a blocking requests.get, or a heavy CPU computation inside a coroutine stops the whole loop dead.

FastAPI's routing rule is the relief valve. async def handlers run on the loop and must await only non-blocking calls. Plain def handlers are dispatched to a threadpool, where blocking code is safe at thread cost. This is the deadliest production bug in FastAPI: an async def handler with a blocking call inside. It passes tests, because single requests look fine, then under load p99 latency explodes for all endpoints at once. Detect it with a loop-lag monitor — a watchdog coroutine that sleeps a fixed interval and alerts when its actual wake-up time drifts.

Workload	Correct form
Async-capable I/O (httpx, async DB driver)	`async def` + `await`
Blocking-only library (classic ORM, some SDKs)	plain `def` (runs in threadpool)
Light CPU (a few ms)	either
Heavy CPU (inference, parsing, crypto)	plain `def`, or offload to a process pool / worker queue

flowchart TD
  R[Request enters<br/>handler] --> K{Handler<br/>type?}
  K -->|async def| L[Runs on<br/>event loop]
  K -->|plain def| T[Runs in<br/>threadpool]
  L --> A{await I/O?}
  A -->|yes| Y[Yields loop<br/>serves others]
  A -->|blocking call| B[Stalls loop<br/>all requests freeze]

Figure 11: The async decision. A request to an async def handler runs on the single event loop; if it awaits non-blocking I/O it yields control so the loop serves other requests, but a blocking call inside it stalls the loop and freezes every request. A plain def handler runs in a threadpool, where blocking code is safe. Match the handler type to the workload.

External calls deserve three non-negotiables. Use a shared client for connection pooling — create one httpx.AsyncClient in the lifespan handler and reuse it, never one per request. Set explicit timeouts, because the default "wait forever" is an outage. And bound retries with jitter, only on idempotent operations:

import asyncio, random, httpx
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.http = httpx.AsyncClient(
        timeout=httpx.Timeout(connect=2.0, read=5.0, write=5.0, pool=2.0),
        limits=httpx.Limits(max_connections=100, max_keepalive_connections=20),
    )
    yield
    await app.state.http.aclose()

app = FastAPI(lifespan=lifespan)

async def call_upstream(client: httpx.AsyncClient, payload: dict, retries: int = 3) -> dict:
    for attempt in range(retries):
        try:
            r = await client.post("/v1/score", json=payload)
            r.raise_for_status()
            return r.json()
        except (httpx.TimeoutException, httpx.HTTPStatusError):
            if attempt == retries - 1:
                raise HTTPException(502, "Upstream unavailable")
            await asyncio.sleep(random.uniform(0, 0.2 * 2 ** attempt))  # backoff + jitter

Retry only transient classes (timeouts, 5xx, connection errors), never 4xx; cap total attempt time below your caller's timeout; and add jitter, because synchronized retries from many workers are a self-inflicted denial of service — the thundering herd. For repeated upstream failure, a circuit breaker stops calling a dead dependency entirely, failing fast until probes succeed. Naive retries against a degraded service amplify load and turn a slowdown into an outage.

Deferred work splits into two tiers. BackgroundTasks runs callables after the response is sent, in the same process — fine for fire-and-forget side effects (a welcome email, an audit log) where loss on a process crash is acceptable. For anything that must survive — long jobs, retries, scheduled work, work that outlives a deploy — use a real queue such as Celery with Redis or RabbitMQ as broker. The API enqueues a job, returns 202 Accepted with a job ID, and workers process independently; clients poll a status endpoint or receive a webhook. The rule: if losing the task on a pod restart is a bug, it does not belong in BackgroundTasks.

flowchart TD
  C[Client] --> API[FastAPI<br/>POST /jobs]
  API --> BG[BackgroundTasks<br/>same process]
  API --> Q[Queue broker<br/>Redis / RabbitMQ]
  Q --> W1[Celery worker]
  Q --> W2[Celery worker]
  W1 --> RS[(Result store)]
  C --> ST[GET /jobs/id<br/>poll status]
  RS -.-> ST

Figure 12: Two tiers of deferred work. Fire-and-forget side effects run in BackgroundTasks in the same process, where a crash loses them. Work that must survive goes to a queue broker; the API returns a job ID immediately, independent Celery workers process it and write to a result store, and the client polls a status endpoint. Durable jobs never live in the same process as the request.

Streaming and uploads have their own rules. StreamingResponse wraps an async generator so bytes flow as produced — the foundation for CSV exports, server-sent events, and LLM token streams — instead of buffering the full response in memory. File uploads stream to a spooled temp file rather than loading into memory, and you cap sizes at the proxy and in the handler, validating content type by sniffing magic bytes, not by trusting the client's Content-Length header (which a client can lie about). Backpressure matters: unbounded concurrency lets a few slow consumers or a flood of uploads exhaust memory, so bound in-flight work with a semaphore.

How do you serve a model, an LLM, and a RAG pipeline in production?

An ML inference service is an ordinary FastAPI app with three unusual properties: a heavy startup artifact, CPU- or GPU-bound hot paths, and probabilistic outputs. The first rule follows directly: load models once, in the lifespan handler, and serve from memory — never per request, never at import time.

Each property bends the patterns above. Model weights take seconds to minutes to load; inference is CPU- or GPU-bound, so everything the async section said applies doubly; and correctness is a distribution, so monitoring must track quality, not just errors.

from contextlib import asynccontextmanager
from anyio import to_thread
from fastapi import FastAPI
from pydantic import BaseModel, Field

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.model = load_model("models/sentiment-v3.onnx")   # once, before traffic
    app.state.model_version = "sentiment-v3.2.1"
    yield
    app.state.model = None                                     # release on shutdown

app = FastAPI(lifespan=lifespan, title="Inference API")

class PredictIn(BaseModel):
    text: str = Field(min_length=1, max_length=4_000)

@app.post("/v1/predict")
async def predict(req: PredictIn):
    # CPU-bound: run in a threadpool, never on the event loop
    label, conf = await to_thread.run_sync(app.state.model.predict, req.text)
    return {"label": label, "confidence": conf, "model_version": app.state.model_version}

Per-request loading is catastrophic — seconds of disk or GPU transfer per call, memory churn, and concurrent requests each loading copies until the process is killed for using too much memory. Import-time loading works but couples loading to module import, so tests, linters, and tooling pay the cost or crash without a GPU, and there is no clean shutdown hook. Lifespan loading is correct: it runs once per worker before traffic and plays well with readiness probes — the pod reports ready only after the model is warm, which is critical for rolling deploys. With N workers you hold N model copies, so size worker count by memory, or put a dedicated inference server behind FastAPI when models are large.

Echoing the model version in every response is non-negotiable. When quality regresses, the first question is always "which model produced this?" — and during canary rollouts, multiple versions serve simultaneously. The same logic applies to embeddings: vectors from different models are incomparable, so an embedding API must return its model version, and upgrading the embedding model means re-embedding the entire corpus.

Wrapping an LLM in your own FastAPI service — an LLM gateway — is the pattern that pays for itself: one place for authentication, prompt templates, guardrails, caching, cost metering, provider failover, and model swaps without touching clients. Streaming is what makes the user experience tolerable: first tokens arrive in hundreds of milliseconds instead of the full completion's tens of seconds. Server-sent events over StreamingResponse carry the tokens:

from fastapi import Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field

class ChatIn(BaseModel):
    message: str = Field(min_length=1, max_length=8_000)
    max_tokens: int = Field(default=512, le=2_048)   # a Pydantic constraint is a spending limit

@app.post("/v1/chat/stream")
async def chat_stream(req: ChatIn, user: CurrentUser):
    await guardrails.check_input(req.message)        # pre-flight
    async def tokens():
        usage = {"in": 0, "out": 0}
        try:
            async for chunk in llm.stream(req.message, max_tokens=req.max_tokens):
                usage["out"] += 1
                yield f"data: {chunk.json()}\n\n"    # SSE frame
            yield "data: [DONE]\n\n"
        finally:
            await metering.record(user.id, usage)    # runs even if the client disconnects
    return StreamingResponse(tokens(), media_type="text/event-stream")

Two details there are load-bearing. Clients disconnect mid-stream constantly, so cost metering must live in a finally block to survive that. And max_tokens validated with a ceiling is literally a spending limit — a Pydantic constraint as a cost control. At the infrastructure layer, disable proxy buffering, raise idle timeouts on the load balancer, and monitor time-to-first-token separately from total latency, because it is the real UX metric.

Retrieval-augmented generation (RAG) grounds LLM answers in your documents. The ingestion pipeline chunks, embeds, and stores vectors; the query pipeline embeds the question, retrieves the top-k similar chunks, optionally reranks, assembles a context-stuffed prompt, generates, and returns the answer with citations. Guardrails sit at both ends of the LLM.

flowchart TD
  U[User query] --> IG[Input<br/>guardrails]
  IG --> EQ[Embed query]
  EQ --> VDB[(Vector DB<br/>pgvector)]
  VDB --> RR[Rerank<br/>top-k]
  RR --> GEN[LLM generate<br/>cite sources]
  GEN --> OV[Output validation<br/>verify citations]
  DOC[Documents] --> CE[Chunk + embed<br/>ingest]
  CE --> VDB

Figure 13: The RAG pipeline. An ingestion path chunks and embeds documents into a vector store; the query path applies input guardrails, embeds the question, retrieves and reranks the top-k chunks, generates an answer that cites its sources, and validates the output before returning it. The guardrails at both ends are what make a stochastic system safe to expose.

For vector storage, pgvector keeps vectors in Postgres — transactional consistency with your metadata, one database to operate, and filtered retrieval (WHERE team_id = ...) in the same SQL query, which is huge for access control. It comfortably serves up to millions of vectors. Dedicated engines (Qdrant, Weaviate, Pinecone, Milvus) add horizontal scale and faster approximate search at the cost of a second system. Start with pgvector and migrate behind the repository interface when scale demands; the embedding-model version problem dwarfs the storage choice, because changing models means re-embedding everything regardless. This is the layer where owning an auditable, inspectable pipeline pays off — the same argument behind building a personal RAG you can actually audit.

Guardrails treat the LLM as a powerful but untrusted data source behind a validation boundary. On the input side: length and token caps, prompt-injection screening, and topic policy checks. On the output side, the most useful pattern in LLM engineering is forcing a stochastic system through a typed contract:

from decimal import Decimal
from typing import Literal
from pydantic import BaseModel, Field, ValidationError

class ExtractedInvoice(BaseModel):
    vendor: str
    total: Decimal = Field(ge=0)
    currency: Literal["USD", "EUR", "GBP"]

async def extract_invoice(text: str) -> ExtractedInvoice:
    prompt = EXTRACT_PROMPT.format(text=text)
    for attempt in range(2):
        raw = await llm.complete(prompt, json_mode=True)
        try:
            return ExtractedInvoice.model_validate_json(raw)
        except ValidationError as e:
            prompt = f"{prompt}\n\nYour previous output failed validation:\n{e}\nReturn corrected JSON only."
    raise HTTPException(502, "Model output failed validation")

Generate, validate with Pydantic, feed the error back, retry — this loop converts a stochastic text generator into a typed API component with an explicit failure mode. One retry fixes most cases; the second failure routes to a fallback (a human review queue, a cheaper deterministic extractor, or an explicit 502), never silently passing malformed data downstream.

The cost of skipping output validation is documented. In Moffatt v. Air Canada (2024), the British Columbia Civil Resolution Tribunal ordered Air Canada to honor a refund policy its website chatbot had invented, awarding the customer the roughly $483 refund the bot described; the tribunal rejected Air Canada's argument that the chatbot was a separate legal entity responsible for its own actions.⁷ The engineering translation: the failure was architectural, not model quality — there was no grounding in actual policy documents, no output validation requiring citations, and no fallback to a human or the real policy page. Your model's words are your company's words. Below a confidence threshold or on guardrail triggers, route outputs to a review API where a human is a component with an SLA, not an afterthought.

How do you deploy FastAPI so it survives contact with production?

Run one Uvicorn worker per core, plus supervision — and on Kubernetes, often one worker per pod. One Uvicorn process is one event loop on one core, so I/O-bound async apps use roughly one worker per core. Memory bounds it all: N workers means N model copies, which is the constraint that actually caps worker count for AI services.

Two supervision arrangements exist. uvicorn --workers N is built-in multi-process, fine on Kubernetes where the platform supervises pods; Gunicorn managing Uvicorn workers adds crash restarts, request recycling against slow leaks, and graceful timeouts, suiting bare VMs where nothing else supervises. One worker per pod keeps resource accounting clean and lets the scheduler do the supervision.

flowchart TD
  CL[Clients] --> IN[Ingress / LB]
  IN --> P1[API pod]
  IN --> P2[API pod]
  P1 --> PG[(PostgreSQL)]
  P2 --> RD[(Redis)]
  IN -.health checks.-> P1
  Q[Queue] --> WK[Workers]
  OBS[Observability<br/>logs metrics traces] -.-> P1

Figure 14: A production deployment topology. Clients reach replicated API pods through an ingress or load balancer; the pods share state in Postgres and Redis, deferred jobs run on separate workers, and an observability plane collects logs, metrics, and traces from everything. The ingress routes traffic to pods based on their health-probe results.

The probe distinction is a classic outage waiting to happen. A liveness probe answers "is this process broken beyond self-recovery?" — its remedy is restart, so it must check only process-local health, because restarting a pod does not fix a down database. A readiness probe answers "should this pod receive traffic right now?" — its remedy is removal from the load balancer, so it checks critical dependencies and warm-up state, including whether the model is loaded.

@app.get("/health/live")     # "is the process stuck?" -> restart me
def live():
    return {"status": "ok"}   # no dependencies: just confirm the process responds

@app.get("/health/ready")    # "can I serve?" -> route traffic to me
async def ready(db=Depends(get_db)):
    await db.execute(text("SELECT 1"))         # critical dependencies only
    if app.state.model is None:
        raise HTTPException(503, "model not loaded")   # model warm?
    return {"status": "ready"}

Conflating them creates a documented cascade. Point both probes at an endpoint that checks the database; a 30-second database failover begins; every pod's probes fail simultaneously. Readiness failure correctly removes pods from the load balancer — but liveness failure restarts them all. Restarted pods come up cold, with model loading taking a minute or more, fail readiness during warm-up, and the autoscaler, seeing near-zero CPU because no traffic flows, scales down. The database recovers in 30 seconds; the service takes many minutes, because the platform spent the incident killing healthy-but-waiting processes. The fix: liveness checks process health only, readiness checks dependencies with a failure threshold tolerant of blips, and a startup probe covers long warm-ups so liveness does not kill a booting pod.

Caching is high-leverage and the hardest thing to get right. Cache-aside with a TTL — check the cache, on a miss compute and store — fits idempotent expensive reads, and LLM response caching keyed on normalized prompt plus model version is often the single largest cost saver. The two hard problems are honest TTL choice (staleness budget is a product decision) and invalidation. The stampede problem is concrete: a hot key expires, hundreds of concurrent requests all miss, all recompute the expensive value, latency spikes, and under load the recompute storm can take the backend down — the cache was load-bearing. Solutions are per-key locks (one request recomputes, others wait or serve stale), probabilistic early refresh, and jittered TTLs so cohorts do not expire together. Invalidation is harder than caching because it requires knowing every write path that affects a cached value; key-versioning (bump a version suffix on write) sidesteps deletion but moves the problem to version storage.

Observability has three pillars answering three questions. Structured logs (JSON, one event per line) answer "what happened in this specific case?" Metrics answer "how is the system behaving in aggregate?" — the rate/errors/duration trio per endpoint, plus the resource gauges this guide has flagged: pool checkout wait, queue depth, loop lag, token costs. Distributed traces answer "where did this request spend its time across services?" — indispensable once a request touches gateway, API, database, and an LLM provider. Propagate the trace ID into every log line so the pillars cross-link.

Two disciplines separate a debuggable service from a noisy one. Structured logs must never carry secrets — no passwords, tokens, full authorization headers, raw bodies containing personal data, or full LLM prompt/response text without a retention policy. One leaked credential in logs has the blast radius of a breach, because logs are widely readable and long-retained; log a content hash and token counts instead. And alert on symptoms, not causes: paging on "CPU above 80%" is an anti-pattern. An SLO sets a user-facing target — say, 99.9% of requests succeed under 500 ms over 30 days — implying an error budget, and alerts fire on budget burn rate. CPU at 80% may be perfectly healthy utilization or already too late; it pages humans for conditions users do not feel and misses user pain with idle CPU (a blocked loop, an upstream failure). Every page should mean a user is being hurt and a human must act now.

A diagnostic worth internalizing: when p99 latency triples but p50 is unchanged, a subset of requests hits a slow path while the median request is fine. Read it with the pillars in order. Metrics first — slice the duration histogram by endpoint, status, pod, and time: is the tail one endpoint, one bad pod, or a global periodic event like a cache-expiry stampede? Then traces — pull exemplar traces from the slow bucket and read where the time went: a database span (pool checkout wait? a query whose plan flipped?), an external call (retries? an upstream tail?), or a gap between spans (event-loop blocking). Then logs — correlate the slow request IDs: a shared user, large payloads, one heavy tenant. The method is histogram, then exemplar trace, then correlated logs — not guessing.

The production checklist ties it together. Before the first real user: every endpoint authenticated and authorized, deny by default, with rate limits on auth and expensive routes; secrets in a manager, never in code or images; configuration validated at boot, so a missing setting crashes the boot rather than the first request; migrations reviewed and reversible; pools sized by the arithmetic; timeouts on every external call; structured logs with request IDs; metrics, traces, and alerts on symptoms not causes; liveness, readiness, and startup probes correctly separated; graceful shutdown draining in-flight requests; a load test at the latency knee plus a soak test passed; backups tested rather than assumed; dependency and image scanning in CI; runbooks for the top failure modes; and a rollback path exercised at least once on purpose.

FastAPI is a small framework with one idea, and that idea — the typed signature is the contract — carries you from a first endpoint to a production AI system. The framework rarely fails. The architecture around it does, and only when you treat its constraints as suggestions. Build for the constraint, and the model behind the API becomes the easy part.

Frequently Asked Questions

Why does FastAPI need a separate server like Uvicorn? FastAPI is an ASGI application — a callable that consumes ASGI events. It cannot open sockets, parse raw HTTP, or manage the event loop; that is the job of an ASGI server. Uvicorn listens on the network, parses HTTP, translates each connection into ASGI events, and invokes your app. The split lets servers and frameworks evolve independently.

When should I use async def vs def in FastAPI? Use async def with await for non-blocking I/O (an async HTTP client, an async database driver). Use plain def for blocking libraries — FastAPI runs those in a threadpool so they cannot stall the event loop. The deadliest mistake is an async def handler making a blocking call inside it: it freezes every request in the process under load.

How do I stop FastAPI returning the password hash in the response? Declare a separate response model that does not include the sensitive field, and set it as the endpoint's response_model. FastAPI filters whatever object you return through that model as a whitelist, so undeclared fields cannot be serialized. Do not rely on remembering to exclude fields at each call site — make leak-prevention structural with a dedicated output schema.

What is the N+1 query problem and how do I fix it? N+1 is fetching N parent rows, then lazily firing one child query per row — 1+N round trips that scale with data volume and melt at scale. Fix it with explicit eager loading: selectinload for to-many collections (a second WHERE id IN query) and joinedload for to-one references (a single join). Detect it by failing tests that exceed a per-request query budget.

How do I stream LLM tokens from FastAPI? Return a StreamingResponse wrapping an async generator that yields server-sent-event frames (data: ...\n\n) as tokens arrive, with a terminal [DONE] frame. Put cost metering and logging in a finally block, because clients disconnect mid-stream constantly. At the infrastructure layer, disable proxy buffering and monitor time-to-first-token separately from total latency.

What's the difference between a liveness and a readiness probe? Liveness answers "is this process broken beyond self-recovery?" — its remedy is a restart, so it must check only process-local health, never the database. Readiness answers "should this pod receive traffic now?" — its remedy is removal from the load balancer, so it checks critical dependencies and warm-up state. Conflating them turns a brief dependency blip into a mass-restart cascade.

Is FastAPI good for serving machine learning models? Yes — it is the de facto serving layer for inference and LLM endpoints. Models live in Python, so serving them in-process avoids a serialization boundary; async I/O multiplexes the multi-second waits of hosted LLM calls; and streaming over ASGI enables token-by-token responses. Load the model once in the lifespan handler and offload CPU-bound inference off the event loop.

How do I load a model once instead of on every request? Load it in the lifespan handler and store it on app.state, then read it in handlers. Per-request loading causes seconds of transfer per call and memory exhaustion; import-time loading couples loading to module import and breaks tooling. Lifespan loading runs once per worker before traffic and lets a readiness probe report ready only after the model is warm — critical for rolling deploys.

References

Build for the AI overview that quotes one sentence from your service's docs, and make sure the sentence is true; the same discipline that keeps a typed contract honest keeps a knowledge base honest.

If you keep your own notes in plain Markdown, on your own device, they stay yours to read, search, and feed to whatever tool you choose — try mnmnote.com.

"FastAPI," FastAPI documentation, Sebastián Ramírez, https://fastapi.tiangolo.com/, accessed 2026-06-20. Repo anchors re-derived via curl -s https://api.github.com/repos/fastapi/fastapi => 99,425 stars, license MIT, language Python, default branch master (as of 2026-06-20). ↩ ↩² ↩³ ↩⁴
"Pydantic," Pydantic documentation, https://pydantic.dev/docs/validation/latest/get-started/, accessed 2026-06-20. ↩
Y. Sheffer, D. Hardt, M. Jones, "JSON Web Token Best Current Practices," RFC 8725 (BCP 225), IETF, February 2020, https://www.rfc-editor.org/rfc/rfc8725. ↩
"API1:2023 Broken Object Level Authorization," OWASP API Security Top 10, 2023 edition, https://owasp.org/API-Security/editions/2023/en/0xa1-broken-object-level-authorization/, accessed 2026-06-20. ↩ ↩²
"2022 Optus data breach," Wikipedia, https://en.wikipedia.org/wiki/2022_Optus_data_breach, accessed 2026-06-20 (event September 2022; up to ~10M current and former customers affected). Technical breach mechanics — an unauthenticated, internet-facing API and incrementing (sequential) customer identifiers, leaving ~9.8M customers at risk over the exposure window: "How Did the Optus Data Breach Happen?", UpGuard, https://www.upguard.com/blog/how-did-the-optus-data-breach-happen, accessed 2026-06-20. ↩
"IBM Report: Escalating Data Breach Disruption Pushes Costs to New Highs," IBM Newsroom, 2024-07-30, https://newsroom.ibm.com/2024-07-30-ibm-report-escalating-data-breach-disruption-pushes-costs-to-new-highs (global average breach USD 4.88M, up 10% year over year). ↩
"BC Tribunal Confirms Companies Remain Liable for Information Provided by AI Chatbot," American Bar Association Business Law Today, February 2024, https://www.americanbar.org/groups/business_law/resources/business-law-today/2024-february/bc-tribunal-confirms-companies-remain-liable-information-provided-ai-chatbot/ (Moffatt v. Air Canada, 2024 BCCRT 149; ~$483 refund ordered; "separate legal entity" argument rejected). Corroborating accessible report: "Air Canada must pay refund promised by AI chatbot, tribunal rules," AOL/Reuters, https://www.aol.com/air-canada-must-pay-refund-040527166.html, accessed 2026-06-20. ↩