ADR-024: Gunicorn Auto-Scaling and Horizontal Scaling Readiness¶

Status: Accepted
Date: 2026-04-18
Issue: #203

Context¶

The initial production Dockerfile (ADR-023) hardcoded gunicorn -w 2, which doesn't adapt to the container's CPU allocation. A 4-CPU host wastes cores; a 1-CPU Fly.io machine runs more workers than it can sustain. We also had no infrastructure-level auto-scaling — Fly.io machines didn't know when to spin up additional instances.

Decision¶

1. Process-level auto-tuning via `gunicorn.conf.py`¶

Workers are calculated as min(2 * CPU_CORES + 1, 4):

Container CPU	Workers
1 (Fly.io shared)	3
2+ (compose default)	4 (cap)

Capped at 4 because each worker loads Docling/PyTorch ML models (~250-400 MB RSS). With the default 2 GB container memory limit, 4 workers × ~400 MB ≈ 1.6 GB — the safe maximum. Each uvicorn async worker handles many concurrent connections, so 4 workers is sufficient for high throughput. Increase the cap only if you also raise deploy.resources.limits.memory.

Operators override at runtime via the WEB_CONCURRENCY environment variable (read natively by gunicorn, no custom code):

WEB_CONCURRENCY=4 make deploy-up

2. Machine-level auto-scaling via Fly.io concurrency limits¶

[http_service.concurrency]
  type = "connections"
  hard_limit = 100
  soft_limit = 80

When connections exceed soft_limit on a machine, Fly automatically starts another. Combined with auto_stop_machines = "stop", idle machines shut down — pay only for what you use.

3. Docker Compose horizontal scaling prep¶

Removed container_name from the gateway service. Docker Compose refuses --scale when container_name is set. This unblocks future docker compose up --scale gateway=3 (requires a reverse proxy in front).

Scaling tiers¶

Tier	Mechanism	Approximate capacity
Single container, auto-tuned workers	`gunicorn.conf.py` adapts to CPU	30-50 concurrent users
Fly.io multi-machine	Concurrency-based auto-scaling	100+ connections per machine
Docker Compose multi-replica	`--scale gateway=N` + reverse proxy	Requires resolving blockers below

Known blockers for multi-replica gateway¶

Items 1–3 were resolved in ADR-026 (#212):

~~WebSocket ConnectionManager~~ — Resolved. Redis Pub/Sub for cross-replica event distribution.
~~ChromaDB PersistentClient~~ — Resolved. Embedding writes already routed through single-writer ARQ.
~~LLM budget tracking~~ — Resolved. Budget dedup flags moved to Redis with TTL.
Frontend VITE_API_URL — baked in at build time, cannot redirect to a different backend without rebuilding. Fix: runtime configuration via window.__CONFIG__ or relative URLs.

Alternatives Considered¶

Keep hardcoded -w 2 — simple but wastes resources on larger machines and over-provisions on smaller ones.
Kubernetes HPA — auto-scales pods based on CPU/memory metrics. Over-engineered for this project's scale; adds significant operational complexity.
Docker Swarm mode — built-in service scaling and load balancing. Adds operational overhead; Fly.io handles this natively for cloud deployments.

Consequences¶

Gunicorn workers auto-scale to available CPU without image rebuilds
WEB_CONCURRENCY env var provides a simple operator override
Fly.io deployments auto-scale across machines based on connection load
Docker Compose deployments can scale gateway replicas (once blockers are resolved)
The four multi-replica blockers are documented for future work