Docker Troubleshooting Guide¶
This guide covers common Docker and container-related issues in RAG Modulo, including networking problems, volume issues, and container debugging techniques.
Table of Contents¶
- Overview
- Container Health Issues
- Networking Problems
- Volume & Storage Issues
- Image Build Problems
- Resource Constraints
- Multi-Container Coordination
- Docker Compose Issues
Overview¶
RAG Modulo uses Docker Compose for orchestrating multiple containers:
Services:
backend: FastAPI application (port 8000)frontend: React/Nginx (port 3000/8080)postgres: PostgreSQL database (port 5432)milvus-standalone: Vector database (port 19530)milvus-etcd: Milvus metadata store (port 2379)minio: Object storage (ports 9000, 9001)mlflow-server: Model tracking (port 5001)
Docker Compose Files:
./docker-compose.yml- Production deployment./docker-compose-infra.yml- Infrastructure services./docker-compose.dev.yml- Development overrides./docker-compose.test.yml- Testing configuration
Container Health Issues¶
Issue 1: Container Immediately Exits¶
Symptoms:
Diagnosis:
# Check exit code
docker compose ps
# View logs
docker compose logs backend | tail -50
# Check last container status
docker inspect rag-modulo-backend-1 | jq '.[0].State'
Common Causes & Solutions:
A) Missing Environment Variables:
# Check required variables
docker compose config | grep -A 50 backend
# Verify .env file exists
ls -la .env
# Check environment in container
docker compose exec backend env | grep COLLECTIONDB
Solution:
# Copy example .env
cp .env.example .env
# Edit with your values
vim .env
# Restart services
docker compose up -d
B) Database Not Ready:
# Check PostgreSQL health
docker compose ps postgres
# Wait for healthy status
docker compose up -d postgres
docker compose exec postgres pg_isready -U postgres
# Start backend after database is healthy
docker compose up -d backend
Solution: Use depends_on with health checks (already configured in docker-compose.yml)
backend:
depends_on:
postgres:
condition: service_healthy
milvus-standalone:
condition: service_healthy
C) Application Startup Error:
# View detailed startup logs
docker compose logs backend | grep -i error
# Common errors:
# - Import errors: Check PYTHONPATH
# - Configuration errors: Validate settings
# - Port conflicts: Check if port 8000 is available
Solution:
# Check PYTHONPATH in Dockerfile
cat backend/Dockerfile.backend | grep PYTHONPATH
# Test import manually
docker compose exec backend python -c "import rag_solution; print('OK')"
# Check port availability
lsof -i :8000 || netstat -tuln | grep 8000
Issue 2: Container Health Check Failures¶
Symptoms:
Diagnosis:
# Check health check configuration
docker inspect rag-modulo-backend-1 | jq '.[0].State.Health'
# View health check logs
docker inspect rag-modulo-backend-1 | jq '.[0].State.Health.Log[-5:]'
# Manual health check
docker compose exec backend python healthcheck.py
echo $? # 0 = healthy, 1 = unhealthy
Common Causes:
A) Backend Not Responding:
# Check if process is running
docker compose exec backend ps aux | grep uvicorn
# Test health endpoint
docker compose exec backend curl -f http://localhost:8000/api/health
# Check logs for errors
docker compose logs backend | tail -50
Solution:
# Restart backend
docker compose restart backend
# Or rebuild if code changes
docker compose up -d --build backend
B) Incorrect Health Check Path:
# Verify health check configuration
# File: docker-compose.yml
backend:
healthcheck:
test: ["CMD", "python", "healthcheck.py"] # Correct
# NOT: ["CMD", "curl", "-f", "http://localhost:8000/health"] # Wrong path
Issue 3: Container Restarts Continuously¶
Symptoms:
Diagnosis:
# Check restart count
docker inspect rag-modulo-backend-1 | jq '.[0].RestartCount'
# View crash logs
docker compose logs backend | grep -i "error\|exception\|traceback"
# Check exit reason
docker inspect rag-modulo-backend-1 | jq '.[0].State'
Common Causes:
A) Out of Memory (OOM):
# Check memory usage
docker stats --no-stream rag-modulo-backend-1
# Check for OOM in kernel logs
dmesg | grep -i "out of memory"
# Check Docker daemon logs
journalctl -u docker | grep oom
Solution:
# Increase memory limit
# File: docker-compose.yml
backend:
deploy:
resources:
limits:
memory: 4G # Increase from 2G
B) Crash Loop Due to Dependencies:
# Check dependency health
docker compose ps postgres milvus-standalone
# Ensure services start in correct order
# File: docker-compose.yml (already configured)
backend:
depends_on:
postgres:
condition: service_healthy
Networking Problems¶
Issue 1: Cannot Connect to Database¶
Symptoms:
Diagnosis:
# Check network connectivity
docker compose exec backend ping -c 3 postgres
# Test database port
docker compose exec backend nc -zv postgres 5432
# Check database service
docker compose ps postgres
docker compose logs postgres | tail -20
Solutions:
A) Service Name Resolution:
# Verify service name in connection string
# File: .env
COLLECTIONDB_HOST=postgres # NOT 'localhost' or '127.0.0.1'
# Test DNS resolution
docker compose exec backend nslookup postgres
docker compose exec backend getent hosts postgres
B) Network Configuration:
# Check Docker networks
docker network ls
# Inspect app-network
docker network inspect rag-modulo_app-network
# Verify all containers are on same network
docker network inspect rag-modulo_app-network | jq '.[0].Containers'
C) Port Conflicts:
# Check if PostgreSQL port is exposed correctly
docker compose port postgres 5432
# Check local port usage
lsof -i :5432
netstat -tuln | grep 5432
Issue 2: Cannot Access Backend from Host¶
Symptoms:
$ curl http://localhost:8000/api/health
curl: (7) Failed to connect to localhost port 8000: Connection refused
Diagnosis:
# Check port mapping
docker compose port backend 8000
# Check if backend is listening
docker compose exec backend netstat -tuln | grep 8000
# Check firewall rules
sudo iptables -L -n | grep 8000
Solutions:
A) Incorrect Port Mapping:
# File: docker-compose.yml
backend:
ports:
- "8000:8000" # host:container
# NOT: "8001:8000" if you're accessing localhost:8000
B) Backend Binding to Wrong Interface:
# Check uvicorn bind address
docker compose exec backend ps aux | grep uvicorn
# Should be: --host 0.0.0.0 (not 127.0.0.1)
# File: docker-compose.yml
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
C) Docker Desktop Networking (Mac/Windows):
# On Mac/Windows, use host.docker.internal instead of localhost in some contexts
# But external access should use localhost:8000
# Test from container
docker compose exec backend curl http://localhost:8000/api/health
# Test from host
curl http://localhost:8000/api/health
Issue 3: Container Cannot Reach External APIs¶
Symptoms:
Diagnosis:
# Test external connectivity
docker compose exec backend ping -c 3 8.8.8.8
docker compose exec backend curl https://google.com
# Test specific API
docker compose exec backend curl https://us-south.ml.cloud.ibm.com
Solutions:
A) DNS Resolution Issues:
# Check DNS configuration
docker compose exec backend cat /etc/resolv.conf
# Test DNS resolution
docker compose exec backend nslookup us-south.ml.cloud.ibm.com
# Add DNS servers
# File: docker-compose.yml
backend:
dns:
- 8.8.8.8
- 8.8.4.4
B) Corporate Proxy:
# File: docker-compose.yml
backend:
environment:
- HTTP_PROXY=http://proxy.company.com:8080
- HTTPS_PROXY=http://proxy.company.com:8080
- NO_PROXY=localhost,postgres,milvus-standalone
C) Firewall Blocking Outbound:
# Check Docker daemon firewall rules
sudo iptables -L DOCKER-USER -n
# Allow outbound HTTPS
sudo iptables -I DOCKER-USER -p tcp --dport 443 -j ACCEPT
Volume & Storage Issues¶
Issue 1: Volume Mount Errors¶
Symptoms:
Diagnosis:
# Check volume configuration
docker compose config | grep -A 10 volumes
# Verify paths exist
ls -la ./volumes/
ls -la ./volumes/postgres
Solutions:
A) Create Volume Directories:
# Create all volume directories
mkdir -p volumes/postgres
mkdir -p volumes/milvus
mkdir -p volumes/etcd
mkdir -p volumes/minio
mkdir -p volumes/backend
# Set permissions
chmod -R 755 volumes/
B) Use Docker-Managed Volumes (alternative):
# File: docker-compose.yml
volumes:
postgres_data: # Docker-managed volume (no device path)
milvus_data:
minio_data:
services:
postgres:
volumes:
- postgres_data:/var/lib/postgresql/data
Issue 2: Permission Denied Errors¶
Symptoms:
postgres_1 | FATAL: data directory "/var/lib/postgresql/data" has wrong ownership
backend_1 | PermissionError: [Errno 13] Permission denied: '/app/logs/rag_modulo.log'
Diagnosis:
# Check volume ownership
ls -la volumes/postgres
ls -la volumes/backend
# Check container user
docker compose exec backend id
docker compose exec postgres id
Solutions:
A) Fix Volume Permissions:
# For PostgreSQL (uid 999)
sudo chown -R 999:999 volumes/postgres
# For backend (uid 10001 from Dockerfile)
sudo chown -R 10001:10001 volumes/backend
# Or make world-writable (less secure)
chmod -R 777 volumes/backend/logs
B) Use Named Volumes (Docker manages permissions):
volumes:
backend_data:
postgres_data:
services:
backend:
volumes:
- backend_data:/mnt/data
postgres:
volumes:
- postgres_data:/var/lib/postgresql/data
Issue 3: Disk Space Exhausted¶
Symptoms:
Diagnosis:
# Check Docker disk usage
docker system df
# Detailed disk usage
docker system df -v
# Check host disk space
df -h
# Check specific volume
du -sh volumes/*
Solutions:
A) Clean Docker Resources:
# Remove unused images
docker image prune -a
# Remove unused volumes
docker volume prune
# Remove build cache
docker builder prune
# Nuclear option: Clean everything (CAUTION!)
docker system prune -a --volumes
B) Increase Docker Disk Allocation (Docker Desktop):
C) Move Volumes to Larger Disk:
# Stop services
docker compose down
# Move volumes
sudo mv volumes /mnt/large-disk/rag-modulo-volumes
# Update docker-compose.yml paths
# File: docker-compose.yml
volumes:
postgres_data:
driver_opts:
device: /mnt/large-disk/rag-modulo-volumes/postgres
# Restart services
docker compose up -d
Image Build Problems¶
Issue 1: Build Fails with BACKEND_CACHE_BUST¶
Symptoms:
ERROR: failed to solve: failed to compute cache key:
# Or cache not invalidating when backend files change
Diagnosis:
# Check Dockerfile
cat backend/Dockerfile.backend | grep BACKEND_CACHE_BUST
# Try build with no cache
docker build --no-cache -f backend/Dockerfile.backend -t test-build .
Solutions:
A) Local Builds (uses default value):
# Local builds use default value 'local-build' automatically
docker build -f backend/Dockerfile.backend -t rag-modulo-backend:latest .
make build-backend # Also works - uses default value
B) Force Cache Invalidation:
# Override with a new value to force cache invalidation
docker build --build-arg BACKEND_CACHE_BUST=$(date +%s) \
-f backend/Dockerfile.backend -t rag-modulo-backend:latest .
C) CI/CD Builds (content-based invalidation):
# In GitHub Actions workflows, BACKEND_CACHE_BUST is set automatically
# based on content hash of backend files:
BACKEND_CACHE_BUST=${{ hashFiles('backend/**/*.py', 'backend/Dockerfile.backend', 'pyproject.toml', 'poetry.lock') }}
D) Build with --pull:
# Pull latest base image
docker build --pull -f backend/Dockerfile.backend -t rag-modulo-backend:latest .
Understanding Cache Invalidation Strategy:
- Local builds: Use default
BACKEND_CACHE_BUST=local-build- cache invalidates only on manual rebuilds - CI builds: Use content hash - cache invalidates automatically when backend Python files, Dockerfile, or dependency files change
- Cache benefits: Docker layer cache is preserved when backend files are unchanged, significantly speeding up builds
Issue 2: Poetry Lock File Issues¶
Symptoms:
Diagnosis:
Solutions:
# Regenerate lock file
cd .
poetry lock
# Rebuild image
docker compose build backend
# Or use build argument to skip validation
docker build --build-arg SKIP_LOCK_CHECK=1 -f backend/Dockerfile.backend .
Issue 3: Build Timeouts¶
Symptoms:
Solutions:
# Increase BuildKit timeout
export BUILDKIT_STEP_LOG_MAX_SIZE=-1
export BUILDKIT_STEP_LOG_MAX_SPEED=-1
# Build with more time
docker build --progress=plain -f backend/Dockerfile.backend .
# Or use docker compose
COMPOSE_HTTP_TIMEOUT=600 docker compose build backend
Resource Constraints¶
Issue 1: Backend OOM (Out of Memory)¶
Symptoms:
docker compose ps
rag-modulo-backend-1 Restarting (137) 1 minute ago
# Exit code 137 = killed by OOM
dmesg | tail
Out of memory: Killed process 1234 (python)
Diagnosis:
# Check memory limit
docker inspect rag-modulo-backend-1 | jq '.[0].HostConfig.Memory'
# Monitor memory usage
docker stats rag-modulo-backend-1
# Check Python memory usage
docker compose exec backend python -c "
import psutil
print(f'Memory: {psutil.virtual_memory().percent}%')
"
Solutions:
A) Increase Memory Limit:
# File: docker-compose.yml
backend:
deploy:
resources:
limits:
memory: 8G # Increase from 4G
reservations:
memory: 4G
B) Reduce Memory Usage:
# Disable CPU-intensive operations
# File: .env
WATSONX_USE_GPU=false # Already default in container
# Reduce worker count
WEB_CONCURRENCY=2 # Default is 4
# Use CPU-only PyTorch (already configured in Dockerfile)
Issue 2: CPU Throttling¶
Symptoms:
Diagnosis:
# Check CPU limits
docker inspect rag-modulo-backend-1 | jq '.[0].HostConfig.CpuQuota'
# Monitor CPU usage
docker stats rag-modulo-backend-1
# Check Docker daemon CPU
top -p $(pgrep dockerd)
Solutions:
A) Increase CPU Limit:
# File: docker-compose.yml
backend:
deploy:
resources:
limits:
cpus: '4.0' # Increase from 2.0
reservations:
cpus: '2.0'
B) Scale Horizontally:
# File: docker-compose.yml
backend:
deploy:
replicas: 3 # Run 3 backend containers
# With load balancer (nginx)
nginx:
image: nginx:alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
ports:
- "80:80"
Multi-Container Coordination¶
Issue 1: Services Start Out of Order¶
Symptoms:
backend_1 | sqlalchemy.exc.OperationalError: could not connect to server
# Backend starts before PostgreSQL is ready
Solution: Use health checks with depends_on (already configured):
# File: docker-compose.yml
backend:
depends_on:
postgres:
condition: service_healthy # Wait for health check
milvus-standalone:
condition: service_healthy
mlflow-server:
condition: service_started # No health check, just started
Issue 2: Circular Dependency¶
Symptoms:
Error: Circular dependency between services:
service1 depends on service2
service2 depends on service1
Solution: Break the cycle by using connection retry logic:
# File: backend/rag_solution/file_management/database.py
from tenacity import retry, stop_after_attempt, wait_fixed
@retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
def connect_to_database():
engine = create_engine(DATABASE_URL)
with engine.connect() as conn:
conn.execute("SELECT 1")
return engine
Docker Compose Issues¶
Issue 1: Docker Compose V1 vs V2¶
Symptoms:
Solutions:
# Check Docker Compose version
docker compose version # V2
docker-compose version # V1 (deprecated)
# Install Docker Compose V2
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker-compose-plugin
# Check Makefile compatibility
# File: Makefile uses docker compose (V2)
DOCKER_COMPOSE := docker compose
Issue 2: Multiple Compose Files¶
็็ถ:
Solution: Understand file precedence:
# Production (default)
docker compose up -d
# Uses: docker-compose.yml + docker-compose-infra.yml
# Development (with overrides)
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d
# Testing
docker compose -f docker-compose.test.yml up -d
# Check merged configuration
docker compose config
docker compose -f docker-compose.yml -f docker-compose.dev.yml config
Issue 3: Environment Variable Conflicts¶
Symptoms:
Solutions:
# Check variable precedence (highest to lowest):
# 1. Shell environment: export VAR=value
# 2. docker-compose.yml environment section
# 3. env_file (.env)
# 4. Dockerfile ENV
# View effective configuration
docker compose config | grep -A 5 environment
# Debug specific variable
docker compose exec backend env | grep COLLECTIONDB_HOST
Related Documentation¶
- Debugging Guide - General debugging techniques
- Performance Troubleshooting - Container performance
- Cloud Deployment - Production Docker deployment
- Common Issues - Quick fixes for frequent problems