Backup and Disaster Recovery¶
This guide covers backup and disaster recovery strategies for RAG Modulo deployment on IBM Cloud, ensuring data protection and business continuity.
Overview¶
The backup and disaster recovery strategy provides:
- Data Protection: Automated backups of all critical data
- Business Continuity: Rapid recovery from disasters
- Compliance: Meet regulatory requirements for data retention
- Testing: Regular validation of backup and recovery procedures
- Documentation: Clear procedures for disaster response
Architecture¶
graph TB
subgraph "Production Environment"
PG[PostgreSQL]
OS[Object Storage]
ZL[Zilliz Cloud]
ES[Event Streams]
BE[Backend App]
FE[Frontend App]
end
subgraph "Backup Services"
PG_BK[PostgreSQL Backups]
OS_BK[Object Storage Backups]
ZL_BK[Zilliz Cloud Backups]
ES_BK[Event Streams Backups]
end
subgraph "Disaster Recovery"
DR_REGION[DR Region]
DR_PG[DR PostgreSQL]
DR_OS[DR Object Storage]
DR_ZL[DR Zilliz Cloud]
DR_ES[DR Event Streams]
DR_APPS[DR Applications]
end
subgraph "Backup Storage"
COS[Cloud Object Storage]
CR[Container Registry]
SECRETS[Secrets Manager]
end
PG --> PG_BK
OS --> OS_BK
ZL --> ZL_BK
ES --> ES_BK
PG_BK --> COS
OS_BK --> COS
ZL_BK --> COS
ES_BK --> COS
COS --> DR_REGION
CR --> DR_REGION
SECRETS --> DR_REGION
DR_REGION --> DR_PG
DR_REGION --> DR_OS
DR_REGION --> DR_ZL
DR_REGION --> DR_ES
DR_REGION --> DR_APPS Backup Strategy¶
1. PostgreSQL Database Backups¶
Automated Backups¶
# PostgreSQL backup configuration
postgresql_backup:
enabled: true
service: "ibm-cloud-databases-for-postgresql"
plan: "standard"
# Backup settings
backup_settings:
frequency: "daily"
retention_days: 30
point_in_time_recovery: true
cross_region_replication: true
# Backup schedule
schedule:
time: "02:00"
timezone: "UTC"
days: ["monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]
# Backup storage
storage:
location: "us-south"
bucket: "rag-modulo-postgresql-backups"
encryption: "AES256"
compression: true
Manual Backup Script¶
#!/bin/bash
# PostgreSQL backup script
set -e
# Configuration
BACKUP_DIR="/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="postgresql_backup_${DATE}.sql"
S3_BUCKET="rag-modulo-postgresql-backups"
S3_PREFIX="postgresql/"
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Create database backup
echo "Creating PostgreSQL backup..."
pg_dump "$DATABASE_URL" > "$BACKUP_DIR/$BACKUP_FILE"
# Compress backup
echo "Compressing backup..."
gzip "$BACKUP_DIR/$BACKUP_FILE"
BACKUP_FILE="${BACKUP_FILE}.gz"
# Upload to S3
echo "Uploading backup to S3..."
aws s3 cp "$BACKUP_DIR/$BACKUP_FILE" "s3://$S3_BUCKET/$S3_PREFIX$BACKUP_FILE"
# Verify upload
echo "Verifying backup upload..."
aws s3 ls "s3://$S3_BUCKET/$S3_PREFIX$BACKUP_FILE"
# Clean up local backup
echo "Cleaning up local backup..."
rm "$BACKUP_DIR/$BACKUP_FILE"
echo "Backup completed successfully: $BACKUP_FILE"
2. Object Storage Backups¶
Cross-Region Replication¶
# Object Storage backup configuration
object_storage_backup:
enabled: true
service: "ibm-cloud-object-storage"
# Replication settings
replication:
enabled: true
source_region: "us-south"
target_region: "us-east"
target_bucket: "rag-modulo-backups-us-east"
# Lifecycle policies
lifecycle_policies:
- name: "standard_to_ia"
rule_id: "standard_to_ia"
status: "Enabled"
transitions:
- days: 30
storage_class: "STANDARD_IA"
- name: "ia_to_glacier"
rule_id: "ia_to_glacier"
status: "Enabled"
transitions:
- days: 90
storage_class: "GLACIER"
- name: "glacier_to_deep_archive"
rule_id: "glacier_to_deep_archive"
status: "Enabled"
transitions:
- days: 365
storage_class: "DEEP_ARCHIVE"
Backup Script¶
#!/bin/bash
# Object Storage backup script
set -e
# Configuration
SOURCE_BUCKET="rag-modulo-app-data"
BACKUP_BUCKET="rag-modulo-backups"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_PREFIX="object-storage-backup-$DATE/"
# Create backup
echo "Creating Object Storage backup..."
aws s3 sync "s3://$SOURCE_BUCKET" "s3://$BACKUP_BUCKET/$BACKUP_PREFIX" \
--storage-class STANDARD_IA \
--metadata "backup-date=$DATE,backup-type=object-storage"
# Verify backup
echo "Verifying backup..."
aws s3 ls "s3://$BACKUP_BUCKET/$BACKUP_PREFIX" --recursive | wc -l
echo "Object Storage backup completed successfully"
3. Vector Database Backups¶
Zilliz Cloud Backups¶
# Zilliz Cloud backup configuration
zilliz_backup:
enabled: true
service: "zilliz-cloud"
# Backup settings
backup_settings:
frequency: "daily"
retention_days: 30
cross_region_replication: true
# Backup collections
collections:
- name: "documents"
backup_enabled: true
- name: "embeddings"
backup_enabled: true
- name: "metadata"
backup_enabled: true
# Backup storage
storage:
location: "us-south"
bucket: "rag-modulo-zilliz-backups"
encryption: "AES256"
Backup Script¶
#!/usr/bin/env python3
# Zilliz Cloud backup script
import os
import json
import boto3
from datetime import datetime
from zilliz import MilvusClient
def backup_zilliz_collections():
"""Backup Zilliz Cloud collections"""
# Configuration
zilliz_endpoint = os.getenv('MILVUS_HOST')
zilliz_api_key = os.getenv('MILVUS_API_KEY')
s3_bucket = os.getenv('BACKUP_BUCKET', 'rag-modulo-zilliz-backups')
backup_prefix = f"zilliz-backup-{datetime.now().strftime('%Y%m%d_%H%M%S')}/"
# Initialize clients
milvus_client = MilvusClient(uri=zilliz_endpoint, token=zilliz_api_key)
s3_client = boto3.client('s3')
# Get all collections
collections = milvus_client.list_collections()
for collection_name in collections:
print(f"Backing up collection: {collection_name}")
# Export collection data
export_result = milvus_client.export_collection(
collection_name=collection_name,
output_path=f"/tmp/{collection_name}_backup.json"
)
# Upload to S3
s3_key = f"{backup_prefix}{collection_name}_backup.json"
s3_client.upload_file(
f"/tmp/{collection_name}_backup.json",
s3_bucket,
s3_key,
ExtraArgs={'ServerSideEncryption': 'AES256'}
)
# Clean up local file
os.remove(f"/tmp/{collection_name}_backup.json")
print(f"Collection {collection_name} backed up successfully")
print("Zilliz Cloud backup completed successfully")
if __name__ == "__main__":
backup_zilliz_collections()
4. Application Configuration Backups¶
Configuration Backup¶
#!/bin/bash
# Application configuration backup script
set -e
# Configuration
BACKUP_DIR="/backups/config"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="config_backup_${DATE}.tar.gz"
S3_BUCKET="rag-modulo-config-backups"
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Backup configuration files
echo "Creating configuration backup..."
tar -czf "$BACKUP_DIR/$BACKUP_FILE" \
deployment/terraform/ \
deployment/ansible/ \
.github/workflows/ \
docker-compose*.yml \
env.example
# Upload to S3
echo "Uploading configuration backup to S3..."
aws s3 cp "$BACKUP_DIR/$BACKUP_FILE" "s3://$S3_BUCKET/$BACKUP_FILE"
# Verify upload
echo "Verifying backup upload..."
aws s3 ls "s3://$S3_BUCKET/$BACKUP_FILE"
# Clean up local backup
echo "Cleaning up local backup..."
rm "$BACKUP_DIR/$BACKUP_FILE"
echo "Configuration backup completed successfully: $BACKUP_FILE"
Disaster Recovery¶
1. Recovery Time Objectives (RTO)¶
| Component | RTO | RPO |
|---|---|---|
| PostgreSQL | 60 minutes | 15 minutes |
| Object Storage | 30 minutes | 5 minutes |
| Vector Database | 90 minutes | 30 minutes |
| Applications | 30 minutes | 0 minutes |
| Overall System | 60 minutes | 15 minutes |
2. Recovery Procedures¶
PostgreSQL Recovery¶
#!/bin/bash
# PostgreSQL disaster recovery script
set -e
# Configuration
RESTORE_DATABASE_URL="$1"
BACKUP_FILE="$2"
S3_BUCKET="rag-modulo-postgresql-backups"
if [ -z "$RESTORE_DATABASE_URL" ] || [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <database_url> <backup_file>"
exit 1
fi
# Download backup from S3
echo "Downloading backup from S3..."
aws s3 cp "s3://$S3_BUCKET/$BACKUP_FILE" "/tmp/$BACKUP_FILE"
# Decompress backup
echo "Decompressing backup..."
gunzip "/tmp/$BACKUP_FILE"
RESTORE_FILE="/tmp/${BACKUP_FILE%.gz}"
# Restore database
echo "Restoring PostgreSQL database..."
psql "$RESTORE_DATABASE_URL" < "$RESTORE_FILE"
# Verify restoration
echo "Verifying database restoration..."
psql "$RESTORE_DATABASE_URL" -c "SELECT COUNT(*) FROM information_schema.tables;"
# Clean up
echo "Cleaning up temporary files..."
rm "/tmp/$RESTORE_FILE"
echo "PostgreSQL recovery completed successfully"
Object Storage Recovery¶
#!/bin/bash
# Object Storage disaster recovery script
set -e
# Configuration
RESTORE_BUCKET="$1"
BACKUP_PREFIX="$2"
S3_BUCKET="rag-modulo-backups"
if [ -z "$RESTORE_BUCKET" ] || [ -z "$BACKUP_PREFIX" ]; then
echo "Usage: $0 <restore_bucket> <backup_prefix>"
exit 1
fi
# Restore from backup
echo "Restoring Object Storage from backup..."
aws s3 sync "s3://$S3_BUCKET/$BACKUP_PREFIX" "s3://$RESTORE_BUCKET/"
# Verify restoration
echo "Verifying Object Storage restoration..."
aws s3 ls "s3://$RESTORE_BUCKET/" --recursive | wc -l
echo "Object Storage recovery completed successfully"
Vector Database Recovery¶
#!/usr/bin/env python3
# Zilliz Cloud disaster recovery script
import os
import json
import boto3
from zilliz import MilvusClient
def restore_zilliz_collections(restore_endpoint, restore_api_key, backup_prefix):
"""Restore Zilliz Cloud collections from backup"""
# Configuration
s3_bucket = os.getenv('BACKUP_BUCKET', 'rag-modulo-zilliz-backups')
# Initialize clients
milvus_client = MilvusClient(uri=restore_endpoint, token=restore_api_key)
s3_client = boto3.client('s3')
# List backup files
response = s3_client.list_objects_v2(
Bucket=s3_bucket,
Prefix=backup_prefix
)
for obj in response.get('Contents', []):
collection_name = obj['Key'].split('/')[-1].replace('_backup.json', '')
print(f"Restoring collection: {collection_name}")
# Download backup file
s3_client.download_file(
s3_bucket,
obj['Key'],
f"/tmp/{collection_name}_restore.json"
)
# Import collection data
milvus_client.import_collection(
collection_name=collection_name,
data_path=f"/tmp/{collection_name}_restore.json"
)
# Clean up local file
os.remove(f"/tmp/{collection_name}_restore.json")
print(f"Collection {collection_name} restored successfully")
print("Zilliz Cloud recovery completed successfully")
if __name__ == "__main__":
import sys
if len(sys.argv) != 4:
print("Usage: python restore_zilliz.py <endpoint> <api_key> <backup_prefix>")
sys.exit(1)
restore_zilliz_collections(sys.argv[1], sys.argv[2], sys.argv[3])
3. Full System Recovery¶
Recovery Orchestration¶
# Full system recovery playbook
---
- name: RAG Modulo Disaster Recovery
hosts: localhost
gather_facts: false
vars:
recovery_region: "{{ recovery_region | default('us-east') }}"
backup_date: "{{ backup_date | default('latest') }}"
recovery_environment: "{{ recovery_environment | default('production') }}"
tasks:
- name: Validate recovery parameters
ansible.builtin.assert:
that:
- recovery_region is defined
- backup_date is defined
- recovery_environment is defined
fail_msg: "Recovery parameters are not defined"
- name: Set up recovery environment
ansible.builtin.shell: |
ibmcloud target -r "{{ recovery_region }}"
ibmcloud target -g "{{ resource_group_id }}"
environment:
IBMCLOUD_API_KEY: "{{ ibmcloud_api_key }}"
- name: Provision recovery infrastructure
ansible.builtin.shell: |
cd deployment/terraform/environments/ibm
terraform init
terraform plan -var-file="recovery.tfvars"
terraform apply -var-file="recovery.tfvars" -auto-approve
environment:
IBMCLOUD_API_KEY: "{{ ibmcloud_api_key }}"
- name: Restore PostgreSQL database
ansible.builtin.shell: |
./scripts/restore_postgresql.sh "{{ postgresql_url }}" "{{ backup_date }}"
- name: Restore Object Storage
ansible.builtin.shell: |
./scripts/restore_object_storage.sh "{{ object_storage_bucket }}" "{{ backup_date }}"
- name: Restore Vector Database
ansible.builtin.shell: |
python scripts/restore_zilliz.py "{{ zilliz_endpoint }}" "{{ zilliz_api_key }}" "{{ backup_date }}"
- name: Deploy applications
ansible.builtin.shell: |
ansible-playbook -i inventories/ibm/hosts.yml playbooks/deploy-rag-modulo.yml -e "environment={{ recovery_environment }}"
- name: Verify recovery
ansible.builtin.shell: |
curl -f "https://{{ frontend_url }}/health" || exit 1
curl -f "https://{{ backend_url }}/health" || exit 1
Testing¶
1. Backup Testing¶
Automated Backup Testing¶
#!/bin/bash
# Automated backup testing script
set -e
# Configuration
TEST_DATABASE_URL="$1"
BACKUP_FILE="$2"
S3_BUCKET="rag-modulo-postgresql-backups"
if [ -z "$TEST_DATABASE_URL" ] || [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <test_database_url> <backup_file>"
exit 1
fi
# Create test database
echo "Creating test database..."
createdb "$TEST_DATABASE_URL"
# Download and restore backup
echo "Testing backup restoration..."
aws s3 cp "s3://$S3_BUCKET/$BACKUP_FILE" "/tmp/$BACKUP_FILE"
gunzip "/tmp/$BACKUP_FILE"
RESTORE_FILE="/tmp/${BACKUP_FILE%.gz}"
psql "$TEST_DATABASE_URL" < "$RESTORE_FILE"
# Verify backup integrity
echo "Verifying backup integrity..."
TABLE_COUNT=$(psql "$TEST_DATABASE_URL" -t -c "SELECT COUNT(*) FROM information_schema.tables;")
echo "Tables found: $TABLE_COUNT"
if [ "$TABLE_COUNT" -gt 0 ]; then
echo "โ
Backup test passed"
else
echo "โ Backup test failed"
exit 1
fi
# Clean up
echo "Cleaning up test database..."
dropdb "$TEST_DATABASE_URL"
rm "/tmp/$RESTORE_FILE"
echo "Backup testing completed successfully"
Weekly Backup Testing¶
# Weekly backup testing schedule
backup_testing:
schedule: "0 3 * * 0" # Every Sunday at 3 AM
tests:
- name: "postgresql_backup_test"
script: "scripts/test_postgresql_backup.sh"
timeout: "30m"
- name: "object_storage_backup_test"
script: "scripts/test_object_storage_backup.sh"
timeout: "15m"
- name: "zilliz_backup_test"
script: "scripts/test_zilliz_backup.sh"
timeout: "45m"
2. Disaster Recovery Testing¶
Quarterly DR Drills¶
# Quarterly disaster recovery testing
dr_testing:
schedule: "0 2 1 */3 *" # First day of every quarter at 2 AM
tests:
- name: "full_system_recovery"
script: "scripts/test_full_system_recovery.sh"
timeout: "2h"
- name: "database_recovery"
script: "scripts/test_database_recovery.sh"
timeout: "1h"
- name: "application_recovery"
script: "scripts/test_application_recovery.sh"
timeout: "30m"
DR Test Script¶
#!/bin/bash
# Disaster recovery test script
set -e
# Configuration
TEST_ENVIRONMENT="dr-test"
TEST_REGION="us-east"
BACKUP_DATE="latest"
echo "Starting disaster recovery test..."
# Create test environment
echo "Creating test environment..."
ibmcloud target -r "$TEST_REGION"
ibmcloud target -g "$RESOURCE_GROUP_ID"
# Run recovery playbook
echo "Running disaster recovery playbook..."
ansible-playbook -i inventories/ibm/hosts.yml playbooks/disaster-recovery.yml \
-e "recovery_region=$TEST_REGION" \
-e "backup_date=$BACKUP_DATE" \
-e "recovery_environment=$TEST_ENVIRONMENT"
# Test application functionality
echo "Testing application functionality..."
curl -f "https://$TEST_ENVIRONMENT-frontend.example.com/health" || exit 1
curl -f "https://$TEST_ENVIRONMENT-backend.example.com/health" || exit 1
# Test data integrity
echo "Testing data integrity..."
python scripts/test_data_integrity.py "$TEST_ENVIRONMENT"
# Clean up test environment
echo "Cleaning up test environment..."
ibmcloud ce project delete "$TEST_ENVIRONMENT" --force
echo "โ
Disaster recovery test completed successfully"
Monitoring and Alerting¶
1. Backup Monitoring¶
Backup Status Alerts¶
# Backup monitoring alerts
backup_alerts:
- name: "backup_failed"
condition: "backup_status == 'failed'"
severity: "critical"
description: "Backup process failed"
- name: "backup_delayed"
condition: "backup_delay > 2h"
severity: "warning"
description: "Backup is delayed by more than 2 hours"
- name: "backup_size_anomaly"
condition: "backup_size < 0.5 * avg_backup_size OR backup_size > 2 * avg_backup_size"
severity: "warning"
description: "Backup size is significantly different from average"
Backup Health Checks¶
#!/bin/bash
# Backup health check script
set -e
# Configuration
S3_BUCKET="rag-modulo-postgresql-backups"
EXPECTED_BACKUPS=7 # 7 days of backups
# Check backup count
echo "Checking backup count..."
BACKUP_COUNT=$(aws s3 ls "s3://$S3_BUCKET/" --recursive | wc -l)
if [ "$BACKUP_COUNT" -lt "$EXPECTED_BACKUPS" ]; then
echo "โ Insufficient backups found: $BACKUP_COUNT (expected: $EXPECTED_BACKUPS)"
exit 1
fi
# Check latest backup
echo "Checking latest backup..."
LATEST_BACKUP=$(aws s3 ls "s3://$S3_BUCKET/" --recursive | sort | tail -1 | awk '{print $1, $2}')
echo "Latest backup: $LATEST_BACKUP"
# Check backup age
echo "Checking backup age..."
BACKUP_AGE=$(aws s3 ls "s3://$S3_BUCKET/" --recursive | sort | tail -1 | awk '{print $1, $2}' | xargs -I {} date -d {} +%s)
CURRENT_TIME=$(date +%s)
AGE_HOURS=$(( (CURRENT_TIME - BACKUP_AGE) / 3600 ))
if [ "$AGE_HOURS" -gt 25 ]; then
echo "โ Latest backup is too old: $AGE_HOURS hours"
exit 1
fi
echo "โ
Backup health check passed"
2. Recovery Monitoring¶
Recovery Time Monitoring¶
# Recovery time monitoring
recovery_monitoring:
- name: "recovery_time_exceeded"
condition: "recovery_time > 60m"
severity: "critical"
description: "Recovery time exceeded RTO of 60 minutes"
- name: "data_loss_detected"
condition: "data_loss > 15m"
severity: "critical"
description: "Data loss exceeds RPO of 15 minutes"
Documentation¶
1. Recovery Procedures¶
Emergency Contact List¶
# Emergency contact list
emergency_contacts:
primary:
- name: "DevOps Team"
phone: "+1-555-0123"
email: "devops@company.com"
slack: "#devops-alerts"
secondary:
- name: "Engineering Manager"
phone: "+1-555-0124"
email: "eng-manager@company.com"
slack: "#engineering"
escalation:
- name: "CTO"
phone: "+1-555-0125"
email: "cto@company.com"
slack: "#executive"
Recovery Checklist¶
# Disaster Recovery Checklist
## Immediate Response (0-15 minutes)
- [ ] Assess the scope of the disaster
- [ ] Notify emergency contacts
- [ ] Activate incident response team
- [ ] Document initial assessment
## Assessment Phase (15-30 minutes)
- [ ] Identify affected systems
- [ ] Determine root cause
- [ ] Estimate recovery time
- [ ] Communicate status to stakeholders
## Recovery Phase (30-60 minutes)
- [ ] Activate disaster recovery environment
- [ ] Restore database from latest backup
- [ ] Restore object storage data
- [ ] Restore vector database
- [ ] Deploy applications
- [ ] Verify system functionality
## Validation Phase (60-90 minutes)
- [ ] Test critical functionality
- [ ] Verify data integrity
- [ ] Monitor system performance
- [ ] Document recovery process
## Post-Recovery (90+ minutes)
- [ ] Conduct post-incident review
- [ ] Update recovery procedures
- [ ] Communicate resolution to stakeholders
- [ ] Schedule follow-up actions
2. Runbooks¶
Database Recovery Runbook¶
# PostgreSQL Database Recovery Runbook
## Prerequisites
- Access to IBM Cloud console
- Database backup files in S3
- Recovery environment provisioned
## Recovery Steps
### 1. Access Recovery Environment
```bash
ibmcloud target -r us-east
ibmcloud target -g production-resource-group
2. Provision Database¶
3. Restore Database¶
4. Verify Restoration¶
5. Test Connectivity¶
Troubleshooting¶
- If restoration fails, try previous backup
- Check database logs for errors
- Verify network connectivity
- Contact database team if needed
```
Best Practices¶
1. Backup Strategy¶
- 3-2-1 Rule: 3 copies, 2 different media, 1 off-site
- Regular Testing: Test backups weekly
- Automation: Automate all backup processes
- Monitoring: Monitor backup success/failure
2. Recovery Planning¶
- Documentation: Maintain up-to-date procedures
- Training: Regular team training on procedures
- Testing: Quarterly disaster recovery drills
- Communication: Clear communication protocols
3. Data Protection¶
- Encryption: Encrypt all backups
- Access Control: Limit backup access
- Retention: Appropriate retention policies
- Compliance: Meet regulatory requirements
4. Continuous Improvement¶
- Post-Incident Reviews: Learn from incidents
- Procedure Updates: Regular procedure updates
- Technology Updates: Stay current with technology
- Team Training: Ongoing team education