Skip to content

Monitoring and Observability

This guide covers monitoring and observability strategies for RAG Modulo deployment on IBM Cloud, ensuring comprehensive visibility into application performance, infrastructure health, and operational metrics.

Overview

The monitoring and observability strategy provides:

  • Application Performance Monitoring (APM): Real-time application metrics and traces
  • Infrastructure Monitoring: Resource utilization and health status
  • Log Management: Centralized logging and analysis
  • Alerting: Proactive notification of issues
  • Dashboards: Visual representation of system health

Architecture

graph TB
    subgraph "Applications"
        BE[Backend App]
        FE[Frontend App]
    end

    subgraph "IBM Cloud Monitoring"
        APM[Application Performance Monitoring]
        LOG[Log Analysis]
        MET[Monitoring]
        ALERT[Alerting]
    end

    subgraph "External Tools"
        GRAF[Grafana]
        PROM[Prometheus]
        ELK[ELK Stack]
    end

    subgraph "Data Sources"
        METRICS[Application Metrics]
        LOGS[Application Logs]
        TRACES[Distributed Traces]
        EVENTS[Events]
    end

    BE --> METRICS
    BE --> LOGS
    BE --> TRACES
    FE --> METRICS
    FE --> LOGS

    METRICS --> APM
    LOGS --> LOG
    TRACES --> APM
    EVENTS --> MET

    APM --> GRAF
    LOG --> ELK
    MET --> PROM
    ALERT --> GRAF

IBM Cloud Monitoring Services

1. Application Performance Monitoring

Features

  • Real-time Metrics: CPU, memory, response time, throughput
  • Distributed Tracing: Request flow across services
  • Error Tracking: Exception monitoring and alerting
  • Custom Metrics: Application-specific metrics
  • Alerting: Threshold-based notifications

Configuration

# Application monitoring configuration
monitoring:
  enabled: true
  service: "ibm-cloud-monitoring"
  plan: "lite"
  region: "us-south"

  # Custom metrics
  custom_metrics:
    - name: "rag_queries_total"
      type: "counter"
      description: "Total number of RAG queries"
    - name: "rag_query_duration_seconds"
      type: "histogram"
      description: "RAG query processing time"
    - name: "vector_search_duration_seconds"
      type: "histogram"
      description: "Vector search processing time"

  # Alerting rules
  alerts:
    - name: "high_error_rate"
      condition: "error_rate > 0.05"
      duration: "5m"
      severity: "critical"
    - name: "high_response_time"
      condition: "response_time_p95 > 2.0"
      duration: "10m"
      severity: "warning"

2. Log Analysis

Features

  • Centralized Logging: All application logs in one place
  • Log Search: Full-text search and filtering
  • Log Analytics: AI-powered log analysis
  • Retention: Configurable log retention periods
  • Export: Log export for external analysis

Configuration

# Log analysis configuration
log_analysis:
  enabled: true
  service: "ibm-cloud-log-analysis"
  plan: "lite"
  region: "us-south"

  # Log sources
  sources:
    - name: "backend-logs"
      type: "application"
      app: "rag-modulo-backend"
    - name: "frontend-logs"
      type: "application"
      app: "rag-modulo-frontend"
    - name: "system-logs"
      type: "system"
      level: "info"

  # Retention policies
  retention:
    default: "30d"
    critical: "90d"
    debug: "7d"

  # Log parsing rules
  parsing:
    - name: "error_logs"
      pattern: "ERROR.*"
      fields: ["timestamp", "level", "message", "stack_trace"]
    - name: "access_logs"
      pattern: "GET|POST|PUT|DELETE.*"
      fields: ["timestamp", "method", "path", "status", "duration"]

3. Infrastructure Monitoring

Features

  • Resource Metrics: CPU, memory, storage, network
  • Service Health: Health checks and status monitoring
  • Capacity Planning: Resource usage trends
  • Cost Monitoring: Resource cost tracking
  • Automated Scaling: Trigger scaling based on metrics

Configuration

# Infrastructure monitoring configuration
infrastructure_monitoring:
  enabled: true
  service: "ibm-cloud-monitoring"
  plan: "lite"
  region: "us-south"

  # Monitored resources
  resources:
    - name: "code-engine-project"
      type: "code_engine"
      metrics: ["cpu_usage", "memory_usage", "request_count"]
    - name: "postgresql-database"
      type: "database"
      metrics: ["connection_count", "query_duration", "storage_usage"]
    - name: "object-storage"
      type: "storage"
      metrics: ["storage_usage", "request_count", "data_transfer"]

  # Alerting thresholds
  thresholds:
    cpu_usage: 80
    memory_usage: 85
    storage_usage: 90
    error_rate: 5

Application Metrics

1. Backend Metrics

Custom Metrics

# Backend metrics implementation
from prometheus_client import Counter, Histogram, Gauge
import time

# Request metrics
request_count = Counter('rag_requests_total', 'Total RAG requests', ['method', 'endpoint'])
request_duration = Histogram('rag_request_duration_seconds', 'Request duration', ['method', 'endpoint'])

# RAG-specific metrics
rag_queries_total = Counter('rag_queries_total', 'Total RAG queries', ['collection', 'status'])
rag_query_duration = Histogram('rag_query_duration_seconds', 'RAG query duration', ['collection'])
vector_search_duration = Histogram('vector_search_duration_seconds', 'Vector search duration', ['collection'])
embedding_duration = Histogram('embedding_duration_seconds', 'Embedding generation duration')

# Resource metrics
active_connections = Gauge('active_connections', 'Active database connections')
cache_hit_rate = Gauge('cache_hit_rate', 'Cache hit rate')
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')

# Error metrics
error_count = Counter('errors_total', 'Total errors', ['error_type', 'endpoint'])

Health Check Endpoint

# Health check implementation
@app.get("/health")
async def health_check():
    """Health check endpoint for monitoring"""
    try:
        # Check database connectivity
        db_status = await check_database_connection()

        # Check vector database connectivity
        vector_status = await check_vector_database_connection()

        # Check object storage connectivity
        storage_status = await check_object_storage_connection()

        # Overall health status
        overall_status = "healthy" if all([db_status, vector_status, storage_status]) else "unhealthy"

        return {
            "status": overall_status,
            "timestamp": datetime.utcnow().isoformat(),
            "checks": {
                "database": db_status,
                "vector_database": vector_status,
                "object_storage": storage_status
            }
        }
    except Exception as e:
        return {
            "status": "unhealthy",
            "timestamp": datetime.utcnow().isoformat(),
            "error": str(e)
        }

2. Frontend Metrics

Performance Metrics

// Frontend metrics implementation
class MetricsCollector {
  constructor() {
    this.metrics = {
      pageLoadTime: new Map(),
      apiCallDuration: new Map(),
      errorCount: 0,
      userInteractions: 0
    };
  }

  // Track page load time
  trackPageLoad(pageName, loadTime) {
    this.metrics.pageLoadTime.set(pageName, loadTime);
    this.sendMetric('page_load_time', { page: pageName }, loadTime);
  }

  // Track API call duration
  trackApiCall(endpoint, duration, status) {
    this.metrics.apiCallDuration.set(endpoint, { duration, status });
    this.sendMetric('api_call_duration', { endpoint, status }, duration);
  }

  // Track errors
  trackError(error, context) {
    this.metrics.errorCount++;
    this.sendMetric('error_count', { error: error.message, context }, 1);
  }

  // Track user interactions
  trackUserInteraction(action, element) {
    this.metrics.userInteractions++;
    this.sendMetric('user_interaction', { action, element }, 1);
  }

  // Send metric to backend
  async sendMetric(name, labels, value) {
    try {
      await fetch('/api/metrics', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ name, labels, value, timestamp: Date.now() })
      });
    } catch (error) {
      console.error('Failed to send metric:', error);
    }
  }
}

// Initialize metrics collector
const metrics = new MetricsCollector();

// Track page load time
window.addEventListener('load', () => {
  const loadTime = performance.timing.loadEventEnd - performance.timing.navigationStart;
  metrics.trackPageLoad(window.location.pathname, loadTime);
});

// Track API calls
const originalFetch = window.fetch;
window.fetch = async (...args) => {
  const start = performance.now();
  try {
    const response = await originalFetch(...args);
    const duration = performance.now() - start;
    metrics.trackApiCall(args[0], duration, response.status);
    return response;
  } catch (error) {
    const duration = performance.now() - start;
    metrics.trackApiCall(args[0], duration, 'error');
    throw error;
  }
};

Dashboards

1. Application Dashboard

Key Metrics

  • Request Rate: Requests per second
  • Response Time: Average and 95th percentile response time
  • Error Rate: Percentage of failed requests
  • Active Users: Concurrent active users
  • Resource Usage: CPU and memory utilization

Grafana Configuration

{
  "dashboard": {
    "title": "RAG Modulo Application Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(rag_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(rag_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.50, rate(rag_request_duration_seconds_bucket[5m]))",
            "legendFormat": "50th percentile"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(errors_total[5m]) / rate(rag_requests_total[5m]) * 100",
            "legendFormat": "Error Rate %"
          }
        ]
      }
    ]
  }
}

2. Infrastructure Dashboard

Key Metrics

  • Resource Utilization: CPU, memory, storage usage
  • Service Health: Health check status
  • Cost Tracking: Resource costs over time
  • Scaling Events: Auto-scaling activities

Grafana Configuration

{
  "dashboard": {
    "title": "RAG Modulo Infrastructure Dashboard",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total[5m]) * 100",
            "legendFormat": "{{container}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "container_memory_usage_bytes / container_spec_memory_limit_bytes * 100",
            "legendFormat": "{{container}}"
          }
        ]
      },
      {
        "title": "Service Health",
        "type": "stat",
        "targets": [
          {
            "expr": "up{job=\"rag-modulo-backend\"}",
            "legendFormat": "Backend"
          },
          {
            "expr": "up{job=\"rag-modulo-frontend\"}",
            "legendFormat": "Frontend"
          }
        ]
      }
    ]
  }
}

Alerting

1. Alert Rules

Critical Alerts

# Critical alert rules
critical_alerts:
  - name: "high_error_rate"
    condition: "rate(errors_total[5m]) / rate(rag_requests_total[5m]) > 0.05"
    duration: "5m"
    severity: "critical"
    description: "Error rate is above 5%"

  - name: "high_response_time"
    condition: "histogram_quantile(0.95, rate(rag_request_duration_seconds_bucket[5m])) > 2.0"
    duration: "10m"
    severity: "critical"
    description: "95th percentile response time is above 2 seconds"

  - name: "service_down"
    condition: "up{job=\"rag-modulo-backend\"} == 0"
    duration: "1m"
    severity: "critical"
    description: "Backend service is down"

  - name: "high_cpu_usage"
    condition: "rate(container_cpu_usage_seconds_total[5m]) * 100 > 80"
    duration: "5m"
    severity: "critical"
    description: "CPU usage is above 80%"

Warning Alerts

# Warning alert rules
warning_alerts:
  - name: "high_memory_usage"
    condition: "container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 85"
    duration: "10m"
    severity: "warning"
    description: "Memory usage is above 85%"

  - name: "low_cache_hit_rate"
    condition: "cache_hit_rate < 0.8"
    duration: "15m"
    severity: "warning"
    description: "Cache hit rate is below 80%"

  - name: "high_database_connections"
    condition: "active_connections > 80"
    duration: "5m"
    severity: "warning"
    description: "Database connection count is high"

2. Notification Channels

Email Notifications

# Email notification configuration
email_notifications:
  enabled: true
  smtp_server: "smtp.gmail.com"
  smtp_port: 587
  username: "alerts@company.com"
  password: "{{ email_password }}"
  recipients:
    - "devops@company.com"
    - "oncall@company.com"

Slack Notifications

# Slack notification configuration
slack_notifications:
  enabled: true
  webhook_url: "{{ slack_webhook_url }}"
  channel: "#alerts"
  username: "RAG Modulo Monitor"
  icon_emoji: ":warning:"

PagerDuty Integration

# PagerDuty integration
pagerduty:
  enabled: true
  integration_key: "{{ pagerduty_integration_key }}"
  escalation_policy: "rag-modulo-escalation"
  severity_mapping:
    critical: "P1"
    warning: "P2"
    info: "P3"

Log Management

1. Log Collection

Application Logs

# Structured logging configuration
import logging
import json
from datetime import datetime

class StructuredLogger:
    def __init__(self, name):
        self.logger = logging.getLogger(name)
        self.logger.setLevel(logging.INFO)

        # Create formatter
        formatter = logging.Formatter('%(message)s')

        # Create handler
        handler = logging.StreamHandler()
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)

    def log(self, level, message, **kwargs):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level.upper(),
            "message": message,
            "service": "rag-modulo-backend",
            **kwargs
        }
        self.logger.info(json.dumps(log_entry))

# Usage
logger = StructuredLogger(__name__)

# Log request
logger.log("info", "Request received",
          method="GET",
          path="/api/search",
          user_id="12345",
          request_id="req-123")

# Log error
logger.log("error", "Database connection failed",
          error="Connection timeout",
          database="postgresql",
          retry_count=3)

Access Logs

# Access log middleware
@app.middleware("http")
async def access_log_middleware(request: Request, call_next):
    start_time = time.time()

    # Process request
    response = await call_next(request)

    # Calculate duration
    duration = time.time() - start_time

    # Log access
    logger.log("info", "Request completed",
              method=request.method,
              path=request.url.path,
              status_code=response.status_code,
              duration=duration,
              user_agent=request.headers.get("user-agent"),
              ip_address=request.client.host)

    return response

2. Log Analysis

Error Analysis

# Error analysis queries
error_analysis_queries = {
    "error_rate_by_endpoint": """
        SELECT
            endpoint,
            COUNT(*) as error_count,
            COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as error_percentage
        FROM logs
        WHERE level = 'ERROR'
        AND timestamp >= NOW() - INTERVAL '1 hour'
        GROUP BY endpoint
        ORDER BY error_count DESC
    """,

    "error_trends": """
        SELECT
            DATE_TRUNC('hour', timestamp) as hour,
            COUNT(*) as error_count
        FROM logs
        WHERE level = 'ERROR'
        AND timestamp >= NOW() - INTERVAL '24 hours'
        GROUP BY hour
        ORDER BY hour
    """,

    "top_errors": """
        SELECT
            message,
            COUNT(*) as count,
            MAX(timestamp) as last_occurrence
        FROM logs
        WHERE level = 'ERROR'
        AND timestamp >= NOW() - INTERVAL '1 hour'
        GROUP BY message
        ORDER BY count DESC
        LIMIT 10
    """
}

Performance Analysis

# Performance analysis queries
performance_analysis_queries = {
    "slow_queries": """
        SELECT
            endpoint,
            AVG(duration) as avg_duration,
            MAX(duration) as max_duration,
            COUNT(*) as request_count
        FROM logs
        WHERE duration > 1.0
        AND timestamp >= NOW() - INTERVAL '1 hour'
        GROUP BY endpoint
        ORDER BY avg_duration DESC
    """,

    "response_time_trends": """
        SELECT
            DATE_TRUNC('minute', timestamp) as minute,
            AVG(duration) as avg_duration,
            PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) as p95_duration
        FROM logs
        WHERE timestamp >= NOW() - INTERVAL '1 hour'
        GROUP BY minute
        ORDER BY minute
    """
}

Troubleshooting

Common Issues

1. High Error Rate

Symptoms:

  • Error rate above 5%
  • Increased user complaints
  • Service degradation

Investigation:

# Check error logs
ibmcloud ce app logs rag-modulo-backend --tail 100 | grep ERROR

# Check error trends
curl "https://monitoring-endpoint/api/query?query=rate(errors_total[5m])"

# Check specific errors
curl "https://monitoring-endpoint/api/query?query=topk(10, count by (error_type) (errors_total))"

Solutions:

  • Check application logs for specific errors
  • Verify database connectivity
  • Check resource utilization
  • Review recent deployments

2. High Response Time

Symptoms:

  • Response time above 2 seconds
  • User experience degradation
  • Timeout errors

Investigation:

# Check response time metrics
curl "https://monitoring-endpoint/api/query?query=histogram_quantile(0.95, rate(rag_request_duration_seconds_bucket[5m]))"

# Check resource utilization
curl "https://monitoring-endpoint/api/query?query=rate(container_cpu_usage_seconds_total[5m])"

# Check database performance
curl "https://monitoring-endpoint/api/query?query=rate(database_query_duration_seconds[5m])"

Solutions:

  • Scale up application resources
  • Optimize database queries
  • Check for resource bottlenecks
  • Review application performance

3. Service Unavailable

Symptoms:

  • Service returns 503 errors
  • Health checks failing
  • Complete service outage

Investigation:

# Check service status
ibmcloud ce app get rag-modulo-backend

# Check health endpoint
curl "https://backend-app.example.com/health"

# Check application logs
ibmcloud ce app logs rag-modulo-backend --tail 100

Solutions:

  • Restart application
  • Check resource limits
  • Verify service bindings
  • Review error logs

Debug Commands

# Check application status
ibmcloud ce app get rag-modulo-backend --output json

# View application logs
ibmcloud ce app logs rag-modulo-backend --follow

# Check resource utilization
ibmcloud ce app get rag-modulo-backend --output json | jq '.spec.template.spec.containers[0].resources'

# Check environment variables
ibmcloud ce app get rag-modulo-backend --output json | jq '.spec.template.spec.containers[0].env'

# Check service bindings
ibmcloud ce app get rag-modulo-backend --output json | jq '.spec.template.spec.serviceBindings'

Best Practices

1. Monitoring

  • Set up comprehensive monitoring from day one
  • Use appropriate alert thresholds
  • Implement proper escalation procedures
  • Regular review of monitoring effectiveness

2. Logging

  • Use structured logging with consistent format
  • Include relevant context in log messages
  • Implement proper log levels
  • Regular log analysis and cleanup

3. Alerting

  • Set up alerts for critical issues
  • Avoid alert fatigue with appropriate thresholds
  • Test alerting procedures regularly
  • Document alert response procedures

4. Dashboards

  • Create meaningful dashboards for different audiences
  • Keep dashboards up to date
  • Use appropriate visualization types
  • Regular dashboard review and optimization