Monitoring & Operations

Health endpoints, logging configuration, key metrics, production troubleshooting, and backup strategies.

This guide covers operational aspects of running the MCP Hub Platform in production: health monitoring, logging, key metrics to track, troubleshooting common production issues, and backup procedures.

Health Endpoints

Every MCP Hub Platform service exposes health check endpoints.

Endpoint Summary

ServiceEndpointExpected ResponseMethod
Hub Web/health200 OK with JSON bodyGET
Registry/healthz200 OKGET
Scan Worker/healthz (port 8083)200 OKGET
PostgreSQLpg_isready -U mcphubexit code 0CLI
Redisredis-cli -p 6390 pingPONGCLI
MinIO/minio/health/live200 OKGET
LavinMQlavinmqctl statusexit code 0CLI

Hub Web Health Response

{
  "status": "healthy",
  "version": "1.0.0",
  "commit": "abc1234",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "amqp": "ok",
    "s3": "ok"
  }
}

If any dependency is down, the response changes to:

{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "redis": "error: connection refused",
    "amqp": "ok",
    "s3": "ok"
  }
}

The HTTP status code is 200 when healthy, 503 when degraded or unhealthy.

Monitoring with curl

# Hub web server
curl -sf http://localhost:8080/health | jq .

# Registry
curl -sf http://localhost:8081/healthz

# Scan worker
curl -sf http://localhost:8083/healthz

# MinIO
curl -sf http://localhost:9000/minio/health/live

Monitoring with Docker Compose

# Check all service health statuses
docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"

# Watch for health changes
watch -n 5 'docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"'

Logging

Log Format

All application services support structured JSON logging when configured:

{
  "level": "info",
  "timestamp": "2025-01-15T10:30:00.000Z",
  "msg": "analysis job completed",
  "service": "scan-worker",
  "job_id": "job-abc123",
  "package": "acme/[email protected]",
  "duration_ms": 4523,
  "findings": 7
}

Log Levels

LevelDescriptionRecommended Use
debugDetailed diagnostic informationLocal development, troubleshooting
infoNormal operational messagesProduction (default)
warnWarning conditions that may need attentionProduction
errorError conditions that require investigationAlways enabled

Configure the log level via the LOG_LEVEL environment variable:

LOG_LEVEL=debug  # Maximum verbosity
LOG_LEVEL=info   # Standard production level
LOG_LEVEL=warn   # Reduced verbosity, warnings and errors only
LOG_LEVEL=error  # Errors only

Log Aggregation

For production, aggregate logs from all services into a centralized system.

Docker Compose (Loki example):

services:
  hub-web:
    logging:
      driver: loki
      options:
        loki-url: "http://loki:3100/loki/api/v1/push"
        loki-batch-size: "400"
        labels: "service=hub-web"

Kubernetes (stdout/stderr):

All services log to stdout/stderr. Use your cluster’s log aggregation pipeline (Fluentd, Fluent Bit, Promtail) to collect and forward logs.

Sensitive Data

All services redact sensitive information from logs:

  • Authentication tokens are never logged in full
  • Database passwords are masked
  • S3 credentials are redacted
  • User PII is logged only at debug level and with identifiers, not full values

Key Metrics

Pipeline Throughput

MetricWhat to MonitorHealthy Range
Ingestion rateJobs entering the pipeline per hourDepends on usage
Analysis completion timeTime from ANALYZE to ANALYZE_COMPLETEFast: < 30s, Deep: < 5m
Analysis queue depthNumber of pending ANALYZE jobs< 10 (scale workers if consistently higher)
Certification ratePercentage of analyses that reach cert level >= 1Varies by code quality

Infrastructure Health

MetricWhat to MonitorAction Threshold
PostgreSQL connectionsActive connections vs pool size> 80% of pool
PostgreSQL disk usageDatabase size growth> 80% of allocated storage
Redis memoryMemory usage vs limit> 80% of maxmemory
Redis hit rateCache hit/miss ratio< 80% hit rate
MinIO disk usageBucket sizes> 80% of allocated storage
LavinMQ queue depthMessages in mcp.jobs.analyze> 100 (scale scan-workers)
LavinMQ consumersConnected workers per queueShould match replica count

Application Performance

MetricWhat to MonitorAction Threshold
Hub web response timeP95 latency for dashboard pages> 2s
Registry resolve timeP95 latency for resolve endpoint> 500ms
Registry download throughputBytes served per secondDepends on usage
Worker error rateFailed jobs / total jobs> 5%
Scan worker CPUCPU utilization during analysis> 90% sustained (scale workers)

LavinMQ Monitoring

The LavinMQ management UI (port 15672) provides built-in monitoring for:

  • Queue depths and message rates
  • Consumer counts and prefetch settings
  • Connection and channel status
  • Memory and disk usage

Access it at http://localhost:15672 (default credentials: guest/guest).

MinIO Monitoring

The MinIO Console (port 9001) provides:

  • Bucket usage and object counts
  • Request rates and latency
  • Disk usage and health
  • Access audit logs

Access it at http://localhost:9001 (default credentials: minioadmin/minioadmin).

Troubleshooting Production Issues

Pipeline Stalls

Symptom: Jobs are submitted but never complete.

Diagnosis:

  1. Check LavinMQ for pending messages:

    # Via management UI at http://localhost:15672
    # Or via API:
    curl -u guest:guest http://localhost:15672/api/queues
    
  2. Check if workers are connected:

    docker compose -f docker-compose.local.yml logs scan-worker --tail 20
    docker compose -f docker-compose.local.yml logs hub-ingestion-worker --tail 20
    docker compose -f docker-compose.local.yml logs hub-results-worker --tail 20
    
  3. Common causes:

    • scan-worker not running: Analysis jobs queue up but are never processed
    • results-worker not running: ANALYZE_COMPLETE messages are not processed
    • S3 unreachable: Workers cannot download/upload tarballs
    • AMQP connection lost: Workers reconnect automatically but may miss messages during disconnection

Database Connection Exhaustion

Symptom: “too many connections” errors in logs.

Solutions:

  1. Check active connections:

    SELECT count(*) FROM pg_stat_activity;
    SELECT max_conn FROM pg_settings WHERE name='max_connections';
    
  2. Increase max_connections in PostgreSQL config or use connection pooling (PgBouncer).

  3. Reduce worker concurrency settings if connection usage is too high.

S3 / MinIO Issues

Symptom: “NoSuchBucket” or “AccessDenied” errors.

Solutions:

  1. Verify minio-init completed:

    docker compose -f docker-compose.local.yml ps minio-init
    # Should show "Exited (0)"
    
  2. Check bucket existence:

    docker compose exec minio mc ls myminio/
    # Should show: mcp-hub-sources, mcp-hub-analysis, mcp-registry
    
  3. Verify credentials match between services and MinIO.

High Memory Usage on Scan Worker

Symptom: Scan worker OOM-killed during analysis.

Solutions:

  1. Reduce MAX_CONCURRENT to limit parallel analyses
  2. Increase memory limits in Docker or Kubernetes
  3. Switch to SCAN_MODE=fast for lower memory usage
  4. Reduce SCAN_TIMEOUT to prevent long-running analyses from accumulating

Registry Response Timeouts

Symptom: MCP Client reports timeout when resolving or downloading packages.

Solutions:

  1. Check registry health: curl http://localhost:8081/healthz
  2. Check PostgreSQL connectivity from the registry
  3. Check MinIO connectivity (presigned URL generation requires S3 access)
  4. Increase REGISTRY_TIMEOUT on the hub side if the registry is healthy but slow

Backup and Restore

PostgreSQL Backup

Full backup:

# Docker Compose
docker compose exec postgres pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql

# Kubernetes
kubectl exec -n mcp-hub deploy/postgres -- pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql

Per-database backup:

# Hub database
docker compose exec postgres pg_dump -U mcphub mcphub > mcphub_$(date +%Y%m%d).sql

# Registry database
docker compose exec postgres pg_dump -U mcphub mcp_registry > registry_$(date +%Y%m%d).sql

Restore:

docker compose exec -i postgres psql -U mcphub < backup_20250115.sql

MinIO / S3 Backup

Mirror to local directory:

docker compose exec minio mc mirror /data /backup

Mirror between S3 buckets (production):

# Using mc (MinIO Client)
mc mirror myminio/mcp-hub-sources backup/mcp-hub-sources
mc mirror myminio/mcp-hub-analysis backup/mcp-hub-analysis
mc mirror myminio/mcp-registry backup/mcp-registry

For managed S3 (AWS, GCS), enable cross-region replication and versioning for automated backup.

Redis

Redis data is used for caching and rate limiting. It is not critical to back up because it is reconstructed from the database. However, losing Redis causes a temporary performance degradation while the cache warms up.

If you use Redis for session storage, back it up:

docker compose exec redis redis-cli -p 6390 BGSAVE
docker cp mcphub-redis:/data/dump.rdb ./redis_backup.rdb

Backup Schedule Recommendations

ComponentFrequencyRetentionMethod
PostgreSQLDaily + WAL archiving30 dayspg_dump or managed backups
MinIO / S3Daily90 daysmc mirror or S3 replication
RedisNot criticalOptional BGSAVE
LavinMQNot criticalMessages are transient

Disaster Recovery

For disaster recovery:

  1. Restore PostgreSQL from the latest backup
  2. Restore MinIO from the latest mirror
  3. Restart all services – workers will reconnect to AMQP and resume processing
  4. Redis will rebuild its cache automatically
  5. Verify health with curl checks on all endpoints
  6. Re-run pending jobs – any jobs that were in-flight during the failure may need to be resubmitted through the dashboard