Monitoring & Operations

Health endpoints, logging configuration, key metrics, production troubleshooting, and backup strategies.

6 min read

This guide covers operational aspects of running the MCP Hub Platform in production: health monitoring, logging, key metrics to track, troubleshooting common production issues, and backup procedures.

Health Endpoints

Every MCP Hub Platform service exposes health check endpoints.

Endpoint Summary

Service	Endpoint	Expected Response	Method
Hub Web	`/health`	`200 OK` with JSON body	GET
Registry	`/healthz`	`200 OK`	GET
Scan Worker	`/healthz` (port 8083)	`200 OK`	GET
PostgreSQL	`pg_isready -U mcphub`	exit code 0	CLI
Redis	`redis-cli -p 6390 ping`	`PONG`	CLI
MinIO	`/minio/health/live`	`200 OK`	GET
LavinMQ	`lavinmqctl status`	exit code 0	CLI

Hub Web Health Response

{
  "status": "healthy",
  "version": "1.0.0",
  "commit": "abc1234",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "amqp": "ok",
    "s3": "ok"
  }
}

If any dependency is down, the response changes to:

{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "redis": "error: connection refused",
    "amqp": "ok",
    "s3": "ok"
  }
}

The HTTP status code is 200 when healthy, 503 when degraded or unhealthy.

Monitoring with curl

# Hub web server
curl -sf http://localhost:8080/health | jq .

# Registry
curl -sf http://localhost:8081/healthz

# Scan worker
curl -sf http://localhost:8083/healthz

# MinIO
curl -sf http://localhost:9000/minio/health/live

Monitoring with Docker Compose

# Check all service health statuses
docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"

# Watch for health changes
watch -n 5 'docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"'

Logging

Log Format

All application services support structured JSON logging when configured:

{
  "level": "info",
  "timestamp": "2025-01-15T10:30:00.000Z",
  "msg": "analysis job completed",
  "service": "scan-worker",
  "job_id": "job-abc123",
  "package": "acme/[email protected]",
  "duration_ms": 4523,
  "findings": 7
}

Log Levels

Level	Description	Recommended Use
`debug`	Detailed diagnostic information	Local development, troubleshooting
`info`	Normal operational messages	Production (default)
`warn`	Warning conditions that may need attention	Production
`error`	Error conditions that require investigation	Always enabled

Configure the log level via the LOG_LEVEL environment variable:

LOG_LEVEL=debug  # Maximum verbosity
LOG_LEVEL=info   # Standard production level
LOG_LEVEL=warn   # Reduced verbosity, warnings and errors only
LOG_LEVEL=error  # Errors only

Log Aggregation

For production, aggregate logs from all services into a centralized system.

Docker Compose (Loki example):

services:
  hub-web:
    logging:
      driver: loki
      options:
        loki-url: "http://loki:3100/loki/api/v1/push"
        loki-batch-size: "400"
        labels: "service=hub-web"

Kubernetes (stdout/stderr):

All services log to stdout/stderr. Use your cluster’s log aggregation pipeline (Fluentd, Fluent Bit, Promtail) to collect and forward logs.

Sensitive Data

All services redact sensitive information from logs:

Authentication tokens are never logged in full
Database passwords are masked
S3 credentials are redacted
User PII is logged only at debug level and with identifiers, not full values

Key Metrics

Pipeline Throughput

Metric	What to Monitor	Healthy Range
Ingestion rate	Jobs entering the pipeline per hour	Depends on usage
Analysis completion time	Time from ANALYZE to ANALYZE_COMPLETE	Fast: < 30s, Deep: < 5m
Analysis queue depth	Number of pending ANALYZE jobs	< 10 (scale workers if consistently higher)
Certification rate	Percentage of analyses that reach cert level >= 1	Varies by code quality

Infrastructure Health

Metric	What to Monitor	Action Threshold
PostgreSQL connections	Active connections vs pool size	> 80% of pool
PostgreSQL disk usage	Database size growth	> 80% of allocated storage
Redis memory	Memory usage vs limit	> 80% of maxmemory
Redis hit rate	Cache hit/miss ratio	< 80% hit rate
MinIO disk usage	Bucket sizes	> 80% of allocated storage
LavinMQ queue depth	Messages in `mcp.jobs.analyze`	> 100 (scale scan-workers)
LavinMQ consumers	Connected workers per queue	Should match replica count

Application Performance

Metric	What to Monitor	Action Threshold
Hub web response time	P95 latency for dashboard pages	> 2s
Registry resolve time	P95 latency for resolve endpoint	> 500ms
Registry download throughput	Bytes served per second	Depends on usage
Worker error rate	Failed jobs / total jobs	> 5%
Scan worker CPU	CPU utilization during analysis	> 90% sustained (scale workers)

LavinMQ Monitoring

The LavinMQ management UI (port 15672) provides built-in monitoring for:

Queue depths and message rates
Consumer counts and prefetch settings
Connection and channel status
Memory and disk usage

Access it at http://localhost:15672 (default credentials: guest/guest).

MinIO Monitoring

The MinIO Console (port 9001) provides:

Bucket usage and object counts
Request rates and latency
Disk usage and health
Access audit logs

Access it at http://localhost:9001 (default credentials: minioadmin/minioadmin).

Troubleshooting Production Issues

Pipeline Stalls

Symptom: Jobs are submitted but never complete.

Diagnosis:

Check LavinMQ for pending messages:

# Via management UI at http://localhost:15672
# Or via API:
curl -u guest:guest http://localhost:15672/api/queues

Check if workers are connected:

docker compose -f docker-compose.local.yml logs scan-worker --tail 20
docker compose -f docker-compose.local.yml logs hub-ingestion-worker --tail 20
docker compose -f docker-compose.local.yml logs hub-results-worker --tail 20

Common causes:
- scan-worker not running: Analysis jobs queue up but are never processed
- results-worker not running: ANALYZE_COMPLETE messages are not processed
- S3 unreachable: Workers cannot download/upload tarballs
- AMQP connection lost: Workers reconnect automatically but may miss messages during disconnection

Database Connection Exhaustion

Symptom: “too many connections” errors in logs.

Solutions:

Check active connections:

SELECT count(*) FROM pg_stat_activity;
SELECT max_conn FROM pg_settings WHERE name='max_connections';

Increase max_connections in PostgreSQL config or use connection pooling (PgBouncer).
Reduce worker concurrency settings if connection usage is too high.

S3 / MinIO Issues

Symptom: “NoSuchBucket” or “AccessDenied” errors.

Solutions:

Verify minio-init completed:

docker compose -f docker-compose.local.yml ps minio-init
# Should show "Exited (0)"

Check bucket existence:

docker compose exec minio mc ls myminio/
# Should show: mcp-hub-sources, mcp-hub-analysis, mcp-registry

Verify credentials match between services and MinIO.

High Memory Usage on Scan Worker

Symptom: Scan worker OOM-killed during analysis.

Solutions:

Reduce MAX_CONCURRENT to limit parallel analyses
Increase memory limits in Docker or Kubernetes
Switch to SCAN_MODE=fast for lower memory usage
Reduce SCAN_TIMEOUT to prevent long-running analyses from accumulating

Registry Response Timeouts

Symptom: MCP Client reports timeout when resolving or downloading packages.

Solutions:

Check registry health: curl http://localhost:8081/healthz
Check PostgreSQL connectivity from the registry
Check MinIO connectivity (presigned URL generation requires S3 access)
Increase REGISTRY_TIMEOUT on the hub side if the registry is healthy but slow

Backup and Restore

PostgreSQL Backup

Full backup:

# Docker Compose
docker compose exec postgres pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql

# Kubernetes
kubectl exec -n mcp-hub deploy/postgres -- pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql

Per-database backup:

# Hub database
docker compose exec postgres pg_dump -U mcphub mcphub > mcphub_$(date +%Y%m%d).sql

# Registry database
docker compose exec postgres pg_dump -U mcphub mcp_registry > registry_$(date +%Y%m%d).sql

Restore:

docker compose exec -i postgres psql -U mcphub < backup_20250115.sql

MinIO / S3 Backup

Mirror to local directory:

docker compose exec minio mc mirror /data /backup

Mirror between S3 buckets (production):

# Using mc (MinIO Client)
mc mirror myminio/mcp-hub-sources backup/mcp-hub-sources
mc mirror myminio/mcp-hub-analysis backup/mcp-hub-analysis
mc mirror myminio/mcp-registry backup/mcp-registry

For managed S3 (AWS, GCS), enable cross-region replication and versioning for automated backup.

Redis

Redis data is used for caching and rate limiting. It is not critical to back up because it is reconstructed from the database. However, losing Redis causes a temporary performance degradation while the cache warms up.

If you use Redis for session storage, back it up:

docker compose exec redis redis-cli -p 6390 BGSAVE
docker cp mcphub-redis:/data/dump.rdb ./redis_backup.rdb

Backup Schedule Recommendations

Component	Frequency	Retention	Method
PostgreSQL	Daily + WAL archiving	30 days	`pg_dump` or managed backups
MinIO / S3	Daily	90 days	`mc mirror` or S3 replication
Redis	Not critical	–	Optional `BGSAVE`
LavinMQ	Not critical	–	Messages are transient

Disaster Recovery

For disaster recovery:

Restore PostgreSQL from the latest backup
Restore MinIO from the latest mirror
Restart all services – workers will reconnect to AMQP and resume processing
Redis will rebuild its cache automatically
Verify health with curl checks on all endpoints
Re-run pending jobs – any jobs that were in-flight during the failure may need to be resubmitted through the dashboard