Monitoring & Operations
6 min read
This guide covers operational aspects of running the MCP Hub Platform in production: health monitoring, logging, key metrics to track, troubleshooting common production issues, and backup procedures.
Health Endpoints
Every MCP Hub Platform service exposes health check endpoints.
Endpoint Summary
| Service | Endpoint | Expected Response | Method |
|---|---|---|---|
| Hub Web | /health | 200 OK with JSON body | GET |
| Registry | /healthz | 200 OK | GET |
| Scan Worker | /healthz (port 8083) | 200 OK | GET |
| PostgreSQL | pg_isready -U mcphub | exit code 0 | CLI |
| Redis | redis-cli -p 6390 ping | PONG | CLI |
| MinIO | /minio/health/live | 200 OK | GET |
| LavinMQ | lavinmqctl status | exit code 0 | CLI |
Hub Web Health Response
{
"status": "healthy",
"version": "1.0.0",
"commit": "abc1234",
"checks": {
"database": "ok",
"redis": "ok",
"amqp": "ok",
"s3": "ok"
}
}
If any dependency is down, the response changes to:
{
"status": "degraded",
"checks": {
"database": "ok",
"redis": "error: connection refused",
"amqp": "ok",
"s3": "ok"
}
}
The HTTP status code is 200 when healthy, 503 when degraded or unhealthy.
Monitoring with curl
# Hub web server
curl -sf http://localhost:8080/health | jq .
# Registry
curl -sf http://localhost:8081/healthz
# Scan worker
curl -sf http://localhost:8083/healthz
# MinIO
curl -sf http://localhost:9000/minio/health/live
Monitoring with Docker Compose
# Check all service health statuses
docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"
# Watch for health changes
watch -n 5 'docker compose -f docker-compose.local.yml ps --format "table {{.Name}}\t{{.Status}}"'
Logging
Log Format
All application services support structured JSON logging when configured:
{
"level": "info",
"timestamp": "2025-01-15T10:30:00.000Z",
"msg": "analysis job completed",
"service": "scan-worker",
"job_id": "job-abc123",
"package": "acme/[email protected]",
"duration_ms": 4523,
"findings": 7
}
Log Levels
| Level | Description | Recommended Use |
|---|---|---|
debug | Detailed diagnostic information | Local development, troubleshooting |
info | Normal operational messages | Production (default) |
warn | Warning conditions that may need attention | Production |
error | Error conditions that require investigation | Always enabled |
Configure the log level via the LOG_LEVEL environment variable:
LOG_LEVEL=debug # Maximum verbosity
LOG_LEVEL=info # Standard production level
LOG_LEVEL=warn # Reduced verbosity, warnings and errors only
LOG_LEVEL=error # Errors only
Log Aggregation
For production, aggregate logs from all services into a centralized system.
Docker Compose (Loki example):
services:
hub-web:
logging:
driver: loki
options:
loki-url: "http://loki:3100/loki/api/v1/push"
loki-batch-size: "400"
labels: "service=hub-web"
Kubernetes (stdout/stderr):
All services log to stdout/stderr. Use your cluster’s log aggregation pipeline (Fluentd, Fluent Bit, Promtail) to collect and forward logs.
Sensitive Data
All services redact sensitive information from logs:
- Authentication tokens are never logged in full
- Database passwords are masked
- S3 credentials are redacted
- User PII is logged only at
debuglevel and with identifiers, not full values
Key Metrics
Pipeline Throughput
| Metric | What to Monitor | Healthy Range |
|---|---|---|
| Ingestion rate | Jobs entering the pipeline per hour | Depends on usage |
| Analysis completion time | Time from ANALYZE to ANALYZE_COMPLETE | Fast: < 30s, Deep: < 5m |
| Analysis queue depth | Number of pending ANALYZE jobs | < 10 (scale workers if consistently higher) |
| Certification rate | Percentage of analyses that reach cert level >= 1 | Varies by code quality |
Infrastructure Health
| Metric | What to Monitor | Action Threshold |
|---|---|---|
| PostgreSQL connections | Active connections vs pool size | > 80% of pool |
| PostgreSQL disk usage | Database size growth | > 80% of allocated storage |
| Redis memory | Memory usage vs limit | > 80% of maxmemory |
| Redis hit rate | Cache hit/miss ratio | < 80% hit rate |
| MinIO disk usage | Bucket sizes | > 80% of allocated storage |
| LavinMQ queue depth | Messages in mcp.jobs.analyze | > 100 (scale scan-workers) |
| LavinMQ consumers | Connected workers per queue | Should match replica count |
Application Performance
| Metric | What to Monitor | Action Threshold |
|---|---|---|
| Hub web response time | P95 latency for dashboard pages | > 2s |
| Registry resolve time | P95 latency for resolve endpoint | > 500ms |
| Registry download throughput | Bytes served per second | Depends on usage |
| Worker error rate | Failed jobs / total jobs | > 5% |
| Scan worker CPU | CPU utilization during analysis | > 90% sustained (scale workers) |
LavinMQ Monitoring
The LavinMQ management UI (port 15672) provides built-in monitoring for:
- Queue depths and message rates
- Consumer counts and prefetch settings
- Connection and channel status
- Memory and disk usage
Access it at http://localhost:15672 (default credentials: guest/guest).
MinIO Monitoring
The MinIO Console (port 9001) provides:
- Bucket usage and object counts
- Request rates and latency
- Disk usage and health
- Access audit logs
Access it at http://localhost:9001 (default credentials: minioadmin/minioadmin).
Troubleshooting Production Issues
Pipeline Stalls
Symptom: Jobs are submitted but never complete.
Diagnosis:
Check LavinMQ for pending messages:
# Via management UI at http://localhost:15672 # Or via API: curl -u guest:guest http://localhost:15672/api/queuesCheck if workers are connected:
docker compose -f docker-compose.local.yml logs scan-worker --tail 20 docker compose -f docker-compose.local.yml logs hub-ingestion-worker --tail 20 docker compose -f docker-compose.local.yml logs hub-results-worker --tail 20Common causes:
- scan-worker not running: Analysis jobs queue up but are never processed
- results-worker not running: ANALYZE_COMPLETE messages are not processed
- S3 unreachable: Workers cannot download/upload tarballs
- AMQP connection lost: Workers reconnect automatically but may miss messages during disconnection
Database Connection Exhaustion
Symptom: “too many connections” errors in logs.
Solutions:
Check active connections:
SELECT count(*) FROM pg_stat_activity; SELECT max_conn FROM pg_settings WHERE name='max_connections';Increase
max_connectionsin PostgreSQL config or use connection pooling (PgBouncer).Reduce worker concurrency settings if connection usage is too high.
S3 / MinIO Issues
Symptom: “NoSuchBucket” or “AccessDenied” errors.
Solutions:
Verify minio-init completed:
docker compose -f docker-compose.local.yml ps minio-init # Should show "Exited (0)"Check bucket existence:
docker compose exec minio mc ls myminio/ # Should show: mcp-hub-sources, mcp-hub-analysis, mcp-registryVerify credentials match between services and MinIO.
High Memory Usage on Scan Worker
Symptom: Scan worker OOM-killed during analysis.
Solutions:
- Reduce
MAX_CONCURRENTto limit parallel analyses - Increase memory limits in Docker or Kubernetes
- Switch to
SCAN_MODE=fastfor lower memory usage - Reduce
SCAN_TIMEOUTto prevent long-running analyses from accumulating
Registry Response Timeouts
Symptom: MCP Client reports timeout when resolving or downloading packages.
Solutions:
- Check registry health:
curl http://localhost:8081/healthz - Check PostgreSQL connectivity from the registry
- Check MinIO connectivity (presigned URL generation requires S3 access)
- Increase
REGISTRY_TIMEOUTon the hub side if the registry is healthy but slow
Backup and Restore
PostgreSQL Backup
Full backup:
# Docker Compose
docker compose exec postgres pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql
# Kubernetes
kubectl exec -n mcp-hub deploy/postgres -- pg_dumpall -U mcphub > backup_$(date +%Y%m%d).sql
Per-database backup:
# Hub database
docker compose exec postgres pg_dump -U mcphub mcphub > mcphub_$(date +%Y%m%d).sql
# Registry database
docker compose exec postgres pg_dump -U mcphub mcp_registry > registry_$(date +%Y%m%d).sql
Restore:
docker compose exec -i postgres psql -U mcphub < backup_20250115.sql
MinIO / S3 Backup
Mirror to local directory:
docker compose exec minio mc mirror /data /backup
Mirror between S3 buckets (production):
# Using mc (MinIO Client)
mc mirror myminio/mcp-hub-sources backup/mcp-hub-sources
mc mirror myminio/mcp-hub-analysis backup/mcp-hub-analysis
mc mirror myminio/mcp-registry backup/mcp-registry
For managed S3 (AWS, GCS), enable cross-region replication and versioning for automated backup.
Redis
Redis data is used for caching and rate limiting. It is not critical to back up because it is reconstructed from the database. However, losing Redis causes a temporary performance degradation while the cache warms up.
If you use Redis for session storage, back it up:
docker compose exec redis redis-cli -p 6390 BGSAVE
docker cp mcphub-redis:/data/dump.rdb ./redis_backup.rdb
Backup Schedule Recommendations
| Component | Frequency | Retention | Method |
|---|---|---|---|
| PostgreSQL | Daily + WAL archiving | 30 days | pg_dump or managed backups |
| MinIO / S3 | Daily | 90 days | mc mirror or S3 replication |
| Redis | Not critical | – | Optional BGSAVE |
| LavinMQ | Not critical | – | Messages are transient |
Disaster Recovery
For disaster recovery:
- Restore PostgreSQL from the latest backup
- Restore MinIO from the latest mirror
- Restart all services – workers will reconnect to AMQP and resume processing
- Redis will rebuild its cache automatically
- Verify health with
curlchecks on all endpoints - Re-run pending jobs – any jobs that were in-flight during the failure may need to be resubmitted through the dashboard