Files
aitbc/docs/operator/monitoring/monitoring-playbook.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

10 KiB

AITBC Monitoring Playbook & On-Call Guide

Overview

This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.

Service Overview

Core Services

  • Coordinator API: Job management and marketplace coordination
  • Blockchain Nodes: Consensus and transaction processing
  • Explorer UI: Block explorer and transaction visualization
  • Marketplace UI: User interface for marketplace operations
  • Wallet Daemon: Cryptographic key management
  • Infrastructure: PostgreSQL, Redis, Kubernetes cluster

Critical Metrics

  • Availability: 99.9% uptime SLA
  • Performance: <200ms API response time (95th percentile)
  • Throughput: 100+ TPS sustained
  • MTTR: <2 minutes for critical incidents

On-Call Rotation

Rotation Schedule

  • Primary On-Call: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
  • Secondary On-Call: Shadow primary, handles escalations
  • Tertiary: Backup for both primary and secondary
  • Rotation Handoff: Every Monday at 08:00 UTC

Team Structure

Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)

Handoff Procedures

  1. Pre-handoff Check (Sunday 22:00 UTC):

    • Review active incidents
    • Check scheduled maintenance
    • Verify monitoring systems health
  2. Handoff Meeting (Monday 08:00 UTC):

    • 15-minute video call
    • Discuss current issues
    • Transfer knowledge
    • Confirm contact information
  3. Post-handoff (Monday 09:00 UTC):

    • Primary acknowledges receipt
    • Update on-call calendar
    • Test alerting systems

Contact Information

  • Primary: +1-555-ONCALL-1 (PagerDuty)
  • Secondary: +1-555-ONCALL-2 (PagerDuty)
  • Tertiary: +1-555-ONCALL-3 (PagerDuty)
  • Escalation Manager: +1-555-ESCALATE
  • Emergency: +1-555-EMERGENCY (Critical infrastructure only)

Alerting & Escalation

Alert Severity Levels

Critical (P0)

  • Service completely down
  • Data loss or corruption
  • Security breach
  • SLA violation in progress
  • Response Time: 5 minutes
  • Escalation: 15 minutes if no response

High (P1)

  • Significant degradation
  • Partial service outage
  • High error rates (>10%)
  • Response Time: 15 minutes
  • Escalation: 1 hour if no response

Medium (P2)

  • Minor degradation
  • Elevated error rates (5-10%)
  • Performance issues
  • Response Time: 1 hour
  • Escalation: 4 hours if no response

Low (P3)

  • Informational alerts
  • Non-critical issues
  • Response Time: 4 hours
  • Escalation: 24 hours if no response

Escalation Policy

  1. Level 1: Primary On-Call (5-60 minutes)
  2. Level 2: Secondary On-Call (15 minutes - 4 hours)
  3. Level 3: Tertiary On-Call (1 hour - 24 hours)
  4. Level 4: Engineering Manager (4 hours)
  5. Level 5: CTO (Critical incidents only)

Alert Channels

  • PagerDuty: Primary alerting system
  • Slack: #on-call-aitbc channel
  • Email: oncall@aitbc.io
  • SMS: Critical alerts only
  • Phone: Critical incidents only

Incident Response

Incident Classification

SEV-0 (Critical)

  • Complete service outage
  • Data loss or security breach
  • Financial impact >$10,000/hour
  • Customer impact >50%

SEV-1 (High)

  • Significant service degradation
  • Feature unavailable
  • Financial impact $1,000-$10,000/hour
  • Customer impact 10-50%

SEV-2 (Medium)

  • Minor service degradation
  • Performance issues
  • Financial impact <$1,000/hour
  • Customer impact <10%

SEV-3 (Low)

  • Informational
  • No customer impact

Incident Response Process

1. Detection & Triage (0-5 minutes)

# Check alert severity
# Verify impact
# Create incident channel
# Notify stakeholders

2. Assessment (5-15 minutes)

  • Determine scope
  • Identify root cause area
  • Estimate resolution time
  • Declare severity level

3. Communication (15-30 minutes)

  • Update status page
  • Notify customers (if needed)
  • Internal stakeholder updates
  • Set up war room

4. Resolution (Varies)

  • Implement fix
  • Verify resolution
  • Monitor for recurrence
  • Document actions

5. Recovery (30-60 minutes)

  • Full service restoration
  • Performance validation
  • Customer communication
  • Incident closure

Service-Specific Runbooks

Coordinator API

High Error Rate

Symptoms: 5xx errors >5%, response time >500ms Runbook:

  1. Check pod health: kubectl get pods -l app=coordinator
  2. Review logs: kubectl logs -f deployment/coordinator
  3. Check database connectivity
  4. Verify Redis connection
  5. Scale if needed: kubectl scale deployment coordinator --replicas=5

Service Unavailable

Symptoms: 503 errors, health check failures Runbook:

  1. Check deployment status
  2. Review recent deployments
  3. Rollback if necessary
  4. Check resource limits
  5. Verify ingress configuration

Blockchain Nodes

Consensus Stalled

Symptoms: No new blocks, high finality latency Runbook:

  1. Check node sync status
  2. Verify network connectivity
  3. Review validator set
  4. Check governance proposals
  5. Restart if needed (with caution)

High Peer Drop Rate

Symptoms: Connected peers <50%, network partition Runbook:

  1. Check network policies
  2. Verify DNS resolution
  3. Review firewall rules
  4. Check load balancer health
  5. Restart networking components

Database (PostgreSQL)

Connection Exhaustion

Symptoms: "Too many connections" errors Runbook:

  1. Check active connections
  2. Identify long-running queries
  3. Kill idle connections
  4. Increase pool size if needed
  5. Scale database

Replica Lag

Symptoms: Read replica lag >10 seconds Runbook:

  1. Check replica status
  2. Review network latency
  3. Verify disk space
  4. Restart replication if needed
  5. Failover if necessary

Redis

Memory Pressure

Symptoms: OOM errors, high eviction rate Runbook:

  1. Check memory usage
  2. Review key expiration
  3. Clean up unused keys
  4. Scale Redis cluster
  5. Optimize data structures

Connection Issues

Symptoms: Connection timeouts, errors Runbook:

  1. Check max connections
  2. Review connection pool
  3. Verify network policies
  4. Restart Redis if needed
  5. Scale horizontally

Monitoring Dashboards

Primary Dashboards

1. System Overview

  • Service health status
  • Error rates (4xx/5xx)
  • Response times
  • Throughput metrics
  • Resource utilization

2. Infrastructure

  • Kubernetes cluster health
  • Node resource usage
  • Pod status and restarts
  • Network traffic
  • Storage capacity

3. Application Metrics

  • Job submission rates
  • Transaction processing
  • Marketplace activity
  • Wallet operations
  • Mining statistics

4. Business KPIs

  • Active users
  • Transaction volume
  • Revenue metrics
  • Customer satisfaction
  • SLA compliance

Alert Rules

Critical Alerts

  • Service down >1 minute
  • Error rate >10%
  • Response time >1 second
  • Disk space >90%
  • Memory usage >95%

Warning Alerts

  • Error rate >5%
  • Response time >500ms
  • CPU usage >80%
  • Queue depth >1000
  • Replica lag >5s

SLOs & SLIs

Service Level Objectives

Service Metric Target Measurement
Coordinator API Availability 99.9% 30-day rolling
Coordinator API Latency <200ms 95th percentile
Blockchain Block Time <2s 24-hour average
Marketplace Success Rate 99.5% Daily
Explorer Response Time <500ms 95th percentile

Service Level Indicators

Availability

  • HTTP status codes
  • Health check responses
  • Pod readiness status

Latency

  • Request duration histogram
  • Database query times
  • External API calls

Throughput

  • Requests per second
  • Transactions per block
  • Jobs completed per hour

Quality

  • Error rates
  • Success rates
  • Customer satisfaction

Post-Incident Process

Immediate Actions (0-1 hour)

  1. Verify full resolution
  2. Monitor for recurrence
  3. Update status page
  4. Notify stakeholders

Post-Mortem (1-24 hours)

  1. Create incident document
  2. Gather timeline and logs
  3. Identify root cause
  4. Document lessons learned

Follow-up (1-7 days)

  1. Schedule post-mortem meeting
  2. Assign action items
  3. Update runbooks
  4. Improve monitoring

Review (Weekly)

  1. Review incident trends
  2. Update SLOs if needed
  3. Adjust alerting thresholds
  4. Improve processes

Maintenance Windows

Scheduled Maintenance

  • Frequency: Weekly maintenance window
  • Time: Sunday 02:00-04:00 UTC
  • Duration: Maximum 2 hours
  • Notification: 72 hours advance

Emergency Maintenance

  • Approval: Engineering Manager required
  • Notification: 4 hours advance (if possible)
  • Duration: As needed
  • Rollback: Always required

Tools & Systems

Monitoring Stack

  • Prometheus: Metrics collection
  • Grafana: Visualization and dashboards
  • Alertmanager: Alert routing and management
  • PagerDuty: On-call scheduling and escalation

Observability

  • Jaeger: Distributed tracing
  • Loki: Log aggregation
  • Kiali: Service mesh visualization
  • Kube-state-metrics: Kubernetes metrics

Communication

  • Slack: Primary communication
  • Zoom: War room meetings
  • Status Page: Customer notifications
  • Email: Formal communications

Training & Onboarding

New On-Call Engineer

  1. Shadow primary for 1 week
  2. Review all runbooks
  3. Test alerting systems
  4. Handle low-severity incidents
  5. Solo on-call with mentor

Ongoing Training

  • Monthly incident drills
  • Quarterly runbook updates
  • Annual training refreshers
  • Cross-team knowledge sharing

Emergency Procedures

Major Outage

  1. Declare incident (SEV-0)
  2. Activate war room
  3. Customer communication
  4. Executive updates
  5. Recovery coordination

Security Incident

  1. Isolate affected systems
  2. Preserve evidence
  3. Notify security team
  4. Customer notification
  5. Regulatory compliance

Data Loss

  1. Stop affected services
  2. Assess impact
  3. Initiate recovery
  4. Customer communication
  5. Prevent recurrence

Appendix

A. Contact List

[Detailed contact information]

B. Runbook Checklist

[Quick reference checklists]

C. Alert Configuration

[Prometheus rules and thresholds]

[Grafana dashboard URLs]


Document Version: 1.0 Last Updated: 2024-12-22 Next Review: 2025-01-22 Owner: SRE Team