Files
aitbc/docs/operator/incident-runbooks.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

12 KiB

AITBC Incident Runbooks

This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.

Runbook: Coordinator API Outage

Based on Chaos Test: chaos_test_coordinator.py

Symptoms

  • 503/504 errors on all endpoints
  • Health check failures
  • Job submission failures
  • Marketplace unresponsive

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator

# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20

# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator

# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default

Investigation (2-10 minutes)

  1. Review Logs

    kubectl logs -n default deployment/coordinator --tail=100
    
  2. Check Resource Limits

    kubectl top pods -n default -l app.kubernetes.io/name=coordinator
    
  3. Verify Database Connectivity

    kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
    
  4. Check Redis Connection

    kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
    

Recovery Actions

  1. Scale Up if Resource Starved

    kubectl scale deployment/coordinator --replicas=5 -n default
    
  2. Manual Pod Deletion if Stuck

    kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
    
  3. Rollback Deployment

    kubectl rollout undo deployment/coordinator -n default
    

Verification

# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health

# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"

Runbook: Network Partition

Based on Chaos Test: chaos_test_network.py

Symptoms

  • Blockchain nodes not communicating
  • Consensus stalled
  • High finality latency
  • Transaction processing delays

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq

# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq

# 3. Check network policies
kubectl get networkpolicies -n default

Investigation (5-15 minutes)

  1. Identify Partitioned Nodes

    # Check each node's peer count
    for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
      echo "Pod: $pod"
      kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
    done
    
  2. Check Network Policies

    kubectl describe networkpolicy default-deny-all-ingress -n default
    kubectl describe networkpolicy blockchain-node-netpol -n default
    
  3. Verify DNS Resolution

    kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
    

Recovery Actions

  1. Remove Problematic Network Rules

    # Flush iptables on affected nodes
    for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
      kubectl exec -n default $pod -- iptables -F
    done
    
  2. Restart Network Components

    kubectl rollout restart deployment/blockchain-node -n default
    
  3. Force Re-peering

    # Delete and recreate pods to force re-peering
    kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
    

Verification

# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'

# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'

Runbook: Database Failure

Based on Chaos Test: chaos_test_database.py

Symptoms

  • Database connection errors
  • Service degradation
  • Failed transactions
  • High error rates

MTTR Target: 3 minutes

Immediate Actions (0-3 minutes)

# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready

# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"

# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"

Investigation (3-10 minutes)

  1. Review Database Logs

    kubectl logs -n default deployment/postgresql --tail=100
    
  2. Check Resource Usage

    kubectl top pods -n default -l app.kubernetes.io/name=postgresql
    df -h /var/lib/postgresql/data
    
  3. Identify Long-running Queries

    kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
    

Recovery Actions

  1. Kill Idle Connections

    kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
    
  2. Restart PostgreSQL

    kubectl rollout restart deployment/postgresql -n default
    
  3. Failover to Replica

    # Promote replica if primary fails
    kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
    

Verification

# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"

# Check application health
curl -f http://127.0.0.2:8011/v1/health

Runbook: Redis Failure

Symptoms

  • Caching failures
  • Session loss
  • Increased database load
  • Slow response times

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping

# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human

# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients

Investigation (2-5 minutes)

  1. Review Redis Logs

    kubectl logs -n default deployment/redis --tail=100
    
  2. Check for Eviction

    kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
    
  3. Identify Large Keys

    kubectl exec -n default deployment/redis -- redis-cli --bigkeys
    

Recovery Actions

  1. Clear Expired Keys

    kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
    
  2. Restart Redis

    kubectl rollout restart deployment/redis -n default
    
  3. Scale Redis Cluster

    kubectl scale deployment/redis --replicas=3 -n default
    

Verification

# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: High CPU/Memory Usage

Symptoms

  • Slow response times
  • Pod evictions
  • OOM errors
  • System degradation

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes

# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top

# 3. Check for OOM kills
dmesg | grep -i "killed process"

Investigation (5-15 minutes)

  1. Analyze Resource Usage

    # Detailed pod metrics
    kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
    kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
    
  2. Check Resource Limits

    kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
    
  3. Review Application Metrics

    # Check Prometheus metrics
    curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
    

Recovery Actions

  1. Scale Services

    kubectl scale deployment/coordinator --replicas=5 -n default
    kubectl scale deployment/blockchain-node --replicas=3 -n default
    
  2. Increase Resource Limits

    kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
    
  3. Restart Affected Services

    kubectl rollout restart deployment/coordinator -n default
    

Verification

# Monitor resource usage
watch -n 5 'kubectl top pods -n default'

# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: Storage Issues

Symptoms

  • Disk space warnings
  • Write failures
  • Database errors
  • Pod crashes

MTTR Target: 10 minutes

Immediate Actions (0-10 minutes)

# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h

# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G

# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log

Investigation (10-20 minutes)

  1. Analyze Storage Usage

    du -sh /var/log/*
    du -sh /var/lib/docker/*
    
  2. Check PVC Usage

    kubectl get pvc -n default
    kubectl describe pvc postgresql-data -n default
    
  3. Review Retention Policies

    kubectl get cronjobs -n default
    kubectl describe cronjob log-cleanup -n default
    

Recovery Actions

  1. Expand Storage

    kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
    
  2. Force Cleanup

    # Clean old logs
    find /var/log -name "*.log" -mtime +7 -delete
    
    # Clean Docker images
    docker system prune -a
    
  3. Restart Services

    kubectl rollout restart deployment/postgresql -n default
    

Verification

# Check disk space
df -h

# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"

Emergency Contact Procedures

Escalation Matrix

  1. Level 1: On-call engineer (5 minutes)
  2. Level 2: On-call secondary (15 minutes)
  3. Level 3: Engineering manager (30 minutes)
  4. Level 4: CTO (1 hour, critical only)

War Room Activation

# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)

# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto

# Start Zoom meeting
/zoom start "AITBC Incident War Room"

Customer Communication

  1. Status Page Update (5 minutes)
  2. Email Notification (15 minutes)
  3. Twitter Update (30 minutes, critical only)

Post-Incident Checklist

Immediate (0-1 hour)

  • Service fully restored
  • Monitoring normal
  • Status page updated
  • Stakeholders notified

Short-term (1-24 hours)

  • Incident document created
  • Root cause identified
  • Runbooks updated
  • Post-mortem scheduled

Long-term (1-7 days)

  • Post-mortem completed
  • Action items assigned
  • Monitoring improved
  • Process updated

Runbook Maintenance

Review Schedule

  • Monthly: Review and update runbooks
  • Quarterly: Full review and testing
  • Annually: Major revision

Update Process

  1. Test runbook procedures
  2. Document lessons learned
  3. Update procedures
  4. Train team members
  5. Update documentation

Version: 1.0 Last Updated: 2024-12-22 Owner: SRE Team