- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
12 KiB
AITBC Incident Runbooks
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
Runbook: Coordinator API Outage
Based on Chaos Test: chaos_test_coordinator.py
Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive
MTTR Target: 2 minutes
Immediate Actions (0-2 minutes)
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
Investigation (2-10 minutes)
-
Review Logs
kubectl logs -n default deployment/coordinator --tail=100 -
Check Resource Limits
kubectl top pods -n default -l app.kubernetes.io/name=coordinator -
Verify Database Connectivity
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432 -
Check Redis Connection
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
Recovery Actions
-
Scale Up if Resource Starved
kubectl scale deployment/coordinator --replicas=5 -n default -
Manual Pod Deletion if Stuck
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0 -
Rollback Deployment
kubectl rollout undo deployment/coordinator -n default
Verification
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health
# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
Runbook: Network Partition
Based on Chaos Test: chaos_test_network.py
Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays
MTTR Target: 5 minutes
Immediate Actions (0-5 minutes)
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
# 3. Check network policies
kubectl get networkpolicies -n default
Investigation (5-15 minutes)
-
Identify Partitioned Nodes
# Check each node's peer count for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do echo "Pod: $pod" kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length' done -
Check Network Policies
kubectl describe networkpolicy default-deny-all-ingress -n default kubectl describe networkpolicy blockchain-node-netpol -n default -
Verify DNS Resolution
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
Recovery Actions
-
Remove Problematic Network Rules
# Flush iptables on affected nodes for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do kubectl exec -n default $pod -- iptables -F done -
Restart Network Components
kubectl rollout restart deployment/blockchain-node -n default -
Force Re-peering
# Delete and recreate pods to force re-peering kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
Verification
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
Runbook: Database Failure
Based on Chaos Test: chaos_test_database.py
Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates
MTTR Target: 3 minutes
Immediate Actions (0-3 minutes)
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready
# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
Investigation (3-10 minutes)
-
Review Database Logs
kubectl logs -n default deployment/postgresql --tail=100 -
Check Resource Usage
kubectl top pods -n default -l app.kubernetes.io/name=postgresql df -h /var/lib/postgresql/data -
Identify Long-running Queries
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
Recovery Actions
-
Kill Idle Connections
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';" -
Restart PostgreSQL
kubectl rollout restart deployment/postgresql -n default -
Failover to Replica
# Promote replica if primary fails kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
Verification
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
# Check application health
curl -f http://127.0.0.2:8011/v1/health
Runbook: Redis Failure
Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times
MTTR Target: 2 minutes
Immediate Actions (0-2 minutes)
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping
# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
Investigation (2-5 minutes)
-
Review Redis Logs
kubectl logs -n default deployment/redis --tail=100 -
Check for Eviction
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys -
Identify Large Keys
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
Recovery Actions
-
Clear Expired Keys
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del -
Restart Redis
kubectl rollout restart deployment/redis -n default -
Scale Redis Cluster
kubectl scale deployment/redis --replicas=3 -n default
Verification
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
Runbook: High CPU/Memory Usage
Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation
MTTR Target: 5 minutes
Immediate Actions (0-5 minutes)
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes
# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top
# 3. Check for OOM kills
dmesg | grep -i "killed process"
Investigation (5-15 minutes)
-
Analyze Resource Usage
# Detailed pod metrics kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10 kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10 -
Check Resource Limits
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits -
Review Application Metrics
# Check Prometheus metrics curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
Recovery Actions
-
Scale Services
kubectl scale deployment/coordinator --replicas=5 -n default kubectl scale deployment/blockchain-node --replicas=3 -n default -
Increase Resource Limits
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}' -
Restart Affected Services
kubectl rollout restart deployment/coordinator -n default
Verification
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'
# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
Runbook: Storage Issues
Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes
MTTR Target: 10 minutes
Immediate Actions (0-10 minutes)
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h
# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
Investigation (10-20 minutes)
-
Analyze Storage Usage
du -sh /var/log/* du -sh /var/lib/docker/* -
Check PVC Usage
kubectl get pvc -n default kubectl describe pvc postgresql-data -n default -
Review Retention Policies
kubectl get cronjobs -n default kubectl describe cronjob log-cleanup -n default
Recovery Actions
-
Expand Storage
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}' -
Force Cleanup
# Clean old logs find /var/log -name "*.log" -mtime +7 -delete # Clean Docker images docker system prune -a -
Restart Services
kubectl rollout restart deployment/postgresql -n default
Verification
# Check disk space
df -h
# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
Emergency Contact Procedures
Escalation Matrix
- Level 1: On-call engineer (5 minutes)
- Level 2: On-call secondary (15 minutes)
- Level 3: Engineering manager (30 minutes)
- Level 4: CTO (1 hour, critical only)
War Room Activation
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto
# Start Zoom meeting
/zoom start "AITBC Incident War Room"
Customer Communication
- Status Page Update (5 minutes)
- Email Notification (15 minutes)
- Twitter Update (30 minutes, critical only)
Post-Incident Checklist
Immediate (0-1 hour)
- Service fully restored
- Monitoring normal
- Status page updated
- Stakeholders notified
Short-term (1-24 hours)
- Incident document created
- Root cause identified
- Runbooks updated
- Post-mortem scheduled
Long-term (1-7 days)
- Post-mortem completed
- Action items assigned
- Monitoring improved
- Process updated
Runbook Maintenance
Review Schedule
- Monthly: Review and update runbooks
- Quarterly: Full review and testing
- Annually: Major revision
Update Process
- Test runbook procedures
- Document lessons learned
- Update procedures
- Train team members
- Update documentation
Version: 1.0 Last Updated: 2024-12-22 Owner: SRE Team