- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
486 lines
12 KiB
Markdown
486 lines
12 KiB
Markdown
# AITBC Incident Runbooks
|
|
|
|
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
|
|
|
|
## Runbook: Coordinator API Outage
|
|
|
|
### Based on Chaos Test: `chaos_test_coordinator.py`
|
|
|
|
### Symptoms
|
|
- 503/504 errors on all endpoints
|
|
- Health check failures
|
|
- Job submission failures
|
|
- Marketplace unresponsive
|
|
|
|
### MTTR Target: 2 minutes
|
|
|
|
### Immediate Actions (0-2 minutes)
|
|
```bash
|
|
# 1. Check pod status
|
|
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
|
|
|
|
# 2. Check recent events
|
|
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
|
|
|
|
# 3. Check if pods are crashlooping
|
|
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
|
|
|
|
# 4. Quick restart if needed
|
|
kubectl rollout restart deployment/coordinator -n default
|
|
```
|
|
|
|
### Investigation (2-10 minutes)
|
|
1. **Review Logs**
|
|
```bash
|
|
kubectl logs -n default deployment/coordinator --tail=100
|
|
```
|
|
|
|
2. **Check Resource Limits**
|
|
```bash
|
|
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
|
|
```
|
|
|
|
3. **Verify Database Connectivity**
|
|
```bash
|
|
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
|
|
```
|
|
|
|
4. **Check Redis Connection**
|
|
```bash
|
|
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Scale Up if Resource Starved**
|
|
```bash
|
|
kubectl scale deployment/coordinator --replicas=5 -n default
|
|
```
|
|
|
|
2. **Manual Pod Deletion if Stuck**
|
|
```bash
|
|
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
|
|
```
|
|
|
|
3. **Rollback Deployment**
|
|
```bash
|
|
kubectl rollout undo deployment/coordinator -n default
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Test health endpoint
|
|
curl -f http://127.0.0.2:8011/v1/health
|
|
|
|
# Test API with sample request
|
|
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
|
|
```
|
|
|
|
## Runbook: Network Partition
|
|
|
|
### Based on Chaos Test: `chaos_test_network.py`
|
|
|
|
### Symptoms
|
|
- Blockchain nodes not communicating
|
|
- Consensus stalled
|
|
- High finality latency
|
|
- Transaction processing delays
|
|
|
|
### MTTR Target: 5 minutes
|
|
|
|
### Immediate Actions (0-5 minutes)
|
|
```bash
|
|
# 1. Check peer connectivity
|
|
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
|
|
|
|
# 2. Check consensus status
|
|
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
|
|
|
|
# 3. Check network policies
|
|
kubectl get networkpolicies -n default
|
|
```
|
|
|
|
### Investigation (5-15 minutes)
|
|
1. **Identify Partitioned Nodes**
|
|
```bash
|
|
# Check each node's peer count
|
|
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
|
echo "Pod: $pod"
|
|
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
|
done
|
|
```
|
|
|
|
2. **Check Network Policies**
|
|
```bash
|
|
kubectl describe networkpolicy default-deny-all-ingress -n default
|
|
kubectl describe networkpolicy blockchain-node-netpol -n default
|
|
```
|
|
|
|
3. **Verify DNS Resolution**
|
|
```bash
|
|
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Remove Problematic Network Rules**
|
|
```bash
|
|
# Flush iptables on affected nodes
|
|
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
|
kubectl exec -n default $pod -- iptables -F
|
|
done
|
|
```
|
|
|
|
2. **Restart Network Components**
|
|
```bash
|
|
kubectl rollout restart deployment/blockchain-node -n default
|
|
```
|
|
|
|
3. **Force Re-peering**
|
|
```bash
|
|
# Delete and recreate pods to force re-peering
|
|
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Wait for consensus to resume
|
|
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
|
|
|
|
# Verify peer connectivity
|
|
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
|
```
|
|
|
|
## Runbook: Database Failure
|
|
|
|
### Based on Chaos Test: `chaos_test_database.py`
|
|
|
|
### Symptoms
|
|
- Database connection errors
|
|
- Service degradation
|
|
- Failed transactions
|
|
- High error rates
|
|
|
|
### MTTR Target: 3 minutes
|
|
|
|
### Immediate Actions (0-3 minutes)
|
|
```bash
|
|
# 1. Check PostgreSQL status
|
|
kubectl exec -n default deployment/postgresql -- pg_isready
|
|
|
|
# 2. Check connection count
|
|
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
|
|
|
|
# 3. Check replica lag
|
|
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
|
|
```
|
|
|
|
### Investigation (3-10 minutes)
|
|
1. **Review Database Logs**
|
|
```bash
|
|
kubectl logs -n default deployment/postgresql --tail=100
|
|
```
|
|
|
|
2. **Check Resource Usage**
|
|
```bash
|
|
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
|
|
df -h /var/lib/postgresql/data
|
|
```
|
|
|
|
3. **Identify Long-running Queries**
|
|
```bash
|
|
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Kill Idle Connections**
|
|
```bash
|
|
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
|
|
```
|
|
|
|
2. **Restart PostgreSQL**
|
|
```bash
|
|
kubectl rollout restart deployment/postgresql -n default
|
|
```
|
|
|
|
3. **Failover to Replica**
|
|
```bash
|
|
# Promote replica if primary fails
|
|
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Test database connectivity
|
|
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
|
|
|
|
# Check application health
|
|
curl -f http://127.0.0.2:8011/v1/health
|
|
```
|
|
|
|
## Runbook: Redis Failure
|
|
|
|
### Symptoms
|
|
- Caching failures
|
|
- Session loss
|
|
- Increased database load
|
|
- Slow response times
|
|
|
|
### MTTR Target: 2 minutes
|
|
|
|
### Immediate Actions (0-2 minutes)
|
|
```bash
|
|
# 1. Check Redis status
|
|
kubectl exec -n default deployment/redis -- redis-cli ping
|
|
|
|
# 2. Check memory usage
|
|
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
|
|
|
|
# 3. Check connection count
|
|
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
|
|
```
|
|
|
|
### Investigation (2-5 minutes)
|
|
1. **Review Redis Logs**
|
|
```bash
|
|
kubectl logs -n default deployment/redis --tail=100
|
|
```
|
|
|
|
2. **Check for Eviction**
|
|
```bash
|
|
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
|
|
```
|
|
|
|
3. **Identify Large Keys**
|
|
```bash
|
|
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Clear Expired Keys**
|
|
```bash
|
|
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
|
|
```
|
|
|
|
2. **Restart Redis**
|
|
```bash
|
|
kubectl rollout restart deployment/redis -n default
|
|
```
|
|
|
|
3. **Scale Redis Cluster**
|
|
```bash
|
|
kubectl scale deployment/redis --replicas=3 -n default
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Test Redis connectivity
|
|
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
|
|
|
# Check application performance
|
|
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
|
```
|
|
|
|
## Runbook: High CPU/Memory Usage
|
|
|
|
### Symptoms
|
|
- Slow response times
|
|
- Pod evictions
|
|
- OOM errors
|
|
- System degradation
|
|
|
|
### MTTR Target: 5 minutes
|
|
|
|
### Immediate Actions (0-5 minutes)
|
|
```bash
|
|
# 1. Check resource usage
|
|
kubectl top pods -n default
|
|
kubectl top nodes
|
|
|
|
# 2. Identify resource-hungry pods
|
|
kubectl exec -n default deployment/coordinator -- top
|
|
|
|
# 3. Check for OOM kills
|
|
dmesg | grep -i "killed process"
|
|
```
|
|
|
|
### Investigation (5-15 minutes)
|
|
1. **Analyze Resource Usage**
|
|
```bash
|
|
# Detailed pod metrics
|
|
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
|
|
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
|
|
```
|
|
|
|
2. **Check Resource Limits**
|
|
```bash
|
|
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
|
|
```
|
|
|
|
3. **Review Application Metrics**
|
|
```bash
|
|
# Check Prometheus metrics
|
|
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Scale Services**
|
|
```bash
|
|
kubectl scale deployment/coordinator --replicas=5 -n default
|
|
kubectl scale deployment/blockchain-node --replicas=3 -n default
|
|
```
|
|
|
|
2. **Increase Resource Limits**
|
|
```bash
|
|
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
|
|
```
|
|
|
|
3. **Restart Affected Services**
|
|
```bash
|
|
kubectl rollout restart deployment/coordinator -n default
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Monitor resource usage
|
|
watch -n 5 'kubectl top pods -n default'
|
|
|
|
# Test service performance
|
|
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
|
```
|
|
|
|
## Runbook: Storage Issues
|
|
|
|
### Symptoms
|
|
- Disk space warnings
|
|
- Write failures
|
|
- Database errors
|
|
- Pod crashes
|
|
|
|
### MTTR Target: 10 minutes
|
|
|
|
### Immediate Actions (0-10 minutes)
|
|
```bash
|
|
# 1. Check disk usage
|
|
df -h
|
|
kubectl exec -n default deployment/postgresql -- df -h
|
|
|
|
# 2. Identify large files
|
|
find /var/log -name "*.log" -size +100M
|
|
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
|
|
|
|
# 3. Clean up logs
|
|
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
|
|
```
|
|
|
|
### Investigation (10-20 minutes)
|
|
1. **Analyze Storage Usage**
|
|
```bash
|
|
du -sh /var/log/*
|
|
du -sh /var/lib/docker/*
|
|
```
|
|
|
|
2. **Check PVC Usage**
|
|
```bash
|
|
kubectl get pvc -n default
|
|
kubectl describe pvc postgresql-data -n default
|
|
```
|
|
|
|
3. **Review Retention Policies**
|
|
```bash
|
|
kubectl get cronjobs -n default
|
|
kubectl describe cronjob log-cleanup -n default
|
|
```
|
|
|
|
### Recovery Actions
|
|
1. **Expand Storage**
|
|
```bash
|
|
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
|
|
```
|
|
|
|
2. **Force Cleanup**
|
|
```bash
|
|
# Clean old logs
|
|
find /var/log -name "*.log" -mtime +7 -delete
|
|
|
|
# Clean Docker images
|
|
docker system prune -a
|
|
```
|
|
|
|
3. **Restart Services**
|
|
```bash
|
|
kubectl rollout restart deployment/postgresql -n default
|
|
```
|
|
|
|
### Verification
|
|
```bash
|
|
# Check disk space
|
|
df -h
|
|
|
|
# Verify database operations
|
|
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
|
|
```
|
|
|
|
## Emergency Contact Procedures
|
|
|
|
### Escalation Matrix
|
|
1. **Level 1**: On-call engineer (5 minutes)
|
|
2. **Level 2**: On-call secondary (15 minutes)
|
|
3. **Level 3**: Engineering manager (30 minutes)
|
|
4. **Level 4**: CTO (1 hour, critical only)
|
|
|
|
### War Room Activation
|
|
```bash
|
|
# Create Slack channel
|
|
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
|
|
|
|
# Invite stakeholders
|
|
/slack invite @sre-team @engineering-manager @cto
|
|
|
|
# Start Zoom meeting
|
|
/zoom start "AITBC Incident War Room"
|
|
```
|
|
|
|
### Customer Communication
|
|
1. **Status Page Update** (5 minutes)
|
|
2. **Email Notification** (15 minutes)
|
|
3. **Twitter Update** (30 minutes, critical only)
|
|
|
|
## Post-Incident Checklist
|
|
|
|
### Immediate (0-1 hour)
|
|
- [ ] Service fully restored
|
|
- [ ] Monitoring normal
|
|
- [ ] Status page updated
|
|
- [ ] Stakeholders notified
|
|
|
|
### Short-term (1-24 hours)
|
|
- [ ] Incident document created
|
|
- [ ] Root cause identified
|
|
- [ ] Runbooks updated
|
|
- [ ] Post-mortem scheduled
|
|
|
|
### Long-term (1-7 days)
|
|
- [ ] Post-mortem completed
|
|
- [ ] Action items assigned
|
|
- [ ] Monitoring improved
|
|
- [ ] Process updated
|
|
|
|
## Runbook Maintenance
|
|
|
|
### Review Schedule
|
|
- **Monthly**: Review and update runbooks
|
|
- **Quarterly**: Full review and testing
|
|
- **Annually**: Major revision
|
|
|
|
### Update Process
|
|
1. Test runbook procedures
|
|
2. Document lessons learned
|
|
3. Update procedures
|
|
4. Train team members
|
|
5. Update documentation
|
|
|
|
---
|
|
|
|
*Version: 1.0*
|
|
*Last Updated: 2024-12-22*
|
|
*Owner: SRE Team*
|