aitbc/docs/operator/incident-runbooks.md

# AITBC Incident Runbooks

This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.

## Runbook: Coordinator API Outage

### Based on Chaos Test: `chaos_test_coordinator.py`

### Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive

### MTTR Target: 2 minutes

### Immediate Actions (0-2 minutes)
```bash
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator

# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20

# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator

# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
```

### Investigation (2-10 minutes)
1. **Review Logs**
   ```bash
   kubectl logs -n default deployment/coordinator --tail=100
   ```

2. **Check Resource Limits**
   ```bash
   kubectl top pods -n default -l app.kubernetes.io/name=coordinator
   ```

3. **Verify Database Connectivity**
   ```bash
   kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
   ```

4. **Check Redis Connection**
   ```bash
   kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
   ```

### Recovery Actions
1. **Scale Up if Resource Starved**
   ```bash
   kubectl scale deployment/coordinator --replicas=5 -n default
   ```

2. **Manual Pod Deletion if Stuck**
   ```bash
   kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
   ```

3. **Rollback Deployment**
   ```bash
   kubectl rollout undo deployment/coordinator -n default
   ```

### Verification
```bash
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health

# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
```

## Runbook: Network Partition

### Based on Chaos Test: `chaos_test_network.py`

### Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays

### MTTR Target: 5 minutes

### Immediate Actions (0-5 minutes)
```bash
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq

# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq

# 3. Check network policies
kubectl get networkpolicies -n default
```

### Investigation (5-15 minutes)
1. **Identify Partitioned Nodes**
   ```bash
   # Check each node's peer count
   for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
     echo "Pod: $pod"
     kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
   done
   ```

2. **Check Network Policies**
   ```bash
   kubectl describe networkpolicy default-deny-all-ingress -n default
   kubectl describe networkpolicy blockchain-node-netpol -n default
   ```

3. **Verify DNS Resolution**
   ```bash
   kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
   ```

### Recovery Actions
1. **Remove Problematic Network Rules**
   ```bash
   # Flush iptables on affected nodes
   for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
     kubectl exec -n default $pod -- iptables -F
   done
   ```

2. **Restart Network Components**
   ```bash
   kubectl rollout restart deployment/blockchain-node -n default
   ```

3. **Force Re-peering**
   ```bash
   # Delete and recreate pods to force re-peering
   kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
   ```

### Verification
```bash
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'

# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
```

## Runbook: Database Failure

### Based on Chaos Test: `chaos_test_database.py`

### Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates

### MTTR Target: 3 minutes

### Immediate Actions (0-3 minutes)
```bash
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready

# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"

# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
```

### Investigation (3-10 minutes)
1. **Review Database Logs**
   ```bash
   kubectl logs -n default deployment/postgresql --tail=100
   ```

2. **Check Resource Usage**
   ```bash
   kubectl top pods -n default -l app.kubernetes.io/name=postgresql
   df -h /var/lib/postgresql/data
   ```

3. **Identify Long-running Queries**
   ```bash
   kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
   ```

### Recovery Actions
1. **Kill Idle Connections**
   ```bash
   kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
   ```

2. **Restart PostgreSQL**
   ```bash
   kubectl rollout restart deployment/postgresql -n default
   ```

3. **Failover to Replica**
   ```bash
   # Promote replica if primary fails
   kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
   ```

### Verification
```bash
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"

# Check application health
curl -f http://127.0.0.2:8011/v1/health
```

## Runbook: Redis Failure

### Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times

### MTTR Target: 2 minutes

### Immediate Actions (0-2 minutes)
```bash
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping

# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human

# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
```

### Investigation (2-5 minutes)
1. **Review Redis Logs**
   ```bash
   kubectl logs -n default deployment/redis --tail=100
   ```

2. **Check for Eviction**
   ```bash
   kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
   ```

3. **Identify Large Keys**
   ```bash
   kubectl exec -n default deployment/redis -- redis-cli --bigkeys
   ```

### Recovery Actions
1. **Clear Expired Keys**
   ```bash
   kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
   ```

2. **Restart Redis**
   ```bash
   kubectl rollout restart deployment/redis -n default
   ```

3. **Scale Redis Cluster**
   ```bash
   kubectl scale deployment/redis --replicas=3 -n default
   ```

### Verification
```bash
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```

## Runbook: High CPU/Memory Usage

### Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation

### MTTR Target: 5 minutes

### Immediate Actions (0-5 minutes)
```bash
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes

# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top

# 3. Check for OOM kills
dmesg | grep -i "killed process"
```

### Investigation (5-15 minutes)
1. **Analyze Resource Usage**
   ```bash
   # Detailed pod metrics
   kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
   kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
   ```

2. **Check Resource Limits**
   ```bash
   kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
   ```

3. **Review Application Metrics**
   ```bash
   # Check Prometheus metrics
   curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
   ```

### Recovery Actions
1. **Scale Services**
   ```bash
   kubectl scale deployment/coordinator --replicas=5 -n default
   kubectl scale deployment/blockchain-node --replicas=3 -n default
   ```

2. **Increase Resource Limits**
   ```bash
   kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
   ```

3. **Restart Affected Services**
   ```bash
   kubectl rollout restart deployment/coordinator -n default
   ```

### Verification
```bash
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'

# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```

## Runbook: Storage Issues

### Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes

### MTTR Target: 10 minutes

### Immediate Actions (0-10 minutes)
```bash
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h

# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G

# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
```

### Investigation (10-20 minutes)
1. **Analyze Storage Usage**
   ```bash
   du -sh /var/log/*
   du -sh /var/lib/docker/*
   ```

2. **Check PVC Usage**
   ```bash
   kubectl get pvc -n default
   kubectl describe pvc postgresql-data -n default
   ```

3. **Review Retention Policies**
   ```bash
   kubectl get cronjobs -n default
   kubectl describe cronjob log-cleanup -n default
   ```

### Recovery Actions
1. **Expand Storage**
   ```bash
   kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
   ```

2. **Force Cleanup**
   ```bash
   # Clean old logs
   find /var/log -name "*.log" -mtime +7 -delete

   # Clean Docker images
   docker system prune -a
   ```

3. **Restart Services**
   ```bash
   kubectl rollout restart deployment/postgresql -n default
   ```

### Verification
```bash
# Check disk space
df -h

# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
```

## Emergency Contact Procedures

### Escalation Matrix
1. **Level 1**: On-call engineer (5 minutes)
2. **Level 2**: On-call secondary (15 minutes)
3. **Level 3**: Engineering manager (30 minutes)
4. **Level 4**: CTO (1 hour, critical only)

### War Room Activation
```bash
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)

# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto

# Start Zoom meeting
/zoom start "AITBC Incident War Room"
```

### Customer Communication
1. **Status Page Update** (5 minutes)
2. **Email Notification** (15 minutes)
3. **Twitter Update** (30 minutes, critical only)

## Post-Incident Checklist

### Immediate (0-1 hour)
- [ ] Service fully restored
- [ ] Monitoring normal
- [ ] Status page updated
- [ ] Stakeholders notified

### Short-term (1-24 hours)
- [ ] Incident document created
- [ ] Root cause identified
- [ ] Runbooks updated
- [ ] Post-mortem scheduled

### Long-term (1-7 days)
- [ ] Post-mortem completed
- [ ] Action items assigned
- [ ] Monitoring improved
- [ ] Process updated

## Runbook Maintenance

### Review Schedule
- **Monthly**: Review and update runbooks
- **Quarterly**: Full review and testing
- **Annually**: Major revision

### Update Process
1. Test runbook procedures
2. Document lessons learned
3. Update procedures
4. Train team members
5. Update documentation

---

*Version: 1.0*
*Last Updated: 2024-12-22*
*Owner: SRE Team*