feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
485
docs/operator/incident-runbooks.md
Normal file
485
docs/operator/incident-runbooks.md
Normal file
@ -0,0 +1,485 @@
|
||||
# AITBC Incident Runbooks
|
||||
|
||||
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
|
||||
|
||||
## Runbook: Coordinator API Outage
|
||||
|
||||
### Based on Chaos Test: `chaos_test_coordinator.py`
|
||||
|
||||
### Symptoms
|
||||
- 503/504 errors on all endpoints
|
||||
- Health check failures
|
||||
- Job submission failures
|
||||
- Marketplace unresponsive
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check pod status
|
||||
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 2. Check recent events
|
||||
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
|
||||
|
||||
# 3. Check if pods are crashlooping
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 4. Quick restart if needed
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Investigation (2-10 minutes)
|
||||
1. **Review Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/coordinator --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
|
||||
```
|
||||
|
||||
3. **Verify Database Connectivity**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
|
||||
```
|
||||
|
||||
4. **Check Redis Connection**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Up if Resource Starved**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
```
|
||||
|
||||
2. **Manual Pod Deletion if Stuck**
|
||||
```bash
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
|
||||
```
|
||||
|
||||
3. **Rollback Deployment**
|
||||
```bash
|
||||
kubectl rollout undo deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test health endpoint
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
|
||||
# Test API with sample request
|
||||
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
|
||||
```
|
||||
|
||||
## Runbook: Network Partition
|
||||
|
||||
### Based on Chaos Test: `chaos_test_network.py`
|
||||
|
||||
### Symptoms
|
||||
- Blockchain nodes not communicating
|
||||
- Consensus stalled
|
||||
- High finality latency
|
||||
- Transaction processing delays
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
|
||||
|
||||
# 2. Check consensus status
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
|
||||
|
||||
# 3. Check network policies
|
||||
kubectl get networkpolicies -n default
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Identify Partitioned Nodes**
|
||||
```bash
|
||||
# Check each node's peer count
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "Pod: $pod"
|
||||
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
done
|
||||
```
|
||||
|
||||
2. **Check Network Policies**
|
||||
```bash
|
||||
kubectl describe networkpolicy default-deny-all-ingress -n default
|
||||
kubectl describe networkpolicy blockchain-node-netpol -n default
|
||||
```
|
||||
|
||||
3. **Verify DNS Resolution**
|
||||
```bash
|
||||
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Remove Problematic Network Rules**
|
||||
```bash
|
||||
# Flush iptables on affected nodes
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
kubectl exec -n default $pod -- iptables -F
|
||||
done
|
||||
```
|
||||
|
||||
2. **Restart Network Components**
|
||||
```bash
|
||||
kubectl rollout restart deployment/blockchain-node -n default
|
||||
```
|
||||
|
||||
3. **Force Re-peering**
|
||||
```bash
|
||||
# Delete and recreate pods to force re-peering
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Wait for consensus to resume
|
||||
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
|
||||
|
||||
# Verify peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
```
|
||||
|
||||
## Runbook: Database Failure
|
||||
|
||||
### Based on Chaos Test: `chaos_test_database.py`
|
||||
|
||||
### Symptoms
|
||||
- Database connection errors
|
||||
- Service degradation
|
||||
- Failed transactions
|
||||
- High error rates
|
||||
|
||||
### MTTR Target: 3 minutes
|
||||
|
||||
### Immediate Actions (0-3 minutes)
|
||||
```bash
|
||||
# 1. Check PostgreSQL status
|
||||
kubectl exec -n default deployment/postgresql -- pg_isready
|
||||
|
||||
# 2. Check connection count
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# 3. Check replica lag
|
||||
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
|
||||
```
|
||||
|
||||
### Investigation (3-10 minutes)
|
||||
1. **Review Database Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/postgresql --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Usage**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
|
||||
df -h /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
3. **Identify Long-running Queries**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Kill Idle Connections**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
|
||||
```
|
||||
|
||||
2. **Restart PostgreSQL**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
3. **Failover to Replica**
|
||||
```bash
|
||||
# Promote replica if primary fails
|
||||
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test database connectivity
|
||||
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
|
||||
|
||||
# Check application health
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Redis Failure
|
||||
|
||||
### Symptoms
|
||||
- Caching failures
|
||||
- Session loss
|
||||
- Increased database load
|
||||
- Slow response times
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check Redis status
|
||||
kubectl exec -n default deployment/redis -- redis-cli ping
|
||||
|
||||
# 2. Check memory usage
|
||||
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
|
||||
|
||||
# 3. Check connection count
|
||||
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
|
||||
```
|
||||
|
||||
### Investigation (2-5 minutes)
|
||||
1. **Review Redis Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/redis --tail=100
|
||||
```
|
||||
|
||||
2. **Check for Eviction**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
|
||||
```
|
||||
|
||||
3. **Identify Large Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Clear Expired Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
|
||||
```
|
||||
|
||||
2. **Restart Redis**
|
||||
```bash
|
||||
kubectl rollout restart deployment/redis -n default
|
||||
```
|
||||
|
||||
3. **Scale Redis Cluster**
|
||||
```bash
|
||||
kubectl scale deployment/redis --replicas=3 -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test Redis connectivity
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
|
||||
# Check application performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: High CPU/Memory Usage
|
||||
|
||||
### Symptoms
|
||||
- Slow response times
|
||||
- Pod evictions
|
||||
- OOM errors
|
||||
- System degradation
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check resource usage
|
||||
kubectl top pods -n default
|
||||
kubectl top nodes
|
||||
|
||||
# 2. Identify resource-hungry pods
|
||||
kubectl exec -n default deployment/coordinator -- top
|
||||
|
||||
# 3. Check for OOM kills
|
||||
dmesg | grep -i "killed process"
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Analyze Resource Usage**
|
||||
```bash
|
||||
# Detailed pod metrics
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
|
||||
```
|
||||
|
||||
3. **Review Application Metrics**
|
||||
```bash
|
||||
# Check Prometheus metrics
|
||||
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Services**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
kubectl scale deployment/blockchain-node --replicas=3 -n default
|
||||
```
|
||||
|
||||
2. **Increase Resource Limits**
|
||||
```bash
|
||||
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
|
||||
```
|
||||
|
||||
3. **Restart Affected Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Monitor resource usage
|
||||
watch -n 5 'kubectl top pods -n default'
|
||||
|
||||
# Test service performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Storage Issues
|
||||
|
||||
### Symptoms
|
||||
- Disk space warnings
|
||||
- Write failures
|
||||
- Database errors
|
||||
- Pod crashes
|
||||
|
||||
### MTTR Target: 10 minutes
|
||||
|
||||
### Immediate Actions (0-10 minutes)
|
||||
```bash
|
||||
# 1. Check disk usage
|
||||
df -h
|
||||
kubectl exec -n default deployment/postgresql -- df -h
|
||||
|
||||
# 2. Identify large files
|
||||
find /var/log -name "*.log" -size +100M
|
||||
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
|
||||
|
||||
# 3. Clean up logs
|
||||
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
|
||||
```
|
||||
|
||||
### Investigation (10-20 minutes)
|
||||
1. **Analyze Storage Usage**
|
||||
```bash
|
||||
du -sh /var/log/*
|
||||
du -sh /var/lib/docker/*
|
||||
```
|
||||
|
||||
2. **Check PVC Usage**
|
||||
```bash
|
||||
kubectl get pvc -n default
|
||||
kubectl describe pvc postgresql-data -n default
|
||||
```
|
||||
|
||||
3. **Review Retention Policies**
|
||||
```bash
|
||||
kubectl get cronjobs -n default
|
||||
kubectl describe cronjob log-cleanup -n default
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Expand Storage**
|
||||
```bash
|
||||
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
|
||||
```
|
||||
|
||||
2. **Force Cleanup**
|
||||
```bash
|
||||
# Clean old logs
|
||||
find /var/log -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Clean Docker images
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
3. **Restart Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# Verify database operations
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
|
||||
```
|
||||
|
||||
## Emergency Contact Procedures
|
||||
|
||||
### Escalation Matrix
|
||||
1. **Level 1**: On-call engineer (5 minutes)
|
||||
2. **Level 2**: On-call secondary (15 minutes)
|
||||
3. **Level 3**: Engineering manager (30 minutes)
|
||||
4. **Level 4**: CTO (1 hour, critical only)
|
||||
|
||||
### War Room Activation
|
||||
```bash
|
||||
# Create Slack channel
|
||||
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
|
||||
|
||||
# Invite stakeholders
|
||||
/slack invite @sre-team @engineering-manager @cto
|
||||
|
||||
# Start Zoom meeting
|
||||
/zoom start "AITBC Incident War Room"
|
||||
```
|
||||
|
||||
### Customer Communication
|
||||
1. **Status Page Update** (5 minutes)
|
||||
2. **Email Notification** (15 minutes)
|
||||
3. **Twitter Update** (30 minutes, critical only)
|
||||
|
||||
## Post-Incident Checklist
|
||||
|
||||
### Immediate (0-1 hour)
|
||||
- [ ] Service fully restored
|
||||
- [ ] Monitoring normal
|
||||
- [ ] Status page updated
|
||||
- [ ] Stakeholders notified
|
||||
|
||||
### Short-term (1-24 hours)
|
||||
- [ ] Incident document created
|
||||
- [ ] Root cause identified
|
||||
- [ ] Runbooks updated
|
||||
- [ ] Post-mortem scheduled
|
||||
|
||||
### Long-term (1-7 days)
|
||||
- [ ] Post-mortem completed
|
||||
- [ ] Action items assigned
|
||||
- [ ] Monitoring improved
|
||||
- [ ] Process updated
|
||||
|
||||
## Runbook Maintenance
|
||||
|
||||
### Review Schedule
|
||||
- **Monthly**: Review and update runbooks
|
||||
- **Quarterly**: Full review and testing
|
||||
- **Annually**: Major revision
|
||||
|
||||
### Update Process
|
||||
1. Test runbook procedures
|
||||
2. Document lessons learned
|
||||
3. Update procedures
|
||||
4. Train team members
|
||||
5. Update documentation
|
||||
|
||||
---
|
||||
|
||||
*Version: 1.0*
|
||||
*Last Updated: 2024-12-22*
|
||||
*Owner: SRE Team*
|
||||
Reference in New Issue
Block a user