✅ v0.2 Release Preparation: - Update version to 0.2.0 in pyproject.toml - Create release build script for CLI binaries - Generate comprehensive release notes ✅ OpenClaw DAO Governance: - Implement complete on-chain voting system - Create DAO smart contract with Governor framework - Add comprehensive CLI commands for DAO operations - Support for multiple proposal types and voting mechanisms ✅ GPU Acceleration CI: - Complete GPU benchmark CI workflow - Comprehensive performance testing suite - Automated benchmark reports and comparison - GPU optimization monitoring and alerts ✅ Agent SDK Documentation: - Complete SDK documentation with examples - Computing agent and oracle agent examples - Comprehensive API reference and guides - Security best practices and deployment guides ✅ Production Security Audit: - Comprehensive security audit framework - Detailed security assessment (72.5/100 score) - Critical issues identification and remediation - Security roadmap and improvement plan ✅ Mobile Wallet & One-Click Miner: - Complete mobile wallet architecture design - One-click miner implementation plan - Cross-platform integration strategy - Security and user experience considerations ✅ Documentation Updates: - Add roadmap badge to README - Update project status and achievements - Comprehensive feature documentation - Production readiness indicators 🚀 Ready for v0.2.0 release with agent-first architecture
13 KiB
AITBC Incident Runbooks
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation and integration test suite.
Integration Test Status (Updated 2026-01-26)
Current Test Coverage
- ✅ 6 integration tests passing
- ✅ Security tests using real ZK proof features
- ✅ Marketplace tests connecting to live service
- ⏸️ 1 test skipped (wallet payment flow)
Test Environment
- Tests run against both real and mock clients
- CI/CD pipeline runs full test suite
- Local development:
python -m pytest tests/integration/ -v
Runbook: Coordinator API Outage
Based on Chaos Test: chaos_test_coordinator.py
Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive
MTTR Target: 2 minutes
Immediate Actions (0-2 minutes)
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
Investigation (2-10 minutes)
-
Review Logs
kubectl logs -n default deployment/coordinator --tail=100 -
Check Resource Limits
kubectl top pods -n default -l app.kubernetes.io/name=coordinator -
Verify Database Connectivity
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432 -
Check Redis Connection
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
Recovery Actions
-
Scale Up if Resource Starved
kubectl scale deployment/coordinator --replicas=5 -n default -
Manual Pod Deletion if Stuck
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0 -
Rollback Deployment
kubectl rollout undo deployment/coordinator -n default
Verification
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health
# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
Runbook: Network Partition
Based on Chaos Test: chaos_test_network.py
Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays
MTTR Target: 5 minutes
Immediate Actions (0-5 minutes)
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
# 3. Check network policies
kubectl get networkpolicies -n default
Investigation (5-15 minutes)
-
Identify Partitioned Nodes
# Check each node's peer count for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do echo "Pod: $pod" kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length' done -
Check Network Policies
kubectl describe networkpolicy default-deny-all-ingress -n default kubectl describe networkpolicy blockchain-node-netpol -n default -
Verify DNS Resolution
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
Recovery Actions
-
Remove Problematic Network Rules
# Flush iptables on affected nodes for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do kubectl exec -n default $pod -- iptables -F done -
Restart Network Components
kubectl rollout restart deployment/blockchain-node -n default -
Force Re-peering
# Delete and recreate pods to force re-peering kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
Verification
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
Runbook: Database Failure
Based on Chaos Test: chaos_test_database.py
Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates
MTTR Target: 3 minutes
Immediate Actions (0-3 minutes)
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready
# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
Investigation (3-10 minutes)
-
Review Database Logs
kubectl logs -n default deployment/postgresql --tail=100 -
Check Resource Usage
kubectl top pods -n default -l app.kubernetes.io/name=postgresql df -h /var/lib/postgresql/data -
Identify Long-running Queries
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
Recovery Actions
-
Kill Idle Connections
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';" -
Restart PostgreSQL
kubectl rollout restart deployment/postgresql -n default -
Failover to Replica
# Promote replica if primary fails kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
Verification
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
# Check application health
curl -f http://127.0.0.2:8011/v1/health
Runbook: Redis Failure
Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times
MTTR Target: 2 minutes
Immediate Actions (0-2 minutes)
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping
# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
Investigation (2-5 minutes)
-
Review Redis Logs
kubectl logs -n default deployment/redis --tail=100 -
Check for Eviction
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys -
Identify Large Keys
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
Recovery Actions
-
Clear Expired Keys
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del -
Restart Redis
kubectl rollout restart deployment/redis -n default -
Scale Redis Cluster
kubectl scale deployment/redis --replicas=3 -n default
Verification
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
Runbook: High CPU/Memory Usage
Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation
MTTR Target: 5 minutes
Immediate Actions (0-5 minutes)
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes
# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top
# 3. Check for OOM kills
dmesg | grep -i "killed process"
Investigation (5-15 minutes)
-
Analyze Resource Usage
# Detailed pod metrics kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10 kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10 -
Check Resource Limits
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits -
Review Application Metrics
# Check Prometheus metrics curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
Recovery Actions
-
Scale Services
kubectl scale deployment/coordinator --replicas=5 -n default kubectl scale deployment/blockchain-node --replicas=3 -n default -
Increase Resource Limits
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}' -
Restart Affected Services
kubectl rollout restart deployment/coordinator -n default
Verification
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'
# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
Runbook: Storage Issues
Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes
MTTR Target: 10 minutes
Immediate Actions (0-10 minutes)
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h
# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
Investigation (10-20 minutes)
-
Analyze Storage Usage
du -sh /var/log/* du -sh /var/lib/docker/* -
Check PVC Usage
kubectl get pvc -n default kubectl describe pvc postgresql-data -n default -
Review Retention Policies
kubectl get cronjobs -n default kubectl describe cronjob log-cleanup -n default
Recovery Actions
-
Expand Storage
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}' -
Force Cleanup
# Clean old logs find /var/log -name "*.log" -mtime +7 -delete # Clean Docker images docker system prune -a -
Restart Services
kubectl rollout restart deployment/postgresql -n default
Verification
# Check disk space
df -h
# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
Emergency Contact Procedures
Escalation Matrix
- Level 1: On-call engineer (5 minutes)
- Level 2: On-call secondary (15 minutes)
- Level 3: Engineering manager (30 minutes)
- Level 4: CTO (1 hour, critical only)
War Room Activation
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto
# Start Zoom meeting
/zoom start "AITBC Incident War Room"
Customer Communication
- Status Page Update (5 minutes)
- Email Notification (15 minutes)
- Twitter Update (30 minutes, critical only)
Post-Incident Checklist
Immediate (0-1 hour)
- Service fully restored
- Monitoring normal
- Status page updated
- Stakeholders notified
Short-term (1-24 hours)
- Incident document created
- Root cause identified
- Runbooks updated
- Post-mortem scheduled
Long-term (1-7 days)
- Post-mortem completed
- Action items assigned
- Monitoring improved
- Process updated
Runbook Maintenance
Review Schedule
- Monthly: Review and update runbooks
- Quarterly: Full review and testing
- Annually: Major revision
Update Process
- Test runbook procedures
- Document lessons learned
- Update procedures
- Train team members
- Update documentation
Version: 1.0 Last Updated: 2024-12-22 Owner: SRE Team