feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
This commit is contained in:
oib
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions

View File

@ -0,0 +1,485 @@
# AITBC Incident Runbooks
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
## Runbook: Coordinator API Outage
### Based on Chaos Test: `chaos_test_coordinator.py`
### Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
```
### Investigation (2-10 minutes)
1. **Review Logs**
```bash
kubectl logs -n default deployment/coordinator --tail=100
```
2. **Check Resource Limits**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
```
3. **Verify Database Connectivity**
```bash
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
```
4. **Check Redis Connection**
```bash
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
```
### Recovery Actions
1. **Scale Up if Resource Starved**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
```
2. **Manual Pod Deletion if Stuck**
```bash
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
```
3. **Rollback Deployment**
```bash
kubectl rollout undo deployment/coordinator -n default
```
### Verification
```bash
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health
# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
```
## Runbook: Network Partition
### Based on Chaos Test: `chaos_test_network.py`
### Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
# 3. Check network policies
kubectl get networkpolicies -n default
```
### Investigation (5-15 minutes)
1. **Identify Partitioned Nodes**
```bash
# Check each node's peer count
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
echo "Pod: $pod"
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
done
```
2. **Check Network Policies**
```bash
kubectl describe networkpolicy default-deny-all-ingress -n default
kubectl describe networkpolicy blockchain-node-netpol -n default
```
3. **Verify DNS Resolution**
```bash
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
```
### Recovery Actions
1. **Remove Problematic Network Rules**
```bash
# Flush iptables on affected nodes
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
kubectl exec -n default $pod -- iptables -F
done
```
2. **Restart Network Components**
```bash
kubectl rollout restart deployment/blockchain-node -n default
```
3. **Force Re-peering**
```bash
# Delete and recreate pods to force re-peering
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
```
### Verification
```bash
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
```
## Runbook: Database Failure
### Based on Chaos Test: `chaos_test_database.py`
### Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates
### MTTR Target: 3 minutes
### Immediate Actions (0-3 minutes)
```bash
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready
# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
```
### Investigation (3-10 minutes)
1. **Review Database Logs**
```bash
kubectl logs -n default deployment/postgresql --tail=100
```
2. **Check Resource Usage**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
df -h /var/lib/postgresql/data
```
3. **Identify Long-running Queries**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```
### Recovery Actions
1. **Kill Idle Connections**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
```
2. **Restart PostgreSQL**
```bash
kubectl rollout restart deployment/postgresql -n default
```
3. **Failover to Replica**
```bash
# Promote replica if primary fails
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
```
### Verification
```bash
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
# Check application health
curl -f http://127.0.0.2:8011/v1/health
```
## Runbook: Redis Failure
### Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping
# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
```
### Investigation (2-5 minutes)
1. **Review Redis Logs**
```bash
kubectl logs -n default deployment/redis --tail=100
```
2. **Check for Eviction**
```bash
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
```
3. **Identify Large Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
```
### Recovery Actions
1. **Clear Expired Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
```
2. **Restart Redis**
```bash
kubectl rollout restart deployment/redis -n default
```
3. **Scale Redis Cluster**
```bash
kubectl scale deployment/redis --replicas=3 -n default
```
### Verification
```bash
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: High CPU/Memory Usage
### Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes
# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top
# 3. Check for OOM kills
dmesg | grep -i "killed process"
```
### Investigation (5-15 minutes)
1. **Analyze Resource Usage**
```bash
# Detailed pod metrics
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
```
2. **Check Resource Limits**
```bash
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
```
3. **Review Application Metrics**
```bash
# Check Prometheus metrics
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
```
### Recovery Actions
1. **Scale Services**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
kubectl scale deployment/blockchain-node --replicas=3 -n default
```
2. **Increase Resource Limits**
```bash
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
```
3. **Restart Affected Services**
```bash
kubectl rollout restart deployment/coordinator -n default
```
### Verification
```bash
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'
# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: Storage Issues
### Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes
### MTTR Target: 10 minutes
### Immediate Actions (0-10 minutes)
```bash
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h
# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
```
### Investigation (10-20 minutes)
1. **Analyze Storage Usage**
```bash
du -sh /var/log/*
du -sh /var/lib/docker/*
```
2. **Check PVC Usage**
```bash
kubectl get pvc -n default
kubectl describe pvc postgresql-data -n default
```
3. **Review Retention Policies**
```bash
kubectl get cronjobs -n default
kubectl describe cronjob log-cleanup -n default
```
### Recovery Actions
1. **Expand Storage**
```bash
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
```
2. **Force Cleanup**
```bash
# Clean old logs
find /var/log -name "*.log" -mtime +7 -delete
# Clean Docker images
docker system prune -a
```
3. **Restart Services**
```bash
kubectl rollout restart deployment/postgresql -n default
```
### Verification
```bash
# Check disk space
df -h
# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
```
## Emergency Contact Procedures
### Escalation Matrix
1. **Level 1**: On-call engineer (5 minutes)
2. **Level 2**: On-call secondary (15 minutes)
3. **Level 3**: Engineering manager (30 minutes)
4. **Level 4**: CTO (1 hour, critical only)
### War Room Activation
```bash
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto
# Start Zoom meeting
/zoom start "AITBC Incident War Room"
```
### Customer Communication
1. **Status Page Update** (5 minutes)
2. **Email Notification** (15 minutes)
3. **Twitter Update** (30 minutes, critical only)
## Post-Incident Checklist
### Immediate (0-1 hour)
- [ ] Service fully restored
- [ ] Monitoring normal
- [ ] Status page updated
- [ ] Stakeholders notified
### Short-term (1-24 hours)
- [ ] Incident document created
- [ ] Root cause identified
- [ ] Runbooks updated
- [ ] Post-mortem scheduled
### Long-term (1-7 days)
- [ ] Post-mortem completed
- [ ] Action items assigned
- [ ] Monitoring improved
- [ ] Process updated
## Runbook Maintenance
### Review Schedule
- **Monthly**: Review and update runbooks
- **Quarterly**: Full review and testing
- **Annually**: Major revision
### Update Process
1. Test runbook procedures
2. Document lessons learned
3. Update procedures
4. Train team members
5. Update documentation
---
*Version: 1.0*
*Last Updated: 2024-12-22*
*Owner: SRE Team*