oib/aitbc

Files

oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru

2025-12-22 10:33:23 +01:00

12 KiB

Raw Permalink Blame History

AITBC Incident Runbooks

This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.

Runbook: Coordinator API Outage

Based on Chaos Test: `chaos_test_coordinator.py`

Symptoms

503/504 errors on all endpoints
Health check failures
Job submission failures
Marketplace unresponsive

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator

# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20

# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator

# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default

Investigation (2-10 minutes)

Review Logs

kubectl logs -n default deployment/coordinator --tail=100

Check Resource Limits

kubectl top pods -n default -l app.kubernetes.io/name=coordinator

Verify Database Connectivity

kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432

Check Redis Connection

kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

Recovery Actions

Scale Up if Resource Starved

kubectl scale deployment/coordinator --replicas=5 -n default

Manual Pod Deletion if Stuck

kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0

Rollback Deployment

kubectl rollout undo deployment/coordinator -n default

Verification

# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health

# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"

Runbook: Network Partition

Based on Chaos Test: `chaos_test_network.py`

Symptoms

Blockchain nodes not communicating
Consensus stalled
High finality latency
Transaction processing delays

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq

# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq

# 3. Check network policies
kubectl get networkpolicies -n default

Investigation (5-15 minutes)

Identify Partitioned Nodes

# Check each node's peer count
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
  echo "Pod: $pod"
  kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
done

Check Network Policies

kubectl describe networkpolicy default-deny-all-ingress -n default
kubectl describe networkpolicy blockchain-node-netpol -n default

Verify DNS Resolution

kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node

Recovery Actions

Remove Problematic Network Rules

# Flush iptables on affected nodes
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
  kubectl exec -n default $pod -- iptables -F
done

Restart Network Components

kubectl rollout restart deployment/blockchain-node -n default

Force Re-peering

# Delete and recreate pods to force re-peering
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node

Verification

# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'

# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'

Runbook: Database Failure

Based on Chaos Test: `chaos_test_database.py`

Symptoms

Database connection errors
Service degradation
Failed transactions
High error rates

MTTR Target: 3 minutes

Immediate Actions (0-3 minutes)

# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready

# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"

# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"

Investigation (3-10 minutes)

Review Database Logs

kubectl logs -n default deployment/postgresql --tail=100

Check Resource Usage

kubectl top pods -n default -l app.kubernetes.io/name=postgresql
df -h /var/lib/postgresql/data

Identify Long-running Queries

kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"

Recovery Actions

Kill Idle Connections

kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"

Restart PostgreSQL

kubectl rollout restart deployment/postgresql -n default

Failover to Replica

# Promote replica if primary fails
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data

Verification

# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"

# Check application health
curl -f http://127.0.0.2:8011/v1/health

Runbook: Redis Failure

Symptoms

Caching failures
Session loss
Increased database load
Slow response times

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping

# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human

# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients

Investigation (2-5 minutes)

Review Redis Logs

kubectl logs -n default deployment/redis --tail=100

Check for Eviction

kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys

Identify Large Keys

kubectl exec -n default deployment/redis -- redis-cli --bigkeys

Recovery Actions

Clear Expired Keys

kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del

Restart Redis

kubectl rollout restart deployment/redis -n default

Scale Redis Cluster

kubectl scale deployment/redis --replicas=3 -n default

Verification

# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: High CPU/Memory Usage

Symptoms

Slow response times
Pod evictions
OOM errors
System degradation

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes

# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top

# 3. Check for OOM kills
dmesg | grep -i "killed process"

Investigation (5-15 minutes)

Analyze Resource Usage

# Detailed pod metrics
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10

Check Resource Limits

kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits

Review Application Metrics

# Check Prometheus metrics
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"

Recovery Actions

Scale Services

kubectl scale deployment/coordinator --replicas=5 -n default
kubectl scale deployment/blockchain-node --replicas=3 -n default

Increase Resource Limits

kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'

Restart Affected Services

kubectl rollout restart deployment/coordinator -n default

Verification

# Monitor resource usage
watch -n 5 'kubectl top pods -n default'

# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: Storage Issues

Symptoms

Disk space warnings
Write failures
Database errors
Pod crashes

MTTR Target: 10 minutes

Immediate Actions (0-10 minutes)

# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h

# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G

# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log

Investigation (10-20 minutes)

Analyze Storage Usage

du -sh /var/log/*
du -sh /var/lib/docker/*

Check PVC Usage

kubectl get pvc -n default
kubectl describe pvc postgresql-data -n default

Review Retention Policies

kubectl get cronjobs -n default
kubectl describe cronjob log-cleanup -n default

Recovery Actions

Expand Storage

kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Force Cleanup

# Clean old logs
find /var/log -name "*.log" -mtime +7 -delete

# Clean Docker images
docker system prune -a

Restart Services

kubectl rollout restart deployment/postgresql -n default

Verification

# Check disk space
df -h

# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"

Emergency Contact Procedures

Escalation Matrix

Level 1: On-call engineer (5 minutes)
Level 2: On-call secondary (15 minutes)
Level 3: Engineering manager (30 minutes)
Level 4: CTO (1 hour, critical only)

War Room Activation

# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)

# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto

# Start Zoom meeting
/zoom start "AITBC Incident War Room"

Customer Communication

Status Page Update (5 minutes)
Email Notification (15 minutes)
Twitter Update (30 minutes, critical only)

Post-Incident Checklist

Immediate (0-1 hour)

Service fully restored
Monitoring normal
Status page updated
Stakeholders notified

Short-term (1-24 hours)

Incident document created
Root cause identified
Runbooks updated
Post-mortem scheduled

Long-term (1-7 days)

Post-mortem completed
Action items assigned
Monitoring improved
Process updated

Runbook Maintenance

Review Schedule

Monthly: Review and update runbooks
Quarterly: Full review and testing
Annually: Major revision

Update Process

Test runbook procedures
Document lessons learned
Update procedures
Train team members
Update documentation

Version: 1.0 Last Updated: 2024-12-22 Owner: SRE Team

12 KiB Raw Permalink Blame History

AITBC Incident Runbooks

Runbook: Coordinator API Outage

Based on Chaos Test: chaos_test_coordinator.py

Symptoms

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

Investigation (2-10 minutes)

Recovery Actions

Verification

Runbook: Network Partition

Based on Chaos Test: chaos_test_network.py

Symptoms

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

Investigation (5-15 minutes)

Recovery Actions

Verification

Runbook: Database Failure

Based on Chaos Test: chaos_test_database.py

Symptoms

MTTR Target: 3 minutes

Immediate Actions (0-3 minutes)

Investigation (3-10 minutes)

Recovery Actions

Verification

Runbook: Redis Failure

Symptoms

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

Investigation (2-5 minutes)

Recovery Actions

Verification

Runbook: High CPU/Memory Usage

Symptoms

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

Investigation (5-15 minutes)

Recovery Actions

Verification

Runbook: Storage Issues

Symptoms

MTTR Target: 10 minutes

Immediate Actions (0-10 minutes)

Investigation (10-20 minutes)

Recovery Actions

Verification

Emergency Contact Procedures

Escalation Matrix

War Room Activation

Customer Communication

Post-Incident Checklist

Immediate (0-1 hour)

Short-term (1-24 hours)

Long-term (1-7 days)

Runbook Maintenance

Review Schedule

Update Process

12 KiB

Raw Permalink Blame History

Based on Chaos Test: `chaos_test_coordinator.py`

Based on Chaos Test: `chaos_test_network.py`

Based on Chaos Test: `chaos_test_database.py`