oib/aitbc

Files

AITBC System dda703de10 feat: implement v0.2.0 release features - agent-first evolution

✅ v0.2 Release Preparation:
- Update version to 0.2.0 in pyproject.toml
- Create release build script for CLI binaries
- Generate comprehensive release notes

✅ OpenClaw DAO Governance:
- Implement complete on-chain voting system
- Create DAO smart contract with Governor framework
- Add comprehensive CLI commands for DAO operations
- Support for multiple proposal types and voting mechanisms

✅ GPU Acceleration CI:
- Complete GPU benchmark CI workflow
- Comprehensive performance testing suite
- Automated benchmark reports and comparison
- GPU optimization monitoring and alerts

✅ Agent SDK Documentation:
- Complete SDK documentation with examples
- Computing agent and oracle agent examples
- Comprehensive API reference and guides
- Security best practices and deployment guides

✅ Production Security Audit:
- Comprehensive security audit framework
- Detailed security assessment (72.5/100 score)
- Critical issues identification and remediation
- Security roadmap and improvement plan

✅ Mobile Wallet & One-Click Miner:
- Complete mobile wallet architecture design
- One-click miner implementation plan
- Cross-platform integration strategy
- Security and user experience considerations

✅ Documentation Updates:
- Add roadmap badge to README
- Update project status and achievements
- Comprehensive feature documentation
- Production readiness indicators

🚀 Ready for v0.2.0 release with agent-first architecture

2026-03-18 20:17:23 +01:00

13 KiB

Raw Blame History

AITBC Incident Runbooks

This document contains specific runbooks for common incident scenarios, based on our chaos testing validation and integration test suite.

Integration Test Status (Updated 2026-01-26)

Current Test Coverage

✅ 6 integration tests passing
✅ Security tests using real ZK proof features
✅ Marketplace tests connecting to live service
⏸️ 1 test skipped (wallet payment flow)

Test Environment

Tests run against both real and mock clients
CI/CD pipeline runs full test suite
Local development: python -m pytest tests/integration/ -v

Runbook: Coordinator API Outage

Based on Chaos Test: `chaos_test_coordinator.py`

Symptoms

503/504 errors on all endpoints
Health check failures
Job submission failures
Marketplace unresponsive

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator

# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20

# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator

# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default

Investigation (2-10 minutes)

Review Logs

kubectl logs -n default deployment/coordinator --tail=100

Check Resource Limits

kubectl top pods -n default -l app.kubernetes.io/name=coordinator

Verify Database Connectivity

kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432

Check Redis Connection

kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

Recovery Actions

Scale Up if Resource Starved

kubectl scale deployment/coordinator --replicas=5 -n default

Manual Pod Deletion if Stuck

kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0

Rollback Deployment

kubectl rollout undo deployment/coordinator -n default

Verification

# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health

# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"

Runbook: Network Partition

Based on Chaos Test: `chaos_test_network.py`

Symptoms

Blockchain nodes not communicating
Consensus stalled
High finality latency
Transaction processing delays

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq

# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq

# 3. Check network policies
kubectl get networkpolicies -n default

Investigation (5-15 minutes)

Identify Partitioned Nodes

# Check each node's peer count
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
  echo "Pod: $pod"
  kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
done

Check Network Policies

kubectl describe networkpolicy default-deny-all-ingress -n default
kubectl describe networkpolicy blockchain-node-netpol -n default

Verify DNS Resolution

kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node

Recovery Actions

Remove Problematic Network Rules

# Flush iptables on affected nodes
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
  kubectl exec -n default $pod -- iptables -F
done

Restart Network Components

kubectl rollout restart deployment/blockchain-node -n default

Force Re-peering

# Delete and recreate pods to force re-peering
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node

Verification

# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'

# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'

Runbook: Database Failure

Based on Chaos Test: `chaos_test_database.py`

Symptoms

Database connection errors
Service degradation
Failed transactions
High error rates

MTTR Target: 3 minutes

Immediate Actions (0-3 minutes)

# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready

# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"

# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"

Investigation (3-10 minutes)

Review Database Logs

kubectl logs -n default deployment/postgresql --tail=100

Check Resource Usage

kubectl top pods -n default -l app.kubernetes.io/name=postgresql
df -h /var/lib/postgresql/data

Identify Long-running Queries

kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"

Recovery Actions

Kill Idle Connections

kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"

Restart PostgreSQL

kubectl rollout restart deployment/postgresql -n default

Failover to Replica

# Promote replica if primary fails
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data

Verification

# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"

# Check application health
curl -f http://127.0.0.2:8011/v1/health

Runbook: Redis Failure

Symptoms

Caching failures
Session loss
Increased database load
Slow response times

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping

# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human

# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients

Investigation (2-5 minutes)

Review Redis Logs

kubectl logs -n default deployment/redis --tail=100

Check for Eviction

kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys

Identify Large Keys

kubectl exec -n default deployment/redis -- redis-cli --bigkeys

Recovery Actions

Clear Expired Keys

kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del

Restart Redis

kubectl rollout restart deployment/redis -n default

Scale Redis Cluster

kubectl scale deployment/redis --replicas=3 -n default

Verification

# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping

# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: High CPU/Memory Usage

Symptoms

Slow response times
Pod evictions
OOM errors
System degradation

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes

# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top

# 3. Check for OOM kills
dmesg | grep -i "killed process"

Investigation (5-15 minutes)

Analyze Resource Usage

# Detailed pod metrics
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10

Check Resource Limits

kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits

Review Application Metrics

# Check Prometheus metrics
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"

Recovery Actions

Scale Services

kubectl scale deployment/coordinator --replicas=5 -n default
kubectl scale deployment/blockchain-node --replicas=3 -n default

Increase Resource Limits

kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'

Restart Affected Services

kubectl rollout restart deployment/coordinator -n default

Verification

# Monitor resource usage
watch -n 5 'kubectl top pods -n default'

# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health

Runbook: Storage Issues

Symptoms

Disk space warnings
Write failures
Database errors
Pod crashes

MTTR Target: 10 minutes

Immediate Actions (0-10 minutes)

# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h

# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G

# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log

Investigation (10-20 minutes)

Analyze Storage Usage

du -sh /var/log/*
du -sh /var/lib/docker/*

Check PVC Usage

kubectl get pvc -n default
kubectl describe pvc postgresql-data -n default

Review Retention Policies

kubectl get cronjobs -n default
kubectl describe cronjob log-cleanup -n default

Recovery Actions

Expand Storage

kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Force Cleanup

# Clean old logs
find /var/log -name "*.log" -mtime +7 -delete

# Clean Docker images
docker system prune -a

Restart Services

kubectl rollout restart deployment/postgresql -n default

Verification

# Check disk space
df -h

# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"

Emergency Contact Procedures

Escalation Matrix

Level 1: On-call engineer (5 minutes)
Level 2: On-call secondary (15 minutes)
Level 3: Engineering manager (30 minutes)
Level 4: CTO (1 hour, critical only)

War Room Activation

# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)

# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto

# Start Zoom meeting
/zoom start "AITBC Incident War Room"

Customer Communication

Status Page Update (5 minutes)
Email Notification (15 minutes)
Twitter Update (30 minutes, critical only)

Post-Incident Checklist

Immediate (0-1 hour)

Service fully restored
Monitoring normal
Status page updated
Stakeholders notified

Short-term (1-24 hours)

Incident document created
Root cause identified
Runbooks updated
Post-mortem scheduled

Long-term (1-7 days)

Post-mortem completed
Action items assigned
Monitoring improved
Process updated

Runbook Maintenance

Review Schedule

Monthly: Review and update runbooks
Quarterly: Full review and testing
Annually: Major revision

Update Process

Test runbook procedures
Document lessons learned
Update procedures
Train team members
Update documentation

Version: 1.0 Last Updated: 2024-12-22 Owner: SRE Team

13 KiB Raw Blame History

AITBC Incident Runbooks

Integration Test Status (Updated 2026-01-26)

Current Test Coverage

Test Environment

Runbook: Coordinator API Outage

Based on Chaos Test: chaos_test_coordinator.py

Symptoms

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

Investigation (2-10 minutes)

Recovery Actions

Verification

Runbook: Network Partition

Based on Chaos Test: chaos_test_network.py

Symptoms

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

Investigation (5-15 minutes)

Recovery Actions

Verification

Runbook: Database Failure

Based on Chaos Test: chaos_test_database.py

Symptoms

MTTR Target: 3 minutes

Immediate Actions (0-3 minutes)

Investigation (3-10 minutes)

Recovery Actions

Verification

Runbook: Redis Failure

Symptoms

MTTR Target: 2 minutes

Immediate Actions (0-2 minutes)

Investigation (2-5 minutes)

Recovery Actions

Verification

Runbook: High CPU/Memory Usage

Symptoms

MTTR Target: 5 minutes

Immediate Actions (0-5 minutes)

Investigation (5-15 minutes)

Recovery Actions

Verification

Runbook: Storage Issues

Symptoms

MTTR Target: 10 minutes

Immediate Actions (0-10 minutes)

Investigation (10-20 minutes)

Recovery Actions

Verification

Emergency Contact Procedures

Escalation Matrix

War Room Activation

Customer Communication

Post-Incident Checklist

Immediate (0-1 hour)

Short-term (1-24 hours)

Long-term (1-7 days)

Runbook Maintenance

Review Schedule

Update Process

13 KiB

Raw Blame History

Based on Chaos Test: `chaos_test_coordinator.py`

Based on Chaos Test: `chaos_test_network.py`

Based on Chaos Test: `chaos_test_database.py`