chore: standardize configuration, logging, and error handling across blockchain node and coordinator API

- Add infrastructure.md and workflow files to .gitignore to prevent sensitive info leaks
- Change blockchain node mempool backend default from memory to database for persistence
- Refactor blockchain node logger with StructuredLogFormatter and AuditLogger (consistent with coordinator)
- Add structured logging fields: service, module, function, line number
- Unify coordinator config with Database
This commit is contained in:
oib
2026-02-13 22:39:43 +01:00
parent 0cbd2b507c
commit 06e48ef34b
196 changed files with 4660 additions and 20090 deletions

View File

@@ -0,0 +1,20 @@
# Deployment & Operations
Deploy, operate, and maintain AITBC infrastructure.
## Reading Order
| # | File | What you learn |
|---|------|----------------|
| 1 | [1_remote-deployment-guide.md](./1_remote-deployment-guide.md) | Deploy to remote servers |
| 2 | [2_service-naming-convention.md](./2_service-naming-convention.md) | Systemd service names and standards |
| 3 | [3_backup-restore.md](./3_backup-restore.md) | Backup PostgreSQL, Redis, ledger data |
| 4 | [4_incident-runbooks.md](./4_incident-runbooks.md) | Handle outages and incidents |
| 5 | [5_marketplace-deployment.md](./5_marketplace-deployment.md) | Deploy GPU marketplace endpoints |
| 6 | [6_beta-release-plan.md](./6_beta-release-plan.md) | Beta release checklist and timeline |
## Related
- [Installation](../0_getting_started/2_installation.md) — Initial setup
- [Security](../9_security/) — Security architecture and hardening
- [Architecture](../6_architecture/) — System design docs

View File

@@ -0,0 +1,138 @@
# AITBC Remote Deployment Guide
## Overview
This deployment strategy builds the blockchain node directly on the ns3 server to utilize its gigabit connection, avoiding slow uploads from localhost.
## Quick Start
### 1. Deploy Everything
```bash
./scripts/deploy/deploy-all-remote.sh
```
This will:
- Copy deployment scripts to ns3
- Copy blockchain source code from localhost
- Build blockchain node directly on server
- Deploy a lightweight HTML-based explorer
- Configure port forwarding
### 2. Access Services
**Blockchain Node RPC:**
- Internal: http://localhost:8082
- External: http://aitbc.keisanki.net:8082
**Blockchain Explorer:**
- Internal: http://localhost:3000
- External: http://aitbc.keisanki.net:3000
## Architecture
```
ns3-root (95.216.198.140)
├── Blockchain Node (port 8082)
│ ├── Auto-syncs on startup
│ └── Serves RPC API
└── Explorer (port 3000)
├── Static HTML/CSS/JS
├── Served by nginx
└── Connects to local node
```
## Key Features
### Blockchain Node
- Built directly on server from source code
- Source copied from localhost via scp
- Auto-sync on startup
- No large file uploads needed
- Uses server's gigabit connection
### Explorer
- Pure HTML/CSS/JS (no build step)
- Served by nginx
- Real-time block viewing
- Transaction details
- Auto-refresh every 30 seconds
## Manual Deployment
If you need to deploy components separately:
### Blockchain Node Only
```bash
ssh ns3-root
cd /opt
./deploy-blockchain-remote.sh
```
### Explorer Only
```bash
ssh ns3-root
cd /opt
./deploy-explorer-remote.sh
```
## Troubleshooting
### Check Services
```bash
# On ns3 server
systemctl status blockchain-node blockchain-rpc nginx
# Check logs
journalctl -u blockchain-node -f
journalctl -u blockchain-rpc -f
journalctl -u nginx -f
```
### Test RPC
```bash
# From ns3
curl http://localhost:8082/rpc/head
# From external
curl http://aitbc.keisanki.net:8082/rpc/head
```
### Port Forwarding
If port forwarding doesn't work:
```bash
# Check iptables rules
iptables -t nat -L -n
# Re-add rules
iptables -t nat -A PREROUTING -p tcp --dport 8082 -j DNAT --to-destination 192.168.100.10:8082
iptables -t nat -A POSTROUTING -p tcp -d 192.168.100.10 --dport 8082 -j MASQUERADE
```
## Configuration
### Blockchain Node
Location: `/opt/blockchain-node/.env`
- Chain ID: ait-devnet
- RPC Port: 8082
- P2P Port: 7070
- Auto-sync: enabled
### Explorer
Location: `/opt/blockchain-explorer/index.html`
- Served by nginx on port 3000
- Connects to localhost:8082
- No configuration needed
## Security Notes
- Services run as root (simplify for dev)
- No authentication on RPC (dev only)
- Port forwarding exposes services externally
- Consider firewall rules for production
## Next Steps
1. Set up proper authentication
2. Configure HTTPS with SSL certificates
3. Add multiple peers for network resilience
4. Implement proper backup procedures
5. Set up monitoring and alerting

View File

@@ -0,0 +1,85 @@
# AITBC Service Naming Convention
## Updated Service Names (2026-02-13)
All AITBC systemd services now follow the `aitbc-` prefix convention for consistency and easier management.
### Site A (aitbc.bubuit.net) - Production Services
| Old Name | New Name | Port | Description |
|----------|----------|------|-------------|
| blockchain-node.service | aitbc-blockchain-node-1.service | 8081 | Blockchain Node 1 |
| blockchain-node-2.service | aitbc-blockchain-node-2.service | 8082 | Blockchain Node 2 |
| blockchain-rpc.service | aitbc-blockchain-rpc-1.service | - | RPC API for Node 1 |
| blockchain-rpc-2.service | aitbc-blockchain-rpc-2.service | - | RPC API for Node 2 |
| coordinator-api.service | aitbc-coordinator-api.service | 8000 | Coordinator API |
| exchange-mock-api.service | aitbc-exchange-mock-api.service | - | Exchange Mock API |
### Site B (ns3 container) - Remote Node
| Old Name | New Name | Port | Description |
|----------|----------|------|-------------|
| blockchain-node.service | aitbc-blockchain-node-3.service | 8082 | Blockchain Node 3 |
| blockchain-rpc.service | aitbc-blockchain-rpc-3.service | - | RPC API for Node 3 |
### Already Compliant Services
These services already had the `aitbc-` prefix:
- aitbc-exchange-api.service (port 3003)
- aitbc-exchange.service (port 3002)
- aitbc-miner-dashboard.service
### Removed Services
- aitbc-blockchain.service (legacy, was on port 9080)
## Management Commands
### Check Service Status
```bash
# Site A (via SSH)
ssh aitbc-cascade "systemctl status aitbc-blockchain-node-1.service"
# Site B (via SSH)
ssh ns3-root "incus exec aitbc -- systemctl status aitbc-blockchain-node-3.service"
```
### Restart Services
```bash
# Site A
ssh aitbc-cascade "sudo systemctl restart aitbc-blockchain-node-1.service"
# Site B
ssh ns3-root "incus exec aitbc -- sudo systemctl restart aitbc-blockchain-node-3.service"
```
### View Logs
```bash
# Site A
ssh aitbc-cascade "journalctl -u aitbc-blockchain-node-1.service -f"
# Site B
ssh ns3-root "incus exec aitbc -- journalctl -u aitbc-blockchain-node-3.service -f"
```
## Service Dependencies
### Blockchain Nodes
- Node 1: `/opt/blockchain-node` → port 8081
- Node 2: `/opt/blockchain-node-2` → port 8082
- Node 3: `/opt/blockchain-node` → port 8082 (Site B)
### RPC Services
- RPC services are companion services to the main nodes
- They provide HTTP API endpoints for blockchain operations
### Coordinator API
- Main API for job submission, miner management, and receipts
- Runs on localhost:8000 inside container
- Proxied via nginx at https://aitbc.bubuit.net/api/
## Benefits of Standardized Naming
1. **Clarity**: Easy to identify AITBC services among system services
2. **Management**: Simpler to filter and manage with wildcards (`systemctl status aitbc-*`)
3. **Documentation**: Consistent naming across all documentation
4. **Automation**: Easier scripting and automation with predictable names
5. **Debugging**: Faster identification of service-related issues

View File

@@ -0,0 +1,316 @@
# AITBC Backup and Restore Procedures
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
## Overview
The AITBC platform implements a comprehensive backup strategy with:
- **Automated daily backups** via Kubernetes CronJobs
- **Manual backup capabilities** for on-demand operations
- **Incremental and full backup options** for ledger data
- **Cloud storage integration** for off-site backups
- **Retention policies** to manage storage efficiently
## Components
### 1. PostgreSQL Database
- **Location**: Coordinator API persistent storage
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
- **Backup Format**: Custom PostgreSQL dump with compression
- **Retention**: 30 days (configurable)
### 2. Redis Cache
- **Location**: In-memory cache with persistence
- **Data**: Session cache, temporary data, rate limiting
- **Backup Format**: RDB snapshot + AOF (if enabled)
- **Retention**: 30 days (configurable)
### 3. Ledger Storage
- **Location**: Blockchain node persistent storage
- **Data**: Blocks, transactions, receipts, wallet states
- **Backup Format**: Compressed tar archives
- **Retention**: 30 days (configurable)
## Automated Backups
### Kubernetes CronJob
The automated backup system runs daily at 2:00 AM UTC:
```bash
# Deploy the backup CronJob
kubectl apply -f infra/k8s/backup-cronjob.yaml
# Check CronJob status
kubectl get cronjob aitbc-backup
# View backup jobs
kubectl get jobs -l app=aitbc-backup
# View backup logs
kubectl logs job/aitbc-backup-<timestamp>
```
### Backup Schedule
| Time (UTC) | Component | Type | Retention |
|------------|----------------|------------|-----------|
| 02:00 | PostgreSQL | Full | 30 days |
| 02:01 | Redis | Full | 30 days |
| 02:02 | Ledger | Full | 30 days |
## Manual Backups
### PostgreSQL
```bash
# Create a manual backup
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
# View available backups
ls -la /tmp/postgresql-backups/
# Upload to S3 manually
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
```
### Redis
```bash
# Create a manual backup
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
# Force background save before backup
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
```
### Ledger Storage
```bash
# Create a full backup
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
# Create incremental backup
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
```
## Restore Procedures
### PostgreSQL Restore
```bash
# List available backups
aws s3 ls s3://aitbc-backups-default/postgresql/
# Download backup from S3
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
# Restore database
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
# Verify restore
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
```
### Redis Restore
```bash
# Stop Redis service
kubectl scale deployment redis --replicas=0 -n default
# Clear existing data
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
# Copy backup file
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
# Start Redis service
kubectl scale deployment redis --replicas=1 -n default
# Verify restore
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
```
### Ledger Restore
```bash
# Stop blockchain nodes
kubectl scale deployment blockchain-node --replicas=0 -n default
# Extract backup
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
# Copy ledger data
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
# Start blockchain nodes
kubectl scale deployment blockchain-node --replicas=3 -n default
# Verify restore
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
```
## Disaster Recovery
### Recovery Time Objective (RTO)
| Component | RTO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 1 hour | Database restore from backup |
| Redis | 15 minutes | Cache rebuild from backup |
| Ledger | 2 hours | Full chain synchronization |
### Recovery Point Objective (RPO)
| Component | RPO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 24 hours | Daily backups |
| Redis | 24 hours | Daily backups |
| Ledger | 24 hours | Daily full + incremental backups|
### Disaster Recovery Steps
1. **Assess Impact**
```bash
# Check component status
kubectl get pods -n default
kubectl get events --sort-by=.metadata.creationTimestamp
```
2. **Restore Critical Services**
```bash
# Restore PostgreSQL first (critical for operations)
./infra/scripts/restore_postgresql.sh default [latest-backup]
# Restore Redis cache
./restore_redis.sh default [latest-backup]
# Restore ledger data
./restore_ledger.sh default [latest-backup]
```
3. **Verify System Health**
```bash
# Check all services
kubectl get pods -n default
# Verify API endpoints
curl -s http://coordinator-api:8011/v1/health
curl -s http://blockchain-node:8080/v1/health
```
## Monitoring and Alerting
### Backup Monitoring
Prometheus metrics track backup success/failure:
```yaml
# AlertManager rules for backups
- alert: BackupFailed
expr: backup_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed for {{ $labels.component }}"
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
```
### Log Monitoring
```bash
# View backup logs
kubectl logs -l app=aitbc-backup -n default --tail=100
# Monitor backup CronJob
kubectl get cronjob aitbc-backup -w
```
## Best Practices
### Backup Security
1. **Encryption**: Backups uploaded to S3 use server-side encryption
2. **Access Control**: IAM policies restrict backup access
3. **Retention**: Automatic cleanup of old backups
4. **Validation**: Regular restore testing
### Performance Considerations
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
2. **Parallel Processing**: Components backed up sequentially
3. **Compression**: All backups compressed to save storage
4. **Incremental Backups**: Ledger supports incremental to reduce size
### Testing
1. **Monthly Restore Tests**: Validate backup integrity
2. **Disaster Recovery Drills**: Quarterly full scenario testing
3. **Documentation Updates**: Keep procedures current
## Troubleshooting
### Common Issues
#### Backup Fails with "Permission Denied"
```bash
# Check service account permissions
kubectl describe serviceaccount backup-service-account
kubectl describe role backup-role
```
#### Restore Fails with "Database in Use"
```bash
# Scale down application before restore
kubectl scale deployment coordinator-api --replicas=0
# Perform restore
# Scale up after restore
kubectl scale deployment coordinator-api --replicas=3
```
#### Ledger Restore Incomplete
```bash
# Verify backup integrity
tar -tzf ledger-backup.tar.gz
# Check metadata.json for block height
cat metadata.json | jq '.latest_block_height'
```
### Getting Help
1. Check logs: `kubectl logs -l app=aitbc-backup`
2. Verify storage: `df -h` on backup nodes
3. Check network: Test S3 connectivity
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
## Configuration
### Environment Variables
| Variable | Default | Description |
|------------------------|------------------|---------------------------------|
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
| COMPRESSION_LEVEL | 6 | gzip compression level |
### Customizing Backup Schedule
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
```yaml
spec:
schedule: "0 3 * * *" # Change to 3 AM UTC
```
### Adjusting Retention
Modify retention in each backup script:
```bash
# In backup_*.sh scripts
RETENTION_DAYS=60 # Keep for 60 days instead of 30
```

View File

@@ -0,0 +1,498 @@
# AITBC Incident Runbooks
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation and integration test suite.
## Integration Test Status (Updated 2026-01-26)
### Current Test Coverage
- ✅ 6 integration tests passing
- ✅ Security tests using real ZK proof features
- ✅ Marketplace tests connecting to live service
- ⏸️ 1 test skipped (wallet payment flow)
### Test Environment
- Tests run against both real and mock clients
- CI/CD pipeline runs full test suite
- Local development: `python -m pytest tests/integration/ -v`
## Runbook: Coordinator API Outage
### Based on Chaos Test: `chaos_test_coordinator.py`
### Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
```
### Investigation (2-10 minutes)
1. **Review Logs**
```bash
kubectl logs -n default deployment/coordinator --tail=100
```
2. **Check Resource Limits**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
```
3. **Verify Database Connectivity**
```bash
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
```
4. **Check Redis Connection**
```bash
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
```
### Recovery Actions
1. **Scale Up if Resource Starved**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
```
2. **Manual Pod Deletion if Stuck**
```bash
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
```
3. **Rollback Deployment**
```bash
kubectl rollout undo deployment/coordinator -n default
```
### Verification
```bash
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health
# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
```
## Runbook: Network Partition
### Based on Chaos Test: `chaos_test_network.py`
### Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
# 3. Check network policies
kubectl get networkpolicies -n default
```
### Investigation (5-15 minutes)
1. **Identify Partitioned Nodes**
```bash
# Check each node's peer count
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
echo "Pod: $pod"
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
done
```
2. **Check Network Policies**
```bash
kubectl describe networkpolicy default-deny-all-ingress -n default
kubectl describe networkpolicy blockchain-node-netpol -n default
```
3. **Verify DNS Resolution**
```bash
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
```
### Recovery Actions
1. **Remove Problematic Network Rules**
```bash
# Flush iptables on affected nodes
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
kubectl exec -n default $pod -- iptables -F
done
```
2. **Restart Network Components**
```bash
kubectl rollout restart deployment/blockchain-node -n default
```
3. **Force Re-peering**
```bash
# Delete and recreate pods to force re-peering
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
```
### Verification
```bash
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
```
## Runbook: Database Failure
### Based on Chaos Test: `chaos_test_database.py`
### Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates
### MTTR Target: 3 minutes
### Immediate Actions (0-3 minutes)
```bash
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready
# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
```
### Investigation (3-10 minutes)
1. **Review Database Logs**
```bash
kubectl logs -n default deployment/postgresql --tail=100
```
2. **Check Resource Usage**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
df -h /var/lib/postgresql/data
```
3. **Identify Long-running Queries**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```
### Recovery Actions
1. **Kill Idle Connections**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
```
2. **Restart PostgreSQL**
```bash
kubectl rollout restart deployment/postgresql -n default
```
3. **Failover to Replica**
```bash
# Promote replica if primary fails
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
```
### Verification
```bash
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
# Check application health
curl -f http://127.0.0.2:8011/v1/health
```
## Runbook: Redis Failure
### Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping
# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
```
### Investigation (2-5 minutes)
1. **Review Redis Logs**
```bash
kubectl logs -n default deployment/redis --tail=100
```
2. **Check for Eviction**
```bash
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
```
3. **Identify Large Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
```
### Recovery Actions
1. **Clear Expired Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
```
2. **Restart Redis**
```bash
kubectl rollout restart deployment/redis -n default
```
3. **Scale Redis Cluster**
```bash
kubectl scale deployment/redis --replicas=3 -n default
```
### Verification
```bash
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: High CPU/Memory Usage
### Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes
# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top
# 3. Check for OOM kills
dmesg | grep -i "killed process"
```
### Investigation (5-15 minutes)
1. **Analyze Resource Usage**
```bash
# Detailed pod metrics
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
```
2. **Check Resource Limits**
```bash
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
```
3. **Review Application Metrics**
```bash
# Check Prometheus metrics
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
```
### Recovery Actions
1. **Scale Services**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
kubectl scale deployment/blockchain-node --replicas=3 -n default
```
2. **Increase Resource Limits**
```bash
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
```
3. **Restart Affected Services**
```bash
kubectl rollout restart deployment/coordinator -n default
```
### Verification
```bash
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'
# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: Storage Issues
### Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes
### MTTR Target: 10 minutes
### Immediate Actions (0-10 minutes)
```bash
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h
# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
```
### Investigation (10-20 minutes)
1. **Analyze Storage Usage**
```bash
du -sh /var/log/*
du -sh /var/lib/docker/*
```
2. **Check PVC Usage**
```bash
kubectl get pvc -n default
kubectl describe pvc postgresql-data -n default
```
3. **Review Retention Policies**
```bash
kubectl get cronjobs -n default
kubectl describe cronjob log-cleanup -n default
```
### Recovery Actions
1. **Expand Storage**
```bash
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
```
2. **Force Cleanup**
```bash
# Clean old logs
find /var/log -name "*.log" -mtime +7 -delete
# Clean Docker images
docker system prune -a
```
3. **Restart Services**
```bash
kubectl rollout restart deployment/postgresql -n default
```
### Verification
```bash
# Check disk space
df -h
# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
```
## Emergency Contact Procedures
### Escalation Matrix
1. **Level 1**: On-call engineer (5 minutes)
2. **Level 2**: On-call secondary (15 minutes)
3. **Level 3**: Engineering manager (30 minutes)
4. **Level 4**: CTO (1 hour, critical only)
### War Room Activation
```bash
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto
# Start Zoom meeting
/zoom start "AITBC Incident War Room"
```
### Customer Communication
1. **Status Page Update** (5 minutes)
2. **Email Notification** (15 minutes)
3. **Twitter Update** (30 minutes, critical only)
## Post-Incident Checklist
### Immediate (0-1 hour)
- [ ] Service fully restored
- [ ] Monitoring normal
- [ ] Status page updated
- [ ] Stakeholders notified
### Short-term (1-24 hours)
- [ ] Incident document created
- [ ] Root cause identified
- [ ] Runbooks updated
- [ ] Post-mortem scheduled
### Long-term (1-7 days)
- [ ] Post-mortem completed
- [ ] Action items assigned
- [ ] Monitoring improved
- [ ] Process updated
## Runbook Maintenance
### Review Schedule
- **Monthly**: Review and update runbooks
- **Quarterly**: Full review and testing
- **Annually**: Major revision
### Update Process
1. Test runbook procedures
2. Document lessons learned
3. Update procedures
4. Train team members
5. Update documentation
---
*Version: 1.0*
*Last Updated: 2024-12-22*
*Owner: SRE Team*

View File

@@ -0,0 +1,69 @@
# Marketplace GPU Endpoints Deployment Summary
## ✅ Successfully Deployed to Remote Server (aitbc-cascade)
### What was deployed:
1. **New router file**: `/opt/coordinator-api/src/app/routers/marketplace_gpu.py`
- 9 GPU-specific endpoints implemented
- In-memory storage for quick testing
- Mock data with 3 initial GPUs
2. **Updated router configuration**:
- Added `marketplace_gpu` import to `__init__.py`
- Added router to main app with `/v1` prefix
- Service restarted successfully
### Available Endpoints:
- `POST /v1/marketplace/gpu/register` - Register GPU
- `GET /v1/marketplace/gpu/list` - List GPUs
- `GET /v1/marketplace/gpu/{gpu_id}` - Get GPU details
- `POST /v1/marketplace/gpu/{gpu_id}/book` - Book GPU
- `POST /v1/marketplace/gpu/{gpu_id}/release` - Release GPU
- `GET /v1/marketplace/gpu/{gpu_id}/reviews` - Get reviews
- `POST /v1/marketplace/gpu/{gpu_id}/reviews` - Add review
- `GET /v1/marketplace/orders` - List orders
- `GET /v1/marketplace/pricing/{model}` - Get pricing
### Test Results:
1. **GPU Registration**: ✅
- Successfully registered RTX 4060 Ti (16GB)
- GPU ID: gpu_001
- Price: $0.30/hour
2. **GPU Booking**: ✅
- Booked for 2 hours
- Total cost: $1.0
- Booking ID generated
3. **Review System**: ✅
- Added 5-star review
- Average rating updated to 5.0
4. **Order Management**: ✅
- Orders tracked
- Status: active
### Current GPU Inventory:
1. RTX 4090 (24GB) - $0.50/hr - Available
2. RTX 3080 (16GB) - $0.35/hr - Available
3. A100 (40GB) - $1.20/hr - Booked
4. **RTX 4060 Ti (16GB) - $0.30/hr - Available** (newly registered)
### Service Status:
- Coordinator API: Running on port 8000
- Service: active (running)
- Last restart: Feb 12, 2026 at 16:14:11 UTC
### Next Steps:
1. Update CLI to use remote server URL (http://aitbc-cascade:8000)
2. Test full CLI workflow against remote server
3. Consider persistent storage implementation
4. Add authentication/authorization for production
### Notes:
- Current implementation uses in-memory storage
- Data resets on service restart
- No authentication required (test API key works)
- All endpoints return proper HTTP status codes (201 for creation)
The marketplace GPU functionality is now fully operational on the remote server! 🚀

View File

@@ -0,0 +1,273 @@
# AITBC Beta Release Plan
## Executive Summary
This document outlines the beta release plan for AITBC (AI Trusted Blockchain Computing), a blockchain platform designed for AI workloads. The release follows a phased approach: Alpha → Beta → Release Candidate (RC) → General Availability (GA).
## Release Phases
### Phase 1: Alpha Release (Completed)
- **Duration**: 2 weeks
- **Participants**: Internal team (10 members)
- **Focus**: Core functionality validation
- **Status**: ✅ Completed
### Phase 2: Beta Release (Current)
- **Duration**: 6 weeks
- **Participants**: 50-100 external testers
- **Focus**: User acceptance testing, performance validation, security assessment
- **Start Date**: 2025-01-15
- **End Date**: 2025-02-26
### Phase 3: Release Candidate
- **Duration**: 2 weeks
- **Participants**: 20 selected beta testers
- **Focus**: Final bug fixes, performance optimization
- **Start Date**: 2025-03-04
- **End Date**: 2025-03-18
### Phase 4: General Availability
- **Date**: 2025-03-25
- **Target**: Public launch
## Beta Release Timeline
### Week 1-2: Onboarding & Basic Flows
- **Jan 15-19**: Tester onboarding and environment setup
- **Jan 22-26**: Basic job submission and completion flows
- **Milestone**: 80% of testers successfully submit and complete jobs
### Week 3-4: Marketplace & Explorer Testing
- **Jan 29 - Feb 2**: Marketplace functionality testing
- **Feb 5-9**: Explorer UI validation and transaction tracking
- **Milestone**: 100 marketplace transactions completed
### Week 5-6: Stress Testing & Feedback
- **Feb 12-16**: Performance stress testing (1000+ concurrent jobs)
- **Feb 19-23**: Security testing and final feedback collection
- **Milestone**: All critical bugs resolved
## User Acceptance Testing (UAT) Scenarios
### 1. Core Job Lifecycle
- **Scenario**: Submit AI inference job → Miner picks up → Execution → Results delivery → Payment
- **Test Cases**:
- Job submission with various model types
- Job monitoring and status tracking
- Result retrieval and verification
- Payment processing and wallet updates
- **Success Criteria**: 95% success rate across 1000 test jobs
### 2. Marketplace Operations
- **Scenario**: Create offer → Accept offer → Execute job → Complete transaction
- **Test Cases**:
- Offer creation and management
- Bid acceptance and matching
- Price discovery mechanisms
- Dispute resolution
- **Success Criteria**: 50 successful marketplace transactions
### 3. Explorer Functionality
- **Scenario**: Transaction lookup → Job tracking → Address analysis
- **Test Cases**:
- Real-time transaction monitoring
- Job history and status visualization
- Wallet balance tracking
- Block explorer features
- **Success Criteria**: All transactions visible within 5 seconds
### 4. Wallet Management
- **Scenario**: Wallet creation → Funding → Transactions → Backup/Restore
- **Test Cases**:
- Multi-signature wallet creation
- Cross-chain transfers
- Backup and recovery procedures
- Staking and unstaking operations
- **Success Criteria**: 100% wallet recovery success rate
### 5. Mining Operations
- **Scenario**: Miner setup → Job acceptance → Mining rewards → Pool participation
- **Test Cases**:
- Miner registration and setup
- Job bidding and execution
- Reward distribution
- Pool mining operations
- **Success Criteria**: 90% of submitted jobs accepted by miners
### 6. Community Management
### Discord Community Structure
- **#announcements**: Official updates and milestones
- **#beta-testers**: Private channel for testers only
- **#bug-reports**: Structured bug reporting format
- **#feature-feedback**: Feature requests and discussions
- **#technical-support**: 24/7 support from the team
### Regulatory Considerations
- **KYC/AML**: Basic identity verification for testers
- **Securities Law**: Beta tokens have no monetary value
- **Tax Reporting**: Testnet transactions not taxable
- **Export Controls**: Compliance with technology export laws
### Geographic Restrictions
Beta testing is not available in:
- North Korea, Iran, Cuba, Syria, Crimea
- Countries under US sanctions
- Jurdictions with unclear crypto regulations
### 7. Token Economics Validation
- **Scenario**: Token issuance → Reward distribution → Staking yields → Fee mechanisms
- **Test Cases**:
- Mining reward calculations match whitepaper specs
- Staking yields and unstaking penalties
- Transaction fee burning and distribution
- Marketplace fee structures
- Token inflation/deflation mechanics
- **Success Criteria**: All token operations within 1% of theoretical values
## Performance Benchmarks (Go/No-Go Criteria)
### Must-Have Metrics
- **Transaction Throughput**: ≥ 100 TPS (Transactions Per Second)
- **Job Completion Time**: ≤ 5 minutes for standard inference jobs
- **API Response Time**: ≤ 200ms (95th percentile)
- **System Uptime**: ≥ 99.9% during beta period
- **MTTR (Mean Time To Recovery)**: ≤ 2 minutes (from chaos tests)
### Nice-to-Have Metrics
- **Transaction Throughput**: ≥ 500 TPS
- **Job Completion Time**: ≤ 2 minutes
- **API Response Time**: ≤ 100ms (95th percentile)
- **Concurrent Users**: ≥ 1000 simultaneous users
## Security Testing
### Automated Security Scans
- **Smart Contract Audits**: Completed by [Security Firm]
- **Penetration Testing**: OWASP Top 10 validation
- **Dependency Scanning**: CVE scan of all dependencies
- **Chaos Testing**: Network partition and coordinator outage scenarios
### Manual Security Reviews
- **Authorization Testing**: API key validation and permissions
- **Data Privacy**: GDPR compliance validation
- **Cryptography**: Proof verification and signature validation
- **Infrastructure Security**: Kubernetes and cloud security review
## Test Environment Setup
### Beta Environment
- **Network**: Separate testnet with faucet for test tokens
- **Infrastructure**: Production-like setup with monitoring
- **Data**: Reset weekly to ensure clean testing
- **Support**: 24/7 Discord support channel
### Access Credentials
- **Testnet Faucet**: 1000 AITBC tokens per tester
- **API Keys**: Unique keys per tester with rate limits
- **Wallet Seeds**: Generated per tester with backup instructions
- **Mining Accounts**: Pre-configured mining pools for testing
## Feedback Collection Mechanisms
### Automated Collection
- **Error Reporting**: Automatic crash reports and error logs
- **Performance Metrics**: Client-side performance data
- **Usage Analytics**: Feature usage tracking (anonymized)
- **Survey System**: In-app feedback prompts
### Manual Collection
- **Weekly Surveys**: Structured feedback on specific features
- **Discord Channels**: Real-time feedback and discussions
- **Office Hours**: Weekly Q&A sessions with the team
- **Bug Bounty**: Program for critical issue discovery
## Success Criteria
### Go/No-Go Decision Points
#### Week 2 Checkpoint (Jan 26)
- **Go Criteria**: 80% of testers onboarded, basic flows working
- **Blockers**: Critical bugs in job submission/completion
#### Week 4 Checkpoint (Feb 9)
- **Go Criteria**: 50 marketplace transactions, explorer functional
- **Blockers**: Security vulnerabilities, performance < 50 TPS
#### Week 6 Final Decision (Feb 23)
- **Go Criteria**: All UAT scenarios passed, benchmarks met
- **Blockers**: Any critical security issue, MTTR > 5 minutes
### Overall Success Metrics
- **User Satisfaction**: ≥ 4.0/5.0 average rating
- **Bug Resolution**: 90% of reported bugs fixed
- **Performance**: All benchmarks met
- **Security**: No critical vulnerabilities
## Risk Management
### Technical Risks
- **Consensus Issues**: Rollback to previous version
- **Performance Degradation**: Auto-scaling and optimization
- **Security Breaches**: Immediate patch and notification
### Operational Risks
- **Test Environment Downtime**: Backup environment ready
- **Low Tester Participation**: Incentive program adjustments
- **Feature Scope Creep**: Strict feature freeze after Week 4
### Mitigation Strategies
- **Daily Health Checks**: Automated monitoring and alerts
- **Rollback Plan**: Documented procedures for quick rollback
- **Communication Plan**: Regular updates to all stakeholders
## Communication Plan
### Internal Updates
- **Daily Standups**: Development team sync
- **Weekly Reports**: Progress to leadership
- **Bi-weekly Demos**: Feature demonstrations
### External Updates
- **Beta Newsletter**: Weekly updates to testers
- **Blog Posts**: Public progress updates
- **Social Media**: Regular platform updates
## Post-Beta Activities
### RC Phase Preparation
- **Bug Triage**: Prioritize and assign all reported issues
- **Performance Tuning**: Optimize based on beta metrics
- **Documentation Updates**: Incorporate beta feedback
### GA Preparation
- **Final Security Review**: Complete audit and penetration test
- **Infrastructure Scaling**: Prepare for production load
- **Support Team Training**: Enable customer support team
## Appendix
### A. Test Case Matrix
[Detailed test case spreadsheet link]
### B. Performance Benchmark Results
[Benchmark data and graphs]
### C. Security Audit Reports
[Audit firm reports and findings]
### D. Feedback Analysis
[Summary of all user feedback and actions taken]
## Contact Information
- **Beta Program Manager**: beta@aitbc.io
- **Technical Support**: support@aitbc.io
- **Security Issues**: security@aitbc.io
- **Discord Community**: https://discord.gg/aitbc
---
*Last Updated: 2025-01-10*
*Version: 1.0*
*Next Review: 2025-01-17*