chore: standardize configuration, logging, and error handling across blockchain node and coordinator API

- Add infrastructure.md and workflow files to .gitignore to prevent sensitive info leaks - Change blockchain node mempool backend default from memory to database for persistence - Refactor blockchain node logger with StructuredLogFormatter and AuditLogger (consistent with coordinator) - Add structured logging fields: service, module, function, line number - Unify coordinator config with Database
2026-02-13 22:39:43 +01:00
parent 0cbd2b507c
commit 06e48ef34b
196 changed files with 4660 additions and 20090 deletions
--- a/docs/7_deployment/0_index.md
+++ b/docs/7_deployment/0_index.md
@@ -0,0 +1,20 @@
+# Deployment & Operations
+
+Deploy, operate, and maintain AITBC infrastructure.
+
+## Reading Order
+
+| # | File | What you learn |
+|---|------|----------------|
+| 1 | [1_remote-deployment-guide.md](./1_remote-deployment-guide.md) | Deploy to remote servers |
+| 2 | [2_service-naming-convention.md](./2_service-naming-convention.md) | Systemd service names and standards |
+| 3 | [3_backup-restore.md](./3_backup-restore.md) | Backup PostgreSQL, Redis, ledger data |
+| 4 | [4_incident-runbooks.md](./4_incident-runbooks.md) | Handle outages and incidents |
+| 5 | [5_marketplace-deployment.md](./5_marketplace-deployment.md) | Deploy GPU marketplace endpoints |
+| 6 | [6_beta-release-plan.md](./6_beta-release-plan.md) | Beta release checklist and timeline |
+
+## Related
+
+- [Installation](../0_getting_started/2_installation.md) — Initial setup
+- [Security](../9_security/) — Security architecture and hardening
+- [Architecture](../6_architecture/) — System design docs
--- a/docs/7_deployment/1_remote-deployment-guide.md
+++ b/docs/7_deployment/1_remote-deployment-guide.md
@@ -0,0 +1,138 @@
+# AITBC Remote Deployment Guide
+
+## Overview
+This deployment strategy builds the blockchain node directly on the ns3 server to utilize its gigabit connection, avoiding slow uploads from localhost.
+
+## Quick Start
+
+### 1. Deploy Everything
+```bash
+./scripts/deploy/deploy-all-remote.sh
+```
+
+This will:
+- Copy deployment scripts to ns3
+- Copy blockchain source code from localhost
+- Build blockchain node directly on server
+- Deploy a lightweight HTML-based explorer
+- Configure port forwarding
+
+### 2. Access Services
+
+**Blockchain Node RPC:**
+- Internal: http://localhost:8082
+- External: http://aitbc.keisanki.net:8082
+
+**Blockchain Explorer:**
+- Internal: http://localhost:3000
+- External: http://aitbc.keisanki.net:3000
+
+## Architecture
+
+```
+ns3-root (95.216.198.140)
+├── Blockchain Node (port 8082)
+│   ├── Auto-syncs on startup
+│   └── Serves RPC API
+└── Explorer (port 3000)
+    ├── Static HTML/CSS/JS
+    ├── Served by nginx
+    └── Connects to local node
+```
+
+## Key Features
+
+### Blockchain Node
+- Built directly on server from source code
+- Source copied from localhost via scp
+- Auto-sync on startup
+- No large file uploads needed
+- Uses server's gigabit connection
+
+### Explorer
+- Pure HTML/CSS/JS (no build step)
+- Served by nginx
+- Real-time block viewing
+- Transaction details
+- Auto-refresh every 30 seconds
+
+## Manual Deployment
+
+If you need to deploy components separately:
+
+### Blockchain Node Only
+```bash
+ssh ns3-root
+cd /opt
+./deploy-blockchain-remote.sh
+```
+
+### Explorer Only
+```bash
+ssh ns3-root
+cd /opt
+./deploy-explorer-remote.sh
+```
+
+## Troubleshooting
+
+### Check Services
+```bash
+# On ns3 server
+systemctl status blockchain-node blockchain-rpc nginx
+
+# Check logs
+journalctl -u blockchain-node -f
+journalctl -u blockchain-rpc -f
+journalctl -u nginx -f
+```
+
+### Test RPC
+```bash
+# From ns3
+curl http://localhost:8082/rpc/head
+
+# From external
+curl http://aitbc.keisanki.net:8082/rpc/head
+```
+
+### Port Forwarding
+If port forwarding doesn't work:
+```bash
+# Check iptables rules
+iptables -t nat -L -n
+
+# Re-add rules
+iptables -t nat -A PREROUTING -p tcp --dport 8082 -j DNAT --to-destination 192.168.100.10:8082
+iptables -t nat -A POSTROUTING -p tcp -d 192.168.100.10 --dport 8082 -j MASQUERADE
+```
+
+## Configuration
+
+### Blockchain Node
+Location: `/opt/blockchain-node/.env`
+- Chain ID: ait-devnet
+- RPC Port: 8082
+- P2P Port: 7070
+- Auto-sync: enabled
+
+### Explorer
+Location: `/opt/blockchain-explorer/index.html`
+- Served by nginx on port 3000
+- Connects to localhost:8082
+- No configuration needed
+
+## Security Notes
+
+- Services run as root (simplify for dev)
+- No authentication on RPC (dev only)
+- Port forwarding exposes services externally
+- Consider firewall rules for production
+
+## Next Steps
+
+1. Set up proper authentication
+2. Configure HTTPS with SSL certificates
+3. Add multiple peers for network resilience
+4. Implement proper backup procedures
+5. Set up monitoring and alerting
--- a/docs/7_deployment/2_service-naming-convention.md
+++ b/docs/7_deployment/2_service-naming-convention.md
@@ -0,0 +1,85 @@
+# AITBC Service Naming Convention
+
+## Updated Service Names (2026-02-13)
+
+All AITBC systemd services now follow the `aitbc-` prefix convention for consistency and easier management.
+
+### Site A (aitbc.bubuit.net) - Production Services
+
+| Old Name | New Name | Port | Description |
+|----------|----------|------|-------------|
+| blockchain-node.service | aitbc-blockchain-node-1.service | 8081 | Blockchain Node 1 |
+| blockchain-node-2.service | aitbc-blockchain-node-2.service | 8082 | Blockchain Node 2 |
+| blockchain-rpc.service | aitbc-blockchain-rpc-1.service | - | RPC API for Node 1 |
+| blockchain-rpc-2.service | aitbc-blockchain-rpc-2.service | - | RPC API for Node 2 |
+| coordinator-api.service | aitbc-coordinator-api.service | 8000 | Coordinator API |
+| exchange-mock-api.service | aitbc-exchange-mock-api.service | - | Exchange Mock API |
+
+### Site B (ns3 container) - Remote Node
+
+| Old Name | New Name | Port | Description |
+|----------|----------|------|-------------|
+| blockchain-node.service | aitbc-blockchain-node-3.service | 8082 | Blockchain Node 3 |
+| blockchain-rpc.service | aitbc-blockchain-rpc-3.service | - | RPC API for Node 3 |
+
+### Already Compliant Services
+These services already had the `aitbc-` prefix:
+- aitbc-exchange-api.service (port 3003)
+- aitbc-exchange.service (port 3002)
+- aitbc-miner-dashboard.service
+
+### Removed Services
+- aitbc-blockchain.service (legacy, was on port 9080)
+
+## Management Commands
+
+### Check Service Status
+```bash
+# Site A (via SSH)
+ssh aitbc-cascade "systemctl status aitbc-blockchain-node-1.service"
+
+# Site B (via SSH)
+ssh ns3-root "incus exec aitbc -- systemctl status aitbc-blockchain-node-3.service"
+```
+
+### Restart Services
+```bash
+# Site A
+ssh aitbc-cascade "sudo systemctl restart aitbc-blockchain-node-1.service"
+
+# Site B
+ssh ns3-root "incus exec aitbc -- sudo systemctl restart aitbc-blockchain-node-3.service"
+```
+
+### View Logs
+```bash
+# Site A
+ssh aitbc-cascade "journalctl -u aitbc-blockchain-node-1.service -f"
+
+# Site B
+ssh ns3-root "incus exec aitbc -- journalctl -u aitbc-blockchain-node-3.service -f"
+```
+
+## Service Dependencies
+
+### Blockchain Nodes
+- Node 1: `/opt/blockchain-node` → port 8081
+- Node 2: `/opt/blockchain-node-2` → port 8082
+- Node 3: `/opt/blockchain-node` → port 8082 (Site B)
+
+### RPC Services
+- RPC services are companion services to the main nodes
+- They provide HTTP API endpoints for blockchain operations
+
+### Coordinator API
+- Main API for job submission, miner management, and receipts
+- Runs on localhost:8000 inside container
+- Proxied via nginx at https://aitbc.bubuit.net/api/
+
+## Benefits of Standardized Naming
+
+1. **Clarity**: Easy to identify AITBC services among system services
+2. **Management**: Simpler to filter and manage with wildcards (`systemctl status aitbc-*`)
+3. **Documentation**: Consistent naming across all documentation
+4. **Automation**: Easier scripting and automation with predictable names
+5. **Debugging**: Faster identification of service-related issues
--- a/docs/7_deployment/3_backup-restore.md
+++ b/docs/7_deployment/3_backup-restore.md
@@ -0,0 +1,316 @@
+# AITBC Backup and Restore Procedures
+
+This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
+
+## Overview
+
+The AITBC platform implements a comprehensive backup strategy with:
+- **Automated daily backups** via Kubernetes CronJobs
+- **Manual backup capabilities** for on-demand operations
+- **Incremental and full backup options** for ledger data
+- **Cloud storage integration** for off-site backups
+- **Retention policies** to manage storage efficiently
+
+## Components
+
+### 1. PostgreSQL Database
+- **Location**: Coordinator API persistent storage
+- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
+- **Backup Format**: Custom PostgreSQL dump with compression
+- **Retention**: 30 days (configurable)
+
+### 2. Redis Cache
+- **Location**: In-memory cache with persistence
+- **Data**: Session cache, temporary data, rate limiting
+- **Backup Format**: RDB snapshot + AOF (if enabled)
+- **Retention**: 30 days (configurable)
+
+### 3. Ledger Storage
+- **Location**: Blockchain node persistent storage
+- **Data**: Blocks, transactions, receipts, wallet states
+- **Backup Format**: Compressed tar archives
+- **Retention**: 30 days (configurable)
+
+## Automated Backups
+
+### Kubernetes CronJob
+
+The automated backup system runs daily at 2:00 AM UTC:
+
+```bash
+# Deploy the backup CronJob
+kubectl apply -f infra/k8s/backup-cronjob.yaml
+
+# Check CronJob status
+kubectl get cronjob aitbc-backup
+
+# View backup jobs
+kubectl get jobs -l app=aitbc-backup
+
+# View backup logs
+kubectl logs job/aitbc-backup-<timestamp>
+```
+
+### Backup Schedule
+
+| Time (UTC) | Component      | Type       | Retention |
+|------------|----------------|------------|-----------|
+| 02:00      | PostgreSQL     | Full       | 30 days   |
+| 02:01      | Redis          | Full       | 30 days   |
+| 02:02      | Ledger         | Full       | 30 days   |
+
+## Manual Backups
+
+### PostgreSQL
+
+```bash
+# Create a manual backup
+./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
+
+# View available backups
+ls -la /tmp/postgresql-backups/
+
+# Upload to S3 manually
+aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
+```
+
+### Redis
+
+```bash
+# Create a manual backup
+./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
+
+# Force background save before backup
+kubectl exec -n default deployment/redis -- redis-cli BGSAVE
+```
+
+### Ledger Storage
+
+```bash
+# Create a full backup
+./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
+
+# Create incremental backup
+./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
+```
+
+## Restore Procedures
+
+### PostgreSQL Restore
+
+```bash
+# List available backups
+aws s3 ls s3://aitbc-backups-default/postgresql/
+
+# Download backup from S3
+aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
+
+# Restore database
+./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
+
+# Verify restore
+kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
+```
+
+### Redis Restore
+
+```bash
+# Stop Redis service
+kubectl scale deployment redis --replicas=0 -n default
+
+# Clear existing data
+kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
+
+# Copy backup file
+kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
+
+# Start Redis service
+kubectl scale deployment redis --replicas=1 -n default
+
+# Verify restore
+kubectl exec -n default deployment/redis -- redis-cli DBSIZE
+```
+
+### Ledger Restore
+
+```bash
+# Stop blockchain nodes
+kubectl scale deployment blockchain-node --replicas=0 -n default
+
+# Extract backup
+tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
+
+# Copy ledger data
+kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
+kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
+kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
+
+# Start blockchain nodes
+kubectl scale deployment blockchain-node --replicas=3 -n default
+
+# Verify restore
+kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
+```
+
+## Disaster Recovery
+
+### Recovery Time Objective (RTO)
+
+| Component      | RTO Target | Notes                           |
+|----------------|------------|---------------------------------|
+| PostgreSQL     | 1 hour     | Database restore from backup     |
+| Redis          | 15 minutes | Cache rebuild from backup       |
+| Ledger         | 2 hours    | Full chain synchronization       |
+
+### Recovery Point Objective (RPO)
+
+| Component      | RPO Target | Notes                           |
+|----------------|------------|---------------------------------|
+| PostgreSQL     | 24 hours   | Daily backups                    |
+| Redis          | 24 hours   | Daily backups                    |
+| Ledger         | 24 hours   | Daily full + incremental backups|
+
+### Disaster Recovery Steps
+
+1. **Assess Impact**
+   ```bash
+   # Check component status
+   kubectl get pods -n default
+   kubectl get events --sort-by=.metadata.creationTimestamp
+   ```
+
+2. **Restore Critical Services**
+   ```bash
+   # Restore PostgreSQL first (critical for operations)
+   ./infra/scripts/restore_postgresql.sh default [latest-backup]
+   
+   # Restore Redis cache
+   ./restore_redis.sh default [latest-backup]
+   
+   # Restore ledger data
+   ./restore_ledger.sh default [latest-backup]
+   ```
+
+3. **Verify System Health**
+   ```bash
+   # Check all services
+   kubectl get pods -n default
+   
+   # Verify API endpoints
+   curl -s http://coordinator-api:8011/v1/health
+   curl -s http://blockchain-node:8080/v1/health
+   ```
+
+## Monitoring and Alerting
+
+### Backup Monitoring
+
+Prometheus metrics track backup success/failure:
+
+```yaml
+# AlertManager rules for backups
+- alert: BackupFailed
+  expr: backup_success == 0
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Backup failed for {{ $labels.component }}"
+    description: "Backup for {{ $labels.component }} has failed for 5 minutes"
+```
+
+### Log Monitoring
+
+```bash
+# View backup logs
+kubectl logs -l app=aitbc-backup -n default --tail=100
+
+# Monitor backup CronJob
+kubectl get cronjob aitbc-backup -w
+```
+
+## Best Practices
+
+### Backup Security
+
+1. **Encryption**: Backups uploaded to S3 use server-side encryption
+2. **Access Control**: IAM policies restrict backup access
+3. **Retention**: Automatic cleanup of old backups
+4. **Validation**: Regular restore testing
+
+### Performance Considerations
+
+1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
+2. **Parallel Processing**: Components backed up sequentially
+3. **Compression**: All backups compressed to save storage
+4. **Incremental Backups**: Ledger supports incremental to reduce size
+
+### Testing
+
+1. **Monthly Restore Tests**: Validate backup integrity
+2. **Disaster Recovery Drills**: Quarterly full scenario testing
+3. **Documentation Updates**: Keep procedures current
+
+## Troubleshooting
+
+### Common Issues
+
+#### Backup Fails with "Permission Denied"
+```bash
+# Check service account permissions
+kubectl describe serviceaccount backup-service-account
+kubectl describe role backup-role
+```
+
+#### Restore Fails with "Database in Use"
+```bash
+# Scale down application before restore
+kubectl scale deployment coordinator-api --replicas=0
+# Perform restore
+# Scale up after restore
+kubectl scale deployment coordinator-api --replicas=3
+```
+
+#### Ledger Restore Incomplete
+```bash
+# Verify backup integrity
+tar -tzf ledger-backup.tar.gz
+# Check metadata.json for block height
+cat metadata.json | jq '.latest_block_height'
+```
+
+### Getting Help
+
+1. Check logs: `kubectl logs -l app=aitbc-backup`
+2. Verify storage: `df -h` on backup nodes
+3. Check network: Test S3 connectivity
+4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
+
+## Configuration
+
+### Environment Variables
+
+| Variable               | Default          | Description                     |
+|------------------------|------------------|---------------------------------|
+| BACKUP_RETENTION_DAYS  | 30               | Days to keep backups            |
+| BACKUP_SCHEDULE        | 0 2 * * *        | Cron schedule for backups       |
+| S3_BUCKET_PREFIX       | aitbc-backups    | S3 bucket name prefix           |
+| COMPRESSION_LEVEL      | 6                | gzip compression level          |
+
+### Customizing Backup Schedule
+
+Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
+
+```yaml
+spec:
+  schedule: "0 3 * * *"  # Change to 3 AM UTC
+```
+
+### Adjusting Retention
+
+Modify retention in each backup script:
+
+```bash
+# In backup_*.sh scripts
+RETENTION_DAYS=60  # Keep for 60 days instead of 30
+```
--- a/docs/7_deployment/4_incident-runbooks.md
+++ b/docs/7_deployment/4_incident-runbooks.md
@@ -0,0 +1,498 @@
+# AITBC Incident Runbooks
+
+This document contains specific runbooks for common incident scenarios, based on our chaos testing validation and integration test suite.
+
+## Integration Test Status (Updated 2026-01-26)
+
+### Current Test Coverage
+- ✅ 6 integration tests passing
+- ✅ Security tests using real ZK proof features
+- ✅ Marketplace tests connecting to live service
+- ⏸️ 1 test skipped (wallet payment flow)
+
+### Test Environment
+- Tests run against both real and mock clients
+- CI/CD pipeline runs full test suite
+- Local development: `python -m pytest tests/integration/ -v`
+
+## Runbook: Coordinator API Outage
+
+### Based on Chaos Test: `chaos_test_coordinator.py`
+
+### Symptoms
+- 503/504 errors on all endpoints
+- Health check failures
+- Job submission failures
+- Marketplace unresponsive
+
+### MTTR Target: 2 minutes
+
+### Immediate Actions (0-2 minutes)
+```bash
+# 1. Check pod status
+kubectl get pods -n default -l app.kubernetes.io/name=coordinator
+
+# 2. Check recent events
+kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
+
+# 3. Check if pods are crashlooping
+kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
+
+# 4. Quick restart if needed
+kubectl rollout restart deployment/coordinator -n default
+```
+
+### Investigation (2-10 minutes)
+1. **Review Logs**
+   ```bash
+   kubectl logs -n default deployment/coordinator --tail=100
+   ```
+
+2. **Check Resource Limits**
+   ```bash
+   kubectl top pods -n default -l app.kubernetes.io/name=coordinator
+   ```
+
+3. **Verify Database Connectivity**
+   ```bash
+   kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
+   ```
+
+4. **Check Redis Connection**
+   ```bash
+   kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
+   ```
+
+### Recovery Actions
+1. **Scale Up if Resource Starved**
+   ```bash
+   kubectl scale deployment/coordinator --replicas=5 -n default
+   ```
+
+2. **Manual Pod Deletion if Stuck**
+   ```bash
+   kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
+   ```
+
+3. **Rollback Deployment**
+   ```bash
+   kubectl rollout undo deployment/coordinator -n default
+   ```
+
+### Verification
+```bash
+# Test health endpoint
+curl -f http://127.0.0.2:8011/v1/health
+
+# Test API with sample request
+curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
+```
+
+## Runbook: Network Partition
+
+### Based on Chaos Test: `chaos_test_network.py`
+
+### Symptoms
+- Blockchain nodes not communicating
+- Consensus stalled
+- High finality latency
+- Transaction processing delays
+
+### MTTR Target: 5 minutes
+
+### Immediate Actions (0-5 minutes)
+```bash
+# 1. Check peer connectivity
+kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
+
+# 2. Check consensus status
+kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
+
+# 3. Check network policies
+kubectl get networkpolicies -n default
+```
+
+### Investigation (5-15 minutes)
+1. **Identify Partitioned Nodes**
+   ```bash
+   # Check each node's peer count
+   for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
+     echo "Pod: $pod"
+     kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
+   done
+   ```
+
+2. **Check Network Policies**
+   ```bash
+   kubectl describe networkpolicy default-deny-all-ingress -n default
+   kubectl describe networkpolicy blockchain-node-netpol -n default
+   ```
+
+3. **Verify DNS Resolution**
+   ```bash
+   kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
+   ```
+
+### Recovery Actions
+1. **Remove Problematic Network Rules**
+   ```bash
+   # Flush iptables on affected nodes
+   for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
+     kubectl exec -n default $pod -- iptables -F
+   done
+   ```
+
+2. **Restart Network Components**
+   ```bash
+   kubectl rollout restart deployment/blockchain-node -n default
+   ```
+
+3. **Force Re-peering**
+   ```bash
+   # Delete and recreate pods to force re-peering
+   kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
+   ```
+
+### Verification
+```bash
+# Wait for consensus to resume
+watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
+
+# Verify peer connectivity
+kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
+```
+
+## Runbook: Database Failure
+
+### Based on Chaos Test: `chaos_test_database.py`
+
+### Symptoms
+- Database connection errors
+- Service degradation
+- Failed transactions
+- High error rates
+
+### MTTR Target: 3 minutes
+
+### Immediate Actions (0-3 minutes)
+```bash
+# 1. Check PostgreSQL status
+kubectl exec -n default deployment/postgresql -- pg_isready
+
+# 2. Check connection count
+kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
+
+# 3. Check replica lag
+kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
+```
+
+### Investigation (3-10 minutes)
+1. **Review Database Logs**
+   ```bash
+   kubectl logs -n default deployment/postgresql --tail=100
+   ```
+
+2. **Check Resource Usage**
+   ```bash
+   kubectl top pods -n default -l app.kubernetes.io/name=postgresql
+   df -h /var/lib/postgresql/data
+   ```
+
+3. **Identify Long-running Queries**
+   ```bash
+   kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
+   ```
+
+### Recovery Actions
+1. **Kill Idle Connections**
+   ```bash
+   kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
+   ```
+
+2. **Restart PostgreSQL**
+   ```bash
+   kubectl rollout restart deployment/postgresql -n default
+   ```
+
+3. **Failover to Replica**
+   ```bash
+   # Promote replica if primary fails
+   kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
+   ```
+
+### Verification
+```bash
+# Test database connectivity
+kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
+
+# Check application health
+curl -f http://127.0.0.2:8011/v1/health
+```
+
+## Runbook: Redis Failure
+
+### Symptoms
+- Caching failures
+- Session loss
+- Increased database load
+- Slow response times
+
+### MTTR Target: 2 minutes
+
+### Immediate Actions (0-2 minutes)
+```bash
+# 1. Check Redis status
+kubectl exec -n default deployment/redis -- redis-cli ping
+
+# 2. Check memory usage
+kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
+
+# 3. Check connection count
+kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
+```
+
+### Investigation (2-5 minutes)
+1. **Review Redis Logs**
+   ```bash
+   kubectl logs -n default deployment/redis --tail=100
+   ```
+
+2. **Check for Eviction**
+   ```bash
+   kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
+   ```
+
+3. **Identify Large Keys**
+   ```bash
+   kubectl exec -n default deployment/redis -- redis-cli --bigkeys
+   ```
+
+### Recovery Actions
+1. **Clear Expired Keys**
+   ```bash
+   kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
+   ```
+
+2. **Restart Redis**
+   ```bash
+   kubectl rollout restart deployment/redis -n default
+   ```
+
+3. **Scale Redis Cluster**
+   ```bash
+   kubectl scale deployment/redis --replicas=3 -n default
+   ```
+
+### Verification
+```bash
+# Test Redis connectivity
+kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
+
+# Check application performance
+curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
+```
+
+## Runbook: High CPU/Memory Usage
+
+### Symptoms
+- Slow response times
+- Pod evictions
+- OOM errors
+- System degradation
+
+### MTTR Target: 5 minutes
+
+### Immediate Actions (0-5 minutes)
+```bash
+# 1. Check resource usage
+kubectl top pods -n default
+kubectl top nodes
+
+# 2. Identify resource-hungry pods
+kubectl exec -n default deployment/coordinator -- top
+
+# 3. Check for OOM kills
+dmesg | grep -i "killed process"
+```
+
+### Investigation (5-15 minutes)
+1. **Analyze Resource Usage**
+   ```bash
+   # Detailed pod metrics
+   kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
+   kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
+   ```
+
+2. **Check Resource Limits**
+   ```bash
+   kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
+   ```
+
+3. **Review Application Metrics**
+   ```bash
+   # Check Prometheus metrics
+   curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
+   ```
+
+### Recovery Actions
+1. **Scale Services**
+   ```bash
+   kubectl scale deployment/coordinator --replicas=5 -n default
+   kubectl scale deployment/blockchain-node --replicas=3 -n default
+   ```
+
+2. **Increase Resource Limits**
+   ```bash
+   kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
+   ```
+
+3. **Restart Affected Services**
+   ```bash
+   kubectl rollout restart deployment/coordinator -n default
+   ```
+
+### Verification
+```bash
+# Monitor resource usage
+watch -n 5 'kubectl top pods -n default'
+
+# Test service performance
+curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
+```
+
+## Runbook: Storage Issues
+
+### Symptoms
+- Disk space warnings
+- Write failures
+- Database errors
+- Pod crashes
+
+### MTTR Target: 10 minutes
+
+### Immediate Actions (0-10 minutes)
+```bash
+# 1. Check disk usage
+df -h
+kubectl exec -n default deployment/postgresql -- df -h
+
+# 2. Identify large files
+find /var/log -name "*.log" -size +100M
+kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
+
+# 3. Clean up logs
+kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
+```
+
+### Investigation (10-20 minutes)
+1. **Analyze Storage Usage**
+   ```bash
+   du -sh /var/log/*
+   du -sh /var/lib/docker/*
+   ```
+
+2. **Check PVC Usage**
+   ```bash
+   kubectl get pvc -n default
+   kubectl describe pvc postgresql-data -n default
+   ```
+
+3. **Review Retention Policies**
+   ```bash
+   kubectl get cronjobs -n default
+   kubectl describe cronjob log-cleanup -n default
+   ```
+
+### Recovery Actions
+1. **Expand Storage**
+   ```bash
+   kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
+   ```
+
+2. **Force Cleanup**
+   ```bash
+   # Clean old logs
+   find /var/log -name "*.log" -mtime +7 -delete
+   
+   # Clean Docker images
+   docker system prune -a
+   ```
+
+3. **Restart Services**
+   ```bash
+   kubectl rollout restart deployment/postgresql -n default
+   ```
+
+### Verification
+```bash
+# Check disk space
+df -h
+
+# Verify database operations
+kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
+```
+
+## Emergency Contact Procedures
+
+### Escalation Matrix
+1. **Level 1**: On-call engineer (5 minutes)
+2. **Level 2**: On-call secondary (15 minutes)
+3. **Level 3**: Engineering manager (30 minutes)
+4. **Level 4**: CTO (1 hour, critical only)
+
+### War Room Activation
+```bash
+# Create Slack channel
+/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
+
+# Invite stakeholders
+/slack invite @sre-team @engineering-manager @cto
+
+# Start Zoom meeting
+/zoom start "AITBC Incident War Room"
+```
+
+### Customer Communication
+1. **Status Page Update** (5 minutes)
+2. **Email Notification** (15 minutes)
+3. **Twitter Update** (30 minutes, critical only)
+
+## Post-Incident Checklist
+
+### Immediate (0-1 hour)
+- [ ] Service fully restored
+- [ ] Monitoring normal
+- [ ] Status page updated
+- [ ] Stakeholders notified
+
+### Short-term (1-24 hours)
+- [ ] Incident document created
+- [ ] Root cause identified
+- [ ] Runbooks updated
+- [ ] Post-mortem scheduled
+
+### Long-term (1-7 days)
+- [ ] Post-mortem completed
+- [ ] Action items assigned
+- [ ] Monitoring improved
+- [ ] Process updated
+
+## Runbook Maintenance
+
+### Review Schedule
+- **Monthly**: Review and update runbooks
+- **Quarterly**: Full review and testing
+- **Annually**: Major revision
+
+### Update Process
+1. Test runbook procedures
+2. Document lessons learned
+3. Update procedures
+4. Train team members
+5. Update documentation
+
+---
+
+*Version: 1.0*
+*Last Updated: 2024-12-22*
+*Owner: SRE Team*
--- a/docs/7_deployment/5_marketplace-deployment.md
+++ b/docs/7_deployment/5_marketplace-deployment.md
@@ -0,0 +1,69 @@
+# Marketplace GPU Endpoints Deployment Summary
+
+## ✅ Successfully Deployed to Remote Server (aitbc-cascade)
+
+### What was deployed:
+1. **New router file**: `/opt/coordinator-api/src/app/routers/marketplace_gpu.py`
+   - 9 GPU-specific endpoints implemented
+   - In-memory storage for quick testing
+   - Mock data with 3 initial GPUs
+
+2. **Updated router configuration**:
+   - Added `marketplace_gpu` import to `__init__.py`
+   - Added router to main app with `/v1` prefix
+   - Service restarted successfully
+
+### Available Endpoints:
+- `POST /v1/marketplace/gpu/register` - Register GPU
+- `GET /v1/marketplace/gpu/list` - List GPUs
+- `GET /v1/marketplace/gpu/{gpu_id}` - Get GPU details
+- `POST /v1/marketplace/gpu/{gpu_id}/book` - Book GPU
+- `POST /v1/marketplace/gpu/{gpu_id}/release` - Release GPU
+- `GET /v1/marketplace/gpu/{gpu_id}/reviews` - Get reviews
+- `POST /v1/marketplace/gpu/{gpu_id}/reviews` - Add review
+- `GET /v1/marketplace/orders` - List orders
+- `GET /v1/marketplace/pricing/{model}` - Get pricing
+
+### Test Results:
+1. **GPU Registration**: ✅
+   - Successfully registered RTX 4060 Ti (16GB)
+   - GPU ID: gpu_001
+   - Price: $0.30/hour
+
+2. **GPU Booking**: ✅
+   - Booked for 2 hours
+   - Total cost: $1.0
+   - Booking ID generated
+
+3. **Review System**: ✅
+   - Added 5-star review
+   - Average rating updated to 5.0
+
+4. **Order Management**: ✅
+   - Orders tracked
+   - Status: active
+
+### Current GPU Inventory:
+1. RTX 4090 (24GB) - $0.50/hr - Available
+2. RTX 3080 (16GB) - $0.35/hr - Available  
+3. A100 (40GB) - $1.20/hr - Booked
+4. **RTX 4060 Ti (16GB) - $0.30/hr - Available** (newly registered)
+
+### Service Status:
+- Coordinator API: Running on port 8000
+- Service: active (running)
+- Last restart: Feb 12, 2026 at 16:14:11 UTC
+
+### Next Steps:
+1. Update CLI to use remote server URL (http://aitbc-cascade:8000)
+2. Test full CLI workflow against remote server
+3. Consider persistent storage implementation
+4. Add authentication/authorization for production
+
+### Notes:
+- Current implementation uses in-memory storage
+- Data resets on service restart
+- No authentication required (test API key works)
+- All endpoints return proper HTTP status codes (201 for creation)
+
+The marketplace GPU functionality is now fully operational on the remote server! 🚀
--- a/docs/7_deployment/6_beta-release-plan.md
+++ b/docs/7_deployment/6_beta-release-plan.md
@@ -0,0 +1,273 @@
+# AITBC Beta Release Plan
+
+## Executive Summary
+
+This document outlines the beta release plan for AITBC (AI Trusted Blockchain Computing), a blockchain platform designed for AI workloads. The release follows a phased approach: Alpha → Beta → Release Candidate (RC) → General Availability (GA).
+
+## Release Phases
+
+### Phase 1: Alpha Release (Completed)
+- **Duration**: 2 weeks
+- **Participants**: Internal team (10 members)
+- **Focus**: Core functionality validation
+- **Status**: ✅ Completed
+
+### Phase 2: Beta Release (Current)
+- **Duration**: 6 weeks
+- **Participants**: 50-100 external testers
+- **Focus**: User acceptance testing, performance validation, security assessment
+- **Start Date**: 2025-01-15
+- **End Date**: 2025-02-26
+
+### Phase 3: Release Candidate
+- **Duration**: 2 weeks
+- **Participants**: 20 selected beta testers
+- **Focus**: Final bug fixes, performance optimization
+- **Start Date**: 2025-03-04
+- **End Date**: 2025-03-18
+
+### Phase 4: General Availability
+- **Date**: 2025-03-25
+- **Target**: Public launch
+
+## Beta Release Timeline
+
+### Week 1-2: Onboarding & Basic Flows
+- **Jan 15-19**: Tester onboarding and environment setup
+- **Jan 22-26**: Basic job submission and completion flows
+- **Milestone**: 80% of testers successfully submit and complete jobs
+
+### Week 3-4: Marketplace & Explorer Testing
+- **Jan 29 - Feb 2**: Marketplace functionality testing
+- **Feb 5-9**: Explorer UI validation and transaction tracking
+- **Milestone**: 100 marketplace transactions completed
+
+### Week 5-6: Stress Testing & Feedback
+- **Feb 12-16**: Performance stress testing (1000+ concurrent jobs)
+- **Feb 19-23**: Security testing and final feedback collection
+- **Milestone**: All critical bugs resolved
+
+## User Acceptance Testing (UAT) Scenarios
+
+### 1. Core Job Lifecycle
+- **Scenario**: Submit AI inference job → Miner picks up → Execution → Results delivery → Payment
+- **Test Cases**:
+  - Job submission with various model types
+  - Job monitoring and status tracking
+  - Result retrieval and verification
+  - Payment processing and wallet updates
+- **Success Criteria**: 95% success rate across 1000 test jobs
+
+### 2. Marketplace Operations
+- **Scenario**: Create offer → Accept offer → Execute job → Complete transaction
+- **Test Cases**:
+  - Offer creation and management
+  - Bid acceptance and matching
+  - Price discovery mechanisms
+  - Dispute resolution
+- **Success Criteria**: 50 successful marketplace transactions
+
+### 3. Explorer Functionality
+- **Scenario**: Transaction lookup → Job tracking → Address analysis
+- **Test Cases**:
+  - Real-time transaction monitoring
+  - Job history and status visualization
+  - Wallet balance tracking
+  - Block explorer features
+- **Success Criteria**: All transactions visible within 5 seconds
+
+### 4. Wallet Management
+- **Scenario**: Wallet creation → Funding → Transactions → Backup/Restore
+- **Test Cases**:
+  - Multi-signature wallet creation
+  - Cross-chain transfers
+  - Backup and recovery procedures
+  - Staking and unstaking operations
+- **Success Criteria**: 100% wallet recovery success rate
+
+### 5. Mining Operations
+- **Scenario**: Miner setup → Job acceptance → Mining rewards → Pool participation
+- **Test Cases**:
+  - Miner registration and setup
+  - Job bidding and execution
+  - Reward distribution
+  - Pool mining operations
+- **Success Criteria**: 90% of submitted jobs accepted by miners
+
+### 6. Community Management
+
+### Discord Community Structure
+- **#announcements**: Official updates and milestones
+- **#beta-testers**: Private channel for testers only
+- **#bug-reports**: Structured bug reporting format
+- **#feature-feedback**: Feature requests and discussions
+- **#technical-support**: 24/7 support from the team
+
+### Regulatory Considerations
+- **KYC/AML**: Basic identity verification for testers
+- **Securities Law**: Beta tokens have no monetary value
+- **Tax Reporting**: Testnet transactions not taxable
+- **Export Controls**: Compliance with technology export laws
+
+### Geographic Restrictions
+Beta testing is not available in:
+- North Korea, Iran, Cuba, Syria, Crimea
+- Countries under US sanctions
+- Jurdictions with unclear crypto regulations
+
+### 7. Token Economics Validation
+- **Scenario**: Token issuance → Reward distribution → Staking yields → Fee mechanisms
+- **Test Cases**:
+  - Mining reward calculations match whitepaper specs
+  - Staking yields and unstaking penalties
+  - Transaction fee burning and distribution
+  - Marketplace fee structures
+  - Token inflation/deflation mechanics
+- **Success Criteria**: All token operations within 1% of theoretical values
+
+## Performance Benchmarks (Go/No-Go Criteria)
+
+### Must-Have Metrics
+- **Transaction Throughput**: ≥ 100 TPS (Transactions Per Second)
+- **Job Completion Time**: ≤ 5 minutes for standard inference jobs
+- **API Response Time**: ≤ 200ms (95th percentile)
+- **System Uptime**: ≥ 99.9% during beta period
+- **MTTR (Mean Time To Recovery)**: ≤ 2 minutes (from chaos tests)
+
+### Nice-to-Have Metrics
+- **Transaction Throughput**: ≥ 500 TPS
+- **Job Completion Time**: ≤ 2 minutes
+- **API Response Time**: ≤ 100ms (95th percentile)
+- **Concurrent Users**: ≥ 1000 simultaneous users
+
+## Security Testing
+
+### Automated Security Scans
+- **Smart Contract Audits**: Completed by [Security Firm]
+- **Penetration Testing**: OWASP Top 10 validation
+- **Dependency Scanning**: CVE scan of all dependencies
+- **Chaos Testing**: Network partition and coordinator outage scenarios
+
+### Manual Security Reviews
+- **Authorization Testing**: API key validation and permissions
+- **Data Privacy**: GDPR compliance validation
+- **Cryptography**: Proof verification and signature validation
+- **Infrastructure Security**: Kubernetes and cloud security review
+
+## Test Environment Setup
+
+### Beta Environment
+- **Network**: Separate testnet with faucet for test tokens
+- **Infrastructure**: Production-like setup with monitoring
+- **Data**: Reset weekly to ensure clean testing
+- **Support**: 24/7 Discord support channel
+
+### Access Credentials
+- **Testnet Faucet**: 1000 AITBC tokens per tester
+- **API Keys**: Unique keys per tester with rate limits
+- **Wallet Seeds**: Generated per tester with backup instructions
+- **Mining Accounts**: Pre-configured mining pools for testing
+
+## Feedback Collection Mechanisms
+
+### Automated Collection
+- **Error Reporting**: Automatic crash reports and error logs
+- **Performance Metrics**: Client-side performance data
+- **Usage Analytics**: Feature usage tracking (anonymized)
+- **Survey System**: In-app feedback prompts
+
+### Manual Collection
+- **Weekly Surveys**: Structured feedback on specific features
+- **Discord Channels**: Real-time feedback and discussions
+- **Office Hours**: Weekly Q&A sessions with the team
+- **Bug Bounty**: Program for critical issue discovery
+
+## Success Criteria
+
+### Go/No-Go Decision Points
+
+#### Week 2 Checkpoint (Jan 26)
+- **Go Criteria**: 80% of testers onboarded, basic flows working
+- **Blockers**: Critical bugs in job submission/completion
+
+#### Week 4 Checkpoint (Feb 9)
+- **Go Criteria**: 50 marketplace transactions, explorer functional
+- **Blockers**: Security vulnerabilities, performance < 50 TPS
+
+#### Week 6 Final Decision (Feb 23)
+- **Go Criteria**: All UAT scenarios passed, benchmarks met
+- **Blockers**: Any critical security issue, MTTR > 5 minutes
+
+### Overall Success Metrics
+- **User Satisfaction**: ≥ 4.0/5.0 average rating
+- **Bug Resolution**: 90% of reported bugs fixed
+- **Performance**: All benchmarks met
+- **Security**: No critical vulnerabilities
+
+## Risk Management
+
+### Technical Risks
+- **Consensus Issues**: Rollback to previous version
+- **Performance Degradation**: Auto-scaling and optimization
+- **Security Breaches**: Immediate patch and notification
+
+### Operational Risks
+- **Test Environment Downtime**: Backup environment ready
+- **Low Tester Participation**: Incentive program adjustments
+- **Feature Scope Creep**: Strict feature freeze after Week 4
+
+### Mitigation Strategies
+- **Daily Health Checks**: Automated monitoring and alerts
+- **Rollback Plan**: Documented procedures for quick rollback
+- **Communication Plan**: Regular updates to all stakeholders
+
+## Communication Plan
+
+### Internal Updates
+- **Daily Standups**: Development team sync
+- **Weekly Reports**: Progress to leadership
+- **Bi-weekly Demos**: Feature demonstrations
+
+### External Updates
+- **Beta Newsletter**: Weekly updates to testers
+- **Blog Posts**: Public progress updates
+- **Social Media**: Regular platform updates
+
+## Post-Beta Activities
+
+### RC Phase Preparation
+- **Bug Triage**: Prioritize and assign all reported issues
+- **Performance Tuning**: Optimize based on beta metrics
+- **Documentation Updates**: Incorporate beta feedback
+
+### GA Preparation
+- **Final Security Review**: Complete audit and penetration test
+- **Infrastructure Scaling**: Prepare for production load
+- **Support Team Training**: Enable customer support team
+
+## Appendix
+
+### A. Test Case Matrix
+[Detailed test case spreadsheet link]
+
+### B. Performance Benchmark Results
+[Benchmark data and graphs]
+
+### C. Security Audit Reports
+[Audit firm reports and findings]
+
+### D. Feedback Analysis
+[Summary of all user feedback and actions taken]
+
+## Contact Information
+
+- **Beta Program Manager**: beta@aitbc.io
+- **Technical Support**: support@aitbc.io
+- **Security Issues**: security@aitbc.io
+- **Discord Community**: https://discord.gg/aitbc
+
+---
+
+*Last Updated: 2025-01-10*
+*Version: 1.0*
+*Next Review: 2025-01-17*