feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions
--- a/docs/operator/backup_restore.md
+++ b/docs/operator/backup_restore.md
@ -0,0 +1,316 @@
+# AITBC Backup and Restore Procedures
+
+This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
+
+## Overview
+
+The AITBC platform implements a comprehensive backup strategy with:
+- **Automated daily backups** via Kubernetes CronJobs
+- **Manual backup capabilities** for on-demand operations
+- **Incremental and full backup options** for ledger data
+- **Cloud storage integration** for off-site backups
+- **Retention policies** to manage storage efficiently
+
+## Components
+
+### 1. PostgreSQL Database
+- **Location**: Coordinator API persistent storage
+- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
+- **Backup Format**: Custom PostgreSQL dump with compression
+- **Retention**: 30 days (configurable)
+
+### 2. Redis Cache
+- **Location**: In-memory cache with persistence
+- **Data**: Session cache, temporary data, rate limiting
+- **Backup Format**: RDB snapshot + AOF (if enabled)
+- **Retention**: 30 days (configurable)
+
+### 3. Ledger Storage
+- **Location**: Blockchain node persistent storage
+- **Data**: Blocks, transactions, receipts, wallet states
+- **Backup Format**: Compressed tar archives
+- **Retention**: 30 days (configurable)
+
+## Automated Backups
+
+### Kubernetes CronJob
+
+The automated backup system runs daily at 2:00 AM UTC:
+
+```bash
+# Deploy the backup CronJob
+kubectl apply -f infra/k8s/backup-cronjob.yaml
+
+# Check CronJob status
+kubectl get cronjob aitbc-backup
+
+# View backup jobs
+kubectl get jobs -l app=aitbc-backup
+
+# View backup logs
+kubectl logs job/aitbc-backup-<timestamp>
+```
+
+### Backup Schedule
+
+| Time (UTC) | Component      | Type       | Retention |
+|------------|----------------|------------|-----------|
+| 02:00      | PostgreSQL     | Full       | 30 days   |
+| 02:01      | Redis          | Full       | 30 days   |
+| 02:02      | Ledger         | Full       | 30 days   |
+
+## Manual Backups
+
+### PostgreSQL
+
+```bash
+# Create a manual backup
+./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
+
+# View available backups
+ls -la /tmp/postgresql-backups/
+
+# Upload to S3 manually
+aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
+```
+
+### Redis
+
+```bash
+# Create a manual backup
+./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
+
+# Force background save before backup
+kubectl exec -n default deployment/redis -- redis-cli BGSAVE
+```
+
+### Ledger Storage
+
+```bash
+# Create a full backup
+./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
+
+# Create incremental backup
+./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
+```
+
+## Restore Procedures
+
+### PostgreSQL Restore
+
+```bash
+# List available backups
+aws s3 ls s3://aitbc-backups-default/postgresql/
+
+# Download backup from S3
+aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
+
+# Restore database
+./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
+
+# Verify restore
+kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
+```
+
+### Redis Restore
+
+```bash
+# Stop Redis service
+kubectl scale deployment redis --replicas=0 -n default
+
+# Clear existing data
+kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
+
+# Copy backup file
+kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
+
+# Start Redis service
+kubectl scale deployment redis --replicas=1 -n default
+
+# Verify restore
+kubectl exec -n default deployment/redis -- redis-cli DBSIZE
+```
+
+### Ledger Restore
+
+```bash
+# Stop blockchain nodes
+kubectl scale deployment blockchain-node --replicas=0 -n default
+
+# Extract backup
+tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
+
+# Copy ledger data
+kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
+kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
+kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
+
+# Start blockchain nodes
+kubectl scale deployment blockchain-node --replicas=3 -n default
+
+# Verify restore
+kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
+```
+
+## Disaster Recovery
+
+### Recovery Time Objective (RTO)
+
+| Component      | RTO Target | Notes                           |
+|----------------|------------|---------------------------------|
+| PostgreSQL     | 1 hour     | Database restore from backup     |
+| Redis          | 15 minutes | Cache rebuild from backup       |
+| Ledger         | 2 hours    | Full chain synchronization       |
+
+### Recovery Point Objective (RPO)
+
+| Component      | RPO Target | Notes                           |
+|----------------|------------|---------------------------------|
+| PostgreSQL     | 24 hours   | Daily backups                    |
+| Redis          | 24 hours   | Daily backups                    |
+| Ledger         | 24 hours   | Daily full + incremental backups|
+
+### Disaster Recovery Steps
+
+1. **Assess Impact**
+   ```bash
+   # Check component status
+   kubectl get pods -n default
+   kubectl get events --sort-by=.metadata.creationTimestamp
+   ```
+
+2. **Restore Critical Services**
+   ```bash
+   # Restore PostgreSQL first (critical for operations)
+   ./infra/scripts/restore_postgresql.sh default [latest-backup]
+   
+   # Restore Redis cache
+   ./restore_redis.sh default [latest-backup]
+   
+   # Restore ledger data
+   ./restore_ledger.sh default [latest-backup]
+   ```
+
+3. **Verify System Health**
+   ```bash
+   # Check all services
+   kubectl get pods -n default
+   
+   # Verify API endpoints
+   curl -s http://coordinator-api:8011/v1/health
+   curl -s http://blockchain-node:8080/v1/health
+   ```
+
+## Monitoring and Alerting
+
+### Backup Monitoring
+
+Prometheus metrics track backup success/failure:
+
+```yaml
+# AlertManager rules for backups
+- alert: BackupFailed
+  expr: backup_success == 0
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Backup failed for {{ $labels.component }}"
+    description: "Backup for {{ $labels.component }} has failed for 5 minutes"
+```
+
+### Log Monitoring
+
+```bash
+# View backup logs
+kubectl logs -l app=aitbc-backup -n default --tail=100
+
+# Monitor backup CronJob
+kubectl get cronjob aitbc-backup -w
+```
+
+## Best Practices
+
+### Backup Security
+
+1. **Encryption**: Backups uploaded to S3 use server-side encryption
+2. **Access Control**: IAM policies restrict backup access
+3. **Retention**: Automatic cleanup of old backups
+4. **Validation**: Regular restore testing
+
+### Performance Considerations
+
+1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
+2. **Parallel Processing**: Components backed up sequentially
+3. **Compression**: All backups compressed to save storage
+4. **Incremental Backups**: Ledger supports incremental to reduce size
+
+### Testing
+
+1. **Monthly Restore Tests**: Validate backup integrity
+2. **Disaster Recovery Drills**: Quarterly full scenario testing
+3. **Documentation Updates**: Keep procedures current
+
+## Troubleshooting
+
+### Common Issues
+
+#### Backup Fails with "Permission Denied"
+```bash
+# Check service account permissions
+kubectl describe serviceaccount backup-service-account
+kubectl describe role backup-role
+```
+
+#### Restore Fails with "Database in Use"
+```bash
+# Scale down application before restore
+kubectl scale deployment coordinator-api --replicas=0
+# Perform restore
+# Scale up after restore
+kubectl scale deployment coordinator-api --replicas=3
+```
+
+#### Ledger Restore Incomplete
+```bash
+# Verify backup integrity
+tar -tzf ledger-backup.tar.gz
+# Check metadata.json for block height
+cat metadata.json | jq '.latest_block_height'
+```
+
+### Getting Help
+
+1. Check logs: `kubectl logs -l app=aitbc-backup`
+2. Verify storage: `df -h` on backup nodes
+3. Check network: Test S3 connectivity
+4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
+
+## Configuration
+
+### Environment Variables
+
+| Variable               | Default          | Description                     |
+|------------------------|------------------|---------------------------------|
+| BACKUP_RETENTION_DAYS  | 30               | Days to keep backups            |
+| BACKUP_SCHEDULE        | 0 2 * * *        | Cron schedule for backups       |
+| S3_BUCKET_PREFIX       | aitbc-backups    | S3 bucket name prefix           |
+| COMPRESSION_LEVEL      | 6                | gzip compression level          |
+
+### Customizing Backup Schedule
+
+Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
+
+```yaml
+spec:
+  schedule: "0 3 * * *"  # Change to 3 AM UTC
+```
+
+### Adjusting Retention
+
+Modify retention in each backup script:
+
+```bash
+# In backup_*.sh scripts
+RETENTION_DAYS=60  # Keep for 60 days instead of 30
+```