Files
aitbc/docs/operator/backup_restore.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

317 lines
8.6 KiB
Markdown

# AITBC Backup and Restore Procedures
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
## Overview
The AITBC platform implements a comprehensive backup strategy with:
- **Automated daily backups** via Kubernetes CronJobs
- **Manual backup capabilities** for on-demand operations
- **Incremental and full backup options** for ledger data
- **Cloud storage integration** for off-site backups
- **Retention policies** to manage storage efficiently
## Components
### 1. PostgreSQL Database
- **Location**: Coordinator API persistent storage
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
- **Backup Format**: Custom PostgreSQL dump with compression
- **Retention**: 30 days (configurable)
### 2. Redis Cache
- **Location**: In-memory cache with persistence
- **Data**: Session cache, temporary data, rate limiting
- **Backup Format**: RDB snapshot + AOF (if enabled)
- **Retention**: 30 days (configurable)
### 3. Ledger Storage
- **Location**: Blockchain node persistent storage
- **Data**: Blocks, transactions, receipts, wallet states
- **Backup Format**: Compressed tar archives
- **Retention**: 30 days (configurable)
## Automated Backups
### Kubernetes CronJob
The automated backup system runs daily at 2:00 AM UTC:
```bash
# Deploy the backup CronJob
kubectl apply -f infra/k8s/backup-cronjob.yaml
# Check CronJob status
kubectl get cronjob aitbc-backup
# View backup jobs
kubectl get jobs -l app=aitbc-backup
# View backup logs
kubectl logs job/aitbc-backup-<timestamp>
```
### Backup Schedule
| Time (UTC) | Component | Type | Retention |
|------------|----------------|------------|-----------|
| 02:00 | PostgreSQL | Full | 30 days |
| 02:01 | Redis | Full | 30 days |
| 02:02 | Ledger | Full | 30 days |
## Manual Backups
### PostgreSQL
```bash
# Create a manual backup
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
# View available backups
ls -la /tmp/postgresql-backups/
# Upload to S3 manually
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
```
### Redis
```bash
# Create a manual backup
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
# Force background save before backup
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
```
### Ledger Storage
```bash
# Create a full backup
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
# Create incremental backup
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
```
## Restore Procedures
### PostgreSQL Restore
```bash
# List available backups
aws s3 ls s3://aitbc-backups-default/postgresql/
# Download backup from S3
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
# Restore database
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
# Verify restore
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
```
### Redis Restore
```bash
# Stop Redis service
kubectl scale deployment redis --replicas=0 -n default
# Clear existing data
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
# Copy backup file
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
# Start Redis service
kubectl scale deployment redis --replicas=1 -n default
# Verify restore
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
```
### Ledger Restore
```bash
# Stop blockchain nodes
kubectl scale deployment blockchain-node --replicas=0 -n default
# Extract backup
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
# Copy ledger data
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
# Start blockchain nodes
kubectl scale deployment blockchain-node --replicas=3 -n default
# Verify restore
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
```
## Disaster Recovery
### Recovery Time Objective (RTO)
| Component | RTO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 1 hour | Database restore from backup |
| Redis | 15 minutes | Cache rebuild from backup |
| Ledger | 2 hours | Full chain synchronization |
### Recovery Point Objective (RPO)
| Component | RPO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 24 hours | Daily backups |
| Redis | 24 hours | Daily backups |
| Ledger | 24 hours | Daily full + incremental backups|
### Disaster Recovery Steps
1. **Assess Impact**
```bash
# Check component status
kubectl get pods -n default
kubectl get events --sort-by=.metadata.creationTimestamp
```
2. **Restore Critical Services**
```bash
# Restore PostgreSQL first (critical for operations)
./infra/scripts/restore_postgresql.sh default [latest-backup]
# Restore Redis cache
./restore_redis.sh default [latest-backup]
# Restore ledger data
./restore_ledger.sh default [latest-backup]
```
3. **Verify System Health**
```bash
# Check all services
kubectl get pods -n default
# Verify API endpoints
curl -s http://coordinator-api:8011/v1/health
curl -s http://blockchain-node:8080/v1/health
```
## Monitoring and Alerting
### Backup Monitoring
Prometheus metrics track backup success/failure:
```yaml
# AlertManager rules for backups
- alert: BackupFailed
expr: backup_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed for {{ $labels.component }}"
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
```
### Log Monitoring
```bash
# View backup logs
kubectl logs -l app=aitbc-backup -n default --tail=100
# Monitor backup CronJob
kubectl get cronjob aitbc-backup -w
```
## Best Practices
### Backup Security
1. **Encryption**: Backups uploaded to S3 use server-side encryption
2. **Access Control**: IAM policies restrict backup access
3. **Retention**: Automatic cleanup of old backups
4. **Validation**: Regular restore testing
### Performance Considerations
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
2. **Parallel Processing**: Components backed up sequentially
3. **Compression**: All backups compressed to save storage
4. **Incremental Backups**: Ledger supports incremental to reduce size
### Testing
1. **Monthly Restore Tests**: Validate backup integrity
2. **Disaster Recovery Drills**: Quarterly full scenario testing
3. **Documentation Updates**: Keep procedures current
## Troubleshooting
### Common Issues
#### Backup Fails with "Permission Denied"
```bash
# Check service account permissions
kubectl describe serviceaccount backup-service-account
kubectl describe role backup-role
```
#### Restore Fails with "Database in Use"
```bash
# Scale down application before restore
kubectl scale deployment coordinator-api --replicas=0
# Perform restore
# Scale up after restore
kubectl scale deployment coordinator-api --replicas=3
```
#### Ledger Restore Incomplete
```bash
# Verify backup integrity
tar -tzf ledger-backup.tar.gz
# Check metadata.json for block height
cat metadata.json | jq '.latest_block_height'
```
### Getting Help
1. Check logs: `kubectl logs -l app=aitbc-backup`
2. Verify storage: `df -h` on backup nodes
3. Check network: Test S3 connectivity
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
## Configuration
### Environment Variables
| Variable | Default | Description |
|------------------------|------------------|---------------------------------|
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
| COMPRESSION_LEVEL | 6 | gzip compression level |
### Customizing Backup Schedule
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
```yaml
spec:
schedule: "0 3 * * *" # Change to 3 AM UTC
```
### Adjusting Retention
Modify retention in each backup script:
```bash
# In backup_*.sh scripts
RETENTION_DAYS=60 # Keep for 60 days instead of 30
```