- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
317 lines
8.6 KiB
Markdown
317 lines
8.6 KiB
Markdown
# AITBC Backup and Restore Procedures
|
|
|
|
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
|
|
|
|
## Overview
|
|
|
|
The AITBC platform implements a comprehensive backup strategy with:
|
|
- **Automated daily backups** via Kubernetes CronJobs
|
|
- **Manual backup capabilities** for on-demand operations
|
|
- **Incremental and full backup options** for ledger data
|
|
- **Cloud storage integration** for off-site backups
|
|
- **Retention policies** to manage storage efficiently
|
|
|
|
## Components
|
|
|
|
### 1. PostgreSQL Database
|
|
- **Location**: Coordinator API persistent storage
|
|
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
|
|
- **Backup Format**: Custom PostgreSQL dump with compression
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
### 2. Redis Cache
|
|
- **Location**: In-memory cache with persistence
|
|
- **Data**: Session cache, temporary data, rate limiting
|
|
- **Backup Format**: RDB snapshot + AOF (if enabled)
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
### 3. Ledger Storage
|
|
- **Location**: Blockchain node persistent storage
|
|
- **Data**: Blocks, transactions, receipts, wallet states
|
|
- **Backup Format**: Compressed tar archives
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
## Automated Backups
|
|
|
|
### Kubernetes CronJob
|
|
|
|
The automated backup system runs daily at 2:00 AM UTC:
|
|
|
|
```bash
|
|
# Deploy the backup CronJob
|
|
kubectl apply -f infra/k8s/backup-cronjob.yaml
|
|
|
|
# Check CronJob status
|
|
kubectl get cronjob aitbc-backup
|
|
|
|
# View backup jobs
|
|
kubectl get jobs -l app=aitbc-backup
|
|
|
|
# View backup logs
|
|
kubectl logs job/aitbc-backup-<timestamp>
|
|
```
|
|
|
|
### Backup Schedule
|
|
|
|
| Time (UTC) | Component | Type | Retention |
|
|
|------------|----------------|------------|-----------|
|
|
| 02:00 | PostgreSQL | Full | 30 days |
|
|
| 02:01 | Redis | Full | 30 days |
|
|
| 02:02 | Ledger | Full | 30 days |
|
|
|
|
## Manual Backups
|
|
|
|
### PostgreSQL
|
|
|
|
```bash
|
|
# Create a manual backup
|
|
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
|
|
|
|
# View available backups
|
|
ls -la /tmp/postgresql-backups/
|
|
|
|
# Upload to S3 manually
|
|
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
|
|
```
|
|
|
|
### Redis
|
|
|
|
```bash
|
|
# Create a manual backup
|
|
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
|
|
|
|
# Force background save before backup
|
|
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
|
|
```
|
|
|
|
### Ledger Storage
|
|
|
|
```bash
|
|
# Create a full backup
|
|
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
|
|
|
|
# Create incremental backup
|
|
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
|
|
```
|
|
|
|
## Restore Procedures
|
|
|
|
### PostgreSQL Restore
|
|
|
|
```bash
|
|
# List available backups
|
|
aws s3 ls s3://aitbc-backups-default/postgresql/
|
|
|
|
# Download backup from S3
|
|
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
|
|
|
|
# Restore database
|
|
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
|
|
```
|
|
|
|
### Redis Restore
|
|
|
|
```bash
|
|
# Stop Redis service
|
|
kubectl scale deployment redis --replicas=0 -n default
|
|
|
|
# Clear existing data
|
|
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
|
|
|
|
# Copy backup file
|
|
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
|
|
|
|
# Start Redis service
|
|
kubectl scale deployment redis --replicas=1 -n default
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
|
|
```
|
|
|
|
### Ledger Restore
|
|
|
|
```bash
|
|
# Stop blockchain nodes
|
|
kubectl scale deployment blockchain-node --replicas=0 -n default
|
|
|
|
# Extract backup
|
|
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
|
|
|
|
# Copy ledger data
|
|
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
|
|
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
|
|
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
|
|
|
|
# Start blockchain nodes
|
|
kubectl scale deployment blockchain-node --replicas=3 -n default
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
|
|
```
|
|
|
|
## Disaster Recovery
|
|
|
|
### Recovery Time Objective (RTO)
|
|
|
|
| Component | RTO Target | Notes |
|
|
|----------------|------------|---------------------------------|
|
|
| PostgreSQL | 1 hour | Database restore from backup |
|
|
| Redis | 15 minutes | Cache rebuild from backup |
|
|
| Ledger | 2 hours | Full chain synchronization |
|
|
|
|
### Recovery Point Objective (RPO)
|
|
|
|
| Component | RPO Target | Notes |
|
|
|----------------|------------|---------------------------------|
|
|
| PostgreSQL | 24 hours | Daily backups |
|
|
| Redis | 24 hours | Daily backups |
|
|
| Ledger | 24 hours | Daily full + incremental backups|
|
|
|
|
### Disaster Recovery Steps
|
|
|
|
1. **Assess Impact**
|
|
```bash
|
|
# Check component status
|
|
kubectl get pods -n default
|
|
kubectl get events --sort-by=.metadata.creationTimestamp
|
|
```
|
|
|
|
2. **Restore Critical Services**
|
|
```bash
|
|
# Restore PostgreSQL first (critical for operations)
|
|
./infra/scripts/restore_postgresql.sh default [latest-backup]
|
|
|
|
# Restore Redis cache
|
|
./restore_redis.sh default [latest-backup]
|
|
|
|
# Restore ledger data
|
|
./restore_ledger.sh default [latest-backup]
|
|
```
|
|
|
|
3. **Verify System Health**
|
|
```bash
|
|
# Check all services
|
|
kubectl get pods -n default
|
|
|
|
# Verify API endpoints
|
|
curl -s http://coordinator-api:8011/v1/health
|
|
curl -s http://blockchain-node:8080/v1/health
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Backup Monitoring
|
|
|
|
Prometheus metrics track backup success/failure:
|
|
|
|
```yaml
|
|
# AlertManager rules for backups
|
|
- alert: BackupFailed
|
|
expr: backup_success == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Backup failed for {{ $labels.component }}"
|
|
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
|
|
```
|
|
|
|
### Log Monitoring
|
|
|
|
```bash
|
|
# View backup logs
|
|
kubectl logs -l app=aitbc-backup -n default --tail=100
|
|
|
|
# Monitor backup CronJob
|
|
kubectl get cronjob aitbc-backup -w
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Backup Security
|
|
|
|
1. **Encryption**: Backups uploaded to S3 use server-side encryption
|
|
2. **Access Control**: IAM policies restrict backup access
|
|
3. **Retention**: Automatic cleanup of old backups
|
|
4. **Validation**: Regular restore testing
|
|
|
|
### Performance Considerations
|
|
|
|
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
|
|
2. **Parallel Processing**: Components backed up sequentially
|
|
3. **Compression**: All backups compressed to save storage
|
|
4. **Incremental Backups**: Ledger supports incremental to reduce size
|
|
|
|
### Testing
|
|
|
|
1. **Monthly Restore Tests**: Validate backup integrity
|
|
2. **Disaster Recovery Drills**: Quarterly full scenario testing
|
|
3. **Documentation Updates**: Keep procedures current
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Backup Fails with "Permission Denied"
|
|
```bash
|
|
# Check service account permissions
|
|
kubectl describe serviceaccount backup-service-account
|
|
kubectl describe role backup-role
|
|
```
|
|
|
|
#### Restore Fails with "Database in Use"
|
|
```bash
|
|
# Scale down application before restore
|
|
kubectl scale deployment coordinator-api --replicas=0
|
|
# Perform restore
|
|
# Scale up after restore
|
|
kubectl scale deployment coordinator-api --replicas=3
|
|
```
|
|
|
|
#### Ledger Restore Incomplete
|
|
```bash
|
|
# Verify backup integrity
|
|
tar -tzf ledger-backup.tar.gz
|
|
# Check metadata.json for block height
|
|
cat metadata.json | jq '.latest_block_height'
|
|
```
|
|
|
|
### Getting Help
|
|
|
|
1. Check logs: `kubectl logs -l app=aitbc-backup`
|
|
2. Verify storage: `df -h` on backup nodes
|
|
3. Check network: Test S3 connectivity
|
|
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|------------------------|------------------|---------------------------------|
|
|
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
|
|
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
|
|
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
|
|
| COMPRESSION_LEVEL | 6 | gzip compression level |
|
|
|
|
### Customizing Backup Schedule
|
|
|
|
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
|
|
|
|
```yaml
|
|
spec:
|
|
schedule: "0 3 * * *" # Change to 3 AM UTC
|
|
```
|
|
|
|
### Adjusting Retention
|
|
|
|
Modify retention in each backup script:
|
|
|
|
```bash
|
|
# In backup_*.sh scripts
|
|
RETENTION_DAYS=60 # Keep for 60 days instead of 30
|
|
```
|