Files
aitbc/docs/operator/backup_restore.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

8.6 KiB

AITBC Backup and Restore Procedures

This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.

Overview

The AITBC platform implements a comprehensive backup strategy with:

  • Automated daily backups via Kubernetes CronJobs
  • Manual backup capabilities for on-demand operations
  • Incremental and full backup options for ledger data
  • Cloud storage integration for off-site backups
  • Retention policies to manage storage efficiently

Components

1. PostgreSQL Database

  • Location: Coordinator API persistent storage
  • Data: Jobs, marketplace offers/bids, user sessions, configuration
  • Backup Format: Custom PostgreSQL dump with compression
  • Retention: 30 days (configurable)

2. Redis Cache

  • Location: In-memory cache with persistence
  • Data: Session cache, temporary data, rate limiting
  • Backup Format: RDB snapshot + AOF (if enabled)
  • Retention: 30 days (configurable)

3. Ledger Storage

  • Location: Blockchain node persistent storage
  • Data: Blocks, transactions, receipts, wallet states
  • Backup Format: Compressed tar archives
  • Retention: 30 days (configurable)

Automated Backups

Kubernetes CronJob

The automated backup system runs daily at 2:00 AM UTC:

# Deploy the backup CronJob
kubectl apply -f infra/k8s/backup-cronjob.yaml

# Check CronJob status
kubectl get cronjob aitbc-backup

# View backup jobs
kubectl get jobs -l app=aitbc-backup

# View backup logs
kubectl logs job/aitbc-backup-<timestamp>

Backup Schedule

Time (UTC) Component Type Retention
02:00 PostgreSQL Full 30 days
02:01 Redis Full 30 days
02:02 Ledger Full 30 days

Manual Backups

PostgreSQL

# Create a manual backup
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)

# View available backups
ls -la /tmp/postgresql-backups/

# Upload to S3 manually
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/

Redis

# Create a manual backup
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)

# Force background save before backup
kubectl exec -n default deployment/redis -- redis-cli BGSAVE

Ledger Storage

# Create a full backup
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)

# Create incremental backup
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true

Restore Procedures

PostgreSQL Restore

# List available backups
aws s3 ls s3://aitbc-backups-default/postgresql/

# Download backup from S3
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/

# Restore database
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz

# Verify restore
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health

Redis Restore

# Stop Redis service
kubectl scale deployment redis --replicas=0 -n default

# Clear existing data
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof

# Copy backup file
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb

# Start Redis service
kubectl scale deployment redis --replicas=1 -n default

# Verify restore
kubectl exec -n default deployment/redis -- redis-cli DBSIZE

Ledger Restore

# Stop blockchain nodes
kubectl scale deployment blockchain-node --replicas=0 -n default

# Extract backup
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/

# Copy ledger data
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/

# Start blockchain nodes
kubectl scale deployment blockchain-node --replicas=3 -n default

# Verify restore
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head

Disaster Recovery

Recovery Time Objective (RTO)

Component RTO Target Notes
PostgreSQL 1 hour Database restore from backup
Redis 15 minutes Cache rebuild from backup
Ledger 2 hours Full chain synchronization

Recovery Point Objective (RPO)

Component RPO Target Notes
PostgreSQL 24 hours Daily backups
Redis 24 hours Daily backups
Ledger 24 hours Daily full + incremental backups

Disaster Recovery Steps

  1. Assess Impact

    # Check component status
    kubectl get pods -n default
    kubectl get events --sort-by=.metadata.creationTimestamp
    
  2. Restore Critical Services

    # Restore PostgreSQL first (critical for operations)
    ./infra/scripts/restore_postgresql.sh default [latest-backup]
    
    # Restore Redis cache
    ./restore_redis.sh default [latest-backup]
    
    # Restore ledger data
    ./restore_ledger.sh default [latest-backup]
    
  3. Verify System Health

    # Check all services
    kubectl get pods -n default
    
    # Verify API endpoints
    curl -s http://coordinator-api:8011/v1/health
    curl -s http://blockchain-node:8080/v1/health
    

Monitoring and Alerting

Backup Monitoring

Prometheus metrics track backup success/failure:

# AlertManager rules for backups
- alert: BackupFailed
  expr: backup_success == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Backup failed for {{ $labels.component }}"
    description: "Backup for {{ $labels.component }} has failed for 5 minutes"

Log Monitoring

# View backup logs
kubectl logs -l app=aitbc-backup -n default --tail=100

# Monitor backup CronJob
kubectl get cronjob aitbc-backup -w

Best Practices

Backup Security

  1. Encryption: Backups uploaded to S3 use server-side encryption
  2. Access Control: IAM policies restrict backup access
  3. Retention: Automatic cleanup of old backups
  4. Validation: Regular restore testing

Performance Considerations

  1. Off-Peak Backups: Scheduled during low traffic (2 AM UTC)
  2. Parallel Processing: Components backed up sequentially
  3. Compression: All backups compressed to save storage
  4. Incremental Backups: Ledger supports incremental to reduce size

Testing

  1. Monthly Restore Tests: Validate backup integrity
  2. Disaster Recovery Drills: Quarterly full scenario testing
  3. Documentation Updates: Keep procedures current

Troubleshooting

Common Issues

Backup Fails with "Permission Denied"

# Check service account permissions
kubectl describe serviceaccount backup-service-account
kubectl describe role backup-role

Restore Fails with "Database in Use"

# Scale down application before restore
kubectl scale deployment coordinator-api --replicas=0
# Perform restore
# Scale up after restore
kubectl scale deployment coordinator-api --replicas=3

Ledger Restore Incomplete

# Verify backup integrity
tar -tzf ledger-backup.tar.gz
# Check metadata.json for block height
cat metadata.json | jq '.latest_block_height'

Getting Help

  1. Check logs: kubectl logs -l app=aitbc-backup
  2. Verify storage: df -h on backup nodes
  3. Check network: Test S3 connectivity
  4. Review events: kubectl get events --sort-by=.metadata.creationTimestamp

Configuration

Environment Variables

Variable Default Description
BACKUP_RETENTION_DAYS 30 Days to keep backups
BACKUP_SCHEDULE 0 2 * * * Cron schedule for backups
S3_BUCKET_PREFIX aitbc-backups S3 bucket name prefix
COMPRESSION_LEVEL 6 gzip compression level

Customizing Backup Schedule

Edit the CronJob schedule in infra/k8s/backup-cronjob.yaml:

spec:
  schedule: "0 3 * * *"  # Change to 3 AM UTC

Adjusting Retention

Modify retention in each backup script:

# In backup_*.sh scripts
RETENTION_DAYS=60  # Keep for 60 days instead of 30