oib/aitbc

Files

oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru

2025-12-22 10:33:23 +01:00

8.6 KiB

Raw Blame History

AITBC Backup and Restore Procedures

This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.

Overview

The AITBC platform implements a comprehensive backup strategy with:

Automated daily backups via Kubernetes CronJobs
Manual backup capabilities for on-demand operations
Incremental and full backup options for ledger data
Cloud storage integration for off-site backups
Retention policies to manage storage efficiently

Components

1. PostgreSQL Database

Location: Coordinator API persistent storage
Data: Jobs, marketplace offers/bids, user sessions, configuration
Backup Format: Custom PostgreSQL dump with compression
Retention: 30 days (configurable)

2. Redis Cache

Location: In-memory cache with persistence
Data: Session cache, temporary data, rate limiting
Backup Format: RDB snapshot + AOF (if enabled)
Retention: 30 days (configurable)

3. Ledger Storage

Location: Blockchain node persistent storage
Data: Blocks, transactions, receipts, wallet states
Backup Format: Compressed tar archives
Retention: 30 days (configurable)

Automated Backups

Kubernetes CronJob

The automated backup system runs daily at 2:00 AM UTC:

# Deploy the backup CronJob
kubectl apply -f infra/k8s/backup-cronjob.yaml

# Check CronJob status
kubectl get cronjob aitbc-backup

# View backup jobs
kubectl get jobs -l app=aitbc-backup

# View backup logs
kubectl logs job/aitbc-backup-<timestamp>

Backup Schedule

Time (UTC)	Component	Type	Retention
02:00	PostgreSQL	Full	30 days
02:01	Redis	Full	30 days
02:02	Ledger	Full	30 days

Manual Backups

PostgreSQL

# Create a manual backup
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)

# View available backups
ls -la /tmp/postgresql-backups/

# Upload to S3 manually
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/

Redis

# Create a manual backup
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)

# Force background save before backup
kubectl exec -n default deployment/redis -- redis-cli BGSAVE

Ledger Storage

# Create a full backup
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)

# Create incremental backup
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true

Restore Procedures

PostgreSQL Restore

# List available backups
aws s3 ls s3://aitbc-backups-default/postgresql/

# Download backup from S3
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/

# Restore database
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz

# Verify restore
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health

Redis Restore

# Stop Redis service
kubectl scale deployment redis --replicas=0 -n default

# Clear existing data
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof

# Copy backup file
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb

# Start Redis service
kubectl scale deployment redis --replicas=1 -n default

# Verify restore
kubectl exec -n default deployment/redis -- redis-cli DBSIZE

Ledger Restore

# Stop blockchain nodes
kubectl scale deployment blockchain-node --replicas=0 -n default

# Extract backup
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/

# Copy ledger data
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/

# Start blockchain nodes
kubectl scale deployment blockchain-node --replicas=3 -n default

# Verify restore
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head

Disaster Recovery

Recovery Time Objective (RTO)

Component	RTO Target	Notes
PostgreSQL	1 hour	Database restore from backup
Redis	15 minutes	Cache rebuild from backup
Ledger	2 hours	Full chain synchronization

Recovery Point Objective (RPO)

Component	RPO Target	Notes
PostgreSQL	24 hours	Daily backups
Redis	24 hours	Daily backups
Ledger	24 hours	Daily full + incremental backups

Disaster Recovery Steps

Assess Impact

# Check component status
kubectl get pods -n default
kubectl get events --sort-by=.metadata.creationTimestamp

Restore Critical Services

# Restore PostgreSQL first (critical for operations)
./infra/scripts/restore_postgresql.sh default [latest-backup]

# Restore Redis cache
./restore_redis.sh default [latest-backup]

# Restore ledger data
./restore_ledger.sh default [latest-backup]

Verify System Health

# Check all services
kubectl get pods -n default

# Verify API endpoints
curl -s http://coordinator-api:8011/v1/health
curl -s http://blockchain-node:8080/v1/health

Monitoring and Alerting

Backup Monitoring

Prometheus metrics track backup success/failure:

# AlertManager rules for backups
- alert: BackupFailed
  expr: backup_success == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Backup failed for {{ $labels.component }}"
    description: "Backup for {{ $labels.component }} has failed for 5 minutes"

Log Monitoring

# View backup logs
kubectl logs -l app=aitbc-backup -n default --tail=100

# Monitor backup CronJob
kubectl get cronjob aitbc-backup -w

Best Practices

Backup Security

Encryption: Backups uploaded to S3 use server-side encryption
Access Control: IAM policies restrict backup access
Retention: Automatic cleanup of old backups
Validation: Regular restore testing

Performance Considerations

Off-Peak Backups: Scheduled during low traffic (2 AM UTC)
Parallel Processing: Components backed up sequentially
Compression: All backups compressed to save storage
Incremental Backups: Ledger supports incremental to reduce size

Testing

Monthly Restore Tests: Validate backup integrity
Disaster Recovery Drills: Quarterly full scenario testing
Documentation Updates: Keep procedures current

Troubleshooting

Common Issues

Backup Fails with "Permission Denied"

# Check service account permissions
kubectl describe serviceaccount backup-service-account
kubectl describe role backup-role

Restore Fails with "Database in Use"

# Scale down application before restore
kubectl scale deployment coordinator-api --replicas=0
# Perform restore
# Scale up after restore
kubectl scale deployment coordinator-api --replicas=3

Ledger Restore Incomplete

# Verify backup integrity
tar -tzf ledger-backup.tar.gz
# Check metadata.json for block height
cat metadata.json | jq '.latest_block_height'

Getting Help

Check logs: kubectl logs -l app=aitbc-backup
Verify storage: df -h on backup nodes
Check network: Test S3 connectivity
Review events: kubectl get events --sort-by=.metadata.creationTimestamp

Configuration

Environment Variables

Variable	Default	Description
BACKUP_RETENTION_DAYS	30	Days to keep backups
BACKUP_SCHEDULE	0 2 * * *	Cron schedule for backups
S3_BUCKET_PREFIX	aitbc-backups	S3 bucket name prefix
COMPRESSION_LEVEL	6	gzip compression level

Customizing Backup Schedule

Edit the CronJob schedule in infra/k8s/backup-cronjob.yaml:

spec:
  schedule: "0 3 * * *"  # Change to 3 AM UTC

Adjusting Retention

Modify retention in each backup script:

# In backup_*.sh scripts
RETENTION_DAYS=60  # Keep for 60 days instead of 30

8.6 KiB Raw Blame History