Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 3s
CLI Tests / test-cli (push) Failing after 3s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Successful in 2s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 3s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Successful in 2s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Deploy to Testnet / deploy-testnet (push) Successful in 1m12s
Documentation Validation / validate-docs (push) Failing after 8s
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Successful in 2m6s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Successful in 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Failing after 4s
P2P Network Verification / p2p-verification (push) Successful in 4s
Package Tests / Python package - aitbc-agent-sdk (push) Successful in 32s
Package Tests / Python package - aitbc-core (push) Successful in 14s
Package Tests / Python package - aitbc-crypto (push) Successful in 12s
Package Tests / Python package - aitbc-sdk (push) Successful in 9s
Package Tests / JavaScript package - aitbc-sdk-js (push) Successful in 8s
Package Tests / JavaScript package - aitbc-token (push) Successful in 17s
Python Tests / test-python (push) Successful in 15s
Security Scanning / security-scan (push) Successful in 27s
Node Failover Simulation / failover-test (push) Successful in 7s
Multi-Node Stress Testing / stress-test (push) Successful in 6s
Cross-Node Transaction Testing / transaction-test (push) Successful in 4s
- Add SQLCipher encryption for ait-mainnet database with configurable flag - Add db_encryption_enabled and db_encryption_key_path config settings - Implement encryption key loading and PRAGMA key setup via connection events - Add shutdown_db function for proper database cleanup - Export middleware classes in aitbc/__init__.py - Fix import path in sync.py for settings - Remove duplicate agent documentation from docs
317 lines
8.6 KiB
Markdown
317 lines
8.6 KiB
Markdown
# AITBC Backup and Restore Procedures
|
|
|
|
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
|
|
|
|
## Overview
|
|
|
|
The AITBC platform implements a comprehensive backup strategy with:
|
|
- **Automated daily backups** via Kubernetes CronJobs
|
|
- **Manual backup capabilities** for on-demand operations
|
|
- **Incremental and full backup options** for ledger data
|
|
- **Cloud storage integration** for off-site backups
|
|
- **Retention policies** to manage storage efficiently
|
|
|
|
## Components
|
|
|
|
### 1. PostgreSQL Database
|
|
- **Location**: Coordinator API persistent storage
|
|
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
|
|
- **Backup Format**: Custom PostgreSQL dump with compression
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
### 2. Redis Cache
|
|
- **Location**: In-memory cache with persistence
|
|
- **Data**: Session cache, temporary data, rate limiting
|
|
- **Backup Format**: RDB snapshot + AOF (if enabled)
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
### 3. Ledger Storage
|
|
- **Location**: Blockchain node persistent storage
|
|
- **Data**: Blocks, transactions, receipts, wallet states
|
|
- **Backup Format**: Compressed tar archives
|
|
- **Retention**: 30 days (configurable)
|
|
|
|
## Automated Backups
|
|
|
|
### Kubernetes CronJob
|
|
|
|
The automated backup system runs daily at 2:00 AM UTC:
|
|
|
|
```bash
|
|
# Deploy the backup CronJob
|
|
kubectl apply -f infra/k8s/backup-cronjob.yaml
|
|
|
|
# Check CronJob status
|
|
kubectl get cronjob aitbc-backup
|
|
|
|
# View backup jobs
|
|
kubectl get jobs -l app=aitbc-backup
|
|
|
|
# View backup logs
|
|
kubectl logs job/aitbc-backup-<timestamp>
|
|
```
|
|
|
|
### Backup Schedule
|
|
|
|
| Time (UTC) | Component | Type | Retention |
|
|
|------------|----------------|------------|-----------|
|
|
| 02:00 | PostgreSQL | Full | 30 days |
|
|
| 02:01 | Redis | Full | 30 days |
|
|
| 02:02 | Ledger | Full | 30 days |
|
|
|
|
## Manual Backups
|
|
|
|
### PostgreSQL
|
|
|
|
```bash
|
|
# Create a manual backup
|
|
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
|
|
|
|
# View available backups
|
|
ls -la /tmp/postgresql-backups/
|
|
|
|
# Upload to S3 manually
|
|
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
|
|
```
|
|
|
|
### Redis
|
|
|
|
```bash
|
|
# Create a manual backup
|
|
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
|
|
|
|
# Force background save before backup
|
|
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
|
|
```
|
|
|
|
### Ledger Storage
|
|
|
|
```bash
|
|
# Create a full backup
|
|
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
|
|
|
|
# Create incremental backup
|
|
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
|
|
```
|
|
|
|
## Restore Procedures
|
|
|
|
### PostgreSQL Restore
|
|
|
|
```bash
|
|
# List available backups
|
|
aws s3 ls s3://aitbc-backups-default/postgresql/
|
|
|
|
# Download backup from S3
|
|
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
|
|
|
|
# Restore database
|
|
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
|
|
```
|
|
|
|
### Redis Restore
|
|
|
|
```bash
|
|
# Stop Redis service
|
|
kubectl scale deployment redis --replicas=0 -n default
|
|
|
|
# Clear existing data
|
|
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
|
|
|
|
# Copy backup file
|
|
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
|
|
|
|
# Start Redis service
|
|
kubectl scale deployment redis --replicas=1 -n default
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
|
|
```
|
|
|
|
### Ledger Restore
|
|
|
|
```bash
|
|
# Stop blockchain nodes
|
|
kubectl scale deployment blockchain-node --replicas=0 -n default
|
|
|
|
# Extract backup
|
|
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
|
|
|
|
# Copy ledger data
|
|
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
|
|
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
|
|
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
|
|
|
|
# Start blockchain nodes
|
|
kubectl scale deployment blockchain-node --replicas=3 -n default
|
|
|
|
# Verify restore
|
|
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
|
|
```
|
|
|
|
## Disaster Recovery
|
|
|
|
### Recovery Time Objective (RTO)
|
|
|
|
| Component | RTO Target | Notes |
|
|
|----------------|------------|---------------------------------|
|
|
| PostgreSQL | 1 hour | Database restore from backup |
|
|
| Redis | 15 minutes | Cache rebuild from backup |
|
|
| Ledger | 2 hours | Full chain synchronization |
|
|
|
|
### Recovery Point Objective (RPO)
|
|
|
|
| Component | RPO Target | Notes |
|
|
|----------------|------------|---------------------------------|
|
|
| PostgreSQL | 24 hours | Daily backups |
|
|
| Redis | 24 hours | Daily backups |
|
|
| Ledger | 24 hours | Daily full + incremental backups|
|
|
|
|
### Disaster Recovery Steps
|
|
|
|
1. **Assess Impact**
|
|
```bash
|
|
# Check component status
|
|
kubectl get pods -n default
|
|
kubectl get events --sort-by=.metadata.creationTimestamp
|
|
```
|
|
|
|
2. **Restore Critical Services**
|
|
```bash
|
|
# Restore PostgreSQL first (critical for operations)
|
|
./infra/scripts/restore_postgresql.sh default [latest-backup]
|
|
|
|
# Restore Redis cache
|
|
./restore_redis.sh default [latest-backup]
|
|
|
|
# Restore ledger data
|
|
./restore_ledger.sh default [latest-backup]
|
|
```
|
|
|
|
3. **Verify System Health**
|
|
```bash
|
|
# Check all services
|
|
kubectl get pods -n default
|
|
|
|
# Verify API endpoints
|
|
curl -s http://coordinator-api:8011/v1/health
|
|
curl -s http://blockchain-node:8080/v1/health
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Backup Monitoring
|
|
|
|
Prometheus metrics track backup success/failure:
|
|
|
|
```yaml
|
|
# AlertManager rules for backups
|
|
- alert: BackupFailed
|
|
expr: backup_success == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Backup failed for {{ $labels.component }}"
|
|
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
|
|
```
|
|
|
|
### Log Monitoring
|
|
|
|
```bash
|
|
# View backup logs
|
|
kubectl logs -l app=aitbc-backup -n default --tail=100
|
|
|
|
# Monitor backup CronJob
|
|
kubectl get cronjob aitbc-backup -w
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Backup Security
|
|
|
|
1. **Encryption**: Backups uploaded to S3 use server-side encryption
|
|
2. **Access Control**: IAM policies restrict backup access
|
|
3. **Retention**: Automatic cleanup of old backups
|
|
4. **Validation**: Regular restore testing
|
|
|
|
### Performance Considerations
|
|
|
|
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
|
|
2. **Parallel Processing**: Components backed up sequentially
|
|
3. **Compression**: All backups compressed to save storage
|
|
4. **Incremental Backups**: Ledger supports incremental to reduce size
|
|
|
|
### Testing
|
|
|
|
1. **Monthly Restore Tests**: Validate backup integrity
|
|
2. **Disaster Recovery Drills**: Quarterly full scenario testing
|
|
3. **Documentation Updates**: Keep procedures current
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Backup Fails with "Permission Denied"
|
|
```bash
|
|
# Check service account permissions
|
|
kubectl describe serviceaccount backup-service-account
|
|
kubectl describe role backup-role
|
|
```
|
|
|
|
#### Restore Fails with "Database in Use"
|
|
```bash
|
|
# Scale down application before restore
|
|
kubectl scale deployment coordinator-api --replicas=0
|
|
# Perform restore
|
|
# Scale up after restore
|
|
kubectl scale deployment coordinator-api --replicas=3
|
|
```
|
|
|
|
#### Ledger Restore Incomplete
|
|
```bash
|
|
# Verify backup integrity
|
|
tar -tzf ledger-backup.tar.gz
|
|
# Check metadata.json for block height
|
|
cat metadata.json | jq '.latest_block_height'
|
|
```
|
|
|
|
### Getting Help
|
|
|
|
1. Check logs: `kubectl logs -l app=aitbc-backup`
|
|
2. Verify storage: `df -h` on backup nodes
|
|
3. Check network: Test S3 connectivity
|
|
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|------------------------|------------------|---------------------------------|
|
|
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
|
|
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
|
|
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
|
|
| COMPRESSION_LEVEL | 6 | gzip compression level |
|
|
|
|
### Customizing Backup Schedule
|
|
|
|
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
|
|
|
|
```yaml
|
|
spec:
|
|
schedule: "0 3 * * *" # Change to 3 AM UTC
|
|
```
|
|
|
|
### Adjusting Retention
|
|
|
|
Modify retention in each backup script:
|
|
|
|
```bash
|
|
# In backup_*.sh scripts
|
|
RETENTION_DAYS=60 # Keep for 60 days instead of 30
|
|
```
|