feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
330
infra/scripts/README_chaos.md
Normal file
330
infra/scripts/README_chaos.md
Normal file
@ -0,0 +1,330 @@
|
||||
# AITBC Chaos Testing Framework
|
||||
|
||||
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
|
||||
|
||||
## Overview
|
||||
|
||||
The chaos testing framework simulates real-world failure scenarios to:
|
||||
- Test system resilience under adverse conditions
|
||||
- Measure Mean-Time-To-Recovery (MTTR) metrics
|
||||
- Identify single points of failure
|
||||
- Validate recovery procedures
|
||||
- Ensure SLO compliance
|
||||
|
||||
## Components
|
||||
|
||||
### Test Scripts
|
||||
|
||||
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
|
||||
- Deletes coordinator pods to simulate complete service outage
|
||||
- Measures recovery time and service availability
|
||||
- Tests load handling during and after recovery
|
||||
|
||||
2. **`chaos_test_network.py`** - Network partition simulation
|
||||
- Creates network partitions between blockchain nodes
|
||||
- Tests consensus resilience during partition
|
||||
- Measures network recovery time
|
||||
|
||||
3. **`chaos_test_database.py`** - Database failure simulation
|
||||
- Simulates PostgreSQL connection failures
|
||||
- Tests high latency scenarios
|
||||
- Validates application error handling
|
||||
|
||||
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
|
||||
- Runs multiple chaos test scenarios
|
||||
- Aggregates MTTR metrics across tests
|
||||
- Generates comprehensive reports
|
||||
- Supports continuous chaos testing
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- kubectl configured with cluster access
|
||||
- Helm charts deployed in target namespace
|
||||
- Administrative privileges for network manipulation
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone <repository-url>
|
||||
cd aitbc/infra/scripts
|
||||
|
||||
# Install dependencies
|
||||
pip install aiohttp
|
||||
|
||||
# Make scripts executable
|
||||
chmod +x chaos_*.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Running Individual Tests
|
||||
|
||||
#### Coordinator Outage Test
|
||||
```bash
|
||||
# Basic test
|
||||
python3 chaos_test_coordinator.py --namespace default
|
||||
|
||||
# Custom outage duration
|
||||
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
|
||||
|
||||
# Dry run (no actual chaos)
|
||||
python3 chaos_test_coordinator.py --dry-run
|
||||
```
|
||||
|
||||
#### Network Partition Test
|
||||
```bash
|
||||
# Partition 50% of nodes for 60 seconds
|
||||
python3 chaos_test_network.py --namespace default
|
||||
|
||||
# Partition 30% of nodes for 90 seconds
|
||||
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
|
||||
```
|
||||
|
||||
#### Database Failure Test
|
||||
```bash
|
||||
# Simulate connection failure
|
||||
python3 chaos_test_database.py --namespace default --failure-type connection
|
||||
|
||||
# Simulate high latency (5000ms)
|
||||
python3 chaos_test_database.py --namespace default --failure-type latency
|
||||
```
|
||||
|
||||
### Running All Tests
|
||||
|
||||
```bash
|
||||
# Run all scenarios with default parameters
|
||||
python3 chaos_orchestrator.py --namespace default
|
||||
|
||||
# Run specific scenarios
|
||||
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
|
||||
|
||||
# Continuous chaos testing (24 hours, every 60 minutes)
|
||||
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
|
||||
```
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Coordinator API Outage
|
||||
|
||||
**Objective**: Test system resilience when the coordinator service becomes unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Generate baseline load on coordinator API
|
||||
2. Delete all coordinator pods
|
||||
3. Wait for specified outage duration
|
||||
4. Monitor service recovery
|
||||
5. Generate post-recovery load
|
||||
|
||||
**Metrics Collected**:
|
||||
- MTTR (Mean-Time-To-Recovery)
|
||||
- Success/error request counts
|
||||
- Recovery time distribution
|
||||
|
||||
### 2. Network Partition
|
||||
|
||||
**Objective**: Test blockchain consensus during network partitions.
|
||||
|
||||
**Steps**:
|
||||
1. Identify blockchain node pods
|
||||
2. Apply iptables rules to partition nodes
|
||||
3. Monitor consensus during partition
|
||||
4. Remove network partition
|
||||
5. Verify network recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Network recovery time
|
||||
- Consensus health during partition
|
||||
- Node connectivity status
|
||||
|
||||
### 3. Database Failure
|
||||
|
||||
**Objective**: Test application behavior when database is unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Simulate database connection failure or high latency
|
||||
2. Monitor API behavior during failure
|
||||
3. Restore database connectivity
|
||||
4. Verify application recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Database recovery time
|
||||
- API error rates during failure
|
||||
- Application resilience metrics
|
||||
|
||||
## Results and Reporting
|
||||
|
||||
### Test Results Format
|
||||
|
||||
Each test generates a JSON results file with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"test_start": "2024-12-22T10:00:00.000Z",
|
||||
"test_end": "2024-12-22T10:05:00.000Z",
|
||||
"scenario": "coordinator_outage",
|
||||
"mttr": 45.2,
|
||||
"error_count": 156,
|
||||
"success_count": 844,
|
||||
"recovery_time": 45.2
|
||||
}
|
||||
```
|
||||
|
||||
### Orchestrator Report
|
||||
|
||||
The orchestrator generates a comprehensive report including:
|
||||
|
||||
- Summary metrics across all scenarios
|
||||
- SLO compliance analysis
|
||||
- Recommendations for improvements
|
||||
- MTTR trends and statistics
|
||||
|
||||
Example report snippet:
|
||||
```json
|
||||
{
|
||||
"summary": {
|
||||
"total_scenarios": 3,
|
||||
"successful_scenarios": 3,
|
||||
"average_mttr": 67.8,
|
||||
"max_mttr": 120.5,
|
||||
"min_mttr": 45.2
|
||||
},
|
||||
"recommendations": [
|
||||
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
|
||||
"Coordinator recovery is slow. Consider reducing pod startup time."
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| Metric | Target | Current |
|
||||
|--------|--------|---------|
|
||||
| MTTR (Average) | ≤ 120 seconds | TBD |
|
||||
| MTTR (Maximum) | ≤ 300 seconds | TBD |
|
||||
| Success Rate | ≥ 99.9% | TBD |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Before Running Tests
|
||||
|
||||
1. **Backup Critical Data**: Ensure recent backups are available
|
||||
2. **Notify Team**: Inform stakeholders about chaos testing
|
||||
3. **Check Cluster Health**: Verify all components are healthy
|
||||
4. **Schedule Appropriately**: Run during low-traffic periods
|
||||
|
||||
### During Tests
|
||||
|
||||
1. **Monitor Logs**: Watch for unexpected errors
|
||||
2. **Have Rollback Plan**: Be ready to manually intervene
|
||||
3. **Document Observations**: Note any unusual behavior
|
||||
4. **Stop if Critical**: Abort tests if production is impacted
|
||||
|
||||
### After Tests
|
||||
|
||||
1. **Review Results**: Analyze MTTR and error rates
|
||||
2. **Update Documentation**: Record findings and improvements
|
||||
3. **Address Issues**: Fix any discovered problems
|
||||
4. **Schedule Follow-up**: Plan regular chaos testing
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
### GitHub Actions Example
|
||||
|
||||
```yaml
|
||||
name: Chaos Testing
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
chaos-test:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v2
|
||||
with:
|
||||
python-version: '3.9'
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install aiohttp
|
||||
- name: Run chaos tests
|
||||
run: |
|
||||
cd infra/scripts
|
||||
python3 chaos_orchestrator.py --namespace staging
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: chaos-results
|
||||
path: "*.json"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **kubectl not found**
|
||||
```bash
|
||||
# Ensure kubectl is installed and configured
|
||||
which kubectl
|
||||
kubectl version
|
||||
```
|
||||
|
||||
2. **Permission denied errors**
|
||||
```bash
|
||||
# Check RBAC permissions
|
||||
kubectl auth can-i create pods --namespace default
|
||||
kubectl auth can-i exec pods --namespace default
|
||||
```
|
||||
|
||||
3. **Network rules not applying**
|
||||
```bash
|
||||
# Check if iptables is available in pods
|
||||
kubectl exec -it <pod> -- iptables -L
|
||||
```
|
||||
|
||||
4. **Tests hanging**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods --namespace default
|
||||
kubectl describe pod <pod-name> --namespace default
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging:
|
||||
```bash
|
||||
export PYTHONPATH=.
|
||||
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
To add new chaos test scenarios:
|
||||
|
||||
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
|
||||
2. Implement the required methods: `run_test()`, `save_results()`
|
||||
3. Add the scenario to `chaos_orchestrator.py`
|
||||
4. Update documentation
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Chaos tests require elevated privileges
|
||||
- Only run in authorized environments
|
||||
- Ensure test isolation from production data
|
||||
- Review network rules before deployment
|
||||
- Monitor for security violations during tests
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check the troubleshooting section
|
||||
- Review test logs for error details
|
||||
- Contact the DevOps team at devops@aitbc.io
|
||||
|
||||
## License
|
||||
|
||||
This chaos testing framework is part of the AITBC project and follows the same license terms.
|
||||
233
infra/scripts/backup_ledger.sh
Executable file
233
infra/scripts/backup_ledger.sh
Executable file
@ -0,0 +1,233 @@
|
||||
#!/bin/bash
|
||||
# Ledger Storage Backup Script for AITBC
|
||||
# Usage: ./backup_ledger.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/ledger-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get blockchain node pods
|
||||
get_blockchain_pods() {
|
||||
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pods" ]]; then
|
||||
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pods" ]]; then
|
||||
error "Could not find blockchain node pods in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo $pods
|
||||
}
|
||||
|
||||
# Wait for blockchain node to be ready
|
||||
wait_for_blockchain_node() {
|
||||
local pod=$1
|
||||
log "Waiting for blockchain node pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if node is responding
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
|
||||
log "Blockchain node is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Blockchain node did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Backup ledger data
|
||||
backup_ledger_data() {
|
||||
local pod=$1
|
||||
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
mkdir -p "$ledger_backup_dir"
|
||||
|
||||
log "Starting ledger backup from pod $pod"
|
||||
|
||||
# Get the latest block height before backup
|
||||
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
log "Latest block height: $latest_block"
|
||||
|
||||
# Backup blockchain data directory
|
||||
local blockchain_data_dir="/app/data/chain"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
|
||||
log "Backing up blockchain data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup wallet data
|
||||
local wallet_data_dir="/app/data/wallets"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
|
||||
log "Backing up wallet data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup receipts
|
||||
local receipts_data_dir="/app/data/receipts"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
|
||||
log "Backing up receipts directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
|
||||
fi
|
||||
|
||||
# Create metadata file
|
||||
cat > "$ledger_backup_dir/metadata.json" << EOF
|
||||
{
|
||||
"backup_name": "$BACKUP_NAME",
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"namespace": "$NAMESPACE",
|
||||
"source_pod": "$pod",
|
||||
"latest_block_height": $latest_block,
|
||||
"backup_type": "full"
|
||||
}
|
||||
EOF
|
||||
|
||||
log "Ledger backup completed: $ledger_backup_dir"
|
||||
|
||||
# Verify backup
|
||||
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
|
||||
log "Total backup size: $total_size"
|
||||
}
|
||||
|
||||
# Create incremental backup
|
||||
create_incremental_backup() {
|
||||
local pod=$1
|
||||
local last_backup_file="$BACKUP_DIR/.last_backup_height"
|
||||
|
||||
# Get last backup height
|
||||
local last_backup_height=0
|
||||
if [[ -f "$last_backup_file" ]]; then
|
||||
last_backup_height=$(cat "$last_backup_file")
|
||||
fi
|
||||
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
if [[ $current_height -le $last_backup_height ]]; then
|
||||
log "No new blocks since last backup (height: $current_height)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
|
||||
|
||||
# Export blocks since last backup
|
||||
local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
|
||||
|
||||
# Update last backup height
|
||||
echo "$current_height" > "$last_backup_file"
|
||||
|
||||
log "Incremental backup created: $incremental_file"
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
|
||||
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_dir="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
|
||||
# Upload entire backup directory
|
||||
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
|
||||
|
||||
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
local incremental=${3:-false}
|
||||
|
||||
log "Starting ledger backup process (incremental=$incremental)"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pods=($(get_blockchain_pods))
|
||||
|
||||
# Use the first ready pod for backup
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
wait_for_blockchain_node "$pod"
|
||||
|
||||
if [[ "$incremental" == "true" ]]; then
|
||||
create_incremental_backup "$pod"
|
||||
else
|
||||
backup_ledger_data "$pod"
|
||||
fi
|
||||
|
||||
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
upload_to_cloud "$backup_dir"
|
||||
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
cleanup_old_backups
|
||||
|
||||
log "Ledger backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
172
infra/scripts/backup_postgresql.sh
Executable file
172
infra/scripts/backup_postgresql.sh
Executable file
@ -0,0 +1,172 @@
|
||||
#!/bin/bash
|
||||
# PostgreSQL Backup Script for AITBC
|
||||
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/postgresql-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v pg_dump &> /dev/null; then
|
||||
error "pg_dump is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get PostgreSQL pod name
|
||||
get_postgresql_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
wait_for_postgresql() {
|
||||
local pod=$1
|
||||
log "Waiting for PostgreSQL pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if PostgreSQL is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
|
||||
log "PostgreSQL is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "PostgreSQL did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
|
||||
|
||||
log "Starting PostgreSQL backup to $backup_file"
|
||||
|
||||
# Get database credentials from secret
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Perform the backup
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_dump -U "$db_user" -h localhost -d "$db_name" \
|
||||
--verbose --clean --if-exists --create --format=custom \
|
||||
--file="/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Copy backup from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
|
||||
|
||||
# Clean up remote backup file
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Compress backup
|
||||
gzip "$backup_file"
|
||||
backup_file="${backup_file}.gz"
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="postgresql/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting PostgreSQL backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_postgresql_pod)
|
||||
wait_for_postgresql "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "PostgreSQL backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
189
infra/scripts/backup_redis.sh
Executable file
189
infra/scripts/backup_redis.sh
Executable file
@ -0,0 +1,189 @@
|
||||
#!/bin/bash
|
||||
# Redis Backup Script for AITBC
|
||||
# Usage: ./backup_redis.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/redis-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get Redis pod name
|
||||
get_redis_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find Redis pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for Redis to be ready
|
||||
wait_for_redis() {
|
||||
local pod=$1
|
||||
log "Waiting for Redis pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if Redis is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "Redis is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Redis did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
|
||||
log "Starting Redis backup to $backup_file"
|
||||
|
||||
# Create Redis backup
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
|
||||
|
||||
# Wait for background save to complete
|
||||
log "Waiting for background save to complete..."
|
||||
local retries=60
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
|
||||
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
|
||||
log "Background save completed"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
if [[ $retries -eq 0 ]]; then
|
||||
error "Background save did not complete within timeout"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Copy RDB file from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
|
||||
|
||||
# Also create an append-only file backup if enabled
|
||||
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
|
||||
if [[ "$aof_enabled" == "yes" ]]; then
|
||||
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
|
||||
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
|
||||
log "AOF backup created: $aof_backup"
|
||||
fi
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
|
||||
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="redis/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
|
||||
# Upload AOF file if exists
|
||||
local aof_file="${backup_file%.rdb}.aof"
|
||||
if [[ -f "$aof_file" ]]; then
|
||||
local aof_key="redis/$(basename "$aof_file")"
|
||||
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
|
||||
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
|
||||
fi
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting Redis backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_redis_pod)
|
||||
wait_for_redis "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "Redis backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
342
infra/scripts/chaos_orchestrator.py
Executable file
342
infra/scripts/chaos_orchestrator.py
Executable file
@ -0,0 +1,342 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Orchestrator
|
||||
Runs multiple chaos test scenarios and aggregates MTTR metrics
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosOrchestrator:
|
||||
"""Orchestrates multiple chaos test scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.results = {
|
||||
"orchestration_start": None,
|
||||
"orchestration_end": None,
|
||||
"scenarios": [],
|
||||
"summary": {
|
||||
"total_scenarios": 0,
|
||||
"successful_scenarios": 0,
|
||||
"failed_scenarios": 0,
|
||||
"average_mttr": 0,
|
||||
"max_mttr": 0,
|
||||
"min_mttr": float('inf')
|
||||
}
|
||||
}
|
||||
|
||||
async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
|
||||
"""Run a single chaos test scenario"""
|
||||
scenario_name = Path(script).stem.replace("chaos_test_", "")
|
||||
logger.info(f"Running scenario: {scenario_name}")
|
||||
|
||||
cmd = ["python3", script] + args
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
# Run the chaos test script
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE
|
||||
)
|
||||
|
||||
stdout, stderr = await process.communicate()
|
||||
|
||||
if process.returncode != 0:
|
||||
logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
|
||||
logger.error(f"Error: {stderr.decode()}")
|
||||
return None
|
||||
|
||||
# Find the results file
|
||||
result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
|
||||
if not result_files:
|
||||
logger.error(f"No results file found for scenario {scenario_name}")
|
||||
return None
|
||||
|
||||
# Load the most recent result file
|
||||
result_file = max(result_files, key=lambda p: p.stat().st_mtime)
|
||||
with open(result_file, 'r') as f:
|
||||
results = json.load(f)
|
||||
|
||||
# Add execution metadata
|
||||
results["execution_time"] = time.time() - start_time
|
||||
results["scenario_name"] = scenario_name
|
||||
|
||||
logger.info(f"Scenario {scenario_name} completed successfully")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to run scenario {scenario_name}: {e}")
|
||||
return None
|
||||
|
||||
def calculate_summary_metrics(self):
|
||||
"""Calculate summary metrics across all scenarios"""
|
||||
mttr_values = []
|
||||
|
||||
for scenario in self.results["scenarios"]:
|
||||
if scenario.get("mttr"):
|
||||
mttr_values.append(scenario["mttr"])
|
||||
|
||||
if mttr_values:
|
||||
self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
|
||||
self.results["summary"]["max_mttr"] = max(mttr_values)
|
||||
self.results["summary"]["min_mttr"] = min(mttr_values)
|
||||
|
||||
self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
|
||||
self.results["summary"]["successful_scenarios"] = sum(
|
||||
1 for s in self.results["scenarios"] if s.get("mttr") is not None
|
||||
)
|
||||
self.results["summary"]["failed_scenarios"] = (
|
||||
self.results["summary"]["total_scenarios"] -
|
||||
self.results["summary"]["successful_scenarios"]
|
||||
)
|
||||
|
||||
def generate_report(self, output_file: Optional[str] = None):
|
||||
"""Generate a comprehensive chaos test report"""
|
||||
report = {
|
||||
"report_generated": datetime.utcnow().isoformat(),
|
||||
"namespace": self.namespace,
|
||||
"orchestration": self.results,
|
||||
"recommendations": []
|
||||
}
|
||||
|
||||
# Add recommendations based on results
|
||||
if self.results["summary"]["average_mttr"] > 120:
|
||||
report["recommendations"].append(
|
||||
"Average MTTR exceeds 2 minutes. Consider improving recovery automation."
|
||||
)
|
||||
|
||||
if self.results["summary"]["max_mttr"] > 300:
|
||||
report["recommendations"].append(
|
||||
"Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
|
||||
)
|
||||
|
||||
if self.results["summary"]["failed_scenarios"] > 0:
|
||||
report["recommendations"].append(
|
||||
f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
|
||||
)
|
||||
|
||||
# Check for specific scenario issues
|
||||
for scenario in self.results["scenarios"]:
|
||||
if scenario.get("scenario_name") == "coordinator_outage":
|
||||
if scenario.get("mttr", 0) > 180:
|
||||
report["recommendations"].append(
|
||||
"Coordinator recovery is slow. Consider reducing pod startup time."
|
||||
)
|
||||
|
||||
elif scenario.get("scenario_name") == "network_partition":
|
||||
if scenario.get("error_count", 0) > scenario.get("success_count", 0):
|
||||
report["recommendations"].append(
|
||||
"High error rate during network partition. Improve error handling."
|
||||
)
|
||||
|
||||
elif scenario.get("scenario_name") == "database_failure":
|
||||
if scenario.get("failure_type") == "connection":
|
||||
report["recommendations"].append(
|
||||
"Consider implementing database connection pooling and retry logic."
|
||||
)
|
||||
|
||||
# Save report
|
||||
if output_file:
|
||||
with open(output_file, 'w') as f:
|
||||
json.dump(report, f, indent=2)
|
||||
logger.info(f"Chaos test report saved to: {output_file}")
|
||||
|
||||
# Print summary
|
||||
self.print_summary()
|
||||
|
||||
return report
|
||||
|
||||
def print_summary(self):
|
||||
"""Print a summary of all chaos test results"""
|
||||
print("\n" + "="*60)
|
||||
print("CHAOS TESTING SUMMARY REPORT")
|
||||
print("="*60)
|
||||
|
||||
print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
|
||||
print(f"Namespace: {self.namespace}")
|
||||
|
||||
print(f"\nScenario Results:")
|
||||
print("-" * 40)
|
||||
for scenario in self.results["scenarios"]:
|
||||
name = scenario.get("scenario_name", "Unknown")
|
||||
mttr = scenario.get("mttr", "N/A")
|
||||
if mttr != "N/A":
|
||||
mttr = f"{mttr:.2f}s"
|
||||
print(f" {name:20} MTTR: {mttr}")
|
||||
|
||||
print(f"\nSummary Metrics:")
|
||||
print("-" * 40)
|
||||
print(f" Total Scenarios: {self.results['summary']['total_scenarios']}")
|
||||
print(f" Successful: {self.results['summary']['successful_scenarios']}")
|
||||
print(f" Failed: {self.results['summary']['failed_scenarios']}")
|
||||
|
||||
if self.results["summary"]["average_mttr"] > 0:
|
||||
print(f" Average MTTR: {self.results['summary']['average_mttr']:.2f}s")
|
||||
print(f" Maximum MTTR: {self.results['summary']['max_mttr']:.2f}s")
|
||||
print(f" Minimum MTTR: {self.results['summary']['min_mttr']:.2f}s")
|
||||
|
||||
# SLO compliance
|
||||
print(f"\nSLO Compliance:")
|
||||
print("-" * 40)
|
||||
slo_target = 120 # 2 minutes
|
||||
if self.results["summary"]["average_mttr"] <= slo_target:
|
||||
print(f" ✓ Average MTTR within SLO ({slo_target}s)")
|
||||
else:
|
||||
print(f" ✗ Average MTTR exceeds SLO ({slo_target}s)")
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
|
||||
"""Run all specified chaos test scenarios"""
|
||||
logger.info("Starting chaos testing orchestration")
|
||||
self.results["orchestration_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
for scenario in scenarios:
|
||||
args = scenario_args.get(scenario, [])
|
||||
# Add namespace to all scenarios
|
||||
args.extend(["--namespace", self.namespace])
|
||||
|
||||
result = await self.run_scenario(scenario, args)
|
||||
if result:
|
||||
self.results["scenarios"].append(result)
|
||||
|
||||
self.results["orchestration_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Calculate summary metrics
|
||||
self.calculate_summary_metrics()
|
||||
|
||||
# Generate report
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
report_file = f"chaos_test_report_{timestamp}.json"
|
||||
self.generate_report(report_file)
|
||||
|
||||
logger.info("Chaos testing orchestration completed")
|
||||
|
||||
async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
|
||||
"""Run chaos tests continuously over time"""
|
||||
logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
|
||||
|
||||
end_time = datetime.now() + timedelta(hours=duration_hours)
|
||||
interval_seconds = interval_minutes * 60
|
||||
|
||||
all_results = []
|
||||
|
||||
while datetime.now() < end_time:
|
||||
cycle_start = datetime.now()
|
||||
logger.info(f"Starting chaos test cycle at {cycle_start}")
|
||||
|
||||
# Run a random scenario
|
||||
scenarios = [
|
||||
"chaos_test_coordinator.py",
|
||||
"chaos_test_network.py",
|
||||
"chaos_test_database.py"
|
||||
]
|
||||
|
||||
import random
|
||||
selected_scenario = random.choice(scenarios)
|
||||
|
||||
# Run scenario with reduced duration for continuous testing
|
||||
args = ["--namespace", self.namespace]
|
||||
if "coordinator" in selected_scenario:
|
||||
args.extend(["--outage-duration", "30", "--load-duration", "60"])
|
||||
elif "network" in selected_scenario:
|
||||
args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
|
||||
elif "database" in selected_scenario:
|
||||
args.extend(["--failure-duration", "30", "--failure-type", "connection"])
|
||||
|
||||
result = await self.run_scenario(selected_scenario, args)
|
||||
if result:
|
||||
result["cycle_time"] = cycle_start.isoformat()
|
||||
all_results.append(result)
|
||||
|
||||
# Wait for next cycle
|
||||
elapsed = (datetime.now() - cycle_start).total_seconds()
|
||||
if elapsed < interval_seconds:
|
||||
wait_time = interval_seconds - elapsed
|
||||
logger.info(f"Waiting {wait_time:.0f}s for next cycle")
|
||||
await asyncio.sleep(wait_time)
|
||||
|
||||
# Generate continuous testing report
|
||||
continuous_report = {
|
||||
"continuous_testing": True,
|
||||
"duration_hours": duration_hours,
|
||||
"interval_minutes": interval_minutes,
|
||||
"total_cycles": len(all_results),
|
||||
"cycles": all_results
|
||||
}
|
||||
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
report_file = f"continuous_chaos_report_{timestamp}.json"
|
||||
with open(report_file, 'w') as f:
|
||||
json.dump(continuous_report, f, indent=2)
|
||||
|
||||
logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--scenarios", nargs="+",
|
||||
choices=["coordinator", "network", "database"],
|
||||
default=["coordinator", "network", "database"],
|
||||
help="Scenarios to run")
|
||||
parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
|
||||
parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
|
||||
parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
orchestrator = ChaosOrchestrator(args.namespace)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
|
||||
return
|
||||
|
||||
if args.continuous:
|
||||
await orchestrator.run_continuous_chaos(args.duration, args.interval)
|
||||
else:
|
||||
# Map scenario names to script files
|
||||
scenario_map = {
|
||||
"coordinator": "chaos_test_coordinator.py",
|
||||
"network": "chaos_test_network.py",
|
||||
"database": "chaos_test_database.py"
|
||||
}
|
||||
|
||||
# Get script files
|
||||
scripts = [scenario_map[s] for s in args.scenarios]
|
||||
|
||||
# Default arguments for each scenario
|
||||
scenario_args = {
|
||||
"chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
|
||||
"chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
|
||||
"chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
|
||||
}
|
||||
|
||||
await orchestrator.run_all_scenarios(scripts, scenario_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
287
infra/scripts/chaos_test_coordinator.py
Executable file
287
infra/scripts/chaos_test_coordinator.py
Executable file
@ -0,0 +1,287 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Coordinator API Outage
|
||||
Tests system resilience when coordinator API becomes unavailable
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestCoordinator:
|
||||
"""Chaos testing for coordinator API outage scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"outage_start": None,
|
||||
"outage_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "coordinator_outage"
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_coordinator_pods(self) -> List[str]:
|
||||
"""Get list of coordinator pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get coordinator pods: {e}")
|
||||
return []
|
||||
|
||||
def delete_coordinator_pods(self) -> bool:
|
||||
"""Delete all coordinator pods to simulate outage"""
|
||||
try:
|
||||
cmd = [
|
||||
"kubectl", "delete", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"--force", "--grace-period=0"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
logger.info("Coordinator pods deleted successfully")
|
||||
return True
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to delete coordinator pods: {e}")
|
||||
return False
|
||||
|
||||
async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
|
||||
"""Wait for all coordinator pods to terminate"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
pods = self.get_coordinator_pods()
|
||||
if not pods:
|
||||
logger.info("All coordinator pods terminated")
|
||||
return True
|
||||
await asyncio.sleep(2)
|
||||
|
||||
logger.error("Timeout waiting for pods to terminate")
|
||||
return False
|
||||
|
||||
async def wait_for_recovery(self, timeout: int = 300) -> bool:
|
||||
"""Wait for coordinator service to recover"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
try:
|
||||
# Check if pods are running
|
||||
pods = self.get_coordinator_pods()
|
||||
if not pods:
|
||||
await asyncio.sleep(5)
|
||||
continue
|
||||
|
||||
# Check if at least one pod is ready
|
||||
ready_cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
|
||||
]
|
||||
result = subprocess.run(ready_cmd, capture_output=True, text=True)
|
||||
if result.stdout.strip():
|
||||
# Test API health
|
||||
if self.test_health_endpoint():
|
||||
recovery_time = time.time() - start_time
|
||||
self.metrics["recovery_time"] = recovery_time
|
||||
logger.info(f"Service recovered in {recovery_time:.2f} seconds")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Recovery check failed: {e}")
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
logger.error("Service did not recover within timeout")
|
||||
return False
|
||||
|
||||
def test_health_endpoint(self) -> bool:
|
||||
"""Test if coordinator health endpoint is responding"""
|
||||
try:
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
service_url = f"http://{result.stdout.strip()}/v1/health"
|
||||
|
||||
# Test health endpoint
|
||||
response = subprocess.run(
|
||||
["curl", "-s", "--max-time", "5", service_url],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
|
||||
return response.returncode == 0 and "ok" in response.stdout
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 10):
|
||||
"""Generate synthetic load on coordinator API"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
|
||||
"""Run the complete chaos test"""
|
||||
logger.info("Starting coordinator outage chaos test")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Phase 1: Generate initial load
|
||||
logger.info("Phase 1: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 2: Induce outage
|
||||
logger.info("Phase 2: Inducing coordinator outage")
|
||||
self.metrics["outage_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.delete_coordinator_pods():
|
||||
logger.error("Failed to induce outage")
|
||||
return False
|
||||
|
||||
if not await self.wait_for_pods_termination():
|
||||
logger.error("Pods did not terminate")
|
||||
return False
|
||||
|
||||
# Wait for specified outage duration
|
||||
logger.info(f"Waiting for {outage_duration} seconds outage duration")
|
||||
await asyncio.sleep(outage_duration)
|
||||
|
||||
# Phase 3: Monitor recovery
|
||||
logger.info("Phase 3: Monitoring service recovery")
|
||||
self.metrics["outage_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not await self.wait_for_recovery():
|
||||
logger.error("Service did not recover")
|
||||
return False
|
||||
|
||||
# Phase 4: Post-recovery load test
|
||||
logger.info("Phase 4: Post-recovery load test")
|
||||
await self.generate_load(load_duration)
|
||||
|
||||
# Calculate metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_coordinator_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
|
||||
parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN: Would test coordinator outage without actual deletion")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestCoordinator(args.namespace) as test:
|
||||
success = await test.run_test(args.outage_duration, args.load_duration)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
387
infra/scripts/chaos_test_database.py
Executable file
387
infra/scripts/chaos_test_database.py
Executable file
@ -0,0 +1,387 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Database Failure
|
||||
Tests system resilience when PostgreSQL database becomes unavailable
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestDatabase:
|
||||
"""Chaos testing for database failure scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"failure_start": None,
|
||||
"failure_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "database_failure",
|
||||
"failure_type": None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_postgresql_pod(self) -> Optional[str]:
|
||||
"""Get PostgreSQL pod name"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=postgresql",
|
||||
"-o", "jsonpath={.items[0].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pod = result.stdout.strip()
|
||||
return pod if pod else None
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get PostgreSQL pod: {e}")
|
||||
return None
|
||||
|
||||
def simulate_database_connection_failure(self) -> bool:
|
||||
"""Simulate database connection failure by blocking port 5432"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Block incoming connections to PostgreSQL
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
# Block outgoing connections from PostgreSQL
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
logger.info(f"Blocked PostgreSQL connections on pod {pod}")
|
||||
self.metrics["failure_type"] = "connection_blocked"
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to block PostgreSQL connections: {e}")
|
||||
return False
|
||||
|
||||
def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
|
||||
"""Simulate high database latency using netem"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Add latency to PostgreSQL traffic
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
|
||||
self.metrics["failure_type"] = "high_latency"
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to add latency to PostgreSQL: {e}")
|
||||
return False
|
||||
|
||||
def restore_database(self) -> bool:
|
||||
"""Restore database connections"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Remove iptables rules
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "INPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=False) # May fail if rules don't exist
|
||||
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "OUTPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=False)
|
||||
|
||||
# Remove netem qdisc
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"tc", "qdisc", "del", "dev", "eth0", "root"
|
||||
]
|
||||
subprocess.run(cmd, check=False)
|
||||
|
||||
logger.info(f"Restored PostgreSQL connections on pod {pod}")
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to restore PostgreSQL: {e}")
|
||||
return False
|
||||
|
||||
async def test_database_connectivity(self) -> bool:
|
||||
"""Test if coordinator can connect to database"""
|
||||
try:
|
||||
# Get coordinator pod
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[0].metadata.name}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
coordinator_pod = result.stdout.strip()
|
||||
|
||||
if not coordinator_pod:
|
||||
return False
|
||||
|
||||
# Test database connection from coordinator
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
|
||||
"python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
return result.returncode == 0 and "OK" in result.stdout
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def test_api_health(self) -> bool:
|
||||
"""Test if coordinator API is healthy"""
|
||||
try:
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
service_url = f"http://{result.stdout.strip()}/v1/health"
|
||||
|
||||
# Test health endpoint
|
||||
response = subprocess.run(
|
||||
["curl", "-s", "--max-time", "5", service_url],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
|
||||
return response.returncode == 0 and "ok" in response.stdout
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 10):
|
||||
"""Generate synthetic load on coordinator API"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def wait_for_recovery(self, timeout: int = 300) -> bool:
|
||||
"""Wait for database and API to recover"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
# Test database connectivity
|
||||
db_connected = await self.test_database_connectivity()
|
||||
|
||||
# Test API health
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
if db_connected and api_healthy:
|
||||
recovery_time = time.time() - start_time
|
||||
self.metrics["recovery_time"] = recovery_time
|
||||
logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
|
||||
return True
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
logger.error("Database and API did not recover within timeout")
|
||||
return False
|
||||
|
||||
async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
|
||||
"""Run the complete database chaos test"""
|
||||
logger.info(f"Starting database chaos test - failure type: {failure_type}")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Phase 1: Baseline test
|
||||
logger.info("Phase 1: Baseline connectivity test")
|
||||
db_connected = await self.test_database_connectivity()
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
if not db_connected or not api_healthy:
|
||||
logger.error("Baseline test failed - database or API not healthy")
|
||||
return False
|
||||
|
||||
logger.info("Baseline: Database and API are healthy")
|
||||
|
||||
# Phase 2: Generate initial load
|
||||
logger.info("Phase 2: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 3: Induce database failure
|
||||
logger.info("Phase 3: Inducing database failure")
|
||||
self.metrics["failure_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if failure_type == "connection":
|
||||
if not self.simulate_database_connection_failure():
|
||||
logger.error("Failed to induce database connection failure")
|
||||
return False
|
||||
elif failure_type == "latency":
|
||||
if not self.simulate_database_high_latency():
|
||||
logger.error("Failed to induce database latency")
|
||||
return False
|
||||
else:
|
||||
logger.error(f"Unknown failure type: {failure_type}")
|
||||
return False
|
||||
|
||||
# Verify failure is effective
|
||||
await asyncio.sleep(5)
|
||||
db_connected = await self.test_database_connectivity()
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
|
||||
|
||||
# Phase 4: Monitor during failure
|
||||
logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
|
||||
|
||||
# Generate load during failure
|
||||
await self.generate_load(failure_duration)
|
||||
|
||||
# Phase 5: Restore database and monitor recovery
|
||||
logger.info("Phase 5: Restoring database")
|
||||
self.metrics["failure_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.restore_database():
|
||||
logger.error("Failed to restore database")
|
||||
return False
|
||||
|
||||
# Wait for recovery
|
||||
if not await self.wait_for_recovery():
|
||||
logger.error("System did not recover after database restoration")
|
||||
return False
|
||||
|
||||
# Phase 6: Post-recovery load test
|
||||
logger.info("Phase 6: Post-recovery load test")
|
||||
await self.generate_load(60)
|
||||
|
||||
# Final metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Database chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_database_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Failure Type: {self.metrics['failure_type']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for database failure")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
|
||||
parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestDatabase(args.namespace) as test:
|
||||
success = await test.run_test(args.failure_type, args.failure_duration)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
372
infra/scripts/chaos_test_network.py
Executable file
372
infra/scripts/chaos_test_network.py
Executable file
@ -0,0 +1,372 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Network Partition
|
||||
Tests system resilience when blockchain nodes experience network partitions
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestNetwork:
|
||||
"""Chaos testing for network partition scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"partition_start": None,
|
||||
"partition_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "network_partition",
|
||||
"affected_nodes": []
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_blockchain_pods(self) -> List[str]:
|
||||
"""Get list of blockchain node pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=blockchain-node",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get blockchain pods: {e}")
|
||||
return []
|
||||
|
||||
def get_coordinator_pods(self) -> List[str]:
|
||||
"""Get list of coordinator pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get coordinator pods: {e}")
|
||||
return []
|
||||
|
||||
def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
|
||||
"""Apply network partition using iptables"""
|
||||
logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
|
||||
|
||||
for pod in pods:
|
||||
if pod in target_pods:
|
||||
continue
|
||||
|
||||
# Block traffic from this pod to target pods
|
||||
for target_pod in target_pods:
|
||||
try:
|
||||
# Get target pod IP
|
||||
cmd = [
|
||||
"kubectl", "get", "pod", target_pod,
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.status.podIP}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
target_ip = result.stdout.strip()
|
||||
|
||||
if not target_ip:
|
||||
continue
|
||||
|
||||
# Apply iptables rule to block traffic
|
||||
iptables_cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
|
||||
]
|
||||
subprocess.run(iptables_cmd, check=True)
|
||||
|
||||
logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
|
||||
return False
|
||||
|
||||
self.metrics["affected_nodes"] = pods + target_pods
|
||||
return True
|
||||
|
||||
def remove_network_partition(self, pods: List[str]) -> bool:
|
||||
"""Remove network partition rules"""
|
||||
logger.info("Removing network partition rules")
|
||||
|
||||
for pod in pods:
|
||||
try:
|
||||
# Flush OUTPUT chain (remove all rules)
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "OUTPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
logger.info(f"Removed network rules from {pod}")
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to remove network rules from {pod}: {e}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
|
||||
"""Test connectivity between pods"""
|
||||
results = {}
|
||||
|
||||
for pod in pods:
|
||||
try:
|
||||
# Test if pod can reach coordinator
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
results[pod] = result.returncode == 0 and "ok" in result.stdout
|
||||
|
||||
except Exception:
|
||||
results[pod] = False
|
||||
|
||||
return results
|
||||
|
||||
async def monitor_consensus(self, duration: int = 60) -> bool:
|
||||
"""Monitor blockchain consensus health"""
|
||||
logger.info(f"Monitoring consensus for {duration} seconds")
|
||||
|
||||
start_time = time.time()
|
||||
last_height = 0
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
try:
|
||||
# Get block height from a random pod
|
||||
pods = self.get_blockchain_pods()
|
||||
if not pods:
|
||||
await asyncio.sleep(5)
|
||||
continue
|
||||
|
||||
# Use first pod to check height
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pods[0], "--",
|
||||
"curl", "-s", "http://localhost:8080/v1/blocks/head"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
try:
|
||||
data = json.loads(result.stdout)
|
||||
current_height = data.get("height", 0)
|
||||
|
||||
# Check if blockchain is progressing
|
||||
if current_height > last_height:
|
||||
last_height = current_height
|
||||
logger.info(f"Blockchain progressing, height: {current_height}")
|
||||
elif time.time() - start_time > 30: # Allow 30s for initial sync
|
||||
logger.warning(f"Blockchain stuck at height {current_height}")
|
||||
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Consensus check failed: {e}")
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
return last_height > 0
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 5):
|
||||
"""Generate synthetic load on blockchain nodes"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "blockchain-node",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/blocks/head") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
|
||||
"""Run the complete network partition chaos test"""
|
||||
logger.info("Starting network partition chaos test")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Get all blockchain pods
|
||||
all_pods = self.get_blockchain_pods()
|
||||
if not all_pods:
|
||||
logger.error("No blockchain pods found")
|
||||
return False
|
||||
|
||||
# Determine which pods to partition
|
||||
num_partition = int(len(all_pods) * partition_ratio)
|
||||
partition_pods = all_pods[:num_partition]
|
||||
remaining_pods = all_pods[num_partition:]
|
||||
|
||||
logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
|
||||
|
||||
# Phase 1: Baseline test
|
||||
logger.info("Phase 1: Baseline connectivity test")
|
||||
baseline_connectivity = await self.test_connectivity(all_pods)
|
||||
logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
|
||||
|
||||
# Phase 2: Generate initial load
|
||||
logger.info("Phase 2: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 3: Apply network partition
|
||||
logger.info("Phase 3: Applying network partition")
|
||||
self.metrics["partition_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.apply_network_partition(remaining_pods, partition_pods):
|
||||
logger.error("Failed to apply network partition")
|
||||
return False
|
||||
|
||||
# Verify partition is effective
|
||||
await asyncio.sleep(5)
|
||||
partitioned_connectivity = await self.test_connectivity(all_pods)
|
||||
logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
|
||||
|
||||
# Phase 4: Monitor during partition
|
||||
logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
|
||||
consensus_healthy = await self.monitor_consensus(partition_duration)
|
||||
|
||||
# Phase 5: Remove partition and monitor recovery
|
||||
logger.info("Phase 5: Removing network partition")
|
||||
self.metrics["partition_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.remove_network_partition(all_pods):
|
||||
logger.error("Failed to remove network partition")
|
||||
return False
|
||||
|
||||
# Wait for recovery
|
||||
logger.info("Waiting for network recovery...")
|
||||
await asyncio.sleep(10)
|
||||
|
||||
# Test connectivity after recovery
|
||||
recovery_connectivity = await self.test_connectivity(all_pods)
|
||||
recovery_time = time.time()
|
||||
|
||||
# Calculate recovery metrics
|
||||
all_connected = all(recovery_connectivity.values())
|
||||
if all_connected:
|
||||
self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
|
||||
logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
|
||||
|
||||
# Phase 6: Post-recovery load test
|
||||
logger.info("Phase 6: Post-recovery load test")
|
||||
await self.generate_load(60)
|
||||
|
||||
# Final metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Network partition chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_network_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for network partition")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
|
||||
parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestNetwork(args.namespace) as test:
|
||||
success = await test.run_test(args.partition_duration, args.partition_ratio)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
279
infra/scripts/restore_ledger.sh
Normal file
279
infra/scripts/restore_ledger.sh
Normal file
@ -0,0 +1,279 @@
|
||||
#!/bin/bash
|
||||
# Ledger Storage Restore Script for AITBC
|
||||
# Usage: ./restore_ledger.sh [namespace] [backup_directory]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_DIR=${2:-}
|
||||
TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v jq &> /dev/null; then
|
||||
error "jq is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup directory
|
||||
validate_backup_dir() {
|
||||
if [[ -z "$BACKUP_DIR" ]]; then
|
||||
error "Backup directory must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_directory]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -d "$BACKUP_DIR" ]]; then
|
||||
error "Backup directory not found: $BACKUP_DIR"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for required files
|
||||
if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
|
||||
error "metadata.json not found in backup directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
|
||||
error "chain.tar.gz not found in backup directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "Using backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get blockchain node pods
|
||||
get_blockchain_pods() {
|
||||
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pods" ]]; then
|
||||
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pods" ]]; then
|
||||
error "Could not find blockchain node pods in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo $pods
|
||||
}
|
||||
|
||||
# Create backup of current ledger before restore
|
||||
create_pre_restore_backup() {
|
||||
local pods=($1)
|
||||
local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
|
||||
local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
|
||||
|
||||
warn "Creating backup of current ledger before restore..."
|
||||
mkdir -p "$pre_restore_dir"
|
||||
|
||||
# Use the first ready pod
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
# Create metadata
|
||||
cat > "$pre_restore_dir/metadata.json" << EOF
|
||||
{
|
||||
"backup_name": "$pre_restore_backup",
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"namespace": "$NAMESPACE",
|
||||
"source_pod": "$pod",
|
||||
"latest_block_height": $current_height,
|
||||
"backup_type": "pre-restore"
|
||||
}
|
||||
EOF
|
||||
|
||||
# Backup data directories
|
||||
local data_dirs=("chain" "wallets" "receipts")
|
||||
for dir in "${data_dirs[@]}"; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
|
||||
fi
|
||||
done
|
||||
|
||||
log "Pre-restore backup created: $pre_restore_dir"
|
||||
break
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pods=($1)
|
||||
|
||||
warn "This will replace all current ledger data. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Scale down blockchain nodes
|
||||
info "Scaling down blockchain node deployment..."
|
||||
kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pods to terminate
|
||||
kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
|
||||
|
||||
# Scale up blockchain nodes
|
||||
info "Scaling up blockchain node deployment..."
|
||||
kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pods to be ready
|
||||
local ready_pods=()
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
|
||||
local all_pods=$(get_blockchain_pods)
|
||||
for pod in $all_pods; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
ready_pods+=("$pod")
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ ${#ready_pods[@]} -eq 0 ]]; then
|
||||
sleep 5
|
||||
((retries--))
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ ${#ready_pods[@]} -eq 0 ]]; then
|
||||
error "No blockchain nodes became ready after restore"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Restore data to all ready pods
|
||||
for pod in "${ready_pods[@]}"; do
|
||||
info "Restoring ledger data to pod $pod..."
|
||||
|
||||
# Create temp directory on pod
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
|
||||
|
||||
# Extract and copy chain data
|
||||
if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Extract and copy wallet data
|
||||
if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Extract and copy receipt data
|
||||
if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Set correct permissions
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
|
||||
|
||||
# Clean up temp directory
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
|
||||
|
||||
log "Ledger data restored to pod $pod"
|
||||
done
|
||||
|
||||
log "Ledger restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pods=($1)
|
||||
|
||||
log "Verifying ledger restore..."
|
||||
|
||||
# Read backup metadata
|
||||
local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
|
||||
log "Backup contains blocks up to height: $backup_height"
|
||||
|
||||
# Verify on each pod
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
# Check if node is responding
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
if [[ "$current_height" -eq "$backup_height" ]]; then
|
||||
log "✓ Pod $pod: Block height matches backup ($current_height)"
|
||||
else
|
||||
warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
|
||||
fi
|
||||
|
||||
# Check data directories
|
||||
local dirs=("chain" "wallets" "receipts")
|
||||
for dir in "${dirs[@]}"; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
|
||||
local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
|
||||
log "✓ Pod $pod: $dir directory contains $file_count files"
|
||||
else
|
||||
warn "⚠ Pod $pod: $dir directory not found"
|
||||
fi
|
||||
done
|
||||
else
|
||||
error "✗ Pod $pod: Not responding to health checks"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting ledger restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_dir
|
||||
|
||||
local pods=($(get_blockchain_pods))
|
||||
create_pre_restore_backup "${pods[*]}"
|
||||
perform_restore "${pods[*]}"
|
||||
|
||||
# Get updated pod list after restore
|
||||
pods=($(get_blockchain_pods))
|
||||
verify_restore "${pods[*]}"
|
||||
|
||||
log "Ledger restore process completed successfully"
|
||||
warn "Please verify blockchain synchronization and application functionality"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
228
infra/scripts/restore_postgresql.sh
Executable file
228
infra/scripts/restore_postgresql.sh
Executable file
@ -0,0 +1,228 @@
|
||||
#!/bin/bash
|
||||
# PostgreSQL Restore Script for AITBC
|
||||
# Usage: ./restore_postgresql.sh [namespace] [backup_file]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_FILE=${2:-}
|
||||
BACKUP_DIR="/tmp/postgresql-backups"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v pg_restore &> /dev/null; then
|
||||
error "pg_restore is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup file
|
||||
validate_backup_file() {
|
||||
if [[ -z "$BACKUP_FILE" ]]; then
|
||||
error "Backup file must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_file]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# If file doesn't exist locally, try to find it in backup dir
|
||||
if [[ ! -f "$BACKUP_FILE" ]]; then
|
||||
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
|
||||
if [[ -f "$potential_file" ]]; then
|
||||
BACKUP_FILE="$potential_file"
|
||||
else
|
||||
error "Backup file not found: $BACKUP_FILE"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check if file is gzipped and decompress if needed
|
||||
if [[ "$BACKUP_FILE" == *.gz ]]; then
|
||||
info "Decompressing backup file..."
|
||||
gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
|
||||
BACKUP_FILE="/tmp/restore_$(date +%s).dump"
|
||||
fi
|
||||
|
||||
log "Using backup file: $BACKUP_FILE"
|
||||
}
|
||||
|
||||
# Get PostgreSQL pod name
|
||||
get_postgresql_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
wait_for_postgresql() {
|
||||
local pod=$1
|
||||
log "Waiting for PostgreSQL pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if PostgreSQL is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
|
||||
log "PostgreSQL is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "PostgreSQL did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Create backup of current database before restore
|
||||
create_pre_restore_backup() {
|
||||
local pod=$1
|
||||
local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
|
||||
|
||||
warn "Creating backup of current database before restore..."
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Create backup
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_dump -U "$db_user" -h localhost -d "$db_name" \
|
||||
--format=custom --file="/tmp/${pre_restore_backup}.dump"
|
||||
|
||||
# Copy backup locally
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
|
||||
|
||||
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pod=$1
|
||||
|
||||
warn "This will replace the current database. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Copy backup file to pod
|
||||
local remote_backup="/tmp/restore_$(date +%s).dump"
|
||||
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
|
||||
|
||||
# Drop existing database and recreate
|
||||
log "Dropping existing database..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
|
||||
|
||||
log "Creating new database..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
|
||||
|
||||
# Restore database
|
||||
log "Restoring database from backup..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_restore -U "$db_user" -h localhost -d "$db_name" \
|
||||
--verbose --clean --if-exists "$remote_backup"
|
||||
|
||||
# Clean up remote file
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
|
||||
|
||||
log "Database restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pod=$1
|
||||
|
||||
log "Verifying database restore..."
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Check table count
|
||||
local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
|
||||
|
||||
log "Database contains $table_count tables"
|
||||
|
||||
# Check if key tables exist
|
||||
local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
|
||||
for table in "${key_tables[@]}"; do
|
||||
local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
|
||||
if [[ "$exists" == "t" ]]; then
|
||||
log "✓ Table $table exists"
|
||||
else
|
||||
warn "⚠ Table $table not found"
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting PostgreSQL restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_file
|
||||
|
||||
local pod=$(get_postgresql_pod)
|
||||
wait_for_postgresql "$pod"
|
||||
|
||||
create_pre_restore_backup "$pod"
|
||||
perform_restore "$pod"
|
||||
verify_restore "$pod"
|
||||
|
||||
log "PostgreSQL restore process completed successfully"
|
||||
warn "Please verify application functionality after restore"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
223
infra/scripts/restore_redis.sh
Normal file
223
infra/scripts/restore_redis.sh
Normal file
@ -0,0 +1,223 @@
|
||||
#!/bin/bash
|
||||
# Redis Restore Script for AITBC
|
||||
# Usage: ./restore_redis.sh [namespace] [backup_file]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_FILE=${2:-}
|
||||
BACKUP_DIR="/tmp/redis-backups"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup file
|
||||
validate_backup_file() {
|
||||
if [[ -z "$BACKUP_FILE" ]]; then
|
||||
error "Backup file must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_file]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# If file doesn't exist locally, try to find it in backup dir
|
||||
if [[ ! -f "$BACKUP_FILE" ]]; then
|
||||
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
|
||||
if [[ -f "$potential_file" ]]; then
|
||||
BACKUP_FILE="$potential_file"
|
||||
else
|
||||
error "Backup file not found: $BACKUP_FILE"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
log "Using backup file: $BACKUP_FILE"
|
||||
}
|
||||
|
||||
# Get Redis pod name
|
||||
get_redis_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find Redis pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Create backup of current Redis data before restore
|
||||
create_pre_restore_backup() {
|
||||
local pod=$1
|
||||
local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
|
||||
|
||||
warn "Creating backup of current Redis data before restore..."
|
||||
|
||||
# Create background save
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
|
||||
|
||||
# Wait for save to complete
|
||||
local retries=60
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
|
||||
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
# Copy backup locally
|
||||
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
|
||||
|
||||
# Also backup AOF if exists
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
|
||||
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
|
||||
fi
|
||||
|
||||
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pod=$1
|
||||
|
||||
warn "This will replace all current Redis data. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Scale down Redis to ensure clean restore
|
||||
info "Scaling down Redis deployment..."
|
||||
kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pod to terminate
|
||||
kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
|
||||
|
||||
# Scale up Redis
|
||||
info "Scaling up Redis deployment..."
|
||||
kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
|
||||
|
||||
# Wait for new pod to be ready
|
||||
local new_pod=$(get_redis_pod)
|
||||
kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Stop Redis server
|
||||
info "Stopping Redis server..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
|
||||
|
||||
# Clear existing data
|
||||
info "Clearing existing Redis data..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
|
||||
|
||||
# Copy backup file
|
||||
info "Copying backup file..."
|
||||
local remote_file="/data/restore.rdb"
|
||||
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
|
||||
|
||||
# Set correct permissions
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
|
||||
|
||||
# Start Redis server
|
||||
info "Starting Redis server..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
|
||||
|
||||
# Wait for Redis to be ready
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "Redis is ready"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
if [[ $retries -eq 0 ]]; then
|
||||
error "Redis did not start properly after restore"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "Redis restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pod=$1
|
||||
|
||||
log "Verifying Redis restore..."
|
||||
|
||||
# Check database size
|
||||
local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
|
||||
log "Database contains $db_size keys"
|
||||
|
||||
# Check memory usage
|
||||
local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
|
||||
log "Memory usage: $memory"
|
||||
|
||||
# Check if Redis is responding to commands
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "✓ Redis is responding normally"
|
||||
else
|
||||
error "✗ Redis is not responding"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting Redis restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_file
|
||||
|
||||
local pod=$(get_redis_pod)
|
||||
create_pre_restore_backup "$pod"
|
||||
perform_restore "$pod"
|
||||
|
||||
# Get new pod name after restore
|
||||
pod=$(get_redis_pod)
|
||||
verify_restore "$pod"
|
||||
|
||||
log "Redis restore process completed successfully"
|
||||
warn "Please verify application functionality after restore"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
Reference in New Issue
Block a user