feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
This commit is contained in:
oib
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions

View File

@ -0,0 +1,330 @@
# AITBC Chaos Testing Framework
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
## Overview
The chaos testing framework simulates real-world failure scenarios to:
- Test system resilience under adverse conditions
- Measure Mean-Time-To-Recovery (MTTR) metrics
- Identify single points of failure
- Validate recovery procedures
- Ensure SLO compliance
## Components
### Test Scripts
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
- Deletes coordinator pods to simulate complete service outage
- Measures recovery time and service availability
- Tests load handling during and after recovery
2. **`chaos_test_network.py`** - Network partition simulation
- Creates network partitions between blockchain nodes
- Tests consensus resilience during partition
- Measures network recovery time
3. **`chaos_test_database.py`** - Database failure simulation
- Simulates PostgreSQL connection failures
- Tests high latency scenarios
- Validates application error handling
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
- Runs multiple chaos test scenarios
- Aggregates MTTR metrics across tests
- Generates comprehensive reports
- Supports continuous chaos testing
## Prerequisites
- Python 3.8+
- kubectl configured with cluster access
- Helm charts deployed in target namespace
- Administrative privileges for network manipulation
## Installation
```bash
# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts
# Install dependencies
pip install aiohttp
# Make scripts executable
chmod +x chaos_*.py
```
## Usage
### Running Individual Tests
#### Coordinator Outage Test
```bash
# Basic test
python3 chaos_test_coordinator.py --namespace default
# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run
```
#### Network Partition Test
```bash
# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default
# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
```
#### Database Failure Test
```bash
# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection
# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency
```
### Running All Tests
```bash
# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default
# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
```
## Test Scenarios
### 1. Coordinator API Outage
**Objective**: Test system resilience when the coordinator service becomes unavailable.
**Steps**:
1. Generate baseline load on coordinator API
2. Delete all coordinator pods
3. Wait for specified outage duration
4. Monitor service recovery
5. Generate post-recovery load
**Metrics Collected**:
- MTTR (Mean-Time-To-Recovery)
- Success/error request counts
- Recovery time distribution
### 2. Network Partition
**Objective**: Test blockchain consensus during network partitions.
**Steps**:
1. Identify blockchain node pods
2. Apply iptables rules to partition nodes
3. Monitor consensus during partition
4. Remove network partition
5. Verify network recovery
**Metrics Collected**:
- Network recovery time
- Consensus health during partition
- Node connectivity status
### 3. Database Failure
**Objective**: Test application behavior when database is unavailable.
**Steps**:
1. Simulate database connection failure or high latency
2. Monitor API behavior during failure
3. Restore database connectivity
4. Verify application recovery
**Metrics Collected**:
- Database recovery time
- API error rates during failure
- Application resilience metrics
## Results and Reporting
### Test Results Format
Each test generates a JSON results file with the following structure:
```json
{
"test_start": "2024-12-22T10:00:00.000Z",
"test_end": "2024-12-22T10:05:00.000Z",
"scenario": "coordinator_outage",
"mttr": 45.2,
"error_count": 156,
"success_count": 844,
"recovery_time": 45.2
}
```
### Orchestrator Report
The orchestrator generates a comprehensive report including:
- Summary metrics across all scenarios
- SLO compliance analysis
- Recommendations for improvements
- MTTR trends and statistics
Example report snippet:
```json
{
"summary": {
"total_scenarios": 3,
"successful_scenarios": 3,
"average_mttr": 67.8,
"max_mttr": 120.5,
"min_mttr": 45.2
},
"recommendations": [
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
"Coordinator recovery is slow. Consider reducing pod startup time."
]
}
```
## SLO Targets
| Metric | Target | Current |
|--------|--------|---------|
| MTTR (Average) | ≤ 120 seconds | TBD |
| MTTR (Maximum) | ≤ 300 seconds | TBD |
| Success Rate | ≥ 99.9% | TBD |
## Best Practices
### Before Running Tests
1. **Backup Critical Data**: Ensure recent backups are available
2. **Notify Team**: Inform stakeholders about chaos testing
3. **Check Cluster Health**: Verify all components are healthy
4. **Schedule Appropriately**: Run during low-traffic periods
### During Tests
1. **Monitor Logs**: Watch for unexpected errors
2. **Have Rollback Plan**: Be ready to manually intervene
3. **Document Observations**: Note any unusual behavior
4. **Stop if Critical**: Abort tests if production is impacted
### After Tests
1. **Review Results**: Analyze MTTR and error rates
2. **Update Documentation**: Record findings and improvements
3. **Address Issues**: Fix any discovered problems
4. **Schedule Follow-up**: Plan regular chaos testing
## Integration with CI/CD
### GitHub Actions Example
```yaml
name: Chaos Testing
on:
schedule:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
workflow_dispatch:
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install aiohttp
- name: Run chaos tests
run: |
cd infra/scripts
python3 chaos_orchestrator.py --namespace staging
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: chaos-results
path: "*.json"
```
## Troubleshooting
### Common Issues
1. **kubectl not found**
```bash
# Ensure kubectl is installed and configured
which kubectl
kubectl version
```
2. **Permission denied errors**
```bash
# Check RBAC permissions
kubectl auth can-i create pods --namespace default
kubectl auth can-i exec pods --namespace default
```
3. **Network rules not applying**
```bash
# Check if iptables is available in pods
kubectl exec -it <pod> -- iptables -L
```
4. **Tests hanging**
```bash
# Check pod status
kubectl get pods --namespace default
kubectl describe pod <pod-name> --namespace default
```
### Debug Mode
Enable debug logging:
```bash
export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
```
## Contributing
To add new chaos test scenarios:
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
2. Implement the required methods: `run_test()`, `save_results()`
3. Add the scenario to `chaos_orchestrator.py`
4. Update documentation
## Security Considerations
- Chaos tests require elevated privileges
- Only run in authorized environments
- Ensure test isolation from production data
- Review network rules before deployment
- Monitor for security violations during tests
## Support
For issues or questions:
- Check the troubleshooting section
- Review test logs for error details
- Contact the DevOps team at devops@aitbc.io
## License
This chaos testing framework is part of the AITBC project and follows the same license terms.

233
infra/scripts/backup_ledger.sh Executable file
View File

@ -0,0 +1,233 @@
#!/bin/bash
# Ledger Storage Backup Script for AITBC
# Usage: ./backup_ledger.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/ledger-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get blockchain node pods
get_blockchain_pods() {
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pods" ]]; then
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pods" ]]; then
error "Could not find blockchain node pods in namespace $NAMESPACE"
exit 1
fi
echo $pods
}
# Wait for blockchain node to be ready
wait_for_blockchain_node() {
local pod=$1
log "Waiting for blockchain node pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if node is responding
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
log "Blockchain node is ready"
return 0
fi
sleep 2
((retries--))
done
error "Blockchain node did not become ready within timeout"
exit 1
}
# Backup ledger data
backup_ledger_data() {
local pod=$1
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
mkdir -p "$ledger_backup_dir"
log "Starting ledger backup from pod $pod"
# Get the latest block height before backup
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
log "Latest block height: $latest_block"
# Backup blockchain data directory
local blockchain_data_dir="/app/data/chain"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
log "Backing up blockchain data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
fi
# Backup wallet data
local wallet_data_dir="/app/data/wallets"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
log "Backing up wallet data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
fi
# Backup receipts
local receipts_data_dir="/app/data/receipts"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
log "Backing up receipts directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
fi
# Create metadata file
cat > "$ledger_backup_dir/metadata.json" << EOF
{
"backup_name": "$BACKUP_NAME",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"namespace": "$NAMESPACE",
"source_pod": "$pod",
"latest_block_height": $latest_block,
"backup_type": "full"
}
EOF
log "Ledger backup completed: $ledger_backup_dir"
# Verify backup
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
log "Total backup size: $total_size"
}
# Create incremental backup
create_incremental_backup() {
local pod=$1
local last_backup_file="$BACKUP_DIR/.last_backup_height"
# Get last backup height
local last_backup_height=0
if [[ -f "$last_backup_file" ]]; then
last_backup_height=$(cat "$last_backup_file")
fi
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
if [[ $current_height -le $last_backup_height ]]; then
log "No new blocks since last backup (height: $current_height)"
return 0
fi
log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
# Export blocks since last backup
local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
# Update last backup height
echo "$current_height" > "$last_backup_file"
log "Incremental backup created: $incremental_file"
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_dir="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
# Upload entire backup directory
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
local incremental=${3:-false}
log "Starting ledger backup process (incremental=$incremental)"
check_dependencies
create_backup_dir
local pods=($(get_blockchain_pods))
# Use the first ready pod for backup
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
wait_for_blockchain_node "$pod"
if [[ "$incremental" == "true" ]]; then
create_incremental_backup "$pod"
else
backup_ledger_data "$pod"
fi
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
upload_to_cloud "$backup_dir"
break
fi
done
cleanup_old_backups
log "Ledger backup process completed successfully"
}
# Run main function
main "$@"

View File

@ -0,0 +1,172 @@
#!/bin/bash
# PostgreSQL Backup Script for AITBC
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/postgresql-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v pg_dump &> /dev/null; then
error "pg_dump is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get PostgreSQL pod name
get_postgresql_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for PostgreSQL to be ready
wait_for_postgresql() {
local pod=$1
log "Waiting for PostgreSQL pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if PostgreSQL is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
log "PostgreSQL is ready"
return 0
fi
sleep 2
((retries--))
done
error "PostgreSQL did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
log "Starting PostgreSQL backup to $backup_file"
# Get database credentials from secret
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Perform the backup
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_dump -U "$db_user" -h localhost -d "$db_name" \
--verbose --clean --if-exists --create --format=custom \
--file="/tmp/${BACKUP_NAME}.dump"
# Copy backup from pod
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
# Clean up remote backup file
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
# Compress backup
gzip "$backup_file"
backup_file="${backup_file}.gz"
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="postgresql/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting PostgreSQL backup process"
check_dependencies
create_backup_dir
local pod=$(get_postgresql_pod)
wait_for_postgresql "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
upload_to_cloud "$backup_file"
log "PostgreSQL backup process completed successfully"
}
# Run main function
main "$@"

189
infra/scripts/backup_redis.sh Executable file
View File

@ -0,0 +1,189 @@
#!/bin/bash
# Redis Backup Script for AITBC
# Usage: ./backup_redis.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/redis-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get Redis pod name
get_redis_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find Redis pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for Redis to be ready
wait_for_redis() {
local pod=$1
log "Waiting for Redis pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if Redis is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "Redis is ready"
return 0
fi
sleep 2
((retries--))
done
error "Redis did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
log "Starting Redis backup to $backup_file"
# Create Redis backup
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
# Wait for background save to complete
log "Waiting for background save to complete..."
local retries=60
while [[ $retries -gt 0 ]]; do
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
log "Background save completed"
break
fi
sleep 2
((retries--))
done
if [[ $retries -eq 0 ]]; then
error "Background save did not complete within timeout"
exit 1
fi
# Copy RDB file from pod
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
# Also create an append-only file backup if enabled
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
if [[ "$aof_enabled" == "yes" ]]; then
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
log "AOF backup created: $aof_backup"
fi
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="redis/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
# Upload AOF file if exists
local aof_file="${backup_file%.rdb}.aof"
if [[ -f "$aof_file" ]]; then
local aof_key="redis/$(basename "$aof_file")"
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
fi
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting Redis backup process"
check_dependencies
create_backup_dir
local pod=$(get_redis_pod)
wait_for_redis "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
upload_to_cloud "$backup_file"
log "Redis backup process completed successfully"
}
# Run main function
main "$@"

View File

@ -0,0 +1,342 @@
#!/usr/bin/env python3
"""
Chaos Testing Orchestrator
Runs multiple chaos test scenarios and aggregates MTTR metrics
"""
import asyncio
import argparse
import json
import logging
import subprocess
import sys
import time
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosOrchestrator:
"""Orchestrates multiple chaos test scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.results = {
"orchestration_start": None,
"orchestration_end": None,
"scenarios": [],
"summary": {
"total_scenarios": 0,
"successful_scenarios": 0,
"failed_scenarios": 0,
"average_mttr": 0,
"max_mttr": 0,
"min_mttr": float('inf')
}
}
async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
"""Run a single chaos test scenario"""
scenario_name = Path(script).stem.replace("chaos_test_", "")
logger.info(f"Running scenario: {scenario_name}")
cmd = ["python3", script] + args
start_time = time.time()
try:
# Run the chaos test script
process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await process.communicate()
if process.returncode != 0:
logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
logger.error(f"Error: {stderr.decode()}")
return None
# Find the results file
result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
if not result_files:
logger.error(f"No results file found for scenario {scenario_name}")
return None
# Load the most recent result file
result_file = max(result_files, key=lambda p: p.stat().st_mtime)
with open(result_file, 'r') as f:
results = json.load(f)
# Add execution metadata
results["execution_time"] = time.time() - start_time
results["scenario_name"] = scenario_name
logger.info(f"Scenario {scenario_name} completed successfully")
return results
except Exception as e:
logger.error(f"Failed to run scenario {scenario_name}: {e}")
return None
def calculate_summary_metrics(self):
"""Calculate summary metrics across all scenarios"""
mttr_values = []
for scenario in self.results["scenarios"]:
if scenario.get("mttr"):
mttr_values.append(scenario["mttr"])
if mttr_values:
self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
self.results["summary"]["max_mttr"] = max(mttr_values)
self.results["summary"]["min_mttr"] = min(mttr_values)
self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
self.results["summary"]["successful_scenarios"] = sum(
1 for s in self.results["scenarios"] if s.get("mttr") is not None
)
self.results["summary"]["failed_scenarios"] = (
self.results["summary"]["total_scenarios"] -
self.results["summary"]["successful_scenarios"]
)
def generate_report(self, output_file: Optional[str] = None):
"""Generate a comprehensive chaos test report"""
report = {
"report_generated": datetime.utcnow().isoformat(),
"namespace": self.namespace,
"orchestration": self.results,
"recommendations": []
}
# Add recommendations based on results
if self.results["summary"]["average_mttr"] > 120:
report["recommendations"].append(
"Average MTTR exceeds 2 minutes. Consider improving recovery automation."
)
if self.results["summary"]["max_mttr"] > 300:
report["recommendations"].append(
"Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
)
if self.results["summary"]["failed_scenarios"] > 0:
report["recommendations"].append(
f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
)
# Check for specific scenario issues
for scenario in self.results["scenarios"]:
if scenario.get("scenario_name") == "coordinator_outage":
if scenario.get("mttr", 0) > 180:
report["recommendations"].append(
"Coordinator recovery is slow. Consider reducing pod startup time."
)
elif scenario.get("scenario_name") == "network_partition":
if scenario.get("error_count", 0) > scenario.get("success_count", 0):
report["recommendations"].append(
"High error rate during network partition. Improve error handling."
)
elif scenario.get("scenario_name") == "database_failure":
if scenario.get("failure_type") == "connection":
report["recommendations"].append(
"Consider implementing database connection pooling and retry logic."
)
# Save report
if output_file:
with open(output_file, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Chaos test report saved to: {output_file}")
# Print summary
self.print_summary()
return report
def print_summary(self):
"""Print a summary of all chaos test results"""
print("\n" + "="*60)
print("CHAOS TESTING SUMMARY REPORT")
print("="*60)
print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
print(f"Namespace: {self.namespace}")
print(f"\nScenario Results:")
print("-" * 40)
for scenario in self.results["scenarios"]:
name = scenario.get("scenario_name", "Unknown")
mttr = scenario.get("mttr", "N/A")
if mttr != "N/A":
mttr = f"{mttr:.2f}s"
print(f" {name:20} MTTR: {mttr}")
print(f"\nSummary Metrics:")
print("-" * 40)
print(f" Total Scenarios: {self.results['summary']['total_scenarios']}")
print(f" Successful: {self.results['summary']['successful_scenarios']}")
print(f" Failed: {self.results['summary']['failed_scenarios']}")
if self.results["summary"]["average_mttr"] > 0:
print(f" Average MTTR: {self.results['summary']['average_mttr']:.2f}s")
print(f" Maximum MTTR: {self.results['summary']['max_mttr']:.2f}s")
print(f" Minimum MTTR: {self.results['summary']['min_mttr']:.2f}s")
# SLO compliance
print(f"\nSLO Compliance:")
print("-" * 40)
slo_target = 120 # 2 minutes
if self.results["summary"]["average_mttr"] <= slo_target:
print(f" ✓ Average MTTR within SLO ({slo_target}s)")
else:
print(f" ✗ Average MTTR exceeds SLO ({slo_target}s)")
print("\n" + "="*60)
async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
"""Run all specified chaos test scenarios"""
logger.info("Starting chaos testing orchestration")
self.results["orchestration_start"] = datetime.utcnow().isoformat()
for scenario in scenarios:
args = scenario_args.get(scenario, [])
# Add namespace to all scenarios
args.extend(["--namespace", self.namespace])
result = await self.run_scenario(scenario, args)
if result:
self.results["scenarios"].append(result)
self.results["orchestration_end"] = datetime.utcnow().isoformat()
# Calculate summary metrics
self.calculate_summary_metrics()
# Generate report
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_file = f"chaos_test_report_{timestamp}.json"
self.generate_report(report_file)
logger.info("Chaos testing orchestration completed")
async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
"""Run chaos tests continuously over time"""
logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
end_time = datetime.now() + timedelta(hours=duration_hours)
interval_seconds = interval_minutes * 60
all_results = []
while datetime.now() < end_time:
cycle_start = datetime.now()
logger.info(f"Starting chaos test cycle at {cycle_start}")
# Run a random scenario
scenarios = [
"chaos_test_coordinator.py",
"chaos_test_network.py",
"chaos_test_database.py"
]
import random
selected_scenario = random.choice(scenarios)
# Run scenario with reduced duration for continuous testing
args = ["--namespace", self.namespace]
if "coordinator" in selected_scenario:
args.extend(["--outage-duration", "30", "--load-duration", "60"])
elif "network" in selected_scenario:
args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
elif "database" in selected_scenario:
args.extend(["--failure-duration", "30", "--failure-type", "connection"])
result = await self.run_scenario(selected_scenario, args)
if result:
result["cycle_time"] = cycle_start.isoformat()
all_results.append(result)
# Wait for next cycle
elapsed = (datetime.now() - cycle_start).total_seconds()
if elapsed < interval_seconds:
wait_time = interval_seconds - elapsed
logger.info(f"Waiting {wait_time:.0f}s for next cycle")
await asyncio.sleep(wait_time)
# Generate continuous testing report
continuous_report = {
"continuous_testing": True,
"duration_hours": duration_hours,
"interval_minutes": interval_minutes,
"total_cycles": len(all_results),
"cycles": all_results
}
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_file = f"continuous_chaos_report_{timestamp}.json"
with open(report_file, 'w') as f:
json.dump(continuous_report, f, indent=2)
logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
async def main():
parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--scenarios", nargs="+",
choices=["coordinator", "network", "database"],
default=["coordinator", "network", "database"],
help="Scenarios to run")
parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
orchestrator = ChaosOrchestrator(args.namespace)
if args.dry_run:
logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
return
if args.continuous:
await orchestrator.run_continuous_chaos(args.duration, args.interval)
else:
# Map scenario names to script files
scenario_map = {
"coordinator": "chaos_test_coordinator.py",
"network": "chaos_test_network.py",
"database": "chaos_test_database.py"
}
# Get script files
scripts = [scenario_map[s] for s in args.scenarios]
# Default arguments for each scenario
scenario_args = {
"chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
"chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
"chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
}
await orchestrator.run_all_scenarios(scripts, scenario_args)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,287 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Coordinator API Outage
Tests system resilience when coordinator API becomes unavailable
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestCoordinator:
"""Chaos testing for coordinator API outage scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"outage_start": None,
"outage_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "coordinator_outage"
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_coordinator_pods(self) -> List[str]:
"""Get list of coordinator pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get coordinator pods: {e}")
return []
def delete_coordinator_pods(self) -> bool:
"""Delete all coordinator pods to simulate outage"""
try:
cmd = [
"kubectl", "delete", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"--force", "--grace-period=0"
]
subprocess.run(cmd, check=True)
logger.info("Coordinator pods deleted successfully")
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to delete coordinator pods: {e}")
return False
async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
"""Wait for all coordinator pods to terminate"""
start_time = time.time()
while time.time() - start_time < timeout:
pods = self.get_coordinator_pods()
if not pods:
logger.info("All coordinator pods terminated")
return True
await asyncio.sleep(2)
logger.error("Timeout waiting for pods to terminate")
return False
async def wait_for_recovery(self, timeout: int = 300) -> bool:
"""Wait for coordinator service to recover"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
# Check if pods are running
pods = self.get_coordinator_pods()
if not pods:
await asyncio.sleep(5)
continue
# Check if at least one pod is ready
ready_cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
]
result = subprocess.run(ready_cmd, capture_output=True, text=True)
if result.stdout.strip():
# Test API health
if self.test_health_endpoint():
recovery_time = time.time() - start_time
self.metrics["recovery_time"] = recovery_time
logger.info(f"Service recovered in {recovery_time:.2f} seconds")
return True
except Exception as e:
logger.debug(f"Recovery check failed: {e}")
await asyncio.sleep(5)
logger.error("Service did not recover within timeout")
return False
def test_health_endpoint(self) -> bool:
"""Test if coordinator health endpoint is responding"""
try:
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
service_url = f"http://{result.stdout.strip()}/v1/health"
# Test health endpoint
response = subprocess.run(
["curl", "-s", "--max-time", "5", service_url],
capture_output=True, text=True
)
return response.returncode == 0 and "ok" in response.stdout
except Exception:
return False
async def generate_load(self, duration: int, concurrent: int = 10):
"""Generate synthetic load on coordinator API"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
"""Run the complete chaos test"""
logger.info("Starting coordinator outage chaos test")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Phase 1: Generate initial load
logger.info("Phase 1: Generating initial load")
await self.generate_load(30)
# Phase 2: Induce outage
logger.info("Phase 2: Inducing coordinator outage")
self.metrics["outage_start"] = datetime.utcnow().isoformat()
if not self.delete_coordinator_pods():
logger.error("Failed to induce outage")
return False
if not await self.wait_for_pods_termination():
logger.error("Pods did not terminate")
return False
# Wait for specified outage duration
logger.info(f"Waiting for {outage_duration} seconds outage duration")
await asyncio.sleep(outage_duration)
# Phase 3: Monitor recovery
logger.info("Phase 3: Monitoring service recovery")
self.metrics["outage_end"] = datetime.utcnow().isoformat()
if not await self.wait_for_recovery():
logger.error("Service did not recover")
return False
# Phase 4: Post-recovery load test
logger.info("Phase 4: Post-recovery load test")
await self.generate_load(load_duration)
# Calculate metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_coordinator_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info("DRY RUN: Would test coordinator outage without actual deletion")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestCoordinator(args.namespace) as test:
success = await test.run_test(args.outage_duration, args.load_duration)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,387 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Database Failure
Tests system resilience when PostgreSQL database becomes unavailable
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestDatabase:
"""Chaos testing for database failure scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"failure_start": None,
"failure_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "database_failure",
"failure_type": None
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_postgresql_pod(self) -> Optional[str]:
"""Get PostgreSQL pod name"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=postgresql",
"-o", "jsonpath={.items[0].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pod = result.stdout.strip()
return pod if pod else None
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get PostgreSQL pod: {e}")
return None
def simulate_database_connection_failure(self) -> bool:
"""Simulate database connection failure by blocking port 5432"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Block incoming connections to PostgreSQL
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
]
subprocess.run(cmd, check=True)
# Block outgoing connections from PostgreSQL
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
]
subprocess.run(cmd, check=True)
logger.info(f"Blocked PostgreSQL connections on pod {pod}")
self.metrics["failure_type"] = "connection_blocked"
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to block PostgreSQL connections: {e}")
return False
def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
"""Simulate high database latency using netem"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Add latency to PostgreSQL traffic
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
]
subprocess.run(cmd, check=True)
logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
self.metrics["failure_type"] = "high_latency"
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to add latency to PostgreSQL: {e}")
return False
def restore_database(self) -> bool:
"""Restore database connections"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Remove iptables rules
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "INPUT"
]
subprocess.run(cmd, check=False) # May fail if rules don't exist
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "OUTPUT"
]
subprocess.run(cmd, check=False)
# Remove netem qdisc
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"tc", "qdisc", "del", "dev", "eth0", "root"
]
subprocess.run(cmd, check=False)
logger.info(f"Restored PostgreSQL connections on pod {pod}")
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to restore PostgreSQL: {e}")
return False
async def test_database_connectivity(self) -> bool:
"""Test if coordinator can connect to database"""
try:
# Get coordinator pod
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[0].metadata.name}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
coordinator_pod = result.stdout.strip()
if not coordinator_pod:
return False
# Test database connection from coordinator
cmd = [
"kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
"python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
]
result = subprocess.run(cmd, capture_output=True, text=True)
return result.returncode == 0 and "OK" in result.stdout
except Exception:
return False
async def test_api_health(self) -> bool:
"""Test if coordinator API is healthy"""
try:
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
service_url = f"http://{result.stdout.strip()}/v1/health"
# Test health endpoint
response = subprocess.run(
["curl", "-s", "--max-time", "5", service_url],
capture_output=True, text=True
)
return response.returncode == 0 and "ok" in response.stdout
except Exception:
return False
async def generate_load(self, duration: int, concurrent: int = 10):
"""Generate synthetic load on coordinator API"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def wait_for_recovery(self, timeout: int = 300) -> bool:
"""Wait for database and API to recover"""
start_time = time.time()
while time.time() - start_time < timeout:
# Test database connectivity
db_connected = await self.test_database_connectivity()
# Test API health
api_healthy = await self.test_api_health()
if db_connected and api_healthy:
recovery_time = time.time() - start_time
self.metrics["recovery_time"] = recovery_time
logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
return True
await asyncio.sleep(5)
logger.error("Database and API did not recover within timeout")
return False
async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
"""Run the complete database chaos test"""
logger.info(f"Starting database chaos test - failure type: {failure_type}")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Phase 1: Baseline test
logger.info("Phase 1: Baseline connectivity test")
db_connected = await self.test_database_connectivity()
api_healthy = await self.test_api_health()
if not db_connected or not api_healthy:
logger.error("Baseline test failed - database or API not healthy")
return False
logger.info("Baseline: Database and API are healthy")
# Phase 2: Generate initial load
logger.info("Phase 2: Generating initial load")
await self.generate_load(30)
# Phase 3: Induce database failure
logger.info("Phase 3: Inducing database failure")
self.metrics["failure_start"] = datetime.utcnow().isoformat()
if failure_type == "connection":
if not self.simulate_database_connection_failure():
logger.error("Failed to induce database connection failure")
return False
elif failure_type == "latency":
if not self.simulate_database_high_latency():
logger.error("Failed to induce database latency")
return False
else:
logger.error(f"Unknown failure type: {failure_type}")
return False
# Verify failure is effective
await asyncio.sleep(5)
db_connected = await self.test_database_connectivity()
api_healthy = await self.test_api_health()
logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
# Phase 4: Monitor during failure
logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
# Generate load during failure
await self.generate_load(failure_duration)
# Phase 5: Restore database and monitor recovery
logger.info("Phase 5: Restoring database")
self.metrics["failure_end"] = datetime.utcnow().isoformat()
if not self.restore_database():
logger.error("Failed to restore database")
return False
# Wait for recovery
if not await self.wait_for_recovery():
logger.error("System did not recover after database restoration")
return False
# Phase 6: Post-recovery load test
logger.info("Phase 6: Post-recovery load test")
await self.generate_load(60)
# Final metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Database chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_database_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Failure Type: {self.metrics['failure_type']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for database failure")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestDatabase(args.namespace) as test:
success = await test.run_test(args.failure_type, args.failure_duration)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,372 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Network Partition
Tests system resilience when blockchain nodes experience network partitions
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestNetwork:
"""Chaos testing for network partition scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"partition_start": None,
"partition_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "network_partition",
"affected_nodes": []
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_blockchain_pods(self) -> List[str]:
"""Get list of blockchain node pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=blockchain-node",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get blockchain pods: {e}")
return []
def get_coordinator_pods(self) -> List[str]:
"""Get list of coordinator pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get coordinator pods: {e}")
return []
def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
"""Apply network partition using iptables"""
logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
for pod in pods:
if pod in target_pods:
continue
# Block traffic from this pod to target pods
for target_pod in target_pods:
try:
# Get target pod IP
cmd = [
"kubectl", "get", "pod", target_pod,
"-n", self.namespace,
"-o", "jsonpath={.status.podIP}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
target_ip = result.stdout.strip()
if not target_ip:
continue
# Apply iptables rule to block traffic
iptables_cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
]
subprocess.run(iptables_cmd, check=True)
logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
except subprocess.CalledProcessError as e:
logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
return False
self.metrics["affected_nodes"] = pods + target_pods
return True
def remove_network_partition(self, pods: List[str]) -> bool:
"""Remove network partition rules"""
logger.info("Removing network partition rules")
for pod in pods:
try:
# Flush OUTPUT chain (remove all rules)
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "OUTPUT"
]
subprocess.run(cmd, check=True)
logger.info(f"Removed network rules from {pod}")
except subprocess.CalledProcessError as e:
logger.error(f"Failed to remove network rules from {pod}: {e}")
return False
return True
async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
"""Test connectivity between pods"""
results = {}
for pod in pods:
try:
# Test if pod can reach coordinator
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
]
result = subprocess.run(cmd, capture_output=True, text=True)
results[pod] = result.returncode == 0 and "ok" in result.stdout
except Exception:
results[pod] = False
return results
async def monitor_consensus(self, duration: int = 60) -> bool:
"""Monitor blockchain consensus health"""
logger.info(f"Monitoring consensus for {duration} seconds")
start_time = time.time()
last_height = 0
while time.time() - start_time < duration:
try:
# Get block height from a random pod
pods = self.get_blockchain_pods()
if not pods:
await asyncio.sleep(5)
continue
# Use first pod to check height
cmd = [
"kubectl", "exec", "-n", self.namespace, pods[0], "--",
"curl", "-s", "http://localhost:8080/v1/blocks/head"
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
try:
data = json.loads(result.stdout)
current_height = data.get("height", 0)
# Check if blockchain is progressing
if current_height > last_height:
last_height = current_height
logger.info(f"Blockchain progressing, height: {current_height}")
elif time.time() - start_time > 30: # Allow 30s for initial sync
logger.warning(f"Blockchain stuck at height {current_height}")
except json.JSONDecodeError:
pass
except Exception as e:
logger.debug(f"Consensus check failed: {e}")
await asyncio.sleep(5)
return last_height > 0
async def generate_load(self, duration: int, concurrent: int = 5):
"""Generate synthetic load on blockchain nodes"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "blockchain-node",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/blocks/head") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
"""Run the complete network partition chaos test"""
logger.info("Starting network partition chaos test")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Get all blockchain pods
all_pods = self.get_blockchain_pods()
if not all_pods:
logger.error("No blockchain pods found")
return False
# Determine which pods to partition
num_partition = int(len(all_pods) * partition_ratio)
partition_pods = all_pods[:num_partition]
remaining_pods = all_pods[num_partition:]
logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
# Phase 1: Baseline test
logger.info("Phase 1: Baseline connectivity test")
baseline_connectivity = await self.test_connectivity(all_pods)
logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
# Phase 2: Generate initial load
logger.info("Phase 2: Generating initial load")
await self.generate_load(30)
# Phase 3: Apply network partition
logger.info("Phase 3: Applying network partition")
self.metrics["partition_start"] = datetime.utcnow().isoformat()
if not self.apply_network_partition(remaining_pods, partition_pods):
logger.error("Failed to apply network partition")
return False
# Verify partition is effective
await asyncio.sleep(5)
partitioned_connectivity = await self.test_connectivity(all_pods)
logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
# Phase 4: Monitor during partition
logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
consensus_healthy = await self.monitor_consensus(partition_duration)
# Phase 5: Remove partition and monitor recovery
logger.info("Phase 5: Removing network partition")
self.metrics["partition_end"] = datetime.utcnow().isoformat()
if not self.remove_network_partition(all_pods):
logger.error("Failed to remove network partition")
return False
# Wait for recovery
logger.info("Waiting for network recovery...")
await asyncio.sleep(10)
# Test connectivity after recovery
recovery_connectivity = await self.test_connectivity(all_pods)
recovery_time = time.time()
# Calculate recovery metrics
all_connected = all(recovery_connectivity.values())
if all_connected:
self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
# Phase 6: Post-recovery load test
logger.info("Phase 6: Post-recovery load test")
await self.generate_load(60)
# Final metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Network partition chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_network_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for network partition")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestNetwork(args.namespace) as test:
success = await test.run_test(args.partition_duration, args.partition_ratio)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,279 @@
#!/bin/bash
# Ledger Storage Restore Script for AITBC
# Usage: ./restore_ledger.sh [namespace] [backup_directory]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_DIR=${2:-}
TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v jq &> /dev/null; then
error "jq is not installed or not in PATH"
exit 1
fi
}
# Validate backup directory
validate_backup_dir() {
if [[ -z "$BACKUP_DIR" ]]; then
error "Backup directory must be specified"
echo "Usage: $0 [namespace] [backup_directory]"
exit 1
fi
if [[ ! -d "$BACKUP_DIR" ]]; then
error "Backup directory not found: $BACKUP_DIR"
exit 1
fi
# Check for required files
if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
error "metadata.json not found in backup directory"
exit 1
fi
if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
error "chain.tar.gz not found in backup directory"
exit 1
fi
log "Using backup directory: $BACKUP_DIR"
}
# Get blockchain node pods
get_blockchain_pods() {
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pods" ]]; then
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pods" ]]; then
error "Could not find blockchain node pods in namespace $NAMESPACE"
exit 1
fi
echo $pods
}
# Create backup of current ledger before restore
create_pre_restore_backup() {
local pods=($1)
local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
warn "Creating backup of current ledger before restore..."
mkdir -p "$pre_restore_dir"
# Use the first ready pod
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
# Create metadata
cat > "$pre_restore_dir/metadata.json" << EOF
{
"backup_name": "$pre_restore_backup",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"namespace": "$NAMESPACE",
"source_pod": "$pod",
"latest_block_height": $current_height,
"backup_type": "pre-restore"
}
EOF
# Backup data directories
local data_dirs=("chain" "wallets" "receipts")
for dir in "${data_dirs[@]}"; do
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
fi
done
log "Pre-restore backup created: $pre_restore_dir"
break
fi
done
}
# Perform restore
perform_restore() {
local pods=($1)
warn "This will replace all current ledger data. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Scale down blockchain nodes
info "Scaling down blockchain node deployment..."
kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
# Wait for pods to terminate
kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
# Scale up blockchain nodes
info "Scaling up blockchain node deployment..."
kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
# Wait for pods to be ready
local ready_pods=()
local retries=30
while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
local all_pods=$(get_blockchain_pods)
for pod in $all_pods; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
ready_pods+=("$pod")
fi
done
if [[ ${#ready_pods[@]} -eq 0 ]]; then
sleep 5
((retries--))
fi
done
if [[ ${#ready_pods[@]} -eq 0 ]]; then
error "No blockchain nodes became ready after restore"
exit 1
fi
# Restore data to all ready pods
for pod in "${ready_pods[@]}"; do
info "Restoring ledger data to pod $pod..."
# Create temp directory on pod
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
# Extract and copy chain data
if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
fi
# Extract and copy wallet data
if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
fi
# Extract and copy receipt data
if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
fi
# Set correct permissions
kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
# Clean up temp directory
kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
log "Ledger data restored to pod $pod"
done
log "Ledger restore completed successfully"
}
# Verify restore
verify_restore() {
local pods=($1)
log "Verifying ledger restore..."
# Read backup metadata
local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
log "Backup contains blocks up to height: $backup_height"
# Verify on each pod
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
# Check if node is responding
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
if [[ "$current_height" -eq "$backup_height" ]]; then
log "✓ Pod $pod: Block height matches backup ($current_height)"
else
warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
fi
# Check data directories
local dirs=("chain" "wallets" "receipts")
for dir in "${dirs[@]}"; do
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
log "✓ Pod $pod: $dir directory contains $file_count files"
else
warn "⚠ Pod $pod: $dir directory not found"
fi
done
else
error "✗ Pod $pod: Not responding to health checks"
fi
fi
done
}
# Main execution
main() {
log "Starting ledger restore process"
check_dependencies
validate_backup_dir
local pods=($(get_blockchain_pods))
create_pre_restore_backup "${pods[*]}"
perform_restore "${pods[*]}"
# Get updated pod list after restore
pods=($(get_blockchain_pods))
verify_restore "${pods[*]}"
log "Ledger restore process completed successfully"
warn "Please verify blockchain synchronization and application functionality"
}
# Run main function
main "$@"

View File

@ -0,0 +1,228 @@
#!/bin/bash
# PostgreSQL Restore Script for AITBC
# Usage: ./restore_postgresql.sh [namespace] [backup_file]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_FILE=${2:-}
BACKUP_DIR="/tmp/postgresql-backups"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v pg_restore &> /dev/null; then
error "pg_restore is not installed or not in PATH"
exit 1
fi
}
# Validate backup file
validate_backup_file() {
if [[ -z "$BACKUP_FILE" ]]; then
error "Backup file must be specified"
echo "Usage: $0 [namespace] [backup_file]"
exit 1
fi
# If file doesn't exist locally, try to find it in backup dir
if [[ ! -f "$BACKUP_FILE" ]]; then
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
if [[ -f "$potential_file" ]]; then
BACKUP_FILE="$potential_file"
else
error "Backup file not found: $BACKUP_FILE"
exit 1
fi
fi
# Check if file is gzipped and decompress if needed
if [[ "$BACKUP_FILE" == *.gz ]]; then
info "Decompressing backup file..."
gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
BACKUP_FILE="/tmp/restore_$(date +%s).dump"
fi
log "Using backup file: $BACKUP_FILE"
}
# Get PostgreSQL pod name
get_postgresql_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for PostgreSQL to be ready
wait_for_postgresql() {
local pod=$1
log "Waiting for PostgreSQL pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if PostgreSQL is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
log "PostgreSQL is ready"
return 0
fi
sleep 2
((retries--))
done
error "PostgreSQL did not become ready within timeout"
exit 1
}
# Create backup of current database before restore
create_pre_restore_backup() {
local pod=$1
local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
warn "Creating backup of current database before restore..."
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Create backup
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_dump -U "$db_user" -h localhost -d "$db_name" \
--format=custom --file="/tmp/${pre_restore_backup}.dump"
# Copy backup locally
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
}
# Perform restore
perform_restore() {
local pod=$1
warn "This will replace the current database. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Copy backup file to pod
local remote_backup="/tmp/restore_$(date +%s).dump"
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
# Drop existing database and recreate
log "Dropping existing database..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
log "Creating new database..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
# Restore database
log "Restoring database from backup..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_restore -U "$db_user" -h localhost -d "$db_name" \
--verbose --clean --if-exists "$remote_backup"
# Clean up remote file
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
log "Database restore completed successfully"
}
# Verify restore
verify_restore() {
local pod=$1
log "Verifying database restore..."
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Check table count
local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
log "Database contains $table_count tables"
# Check if key tables exist
local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
for table in "${key_tables[@]}"; do
local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
if [[ "$exists" == "t" ]]; then
log "✓ Table $table exists"
else
warn "⚠ Table $table not found"
fi
done
}
# Main execution
main() {
log "Starting PostgreSQL restore process"
check_dependencies
validate_backup_file
local pod=$(get_postgresql_pod)
wait_for_postgresql "$pod"
create_pre_restore_backup "$pod"
perform_restore "$pod"
verify_restore "$pod"
log "PostgreSQL restore process completed successfully"
warn "Please verify application functionality after restore"
}
# Run main function
main "$@"

View File

@ -0,0 +1,223 @@
#!/bin/bash
# Redis Restore Script for AITBC
# Usage: ./restore_redis.sh [namespace] [backup_file]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_FILE=${2:-}
BACKUP_DIR="/tmp/redis-backups"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Validate backup file
validate_backup_file() {
if [[ -z "$BACKUP_FILE" ]]; then
error "Backup file must be specified"
echo "Usage: $0 [namespace] [backup_file]"
exit 1
fi
# If file doesn't exist locally, try to find it in backup dir
if [[ ! -f "$BACKUP_FILE" ]]; then
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
if [[ -f "$potential_file" ]]; then
BACKUP_FILE="$potential_file"
else
error "Backup file not found: $BACKUP_FILE"
exit 1
fi
fi
log "Using backup file: $BACKUP_FILE"
}
# Get Redis pod name
get_redis_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find Redis pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Create backup of current Redis data before restore
create_pre_restore_backup() {
local pod=$1
local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
warn "Creating backup of current Redis data before restore..."
# Create background save
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
# Wait for save to complete
local retries=60
while [[ $retries -gt 0 ]]; do
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
break
fi
sleep 2
((retries--))
done
# Copy backup locally
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
# Also backup AOF if exists
if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
fi
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
}
# Perform restore
perform_restore() {
local pod=$1
warn "This will replace all current Redis data. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Scale down Redis to ensure clean restore
info "Scaling down Redis deployment..."
kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
# Wait for pod to terminate
kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
# Scale up Redis
info "Scaling up Redis deployment..."
kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
# Wait for new pod to be ready
local new_pod=$(get_redis_pod)
kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
# Stop Redis server
info "Stopping Redis server..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
# Clear existing data
info "Clearing existing Redis data..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
# Copy backup file
info "Copying backup file..."
local remote_file="/data/restore.rdb"
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
# Set correct permissions
kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
# Start Redis server
info "Starting Redis server..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
# Wait for Redis to be ready
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "Redis is ready"
break
fi
sleep 2
((retries--))
done
if [[ $retries -eq 0 ]]; then
error "Redis did not start properly after restore"
exit 1
fi
log "Redis restore completed successfully"
}
# Verify restore
verify_restore() {
local pod=$1
log "Verifying Redis restore..."
# Check database size
local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
log "Database contains $db_size keys"
# Check memory usage
local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
log "Memory usage: $memory"
# Check if Redis is responding to commands
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "✓ Redis is responding normally"
else
error "✗ Redis is not responding"
exit 1
fi
}
# Main execution
main() {
log "Starting Redis restore process"
check_dependencies
validate_backup_file
local pod=$(get_redis_pod)
create_pre_restore_backup "$pod"
perform_restore "$pod"
# Get new pod name after restore
pod=$(get_redis_pod)
verify_restore "$pod"
log "Redis restore process completed successfully"
warn "Please verify application functionality after restore"
}
# Run main function
main "$@"