feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions
--- a/infra/scripts/README_chaos.md
+++ b/infra/scripts/README_chaos.md
@ -0,0 +1,330 @@
+# AITBC Chaos Testing Framework
+
+This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
+
+## Overview
+
+The chaos testing framework simulates real-world failure scenarios to:
+- Test system resilience under adverse conditions
+- Measure Mean-Time-To-Recovery (MTTR) metrics
+- Identify single points of failure
+- Validate recovery procedures
+- Ensure SLO compliance
+
+## Components
+
+### Test Scripts
+
+1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
+   - Deletes coordinator pods to simulate complete service outage
+   - Measures recovery time and service availability
+   - Tests load handling during and after recovery
+
+2. **`chaos_test_network.py`** - Network partition simulation
+   - Creates network partitions between blockchain nodes
+   - Tests consensus resilience during partition
+   - Measures network recovery time
+
+3. **`chaos_test_database.py`** - Database failure simulation
+   - Simulates PostgreSQL connection failures
+   - Tests high latency scenarios
+   - Validates application error handling
+
+4. **`chaos_orchestrator.py`** - Test orchestration and reporting
+   - Runs multiple chaos test scenarios
+   - Aggregates MTTR metrics across tests
+   - Generates comprehensive reports
+   - Supports continuous chaos testing
+
+## Prerequisites
+
+- Python 3.8+
+- kubectl configured with cluster access
+- Helm charts deployed in target namespace
+- Administrative privileges for network manipulation
+
+## Installation
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd aitbc/infra/scripts
+
+# Install dependencies
+pip install aiohttp
+
+# Make scripts executable
+chmod +x chaos_*.py
+```
+
+## Usage
+
+### Running Individual Tests
+
+#### Coordinator Outage Test
+```bash
+# Basic test
+python3 chaos_test_coordinator.py --namespace default
+
+# Custom outage duration
+python3 chaos_test_coordinator.py --namespace default --outage-duration 120
+
+# Dry run (no actual chaos)
+python3 chaos_test_coordinator.py --dry-run
+```
+
+#### Network Partition Test
+```bash
+# Partition 50% of nodes for 60 seconds
+python3 chaos_test_network.py --namespace default
+
+# Partition 30% of nodes for 90 seconds
+python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
+```
+
+#### Database Failure Test
+```bash
+# Simulate connection failure
+python3 chaos_test_database.py --namespace default --failure-type connection
+
+# Simulate high latency (5000ms)
+python3 chaos_test_database.py --namespace default --failure-type latency
+```
+
+### Running All Tests
+
+```bash
+# Run all scenarios with default parameters
+python3 chaos_orchestrator.py --namespace default
+
+# Run specific scenarios
+python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
+
+# Continuous chaos testing (24 hours, every 60 minutes)
+python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
+```
+
+## Test Scenarios
+
+### 1. Coordinator API Outage
+
+**Objective**: Test system resilience when the coordinator service becomes unavailable.
+
+**Steps**:
+1. Generate baseline load on coordinator API
+2. Delete all coordinator pods
+3. Wait for specified outage duration
+4. Monitor service recovery
+5. Generate post-recovery load
+
+**Metrics Collected**:
+- MTTR (Mean-Time-To-Recovery)
+- Success/error request counts
+- Recovery time distribution
+
+### 2. Network Partition
+
+**Objective**: Test blockchain consensus during network partitions.
+
+**Steps**:
+1. Identify blockchain node pods
+2. Apply iptables rules to partition nodes
+3. Monitor consensus during partition
+4. Remove network partition
+5. Verify network recovery
+
+**Metrics Collected**:
+- Network recovery time
+- Consensus health during partition
+- Node connectivity status
+
+### 3. Database Failure
+
+**Objective**: Test application behavior when database is unavailable.
+
+**Steps**:
+1. Simulate database connection failure or high latency
+2. Monitor API behavior during failure
+3. Restore database connectivity
+4. Verify application recovery
+
+**Metrics Collected**:
+- Database recovery time
+- API error rates during failure
+- Application resilience metrics
+
+## Results and Reporting
+
+### Test Results Format
+
+Each test generates a JSON results file with the following structure:
+
+```json
+{
+  "test_start": "2024-12-22T10:00:00.000Z",
+  "test_end": "2024-12-22T10:05:00.000Z",
+  "scenario": "coordinator_outage",
+  "mttr": 45.2,
+  "error_count": 156,
+  "success_count": 844,
+  "recovery_time": 45.2
+}
+```
+
+### Orchestrator Report
+
+The orchestrator generates a comprehensive report including:
+
+- Summary metrics across all scenarios
+- SLO compliance analysis
+- Recommendations for improvements
+- MTTR trends and statistics
+
+Example report snippet:
+```json
+{
+  "summary": {
+    "total_scenarios": 3,
+    "successful_scenarios": 3,
+    "average_mttr": 67.8,
+    "max_mttr": 120.5,
+    "min_mttr": 45.2
+  },
+  "recommendations": [
+    "Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
+    "Coordinator recovery is slow. Consider reducing pod startup time."
+  ]
+}
+```
+
+## SLO Targets
+
+| Metric | Target | Current |
+|--------|--------|---------|
+| MTTR (Average) | ≤ 120 seconds | TBD |
+| MTTR (Maximum) | ≤ 300 seconds | TBD |
+| Success Rate | ≥ 99.9% | TBD |
+
+## Best Practices
+
+### Before Running Tests
+
+1. **Backup Critical Data**: Ensure recent backups are available
+2. **Notify Team**: Inform stakeholders about chaos testing
+3. **Check Cluster Health**: Verify all components are healthy
+4. **Schedule Appropriately**: Run during low-traffic periods
+
+### During Tests
+
+1. **Monitor Logs**: Watch for unexpected errors
+2. **Have Rollback Plan**: Be ready to manually intervene
+3. **Document Observations**: Note any unusual behavior
+4. **Stop if Critical**: Abort tests if production is impacted
+
+### After Tests
+
+1. **Review Results**: Analyze MTTR and error rates
+2. **Update Documentation**: Record findings and improvements
+3. **Address Issues**: Fix any discovered problems
+4. **Schedule Follow-up**: Plan regular chaos testing
+
+## Integration with CI/CD
+
+### GitHub Actions Example
+
+```yaml
+name: Chaos Testing
+on:
+  schedule:
+    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
+  workflow_dispatch:
+
+jobs:
+  chaos-test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: '3.9'
+      - name: Install dependencies
+        run: |
+          pip install aiohttp
+      - name: Run chaos tests
+        run: |
+          cd infra/scripts
+          python3 chaos_orchestrator.py --namespace staging
+      - name: Upload results
+        uses: actions/upload-artifact@v2
+        with:
+          name: chaos-results
+          path: "*.json"
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **kubectl not found**
+   ```bash
+   # Ensure kubectl is installed and configured
+   which kubectl
+   kubectl version
+   ```
+
+2. **Permission denied errors**
+   ```bash
+   # Check RBAC permissions
+   kubectl auth can-i create pods --namespace default
+   kubectl auth can-i exec pods --namespace default
+   ```
+
+3. **Network rules not applying**
+   ```bash
+   # Check if iptables is available in pods
+   kubectl exec -it <pod> -- iptables -L
+   ```
+
+4. **Tests hanging**
+   ```bash
+   # Check pod status
+   kubectl get pods --namespace default
+   kubectl describe pod <pod-name> --namespace default
+   ```
+
+### Debug Mode
+
+Enable debug logging:
+```bash
+export PYTHONPATH=.
+python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
+```
+
+## Contributing
+
+To add new chaos test scenarios:
+
+1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
+2. Implement the required methods: `run_test()`, `save_results()`
+3. Add the scenario to `chaos_orchestrator.py`
+4. Update documentation
+
+## Security Considerations
+
+- Chaos tests require elevated privileges
+- Only run in authorized environments
+- Ensure test isolation from production data
+- Review network rules before deployment
+- Monitor for security violations during tests
+
+## Support
+
+For issues or questions:
+- Check the troubleshooting section
+- Review test logs for error details
+- Contact the DevOps team at devops@aitbc.io
+
+## License
+
+This chaos testing framework is part of the AITBC project and follows the same license terms.
--- a/infra/scripts/backup_ledger.sh
+++ b/infra/scripts/backup_ledger.sh
@ -0,0 +1,233 @@
+#!/bin/bash
+# Ledger Storage Backup Script for AITBC
+# Usage: ./backup_ledger.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/ledger-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get blockchain node pods
+get_blockchain_pods() {
+    local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pods" ]]; then
+        pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pods" ]]; then
+        error "Could not find blockchain node pods in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo $pods
+}
+
+# Wait for blockchain node to be ready
+wait_for_blockchain_node() {
+    local pod=$1
+    log "Waiting for blockchain node pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if node is responding
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
+            log "Blockchain node is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "Blockchain node did not become ready within timeout"
+    exit 1
+}
+
+# Backup ledger data
+backup_ledger_data() {
+    local pod=$1
+    local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+    mkdir -p "$ledger_backup_dir"
+    
+    log "Starting ledger backup from pod $pod"
+    
+    # Get the latest block height before backup
+    local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+    log "Latest block height: $latest_block"
+    
+    # Backup blockchain data directory
+    local blockchain_data_dir="/app/data/chain"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
+        log "Backing up blockchain data directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
+    fi
+    
+    # Backup wallet data
+    local wallet_data_dir="/app/data/wallets"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
+        log "Backing up wallet data directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
+    fi
+    
+    # Backup receipts
+    local receipts_data_dir="/app/data/receipts"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
+        log "Backing up receipts directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
+    fi
+    
+    # Create metadata file
+    cat > "$ledger_backup_dir/metadata.json" << EOF
+{
+  "backup_name": "$BACKUP_NAME",
+  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+  "namespace": "$NAMESPACE",
+  "source_pod": "$pod",
+  "latest_block_height": $latest_block,
+  "backup_type": "full"
+}
+EOF
+    
+    log "Ledger backup completed: $ledger_backup_dir"
+    
+    # Verify backup
+    local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
+    log "Total backup size: $total_size"
+}
+
+# Create incremental backup
+create_incremental_backup() {
+    local pod=$1
+    local last_backup_file="$BACKUP_DIR/.last_backup_height"
+    
+    # Get last backup height
+    local last_backup_height=0
+    if [[ -f "$last_backup_file" ]]; then
+        last_backup_height=$(cat "$last_backup_file")
+    fi
+    
+    # Get current block height
+    local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+    
+    if [[ $current_height -le $last_backup_height ]]; then
+        log "No new blocks since last backup (height: $current_height)"
+        return 0
+    fi
+    
+    log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
+    
+    # Export blocks since last backup
+    local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
+    kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
+    
+    # Update last backup height
+    echo "$current_height" > "$last_backup_file"
+    
+    log "Incremental backup created: $incremental_file"
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
+    find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_dir="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        
+        # Upload entire backup directory
+        aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
+        
+        log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    local incremental=${3:-false}
+    
+    log "Starting ledger backup process (incremental=$incremental)"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pods=($(get_blockchain_pods))
+    
+    # Use the first ready pod for backup
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            wait_for_blockchain_node "$pod"
+            
+            if [[ "$incremental" == "true" ]]; then
+                create_incremental_backup "$pod"
+            else
+                backup_ledger_data "$pod"
+            fi
+            
+            local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+            upload_to_cloud "$backup_dir"
+            
+            break
+        fi
+    done
+    
+    cleanup_old_backups
+    
+    log "Ledger backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/backup_postgresql.sh
+++ b/infra/scripts/backup_postgresql.sh
@ -0,0 +1,172 @@
+#!/bin/bash
+# PostgreSQL Backup Script for AITBC
+# Usage: ./backup_postgresql.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/postgresql-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v pg_dump &> /dev/null; then
+        error "pg_dump is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get PostgreSQL pod name
+get_postgresql_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find PostgreSQL pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for PostgreSQL to be ready
+wait_for_postgresql() {
+    local pod=$1
+    log "Waiting for PostgreSQL pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if PostgreSQL is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
+            log "PostgreSQL is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "PostgreSQL did not become ready within timeout"
+    exit 1
+}
+
+# Perform backup
+perform_backup() {
+    local pod=$1
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
+    
+    log "Starting PostgreSQL backup to $backup_file"
+    
+    # Get database credentials from secret
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Perform the backup
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_dump -U "$db_user" -h localhost -d "$db_name" \
+        --verbose --clean --if-exists --create --format=custom \
+        --file="/tmp/${BACKUP_NAME}.dump"
+    
+    # Copy backup from pod
+    kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
+    
+    # Clean up remote backup file
+    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
+    
+    # Compress backup
+    gzip "$backup_file"
+    backup_file="${backup_file}.gz"
+    
+    log "Backup completed: $backup_file"
+    
+    # Verify backup
+    if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+        local size=$(du -h "$backup_file" | cut -f1)
+        log "Backup size: $size"
+    else
+        error "Backup file is empty or missing"
+        exit 1
+    fi
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_file="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        local s3_key="postgresql/$(basename "$backup_file")"
+        
+        aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+        log "Backup uploaded to s3://$s3_bucket/$s3_key"
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting PostgreSQL backup process"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pod=$(get_postgresql_pod)
+    wait_for_postgresql "$pod"
+    
+    perform_backup "$pod"
+    cleanup_old_backups
+    
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
+    upload_to_cloud "$backup_file"
+    
+    log "PostgreSQL backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/backup_redis.sh
+++ b/infra/scripts/backup_redis.sh
@ -0,0 +1,189 @@
+#!/bin/bash
+# Redis Backup Script for AITBC
+# Usage: ./backup_redis.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/redis-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get Redis pod name
+get_redis_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find Redis pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for Redis to be ready
+wait_for_redis() {
+    local pod=$1
+    log "Waiting for Redis pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if Redis is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+            log "Redis is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "Redis did not become ready within timeout"
+    exit 1
+}
+
+# Perform backup
+perform_backup() {
+    local pod=$1
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+    
+    log "Starting Redis backup to $backup_file"
+    
+    # Create Redis backup
+    kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
+    
+    # Wait for background save to complete
+    log "Waiting for background save to complete..."
+    local retries=60
+    while [[ $retries -gt 0 ]]; do
+        local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        
+        if [[ "$lastsave" -gt "$lastbgsave" ]]; then
+            log "Background save completed"
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    if [[ $retries -eq 0 ]]; then
+        error "Background save did not complete within timeout"
+        exit 1
+    fi
+    
+    # Copy RDB file from pod
+    kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
+    
+    # Also create an append-only file backup if enabled
+    local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
+    if [[ "$aof_enabled" == "yes" ]]; then
+        local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
+        kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
+        log "AOF backup created: $aof_backup"
+    fi
+    
+    log "Backup completed: $backup_file"
+    
+    # Verify backup
+    if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+        local size=$(du -h "$backup_file" | cut -f1)
+        log "Backup size: $size"
+    else
+        error "Backup file is empty or missing"
+        exit 1
+    fi
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
+    find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_file="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        local s3_key="redis/$(basename "$backup_file")"
+        
+        aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+        log "Backup uploaded to s3://$s3_bucket/$s3_key"
+        
+        # Upload AOF file if exists
+        local aof_file="${backup_file%.rdb}.aof"
+        if [[ -f "$aof_file" ]]; then
+            local aof_key="redis/$(basename "$aof_file")"
+            aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
+            log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
+        fi
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting Redis backup process"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pod=$(get_redis_pod)
+    wait_for_redis "$pod"
+    
+    perform_backup "$pod"
+    cleanup_old_backups
+    
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+    upload_to_cloud "$backup_file"
+    
+    log "Redis backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/chaos_orchestrator.py
+++ b/infra/scripts/chaos_orchestrator.py
@ -0,0 +1,342 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Orchestrator
+Runs multiple chaos test scenarios and aggregates MTTR metrics
+"""
+
+import asyncio
+import argparse
+import json
+import logging
+import subprocess
+import sys
+import time
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosOrchestrator:
+    """Orchestrates multiple chaos test scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.results = {
+            "orchestration_start": None,
+            "orchestration_end": None,
+            "scenarios": [],
+            "summary": {
+                "total_scenarios": 0,
+                "successful_scenarios": 0,
+                "failed_scenarios": 0,
+                "average_mttr": 0,
+                "max_mttr": 0,
+                "min_mttr": float('inf')
+            }
+        }
+    
+    async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
+        """Run a single chaos test scenario"""
+        scenario_name = Path(script).stem.replace("chaos_test_", "")
+        logger.info(f"Running scenario: {scenario_name}")
+        
+        cmd = ["python3", script] + args
+        start_time = time.time()
+        
+        try:
+            # Run the chaos test script
+            process = await asyncio.create_subprocess_exec(
+                *cmd,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE
+            )
+            
+            stdout, stderr = await process.communicate()
+            
+            if process.returncode != 0:
+                logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
+                logger.error(f"Error: {stderr.decode()}")
+                return None
+            
+            # Find the results file
+            result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
+            if not result_files:
+                logger.error(f"No results file found for scenario {scenario_name}")
+                return None
+            
+            # Load the most recent result file
+            result_file = max(result_files, key=lambda p: p.stat().st_mtime)
+            with open(result_file, 'r') as f:
+                results = json.load(f)
+            
+            # Add execution metadata
+            results["execution_time"] = time.time() - start_time
+            results["scenario_name"] = scenario_name
+            
+            logger.info(f"Scenario {scenario_name} completed successfully")
+            return results
+            
+        except Exception as e:
+            logger.error(f"Failed to run scenario {scenario_name}: {e}")
+            return None
+    
+    def calculate_summary_metrics(self):
+        """Calculate summary metrics across all scenarios"""
+        mttr_values = []
+        
+        for scenario in self.results["scenarios"]:
+            if scenario.get("mttr"):
+                mttr_values.append(scenario["mttr"])
+        
+        if mttr_values:
+            self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
+            self.results["summary"]["max_mttr"] = max(mttr_values)
+            self.results["summary"]["min_mttr"] = min(mttr_values)
+        
+        self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
+        self.results["summary"]["successful_scenarios"] = sum(
+            1 for s in self.results["scenarios"] if s.get("mttr") is not None
+        )
+        self.results["summary"]["failed_scenarios"] = (
+            self.results["summary"]["total_scenarios"] - 
+            self.results["summary"]["successful_scenarios"]
+        )
+    
+    def generate_report(self, output_file: Optional[str] = None):
+        """Generate a comprehensive chaos test report"""
+        report = {
+            "report_generated": datetime.utcnow().isoformat(),
+            "namespace": self.namespace,
+            "orchestration": self.results,
+            "recommendations": []
+        }
+        
+        # Add recommendations based on results
+        if self.results["summary"]["average_mttr"] > 120:
+            report["recommendations"].append(
+                "Average MTTR exceeds 2 minutes. Consider improving recovery automation."
+            )
+        
+        if self.results["summary"]["max_mttr"] > 300:
+            report["recommendations"].append(
+                "Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
+            )
+        
+        if self.results["summary"]["failed_scenarios"] > 0:
+            report["recommendations"].append(
+                f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
+            )
+        
+        # Check for specific scenario issues
+        for scenario in self.results["scenarios"]:
+            if scenario.get("scenario_name") == "coordinator_outage":
+                if scenario.get("mttr", 0) > 180:
+                    report["recommendations"].append(
+                        "Coordinator recovery is slow. Consider reducing pod startup time."
+                    )
+            
+            elif scenario.get("scenario_name") == "network_partition":
+                if scenario.get("error_count", 0) > scenario.get("success_count", 0):
+                    report["recommendations"].append(
+                        "High error rate during network partition. Improve error handling."
+                    )
+            
+            elif scenario.get("scenario_name") == "database_failure":
+                if scenario.get("failure_type") == "connection":
+                    report["recommendations"].append(
+                        "Consider implementing database connection pooling and retry logic."
+                    )
+        
+        # Save report
+        if output_file:
+            with open(output_file, 'w') as f:
+                json.dump(report, f, indent=2)
+            logger.info(f"Chaos test report saved to: {output_file}")
+        
+        # Print summary
+        self.print_summary()
+        
+        return report
+    
+    def print_summary(self):
+        """Print a summary of all chaos test results"""
+        print("\n" + "="*60)
+        print("CHAOS TESTING SUMMARY REPORT")
+        print("="*60)
+        
+        print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
+        print(f"Namespace: {self.namespace}")
+        
+        print(f"\nScenario Results:")
+        print("-" * 40)
+        for scenario in self.results["scenarios"]:
+            name = scenario.get("scenario_name", "Unknown")
+            mttr = scenario.get("mttr", "N/A")
+            if mttr != "N/A":
+                mttr = f"{mttr:.2f}s"
+            print(f"  {name:20} MTTR: {mttr}")
+        
+        print(f"\nSummary Metrics:")
+        print("-" * 40)
+        print(f"  Total Scenarios:     {self.results['summary']['total_scenarios']}")
+        print(f"  Successful:          {self.results['summary']['successful_scenarios']}")
+        print(f"  Failed:              {self.results['summary']['failed_scenarios']}")
+        
+        if self.results["summary"]["average_mttr"] > 0:
+            print(f"  Average MTTR:        {self.results['summary']['average_mttr']:.2f}s")
+            print(f"  Maximum MTTR:        {self.results['summary']['max_mttr']:.2f}s")
+            print(f"  Minimum MTTR:        {self.results['summary']['min_mttr']:.2f}s")
+        
+        # SLO compliance
+        print(f"\nSLO Compliance:")
+        print("-" * 40)
+        slo_target = 120  # 2 minutes
+        if self.results["summary"]["average_mttr"] <= slo_target:
+            print(f"  ✓ Average MTTR within SLO ({slo_target}s)")
+        else:
+            print(f"  ✗ Average MTTR exceeds SLO ({slo_target}s)")
+        
+        print("\n" + "="*60)
+    
+    async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
+        """Run all specified chaos test scenarios"""
+        logger.info("Starting chaos testing orchestration")
+        self.results["orchestration_start"] = datetime.utcnow().isoformat()
+        
+        for scenario in scenarios:
+            args = scenario_args.get(scenario, [])
+            # Add namespace to all scenarios
+            args.extend(["--namespace", self.namespace])
+            
+            result = await self.run_scenario(scenario, args)
+            if result:
+                self.results["scenarios"].append(result)
+        
+        self.results["orchestration_end"] = datetime.utcnow().isoformat()
+        
+        # Calculate summary metrics
+        self.calculate_summary_metrics()
+        
+        # Generate report
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        report_file = f"chaos_test_report_{timestamp}.json"
+        self.generate_report(report_file)
+        
+        logger.info("Chaos testing orchestration completed")
+    
+    async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
+        """Run chaos tests continuously over time"""
+        logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
+        
+        end_time = datetime.now() + timedelta(hours=duration_hours)
+        interval_seconds = interval_minutes * 60
+        
+        all_results = []
+        
+        while datetime.now() < end_time:
+            cycle_start = datetime.now()
+            logger.info(f"Starting chaos test cycle at {cycle_start}")
+            
+            # Run a random scenario
+            scenarios = [
+                "chaos_test_coordinator.py",
+                "chaos_test_network.py",
+                "chaos_test_database.py"
+            ]
+            
+            import random
+            selected_scenario = random.choice(scenarios)
+            
+            # Run scenario with reduced duration for continuous testing
+            args = ["--namespace", self.namespace]
+            if "coordinator" in selected_scenario:
+                args.extend(["--outage-duration", "30", "--load-duration", "60"])
+            elif "network" in selected_scenario:
+                args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
+            elif "database" in selected_scenario:
+                args.extend(["--failure-duration", "30", "--failure-type", "connection"])
+            
+            result = await self.run_scenario(selected_scenario, args)
+            if result:
+                result["cycle_time"] = cycle_start.isoformat()
+                all_results.append(result)
+            
+            # Wait for next cycle
+            elapsed = (datetime.now() - cycle_start).total_seconds()
+            if elapsed < interval_seconds:
+                wait_time = interval_seconds - elapsed
+                logger.info(f"Waiting {wait_time:.0f}s for next cycle")
+                await asyncio.sleep(wait_time)
+        
+        # Generate continuous testing report
+        continuous_report = {
+            "continuous_testing": True,
+            "duration_hours": duration_hours,
+            "interval_minutes": interval_minutes,
+            "total_cycles": len(all_results),
+            "cycles": all_results
+        }
+        
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        report_file = f"continuous_chaos_report_{timestamp}.json"
+        with open(report_file, 'w') as f:
+            json.dump(continuous_report, f, indent=2)
+        
+        logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--scenarios", nargs="+", 
+                       choices=["coordinator", "network", "database"],
+                       default=["coordinator", "network", "database"],
+                       help="Scenarios to run")
+    parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
+    parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
+    parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    orchestrator = ChaosOrchestrator(args.namespace)
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
+        return
+    
+    if args.continuous:
+        await orchestrator.run_continuous_chaos(args.duration, args.interval)
+    else:
+        # Map scenario names to script files
+        scenario_map = {
+            "coordinator": "chaos_test_coordinator.py",
+            "network": "chaos_test_network.py",
+            "database": "chaos_test_database.py"
+        }
+        
+        # Get script files
+        scripts = [scenario_map[s] for s in args.scenarios]
+        
+        # Default arguments for each scenario
+        scenario_args = {
+            "chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
+            "chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
+            "chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
+        }
+        
+        await orchestrator.run_all_scenarios(scripts, scenario_args)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_coordinator.py
+++ b/infra/scripts/chaos_test_coordinator.py
@ -0,0 +1,287 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Coordinator API Outage
+Tests system resilience when coordinator API becomes unavailable
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestCoordinator:
+    """Chaos testing for coordinator API outage scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "outage_start": None,
+            "outage_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "coordinator_outage"
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_coordinator_pods(self) -> List[str]:
+        """Get list of coordinator pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=coordinator",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get coordinator pods: {e}")
+            return []
+    
+    def delete_coordinator_pods(self) -> bool:
+        """Delete all coordinator pods to simulate outage"""
+        try:
+            cmd = [
+                "kubectl", "delete", "pods",
+                "-n", self.namespace,
+                "-l", "app.kubernetes.io/name=coordinator",
+                "--force", "--grace-period=0"
+            ]
+            subprocess.run(cmd, check=True)
+            logger.info("Coordinator pods deleted successfully")
+            return True
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to delete coordinator pods: {e}")
+            return False
+    
+    async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
+        """Wait for all coordinator pods to terminate"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            pods = self.get_coordinator_pods()
+            if not pods:
+                logger.info("All coordinator pods terminated")
+                return True
+            await asyncio.sleep(2)
+        
+        logger.error("Timeout waiting for pods to terminate")
+        return False
+    
+    async def wait_for_recovery(self, timeout: int = 300) -> bool:
+        """Wait for coordinator service to recover"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            try:
+                # Check if pods are running
+                pods = self.get_coordinator_pods()
+                if not pods:
+                    await asyncio.sleep(5)
+                    continue
+                
+                # Check if at least one pod is ready
+                ready_cmd = [
+                    "kubectl", "get", "pods",
+                    "-n", self.namespace,
+                    "-l", "app.kubernetes.io/name=coordinator",
+                    "-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
+                ]
+                result = subprocess.run(ready_cmd, capture_output=True, text=True)
+                if result.stdout.strip():
+                    # Test API health
+                    if self.test_health_endpoint():
+                        recovery_time = time.time() - start_time
+                        self.metrics["recovery_time"] = recovery_time
+                        logger.info(f"Service recovered in {recovery_time:.2f} seconds")
+                        return True
+                
+            except Exception as e:
+                logger.debug(f"Recovery check failed: {e}")
+            
+            await asyncio.sleep(5)
+        
+        logger.error("Service did not recover within timeout")
+        return False
+    
+    def test_health_endpoint(self) -> bool:
+        """Test if coordinator health endpoint is responding"""
+        try:
+            # Get service URL
+            cmd = [
+                "kubectl", "get", "svc", "coordinator",
+                "-n", self.namespace,
+                "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            service_url = f"http://{result.stdout.strip()}/v1/health"
+            
+            # Test health endpoint
+            response = subprocess.run(
+                ["curl", "-s", "--max-time", "5", service_url],
+                capture_output=True, text=True
+            )
+            
+            return response.returncode == 0 and "ok" in response.stdout
+        except Exception:
+            return False
+    
+    async def generate_load(self, duration: int, concurrent: int = 10):
+        """Generate synthetic load on coordinator API"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "coordinator",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
+        """Run the complete chaos test"""
+        logger.info("Starting coordinator outage chaos test")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Phase 1: Generate initial load
+        logger.info("Phase 1: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 2: Induce outage
+        logger.info("Phase 2: Inducing coordinator outage")
+        self.metrics["outage_start"] = datetime.utcnow().isoformat()
+        
+        if not self.delete_coordinator_pods():
+            logger.error("Failed to induce outage")
+            return False
+        
+        if not await self.wait_for_pods_termination():
+            logger.error("Pods did not terminate")
+            return False
+        
+        # Wait for specified outage duration
+        logger.info(f"Waiting for {outage_duration} seconds outage duration")
+        await asyncio.sleep(outage_duration)
+        
+        # Phase 3: Monitor recovery
+        logger.info("Phase 3: Monitoring service recovery")
+        self.metrics["outage_end"] = datetime.utcnow().isoformat()
+        
+        if not await self.wait_for_recovery():
+            logger.error("Service did not recover")
+            return False
+        
+        # Phase 4: Post-recovery load test
+        logger.info("Phase 4: Post-recovery load test")
+        await self.generate_load(load_duration)
+        
+        # Calculate metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_coordinator_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+        print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
+    parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info("DRY RUN: Would test coordinator outage without actual deletion")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestCoordinator(args.namespace) as test:
+        success = await test.run_test(args.outage_duration, args.load_duration)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_database.py
+++ b/infra/scripts/chaos_test_database.py
@ -0,0 +1,387 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Database Failure
+Tests system resilience when PostgreSQL database becomes unavailable
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestDatabase:
+    """Chaos testing for database failure scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "failure_start": None,
+            "failure_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "database_failure",
+            "failure_type": None
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_postgresql_pod(self) -> Optional[str]:
+        """Get PostgreSQL pod name"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=postgresql",
+            "-o", "jsonpath={.items[0].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pod = result.stdout.strip()
+            return pod if pod else None
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get PostgreSQL pod: {e}")
+            return None
+    
+    def simulate_database_connection_failure(self) -> bool:
+        """Simulate database connection failure by blocking port 5432"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Block incoming connections to PostgreSQL
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            # Block outgoing connections from PostgreSQL
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            logger.info(f"Blocked PostgreSQL connections on pod {pod}")
+            self.metrics["failure_type"] = "connection_blocked"
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to block PostgreSQL connections: {e}")
+            return False
+    
+    def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
+        """Simulate high database latency using netem"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Add latency to PostgreSQL traffic
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
+            self.metrics["failure_type"] = "high_latency"
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to add latency to PostgreSQL: {e}")
+            return False
+    
+    def restore_database(self) -> bool:
+        """Restore database connections"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Remove iptables rules
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-F", "INPUT"
+            ]
+            subprocess.run(cmd, check=False)  # May fail if rules don't exist
+            
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-F", "OUTPUT"
+            ]
+            subprocess.run(cmd, check=False)
+            
+            # Remove netem qdisc
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "tc", "qdisc", "del", "dev", "eth0", "root"
+            ]
+            subprocess.run(cmd, check=False)
+            
+            logger.info(f"Restored PostgreSQL connections on pod {pod}")
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to restore PostgreSQL: {e}")
+            return False
+    
+    async def test_database_connectivity(self) -> bool:
+        """Test if coordinator can connect to database"""
+        try:
+            # Get coordinator pod
+            cmd = [
+                "kubectl", "get", "pods",
+                "-n", self.namespace,
+                "-l", "app.kubernetes.io/name=coordinator",
+                "-o", "jsonpath={.items[0].metadata.name}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            coordinator_pod = result.stdout.strip()
+            
+            if not coordinator_pod:
+                return False
+            
+            # Test database connection from coordinator
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
+                "python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            
+            return result.returncode == 0 and "OK" in result.stdout
+            
+        except Exception:
+            return False
+    
+    async def test_api_health(self) -> bool:
+        """Test if coordinator API is healthy"""
+        try:
+            # Get service URL
+            cmd = [
+                "kubectl", "get", "svc", "coordinator",
+                "-n", self.namespace,
+                "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            service_url = f"http://{result.stdout.strip()}/v1/health"
+            
+            # Test health endpoint
+            response = subprocess.run(
+                ["curl", "-s", "--max-time", "5", service_url],
+                capture_output=True, text=True
+            )
+            
+            return response.returncode == 0 and "ok" in response.stdout
+            
+        except Exception:
+            return False
+    
+    async def generate_load(self, duration: int, concurrent: int = 10):
+        """Generate synthetic load on coordinator API"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "coordinator",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def wait_for_recovery(self, timeout: int = 300) -> bool:
+        """Wait for database and API to recover"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            # Test database connectivity
+            db_connected = await self.test_database_connectivity()
+            
+            # Test API health
+            api_healthy = await self.test_api_health()
+            
+            if db_connected and api_healthy:
+                recovery_time = time.time() - start_time
+                self.metrics["recovery_time"] = recovery_time
+                logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
+                return True
+            
+            await asyncio.sleep(5)
+        
+        logger.error("Database and API did not recover within timeout")
+        return False
+    
+    async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
+        """Run the complete database chaos test"""
+        logger.info(f"Starting database chaos test - failure type: {failure_type}")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Phase 1: Baseline test
+        logger.info("Phase 1: Baseline connectivity test")
+        db_connected = await self.test_database_connectivity()
+        api_healthy = await self.test_api_health()
+        
+        if not db_connected or not api_healthy:
+            logger.error("Baseline test failed - database or API not healthy")
+            return False
+        
+        logger.info("Baseline: Database and API are healthy")
+        
+        # Phase 2: Generate initial load
+        logger.info("Phase 2: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 3: Induce database failure
+        logger.info("Phase 3: Inducing database failure")
+        self.metrics["failure_start"] = datetime.utcnow().isoformat()
+        
+        if failure_type == "connection":
+            if not self.simulate_database_connection_failure():
+                logger.error("Failed to induce database connection failure")
+                return False
+        elif failure_type == "latency":
+            if not self.simulate_database_high_latency():
+                logger.error("Failed to induce database latency")
+                return False
+        else:
+            logger.error(f"Unknown failure type: {failure_type}")
+            return False
+        
+        # Verify failure is effective
+        await asyncio.sleep(5)
+        db_connected = await self.test_database_connectivity()
+        api_healthy = await self.test_api_health()
+        
+        logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
+        
+        # Phase 4: Monitor during failure
+        logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
+        
+        # Generate load during failure
+        await self.generate_load(failure_duration)
+        
+        # Phase 5: Restore database and monitor recovery
+        logger.info("Phase 5: Restoring database")
+        self.metrics["failure_end"] = datetime.utcnow().isoformat()
+        
+        if not self.restore_database():
+            logger.error("Failed to restore database")
+            return False
+        
+        # Wait for recovery
+        if not await self.wait_for_recovery():
+            logger.error("System did not recover after database restoration")
+            return False
+        
+        # Phase 6: Post-recovery load test
+        logger.info("Phase 6: Post-recovery load test")
+        await self.generate_load(60)
+        
+        # Final metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Database chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_database_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Failure Type: {self.metrics['failure_type']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for database failure")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
+    parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestDatabase(args.namespace) as test:
+        success = await test.run_test(args.failure_type, args.failure_duration)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_network.py
+++ b/infra/scripts/chaos_test_network.py
@ -0,0 +1,372 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Network Partition
+Tests system resilience when blockchain nodes experience network partitions
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestNetwork:
+    """Chaos testing for network partition scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "partition_start": None,
+            "partition_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "network_partition",
+            "affected_nodes": []
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_blockchain_pods(self) -> List[str]:
+        """Get list of blockchain node pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=blockchain-node",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get blockchain pods: {e}")
+            return []
+    
+    def get_coordinator_pods(self) -> List[str]:
+        """Get list of coordinator pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=coordinator",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get coordinator pods: {e}")
+            return []
+    
+    def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
+        """Apply network partition using iptables"""
+        logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
+        
+        for pod in pods:
+            if pod in target_pods:
+                continue
+                
+            # Block traffic from this pod to target pods
+            for target_pod in target_pods:
+                try:
+                    # Get target pod IP
+                    cmd = [
+                        "kubectl", "get", "pod", target_pod,
+                        "-n", self.namespace,
+                        "-o", "jsonpath={.status.podIP}"
+                    ]
+                    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+                    target_ip = result.stdout.strip()
+                    
+                    if not target_ip:
+                        continue
+                    
+                    # Apply iptables rule to block traffic
+                    iptables_cmd = [
+                        "kubectl", "exec", "-n", self.namespace, pod, "--",
+                        "iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
+                    ]
+                    subprocess.run(iptables_cmd, check=True)
+                    
+                    logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
+                    
+                except subprocess.CalledProcessError as e:
+                    logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
+                    return False
+        
+        self.metrics["affected_nodes"] = pods + target_pods
+        return True
+    
+    def remove_network_partition(self, pods: List[str]) -> bool:
+        """Remove network partition rules"""
+        logger.info("Removing network partition rules")
+        
+        for pod in pods:
+            try:
+                # Flush OUTPUT chain (remove all rules)
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pod, "--",
+                    "iptables", "-F", "OUTPUT"
+                ]
+                subprocess.run(cmd, check=True)
+                logger.info(f"Removed network rules from {pod}")
+                
+            except subprocess.CalledProcessError as e:
+                logger.error(f"Failed to remove network rules from {pod}: {e}")
+                return False
+        
+        return True
+    
+    async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
+        """Test connectivity between pods"""
+        results = {}
+        
+        for pod in pods:
+            try:
+                # Test if pod can reach coordinator
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pod, "--",
+                    "curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
+                ]
+                result = subprocess.run(cmd, capture_output=True, text=True)
+                results[pod] = result.returncode == 0 and "ok" in result.stdout
+                
+            except Exception:
+                results[pod] = False
+        
+        return results
+    
+    async def monitor_consensus(self, duration: int = 60) -> bool:
+        """Monitor blockchain consensus health"""
+        logger.info(f"Monitoring consensus for {duration} seconds")
+        
+        start_time = time.time()
+        last_height = 0
+        
+        while time.time() - start_time < duration:
+            try:
+                # Get block height from a random pod
+                pods = self.get_blockchain_pods()
+                if not pods:
+                    await asyncio.sleep(5)
+                    continue
+                
+                # Use first pod to check height
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pods[0], "--",
+                    "curl", "-s", "http://localhost:8080/v1/blocks/head"
+                ]
+                result = subprocess.run(cmd, capture_output=True, text=True)
+                
+                if result.returncode == 0:
+                    try:
+                        data = json.loads(result.stdout)
+                        current_height = data.get("height", 0)
+                        
+                        # Check if blockchain is progressing
+                        if current_height > last_height:
+                            last_height = current_height
+                            logger.info(f"Blockchain progressing, height: {current_height}")
+                        elif time.time() - start_time > 30:  # Allow 30s for initial sync
+                            logger.warning(f"Blockchain stuck at height {current_height}")
+                    
+                    except json.JSONDecodeError:
+                        pass
+                
+            except Exception as e:
+                logger.debug(f"Consensus check failed: {e}")
+            
+            await asyncio.sleep(5)
+        
+        return last_height > 0
+    
+    async def generate_load(self, duration: int, concurrent: int = 5):
+        """Generate synthetic load on blockchain nodes"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "blockchain-node",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/blocks/head") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
+        """Run the complete network partition chaos test"""
+        logger.info("Starting network partition chaos test")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Get all blockchain pods
+        all_pods = self.get_blockchain_pods()
+        if not all_pods:
+            logger.error("No blockchain pods found")
+            return False
+        
+        # Determine which pods to partition
+        num_partition = int(len(all_pods) * partition_ratio)
+        partition_pods = all_pods[:num_partition]
+        remaining_pods = all_pods[num_partition:]
+        
+        logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
+        
+        # Phase 1: Baseline test
+        logger.info("Phase 1: Baseline connectivity test")
+        baseline_connectivity = await self.test_connectivity(all_pods)
+        logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
+        
+        # Phase 2: Generate initial load
+        logger.info("Phase 2: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 3: Apply network partition
+        logger.info("Phase 3: Applying network partition")
+        self.metrics["partition_start"] = datetime.utcnow().isoformat()
+        
+        if not self.apply_network_partition(remaining_pods, partition_pods):
+            logger.error("Failed to apply network partition")
+            return False
+        
+        # Verify partition is effective
+        await asyncio.sleep(5)
+        partitioned_connectivity = await self.test_connectivity(all_pods)
+        logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
+        
+        # Phase 4: Monitor during partition
+        logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
+        consensus_healthy = await self.monitor_consensus(partition_duration)
+        
+        # Phase 5: Remove partition and monitor recovery
+        logger.info("Phase 5: Removing network partition")
+        self.metrics["partition_end"] = datetime.utcnow().isoformat()
+        
+        if not self.remove_network_partition(all_pods):
+            logger.error("Failed to remove network partition")
+            return False
+        
+        # Wait for recovery
+        logger.info("Waiting for network recovery...")
+        await asyncio.sleep(10)
+        
+        # Test connectivity after recovery
+        recovery_connectivity = await self.test_connectivity(all_pods)
+        recovery_time = time.time()
+        
+        # Calculate recovery metrics
+        all_connected = all(recovery_connectivity.values())
+        if all_connected:
+            self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
+            logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
+        
+        # Phase 6: Post-recovery load test
+        logger.info("Phase 6: Post-recovery load test")
+        await self.generate_load(60)
+        
+        # Final metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Network partition chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_network_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for network partition")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
+    parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestNetwork(args.namespace) as test:
+        success = await test.run_test(args.partition_duration, args.partition_ratio)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/restore_ledger.sh
+++ b/infra/scripts/restore_ledger.sh
@ -0,0 +1,279 @@
+#!/bin/bash
+# Ledger Storage Restore Script for AITBC
+# Usage: ./restore_ledger.sh [namespace] [backup_directory]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_DIR=${2:-}
+TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v jq &> /dev/null; then
+        error "jq is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup directory
+validate_backup_dir() {
+    if [[ -z "$BACKUP_DIR" ]]; then
+        error "Backup directory must be specified"
+        echo "Usage: $0 [namespace] [backup_directory]"
+        exit 1
+    fi
+    
+    if [[ ! -d "$BACKUP_DIR" ]]; then
+        error "Backup directory not found: $BACKUP_DIR"
+        exit 1
+    fi
+    
+    # Check for required files
+    if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
+        error "metadata.json not found in backup directory"
+        exit 1
+    fi
+    
+    if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
+        error "chain.tar.gz not found in backup directory"
+        exit 1
+    fi
+    
+    log "Using backup directory: $BACKUP_DIR"
+}
+
+# Get blockchain node pods
+get_blockchain_pods() {
+    local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pods" ]]; then
+        pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pods" ]]; then
+        error "Could not find blockchain node pods in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo $pods
+}
+
+# Create backup of current ledger before restore
+create_pre_restore_backup() {
+    local pods=($1)
+    local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
+    local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
+    
+    warn "Creating backup of current ledger before restore..."
+    mkdir -p "$pre_restore_dir"
+    
+    # Use the first ready pod
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            # Get current block height
+            local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+            
+            # Create metadata
+            cat > "$pre_restore_dir/metadata.json" << EOF
+{
+  "backup_name": "$pre_restore_backup",
+  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+  "namespace": "$NAMESPACE",
+  "source_pod": "$pod",
+  "latest_block_height": $current_height,
+  "backup_type": "pre-restore"
+}
+EOF
+            
+            # Backup data directories
+            local data_dirs=("chain" "wallets" "receipts")
+            for dir in "${data_dirs[@]}"; do
+                if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
+                    kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
+                    kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
+                    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
+                fi
+            done
+            
+            log "Pre-restore backup created: $pre_restore_dir"
+            break
+        fi
+    done
+}
+
+# Perform restore
+perform_restore() {
+    local pods=($1)
+    
+    warn "This will replace all current ledger data. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Scale down blockchain nodes
+    info "Scaling down blockchain node deployment..."
+    kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
+    
+    # Wait for pods to terminate
+    kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
+    
+    # Scale up blockchain nodes
+    info "Scaling up blockchain node deployment..."
+    kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
+    
+    # Wait for pods to be ready
+    local ready_pods=()
+    local retries=30
+    while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
+        local all_pods=$(get_blockchain_pods)
+        for pod in $all_pods; do
+            if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+                ready_pods+=("$pod")
+            fi
+        done
+        
+        if [[ ${#ready_pods[@]} -eq 0 ]]; then
+            sleep 5
+            ((retries--))
+        fi
+    done
+    
+    if [[ ${#ready_pods[@]} -eq 0 ]]; then
+        error "No blockchain nodes became ready after restore"
+        exit 1
+    fi
+    
+    # Restore data to all ready pods
+    for pod in "${ready_pods[@]}"; do
+        info "Restoring ledger data to pod $pod..."
+        
+        # Create temp directory on pod
+        kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
+        
+        # Extract and copy chain data
+        if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
+        fi
+        
+        # Extract and copy wallet data
+        if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
+        fi
+        
+        # Extract and copy receipt data
+        if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
+        fi
+        
+        # Set correct permissions
+        kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
+        
+        # Clean up temp directory
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
+        
+        log "Ledger data restored to pod $pod"
+    done
+    
+    log "Ledger restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pods=($1)
+    
+    log "Verifying ledger restore..."
+    
+    # Read backup metadata
+    local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
+    log "Backup contains blocks up to height: $backup_height"
+    
+    # Verify on each pod
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            # Check if node is responding
+            if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
+                # Get current block height
+                local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+                
+                if [[ "$current_height" -eq "$backup_height" ]]; then
+                    log "✓ Pod $pod: Block height matches backup ($current_height)"
+                else
+                    warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
+                fi
+                
+                # Check data directories
+                local dirs=("chain" "wallets" "receipts")
+                for dir in "${dirs[@]}"; do
+                    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
+                        local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
+                        log "✓ Pod $pod: $dir directory contains $file_count files"
+                    else
+                        warn "⚠ Pod $pod: $dir directory not found"
+                    fi
+                done
+            else
+                error "✗ Pod $pod: Not responding to health checks"
+            fi
+        fi
+    done
+}
+
+# Main execution
+main() {
+    log "Starting ledger restore process"
+    
+    check_dependencies
+    validate_backup_dir
+    
+    local pods=($(get_blockchain_pods))
+    create_pre_restore_backup "${pods[*]}"
+    perform_restore "${pods[*]}"
+    
+    # Get updated pod list after restore
+    pods=($(get_blockchain_pods))
+    verify_restore "${pods[*]}"
+    
+    log "Ledger restore process completed successfully"
+    warn "Please verify blockchain synchronization and application functionality"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/restore_postgresql.sh
+++ b/infra/scripts/restore_postgresql.sh
@ -0,0 +1,228 @@
+#!/bin/bash
+# PostgreSQL Restore Script for AITBC
+# Usage: ./restore_postgresql.sh [namespace] [backup_file]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_FILE=${2:-}
+BACKUP_DIR="/tmp/postgresql-backups"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v pg_restore &> /dev/null; then
+        error "pg_restore is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup file
+validate_backup_file() {
+    if [[ -z "$BACKUP_FILE" ]]; then
+        error "Backup file must be specified"
+        echo "Usage: $0 [namespace] [backup_file]"
+        exit 1
+    fi
+    
+    # If file doesn't exist locally, try to find it in backup dir
+    if [[ ! -f "$BACKUP_FILE" ]]; then
+        local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
+        if [[ -f "$potential_file" ]]; then
+            BACKUP_FILE="$potential_file"
+        else
+            error "Backup file not found: $BACKUP_FILE"
+            exit 1
+        fi
+    fi
+    
+    # Check if file is gzipped and decompress if needed
+    if [[ "$BACKUP_FILE" == *.gz ]]; then
+        info "Decompressing backup file..."
+        gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
+        BACKUP_FILE="/tmp/restore_$(date +%s).dump"
+    fi
+    
+    log "Using backup file: $BACKUP_FILE"
+}
+
+# Get PostgreSQL pod name
+get_postgresql_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find PostgreSQL pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for PostgreSQL to be ready
+wait_for_postgresql() {
+    local pod=$1
+    log "Waiting for PostgreSQL pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if PostgreSQL is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
+            log "PostgreSQL is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "PostgreSQL did not become ready within timeout"
+    exit 1
+}
+
+# Create backup of current database before restore
+create_pre_restore_backup() {
+    local pod=$1
+    local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
+    
+    warn "Creating backup of current database before restore..."
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Create backup
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_dump -U "$db_user" -h localhost -d "$db_name" \
+        --format=custom --file="/tmp/${pre_restore_backup}.dump"
+    
+    # Copy backup locally
+    kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
+    
+    log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
+}
+
+# Perform restore
+perform_restore() {
+    local pod=$1
+    
+    warn "This will replace the current database. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Copy backup file to pod
+    local remote_backup="/tmp/restore_$(date +%s).dump"
+    kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
+    
+    # Drop existing database and recreate
+    log "Dropping existing database..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
+    
+    log "Creating new database..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
+    
+    # Restore database
+    log "Restoring database from backup..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_restore -U "$db_user" -h localhost -d "$db_name" \
+        --verbose --clean --if-exists "$remote_backup"
+    
+    # Clean up remote file
+    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
+    
+    log "Database restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pod=$1
+    
+    log "Verifying database restore..."
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Check table count
+    local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
+    
+    log "Database contains $table_count tables"
+    
+    # Check if key tables exist
+    local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
+    for table in "${key_tables[@]}"; do
+        local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+            psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
+        if [[ "$exists" == "t" ]]; then
+            log "✓ Table $table exists"
+        else
+            warn "⚠ Table $table not found"
+        fi
+    done
+}
+
+# Main execution
+main() {
+    log "Starting PostgreSQL restore process"
+    
+    check_dependencies
+    validate_backup_file
+    
+    local pod=$(get_postgresql_pod)
+    wait_for_postgresql "$pod"
+    
+    create_pre_restore_backup "$pod"
+    perform_restore "$pod"
+    verify_restore "$pod"
+    
+    log "PostgreSQL restore process completed successfully"
+    warn "Please verify application functionality after restore"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/restore_redis.sh
+++ b/infra/scripts/restore_redis.sh
@ -0,0 +1,223 @@
+#!/bin/bash
+# Redis Restore Script for AITBC
+# Usage: ./restore_redis.sh [namespace] [backup_file]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_FILE=${2:-}
+BACKUP_DIR="/tmp/redis-backups"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup file
+validate_backup_file() {
+    if [[ -z "$BACKUP_FILE" ]]; then
+        error "Backup file must be specified"
+        echo "Usage: $0 [namespace] [backup_file]"
+        exit 1
+    fi
+    
+    # If file doesn't exist locally, try to find it in backup dir
+    if [[ ! -f "$BACKUP_FILE" ]]; then
+        local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
+        if [[ -f "$potential_file" ]]; then
+            BACKUP_FILE="$potential_file"
+        else
+            error "Backup file not found: $BACKUP_FILE"
+            exit 1
+        fi
+    fi
+    
+    log "Using backup file: $BACKUP_FILE"
+}
+
+# Get Redis pod name
+get_redis_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find Redis pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Create backup of current Redis data before restore
+create_pre_restore_backup() {
+    local pod=$1
+    local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
+    
+    warn "Creating backup of current Redis data before restore..."
+    
+    # Create background save
+    kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
+    
+    # Wait for save to complete
+    local retries=60
+    while [[ $retries -gt 0 ]]; do
+        local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        
+        if [[ "$lastsave" -gt "$lastbgsave" ]]; then
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    # Copy backup locally
+    kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
+    
+    # Also backup AOF if exists
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
+        kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
+    fi
+    
+    log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
+}
+
+# Perform restore
+perform_restore() {
+    local pod=$1
+    
+    warn "This will replace all current Redis data. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Scale down Redis to ensure clean restore
+    info "Scaling down Redis deployment..."
+    kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
+    
+    # Wait for pod to terminate
+    kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
+    
+    # Scale up Redis
+    info "Scaling up Redis deployment..."
+    kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
+    
+    # Wait for new pod to be ready
+    local new_pod=$(get_redis_pod)
+    kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Stop Redis server
+    info "Stopping Redis server..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
+    
+    # Clear existing data
+    info "Clearing existing Redis data..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
+    
+    # Copy backup file
+    info "Copying backup file..."
+    local remote_file="/data/restore.rdb"
+    kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
+    
+    # Set correct permissions
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
+    
+    # Start Redis server
+    info "Starting Redis server..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
+    
+    # Wait for Redis to be ready
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+            log "Redis is ready"
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    if [[ $retries -eq 0 ]]; then
+        error "Redis did not start properly after restore"
+        exit 1
+    fi
+    
+    log "Redis restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pod=$1
+    
+    log "Verifying Redis restore..."
+    
+    # Check database size
+    local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
+    log "Database contains $db_size keys"
+    
+    # Check memory usage
+    local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
+    log "Memory usage: $memory"
+    
+    # Check if Redis is responding to commands
+    if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+        log "✓ Redis is responding normally"
+    else
+        error "✗ Redis is not responding"
+        exit 1
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting Redis restore process"
+    
+    check_dependencies
+    validate_backup_file
+    
+    local pod=$(get_redis_pod)
+    create_pre_restore_backup "$pod"
+    perform_restore "$pod"
+    
+    # Get new pod name after restore
+    pod=$(get_redis_pod)
+    verify_restore "$pod"
+    
+    log "Redis restore process completed successfully"
+    warn "Please verify application functionality after restore"
+}
+
+# Run main function
+main "$@"