Files
aitbc/docs/9_security/3_chaos-testing.md
oib 15427c96c0 chore: update file permissions to executable across repository
- Change file mode from 644 to 755 for all project files
- Add chain_id parameter to get_balance RPC endpoint with default "ait-devnet"
- Rename Miner.extra_meta_data to extra_metadata for consistency
2026-03-06 22:17:54 +01:00

8.1 KiB
Executable File

AITBC Chaos Testing Framework

This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.

Overview

The chaos testing framework simulates real-world failure scenarios to:

  • Test system resilience under adverse conditions
  • Measure Mean-Time-To-Recovery (MTTR) metrics
  • Identify single points of failure
  • Validate recovery procedures
  • Ensure SLO compliance

Components

Test Scripts

  1. chaos_test_coordinator.py - Coordinator API outage simulation

    • Deletes coordinator pods to simulate complete service outage
    • Measures recovery time and service availability
    • Tests load handling during and after recovery
  2. chaos_test_network.py - Network partition simulation

    • Creates network partitions between blockchain nodes
    • Tests consensus resilience during partition
    • Measures network recovery time
  3. chaos_test_database.py - Database failure simulation

    • Simulates PostgreSQL connection failures
    • Tests high latency scenarios
    • Validates application error handling
  4. chaos_orchestrator.py - Test orchestration and reporting

    • Runs multiple chaos test scenarios
    • Aggregates MTTR metrics across tests
    • Generates comprehensive reports
    • Supports continuous chaos testing

Prerequisites

  • Python 3.8+
  • kubectl configured with cluster access
  • Helm charts deployed in target namespace
  • Administrative privileges for network manipulation

Installation

# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts

# Install dependencies
pip install aiohttp

# Make scripts executable
chmod +x chaos_*.py

Usage

Running Individual Tests

Coordinator Outage Test

# Basic test
python3 chaos_test_coordinator.py --namespace default

# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120

# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run

Network Partition Test

# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default

# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3

Database Failure Test

# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection

# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency

Running All Tests

# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default

# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network

# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60

Test Scenarios

1. Coordinator API Outage

Objective: Test system resilience when the coordinator service becomes unavailable.

Steps:

  1. Generate baseline load on coordinator API
  2. Delete all coordinator pods
  3. Wait for specified outage duration
  4. Monitor service recovery
  5. Generate post-recovery load

Metrics Collected:

  • MTTR (Mean-Time-To-Recovery)
  • Success/error request counts
  • Recovery time distribution

2. Network Partition

Objective: Test blockchain consensus during network partitions.

Steps:

  1. Identify blockchain node pods
  2. Apply iptables rules to partition nodes
  3. Monitor consensus during partition
  4. Remove network partition
  5. Verify network recovery

Metrics Collected:

  • Network recovery time
  • Consensus health during partition
  • Node connectivity status

3. Database Failure

Objective: Test application behavior when database is unavailable.

Steps:

  1. Simulate database connection failure or high latency
  2. Monitor API behavior during failure
  3. Restore database connectivity
  4. Verify application recovery

Metrics Collected:

  • Database recovery time
  • API error rates during failure
  • Application resilience metrics

Results and Reporting

Test Results Format

Each test generates a JSON results file with the following structure:

{
  "test_start": "2024-12-22T10:00:00.000Z",
  "test_end": "2024-12-22T10:05:00.000Z",
  "scenario": "coordinator_outage",
  "mttr": 45.2,
  "error_count": 156,
  "success_count": 844,
  "recovery_time": 45.2
}

Orchestrator Report

The orchestrator generates a comprehensive report including:

  • Summary metrics across all scenarios
  • SLO compliance analysis
  • Recommendations for improvements
  • MTTR trends and statistics

Example report snippet:

{
  "summary": {
    "total_scenarios": 3,
    "successful_scenarios": 3,
    "average_mttr": 67.8,
    "max_mttr": 120.5,
    "min_mttr": 45.2
  },
  "recommendations": [
    "Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
    "Coordinator recovery is slow. Consider reducing pod startup time."
  ]
}

SLO Targets

Metric Target Current
MTTR (Average) ≤ 120 seconds TBD
MTTR (Maximum) ≤ 300 seconds TBD
Success Rate ≥ 99.9% TBD

Best Practices

Before Running Tests

  1. Backup Critical Data: Ensure recent backups are available
  2. Notify Team: Inform stakeholders about chaos testing
  3. Check Cluster Health: Verify all components are healthy
  4. Schedule Appropriately: Run during low-traffic periods

During Tests

  1. Monitor Logs: Watch for unexpected errors
  2. Have Rollback Plan: Be ready to manually intervene
  3. Document Observations: Note any unusual behavior
  4. Stop if Critical: Abort tests if production is impacted

After Tests

  1. Review Results: Analyze MTTR and error rates
  2. Update Documentation: Record findings and improvements
  3. Address Issues: Fix any discovered problems
  4. Schedule Follow-up: Plan regular chaos testing

Integration with CI/CD

GitHub Actions Example

name: Chaos Testing
on:
  schedule:
    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
  workflow_dispatch:

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          pip install aiohttp
      - name: Run chaos tests
        run: |
          cd infra/scripts
          python3 chaos_orchestrator.py --namespace staging
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: chaos-results
          path: "*.json"

Troubleshooting

Common Issues

  1. kubectl not found

    # Ensure kubectl is installed and configured
    which kubectl
    kubectl version
    
  2. Permission denied errors

    # Check RBAC permissions
    kubectl auth can-i create pods --namespace default
    kubectl auth can-i exec pods --namespace default
    
  3. Network rules not applying

    # Check if iptables is available in pods
    kubectl exec -it <pod> -- iptables -L
    
  4. Tests hanging

    # Check pod status
    kubectl get pods --namespace default
    kubectl describe pod <pod-name> --namespace default
    

Debug Mode

Enable debug logging:

export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log

Contributing

To add new chaos test scenarios:

  1. Create a new script following the naming pattern chaos_test_<scenario>.py
  2. Implement the required methods: run_test(), save_results()
  3. Add the scenario to chaos_orchestrator.py
  4. Update documentation

Security Considerations

  • Chaos tests require elevated privileges
  • Only run in authorized environments
  • Ensure test isolation from production data
  • Review network rules before deployment
  • Monitor for security violations during tests

Support

For issues or questions:

  • Check the troubleshooting section
  • Review test logs for error details
  • Contact the DevOps team at devops@aitbc.io

License

This chaos testing framework is part of the AITBC project and follows the same license terms.