Files
aitbc/infra/scripts/README_chaos.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

8.1 KiB

AITBC Chaos Testing Framework

This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.

Overview

The chaos testing framework simulates real-world failure scenarios to:

  • Test system resilience under adverse conditions
  • Measure Mean-Time-To-Recovery (MTTR) metrics
  • Identify single points of failure
  • Validate recovery procedures
  • Ensure SLO compliance

Components

Test Scripts

  1. chaos_test_coordinator.py - Coordinator API outage simulation

    • Deletes coordinator pods to simulate complete service outage
    • Measures recovery time and service availability
    • Tests load handling during and after recovery
  2. chaos_test_network.py - Network partition simulation

    • Creates network partitions between blockchain nodes
    • Tests consensus resilience during partition
    • Measures network recovery time
  3. chaos_test_database.py - Database failure simulation

    • Simulates PostgreSQL connection failures
    • Tests high latency scenarios
    • Validates application error handling
  4. chaos_orchestrator.py - Test orchestration and reporting

    • Runs multiple chaos test scenarios
    • Aggregates MTTR metrics across tests
    • Generates comprehensive reports
    • Supports continuous chaos testing

Prerequisites

  • Python 3.8+
  • kubectl configured with cluster access
  • Helm charts deployed in target namespace
  • Administrative privileges for network manipulation

Installation

# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts

# Install dependencies
pip install aiohttp

# Make scripts executable
chmod +x chaos_*.py

Usage

Running Individual Tests

Coordinator Outage Test

# Basic test
python3 chaos_test_coordinator.py --namespace default

# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120

# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run

Network Partition Test

# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default

# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3

Database Failure Test

# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection

# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency

Running All Tests

# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default

# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network

# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60

Test Scenarios

1. Coordinator API Outage

Objective: Test system resilience when the coordinator service becomes unavailable.

Steps:

  1. Generate baseline load on coordinator API
  2. Delete all coordinator pods
  3. Wait for specified outage duration
  4. Monitor service recovery
  5. Generate post-recovery load

Metrics Collected:

  • MTTR (Mean-Time-To-Recovery)
  • Success/error request counts
  • Recovery time distribution

2. Network Partition

Objective: Test blockchain consensus during network partitions.

Steps:

  1. Identify blockchain node pods
  2. Apply iptables rules to partition nodes
  3. Monitor consensus during partition
  4. Remove network partition
  5. Verify network recovery

Metrics Collected:

  • Network recovery time
  • Consensus health during partition
  • Node connectivity status

3. Database Failure

Objective: Test application behavior when database is unavailable.

Steps:

  1. Simulate database connection failure or high latency
  2. Monitor API behavior during failure
  3. Restore database connectivity
  4. Verify application recovery

Metrics Collected:

  • Database recovery time
  • API error rates during failure
  • Application resilience metrics

Results and Reporting

Test Results Format

Each test generates a JSON results file with the following structure:

{
  "test_start": "2024-12-22T10:00:00.000Z",
  "test_end": "2024-12-22T10:05:00.000Z",
  "scenario": "coordinator_outage",
  "mttr": 45.2,
  "error_count": 156,
  "success_count": 844,
  "recovery_time": 45.2
}

Orchestrator Report

The orchestrator generates a comprehensive report including:

  • Summary metrics across all scenarios
  • SLO compliance analysis
  • Recommendations for improvements
  • MTTR trends and statistics

Example report snippet:

{
  "summary": {
    "total_scenarios": 3,
    "successful_scenarios": 3,
    "average_mttr": 67.8,
    "max_mttr": 120.5,
    "min_mttr": 45.2
  },
  "recommendations": [
    "Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
    "Coordinator recovery is slow. Consider reducing pod startup time."
  ]
}

SLO Targets

Metric Target Current
MTTR (Average) ≤ 120 seconds TBD
MTTR (Maximum) ≤ 300 seconds TBD
Success Rate ≥ 99.9% TBD

Best Practices

Before Running Tests

  1. Backup Critical Data: Ensure recent backups are available
  2. Notify Team: Inform stakeholders about chaos testing
  3. Check Cluster Health: Verify all components are healthy
  4. Schedule Appropriately: Run during low-traffic periods

During Tests

  1. Monitor Logs: Watch for unexpected errors
  2. Have Rollback Plan: Be ready to manually intervene
  3. Document Observations: Note any unusual behavior
  4. Stop if Critical: Abort tests if production is impacted

After Tests

  1. Review Results: Analyze MTTR and error rates
  2. Update Documentation: Record findings and improvements
  3. Address Issues: Fix any discovered problems
  4. Schedule Follow-up: Plan regular chaos testing

Integration with CI/CD

GitHub Actions Example

name: Chaos Testing
on:
  schedule:
    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
  workflow_dispatch:

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          pip install aiohttp
      - name: Run chaos tests
        run: |
          cd infra/scripts
          python3 chaos_orchestrator.py --namespace staging
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: chaos-results
          path: "*.json"

Troubleshooting

Common Issues

  1. kubectl not found

    # Ensure kubectl is installed and configured
    which kubectl
    kubectl version
    
  2. Permission denied errors

    # Check RBAC permissions
    kubectl auth can-i create pods --namespace default
    kubectl auth can-i exec pods --namespace default
    
  3. Network rules not applying

    # Check if iptables is available in pods
    kubectl exec -it <pod> -- iptables -L
    
  4. Tests hanging

    # Check pod status
    kubectl get pods --namespace default
    kubectl describe pod <pod-name> --namespace default
    

Debug Mode

Enable debug logging:

export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log

Contributing

To add new chaos test scenarios:

  1. Create a new script following the naming pattern chaos_test_<scenario>.py
  2. Implement the required methods: run_test(), save_results()
  3. Add the scenario to chaos_orchestrator.py
  4. Update documentation

Security Considerations

  • Chaos tests require elevated privileges
  • Only run in authorized environments
  • Ensure test isolation from production data
  • Review network rules before deployment
  • Monitor for security violations during tests

Support

For issues or questions:

  • Check the troubleshooting section
  • Review test logs for error details
  • Contact the DevOps team at devops@aitbc.io

License

This chaos testing framework is part of the AITBC project and follows the same license terms.