oib/aitbc

Files

oib f97ace74cb chore: clean up stray .md files — delete junk, move chaos doc

- Delete apps/explorer-web/test-results/ (stale Playwright snapshots)
- Delete scripts/dev/dev-utils-README.md (orphaned from deleted dev-utils/)
- Move infra/scripts/README_chaos.md → docs/9_security/3_chaos-testing.md

2026-02-13 23:37:04 +01:00

8.1 KiB

Raw Blame History

AITBC Chaos Testing Framework

This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.

Overview

The chaos testing framework simulates real-world failure scenarios to:

Test system resilience under adverse conditions
Measure Mean-Time-To-Recovery (MTTR) metrics
Identify single points of failure
Validate recovery procedures
Ensure SLO compliance

Components

Test Scripts

chaos_test_coordinator.py - Coordinator API outage simulation
- Deletes coordinator pods to simulate complete service outage
- Measures recovery time and service availability
- Tests load handling during and after recovery
chaos_test_network.py - Network partition simulation
- Creates network partitions between blockchain nodes
- Tests consensus resilience during partition
- Measures network recovery time
chaos_test_database.py - Database failure simulation
- Simulates PostgreSQL connection failures
- Tests high latency scenarios
- Validates application error handling
chaos_orchestrator.py - Test orchestration and reporting
- Runs multiple chaos test scenarios
- Aggregates MTTR metrics across tests
- Generates comprehensive reports
- Supports continuous chaos testing

Prerequisites

Python 3.8+
kubectl configured with cluster access
Helm charts deployed in target namespace
Administrative privileges for network manipulation

Installation

# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts

# Install dependencies
pip install aiohttp

# Make scripts executable
chmod +x chaos_*.py

Usage

Running Individual Tests

Coordinator Outage Test

# Basic test
python3 chaos_test_coordinator.py --namespace default

# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120

# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run

Network Partition Test

# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default

# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3

Database Failure Test

# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection

# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency

Running All Tests

# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default

# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network

# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60

Test Scenarios

1. Coordinator API Outage

Objective: Test system resilience when the coordinator service becomes unavailable.

Steps:

Generate baseline load on coordinator API
Delete all coordinator pods
Wait for specified outage duration
Monitor service recovery
Generate post-recovery load

Metrics Collected:

MTTR (Mean-Time-To-Recovery)
Success/error request counts
Recovery time distribution

2. Network Partition

Objective: Test blockchain consensus during network partitions.

Steps:

Identify blockchain node pods
Apply iptables rules to partition nodes
Monitor consensus during partition
Remove network partition
Verify network recovery

Metrics Collected:

Network recovery time
Consensus health during partition
Node connectivity status

3. Database Failure

Objective: Test application behavior when database is unavailable.

Steps:

Simulate database connection failure or high latency
Monitor API behavior during failure
Restore database connectivity
Verify application recovery

Metrics Collected:

Database recovery time
API error rates during failure
Application resilience metrics

Results and Reporting

Test Results Format

Each test generates a JSON results file with the following structure:

{
  "test_start": "2024-12-22T10:00:00.000Z",
  "test_end": "2024-12-22T10:05:00.000Z",
  "scenario": "coordinator_outage",
  "mttr": 45.2,
  "error_count": 156,
  "success_count": 844,
  "recovery_time": 45.2
}

Orchestrator Report

The orchestrator generates a comprehensive report including:

Summary metrics across all scenarios
SLO compliance analysis
Recommendations for improvements
MTTR trends and statistics

Example report snippet:

{
  "summary": {
    "total_scenarios": 3,
    "successful_scenarios": 3,
    "average_mttr": 67.8,
    "max_mttr": 120.5,
    "min_mttr": 45.2
  },
  "recommendations": [
    "Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
    "Coordinator recovery is slow. Consider reducing pod startup time."
  ]
}

SLO Targets

Metric	Target	Current
MTTR (Average)	≤ 120 seconds	TBD
MTTR (Maximum)	≤ 300 seconds	TBD
Success Rate	≥ 99.9%	TBD

Best Practices

Before Running Tests

Backup Critical Data: Ensure recent backups are available
Notify Team: Inform stakeholders about chaos testing
Check Cluster Health: Verify all components are healthy
Schedule Appropriately: Run during low-traffic periods

During Tests

Monitor Logs: Watch for unexpected errors
Have Rollback Plan: Be ready to manually intervene
Document Observations: Note any unusual behavior
Stop if Critical: Abort tests if production is impacted

After Tests

Review Results: Analyze MTTR and error rates
Update Documentation: Record findings and improvements
Address Issues: Fix any discovered problems
Schedule Follow-up: Plan regular chaos testing

Integration with CI/CD

GitHub Actions Example

name: Chaos Testing
on:
  schedule:
    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
  workflow_dispatch:

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          pip install aiohttp
      - name: Run chaos tests
        run: |
          cd infra/scripts
          python3 chaos_orchestrator.py --namespace staging
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: chaos-results
          path: "*.json"

Troubleshooting

Common Issues

kubectl not found

# Ensure kubectl is installed and configured
which kubectl
kubectl version

Permission denied errors

# Check RBAC permissions
kubectl auth can-i create pods --namespace default
kubectl auth can-i exec pods --namespace default

Network rules not applying

# Check if iptables is available in pods
kubectl exec -it <pod> -- iptables -L

Tests hanging

# Check pod status
kubectl get pods --namespace default
kubectl describe pod <pod-name> --namespace default

Debug Mode

Enable debug logging:

export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log

Contributing

To add new chaos test scenarios:

Create a new script following the naming pattern chaos_test_<scenario>.py
Implement the required methods: run_test(), save_results()
Add the scenario to chaos_orchestrator.py
Update documentation

Security Considerations

Chaos tests require elevated privileges
Only run in authorized environments
Ensure test isolation from production data
Review network rules before deployment
Monitor for security violations during tests

Support

For issues or questions:

Check the troubleshooting section
Review test logs for error details
Contact the DevOps team at devops@aitbc.io

License

This chaos testing framework is part of the AITBC project and follows the same license terms.

8.1 KiB Raw Blame History

AITBC Chaos Testing Framework

Overview

Components

Test Scripts

Prerequisites

Installation

Usage

Running Individual Tests

Coordinator Outage Test

Network Partition Test

Database Failure Test

Running All Tests

Test Scenarios

1. Coordinator API Outage

2. Network Partition

3. Database Failure

Results and Reporting

Test Results Format

Orchestrator Report

SLO Targets

Best Practices

Before Running Tests

During Tests

After Tests

Integration with CI/CD

GitHub Actions Example

Troubleshooting

Common Issues

Debug Mode

Contributing

Security Considerations

Support

License

8.1 KiB

Raw Blame History