- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
8.1 KiB
AITBC Chaos Testing Framework
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
Overview
The chaos testing framework simulates real-world failure scenarios to:
- Test system resilience under adverse conditions
- Measure Mean-Time-To-Recovery (MTTR) metrics
- Identify single points of failure
- Validate recovery procedures
- Ensure SLO compliance
Components
Test Scripts
-
chaos_test_coordinator.py- Coordinator API outage simulation- Deletes coordinator pods to simulate complete service outage
- Measures recovery time and service availability
- Tests load handling during and after recovery
-
chaos_test_network.py- Network partition simulation- Creates network partitions between blockchain nodes
- Tests consensus resilience during partition
- Measures network recovery time
-
chaos_test_database.py- Database failure simulation- Simulates PostgreSQL connection failures
- Tests high latency scenarios
- Validates application error handling
-
chaos_orchestrator.py- Test orchestration and reporting- Runs multiple chaos test scenarios
- Aggregates MTTR metrics across tests
- Generates comprehensive reports
- Supports continuous chaos testing
Prerequisites
- Python 3.8+
- kubectl configured with cluster access
- Helm charts deployed in target namespace
- Administrative privileges for network manipulation
Installation
# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts
# Install dependencies
pip install aiohttp
# Make scripts executable
chmod +x chaos_*.py
Usage
Running Individual Tests
Coordinator Outage Test
# Basic test
python3 chaos_test_coordinator.py --namespace default
# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run
Network Partition Test
# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default
# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
Database Failure Test
# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection
# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency
Running All Tests
# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default
# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
Test Scenarios
1. Coordinator API Outage
Objective: Test system resilience when the coordinator service becomes unavailable.
Steps:
- Generate baseline load on coordinator API
- Delete all coordinator pods
- Wait for specified outage duration
- Monitor service recovery
- Generate post-recovery load
Metrics Collected:
- MTTR (Mean-Time-To-Recovery)
- Success/error request counts
- Recovery time distribution
2. Network Partition
Objective: Test blockchain consensus during network partitions.
Steps:
- Identify blockchain node pods
- Apply iptables rules to partition nodes
- Monitor consensus during partition
- Remove network partition
- Verify network recovery
Metrics Collected:
- Network recovery time
- Consensus health during partition
- Node connectivity status
3. Database Failure
Objective: Test application behavior when database is unavailable.
Steps:
- Simulate database connection failure or high latency
- Monitor API behavior during failure
- Restore database connectivity
- Verify application recovery
Metrics Collected:
- Database recovery time
- API error rates during failure
- Application resilience metrics
Results and Reporting
Test Results Format
Each test generates a JSON results file with the following structure:
{
"test_start": "2024-12-22T10:00:00.000Z",
"test_end": "2024-12-22T10:05:00.000Z",
"scenario": "coordinator_outage",
"mttr": 45.2,
"error_count": 156,
"success_count": 844,
"recovery_time": 45.2
}
Orchestrator Report
The orchestrator generates a comprehensive report including:
- Summary metrics across all scenarios
- SLO compliance analysis
- Recommendations for improvements
- MTTR trends and statistics
Example report snippet:
{
"summary": {
"total_scenarios": 3,
"successful_scenarios": 3,
"average_mttr": 67.8,
"max_mttr": 120.5,
"min_mttr": 45.2
},
"recommendations": [
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
"Coordinator recovery is slow. Consider reducing pod startup time."
]
}
SLO Targets
| Metric | Target | Current |
|---|---|---|
| MTTR (Average) | ≤ 120 seconds | TBD |
| MTTR (Maximum) | ≤ 300 seconds | TBD |
| Success Rate | ≥ 99.9% | TBD |
Best Practices
Before Running Tests
- Backup Critical Data: Ensure recent backups are available
- Notify Team: Inform stakeholders about chaos testing
- Check Cluster Health: Verify all components are healthy
- Schedule Appropriately: Run during low-traffic periods
During Tests
- Monitor Logs: Watch for unexpected errors
- Have Rollback Plan: Be ready to manually intervene
- Document Observations: Note any unusual behavior
- Stop if Critical: Abort tests if production is impacted
After Tests
- Review Results: Analyze MTTR and error rates
- Update Documentation: Record findings and improvements
- Address Issues: Fix any discovered problems
- Schedule Follow-up: Plan regular chaos testing
Integration with CI/CD
GitHub Actions Example
name: Chaos Testing
on:
schedule:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
workflow_dispatch:
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install aiohttp
- name: Run chaos tests
run: |
cd infra/scripts
python3 chaos_orchestrator.py --namespace staging
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: chaos-results
path: "*.json"
Troubleshooting
Common Issues
-
kubectl not found
# Ensure kubectl is installed and configured which kubectl kubectl version -
Permission denied errors
# Check RBAC permissions kubectl auth can-i create pods --namespace default kubectl auth can-i exec pods --namespace default -
Network rules not applying
# Check if iptables is available in pods kubectl exec -it <pod> -- iptables -L -
Tests hanging
# Check pod status kubectl get pods --namespace default kubectl describe pod <pod-name> --namespace default
Debug Mode
Enable debug logging:
export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
Contributing
To add new chaos test scenarios:
- Create a new script following the naming pattern
chaos_test_<scenario>.py - Implement the required methods:
run_test(),save_results() - Add the scenario to
chaos_orchestrator.py - Update documentation
Security Considerations
- Chaos tests require elevated privileges
- Only run in authorized environments
- Ensure test isolation from production data
- Review network rules before deployment
- Monitor for security violations during tests
Support
For issues or questions:
- Check the troubleshooting section
- Review test logs for error details
- Contact the DevOps team at devops@aitbc.io
License
This chaos testing framework is part of the AITBC project and follows the same license terms.