- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
331 lines
8.1 KiB
Markdown
331 lines
8.1 KiB
Markdown
# AITBC Chaos Testing Framework
|
|
|
|
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
|
|
|
|
## Overview
|
|
|
|
The chaos testing framework simulates real-world failure scenarios to:
|
|
- Test system resilience under adverse conditions
|
|
- Measure Mean-Time-To-Recovery (MTTR) metrics
|
|
- Identify single points of failure
|
|
- Validate recovery procedures
|
|
- Ensure SLO compliance
|
|
|
|
## Components
|
|
|
|
### Test Scripts
|
|
|
|
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
|
|
- Deletes coordinator pods to simulate complete service outage
|
|
- Measures recovery time and service availability
|
|
- Tests load handling during and after recovery
|
|
|
|
2. **`chaos_test_network.py`** - Network partition simulation
|
|
- Creates network partitions between blockchain nodes
|
|
- Tests consensus resilience during partition
|
|
- Measures network recovery time
|
|
|
|
3. **`chaos_test_database.py`** - Database failure simulation
|
|
- Simulates PostgreSQL connection failures
|
|
- Tests high latency scenarios
|
|
- Validates application error handling
|
|
|
|
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
|
|
- Runs multiple chaos test scenarios
|
|
- Aggregates MTTR metrics across tests
|
|
- Generates comprehensive reports
|
|
- Supports continuous chaos testing
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8+
|
|
- kubectl configured with cluster access
|
|
- Helm charts deployed in target namespace
|
|
- Administrative privileges for network manipulation
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone <repository-url>
|
|
cd aitbc/infra/scripts
|
|
|
|
# Install dependencies
|
|
pip install aiohttp
|
|
|
|
# Make scripts executable
|
|
chmod +x chaos_*.py
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Running Individual Tests
|
|
|
|
#### Coordinator Outage Test
|
|
```bash
|
|
# Basic test
|
|
python3 chaos_test_coordinator.py --namespace default
|
|
|
|
# Custom outage duration
|
|
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
|
|
|
|
# Dry run (no actual chaos)
|
|
python3 chaos_test_coordinator.py --dry-run
|
|
```
|
|
|
|
#### Network Partition Test
|
|
```bash
|
|
# Partition 50% of nodes for 60 seconds
|
|
python3 chaos_test_network.py --namespace default
|
|
|
|
# Partition 30% of nodes for 90 seconds
|
|
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
|
|
```
|
|
|
|
#### Database Failure Test
|
|
```bash
|
|
# Simulate connection failure
|
|
python3 chaos_test_database.py --namespace default --failure-type connection
|
|
|
|
# Simulate high latency (5000ms)
|
|
python3 chaos_test_database.py --namespace default --failure-type latency
|
|
```
|
|
|
|
### Running All Tests
|
|
|
|
```bash
|
|
# Run all scenarios with default parameters
|
|
python3 chaos_orchestrator.py --namespace default
|
|
|
|
# Run specific scenarios
|
|
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
|
|
|
|
# Continuous chaos testing (24 hours, every 60 minutes)
|
|
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
|
|
```
|
|
|
|
## Test Scenarios
|
|
|
|
### 1. Coordinator API Outage
|
|
|
|
**Objective**: Test system resilience when the coordinator service becomes unavailable.
|
|
|
|
**Steps**:
|
|
1. Generate baseline load on coordinator API
|
|
2. Delete all coordinator pods
|
|
3. Wait for specified outage duration
|
|
4. Monitor service recovery
|
|
5. Generate post-recovery load
|
|
|
|
**Metrics Collected**:
|
|
- MTTR (Mean-Time-To-Recovery)
|
|
- Success/error request counts
|
|
- Recovery time distribution
|
|
|
|
### 2. Network Partition
|
|
|
|
**Objective**: Test blockchain consensus during network partitions.
|
|
|
|
**Steps**:
|
|
1. Identify blockchain node pods
|
|
2. Apply iptables rules to partition nodes
|
|
3. Monitor consensus during partition
|
|
4. Remove network partition
|
|
5. Verify network recovery
|
|
|
|
**Metrics Collected**:
|
|
- Network recovery time
|
|
- Consensus health during partition
|
|
- Node connectivity status
|
|
|
|
### 3. Database Failure
|
|
|
|
**Objective**: Test application behavior when database is unavailable.
|
|
|
|
**Steps**:
|
|
1. Simulate database connection failure or high latency
|
|
2. Monitor API behavior during failure
|
|
3. Restore database connectivity
|
|
4. Verify application recovery
|
|
|
|
**Metrics Collected**:
|
|
- Database recovery time
|
|
- API error rates during failure
|
|
- Application resilience metrics
|
|
|
|
## Results and Reporting
|
|
|
|
### Test Results Format
|
|
|
|
Each test generates a JSON results file with the following structure:
|
|
|
|
```json
|
|
{
|
|
"test_start": "2024-12-22T10:00:00.000Z",
|
|
"test_end": "2024-12-22T10:05:00.000Z",
|
|
"scenario": "coordinator_outage",
|
|
"mttr": 45.2,
|
|
"error_count": 156,
|
|
"success_count": 844,
|
|
"recovery_time": 45.2
|
|
}
|
|
```
|
|
|
|
### Orchestrator Report
|
|
|
|
The orchestrator generates a comprehensive report including:
|
|
|
|
- Summary metrics across all scenarios
|
|
- SLO compliance analysis
|
|
- Recommendations for improvements
|
|
- MTTR trends and statistics
|
|
|
|
Example report snippet:
|
|
```json
|
|
{
|
|
"summary": {
|
|
"total_scenarios": 3,
|
|
"successful_scenarios": 3,
|
|
"average_mttr": 67.8,
|
|
"max_mttr": 120.5,
|
|
"min_mttr": 45.2
|
|
},
|
|
"recommendations": [
|
|
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
|
|
"Coordinator recovery is slow. Consider reducing pod startup time."
|
|
]
|
|
}
|
|
```
|
|
|
|
## SLO Targets
|
|
|
|
| Metric | Target | Current |
|
|
|--------|--------|---------|
|
|
| MTTR (Average) | ≤ 120 seconds | TBD |
|
|
| MTTR (Maximum) | ≤ 300 seconds | TBD |
|
|
| Success Rate | ≥ 99.9% | TBD |
|
|
|
|
## Best Practices
|
|
|
|
### Before Running Tests
|
|
|
|
1. **Backup Critical Data**: Ensure recent backups are available
|
|
2. **Notify Team**: Inform stakeholders about chaos testing
|
|
3. **Check Cluster Health**: Verify all components are healthy
|
|
4. **Schedule Appropriately**: Run during low-traffic periods
|
|
|
|
### During Tests
|
|
|
|
1. **Monitor Logs**: Watch for unexpected errors
|
|
2. **Have Rollback Plan**: Be ready to manually intervene
|
|
3. **Document Observations**: Note any unusual behavior
|
|
4. **Stop if Critical**: Abort tests if production is impacted
|
|
|
|
### After Tests
|
|
|
|
1. **Review Results**: Analyze MTTR and error rates
|
|
2. **Update Documentation**: Record findings and improvements
|
|
3. **Address Issues**: Fix any discovered problems
|
|
4. **Schedule Follow-up**: Plan regular chaos testing
|
|
|
|
## Integration with CI/CD
|
|
|
|
### GitHub Actions Example
|
|
|
|
```yaml
|
|
name: Chaos Testing
|
|
on:
|
|
schedule:
|
|
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
|
|
workflow_dispatch:
|
|
|
|
jobs:
|
|
chaos-test:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v2
|
|
- name: Setup Python
|
|
uses: actions/setup-python@v2
|
|
with:
|
|
python-version: '3.9'
|
|
- name: Install dependencies
|
|
run: |
|
|
pip install aiohttp
|
|
- name: Run chaos tests
|
|
run: |
|
|
cd infra/scripts
|
|
python3 chaos_orchestrator.py --namespace staging
|
|
- name: Upload results
|
|
uses: actions/upload-artifact@v2
|
|
with:
|
|
name: chaos-results
|
|
path: "*.json"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **kubectl not found**
|
|
```bash
|
|
# Ensure kubectl is installed and configured
|
|
which kubectl
|
|
kubectl version
|
|
```
|
|
|
|
2. **Permission denied errors**
|
|
```bash
|
|
# Check RBAC permissions
|
|
kubectl auth can-i create pods --namespace default
|
|
kubectl auth can-i exec pods --namespace default
|
|
```
|
|
|
|
3. **Network rules not applying**
|
|
```bash
|
|
# Check if iptables is available in pods
|
|
kubectl exec -it <pod> -- iptables -L
|
|
```
|
|
|
|
4. **Tests hanging**
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods --namespace default
|
|
kubectl describe pod <pod-name> --namespace default
|
|
```
|
|
|
|
### Debug Mode
|
|
|
|
Enable debug logging:
|
|
```bash
|
|
export PYTHONPATH=.
|
|
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
|
|
```
|
|
|
|
## Contributing
|
|
|
|
To add new chaos test scenarios:
|
|
|
|
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
|
|
2. Implement the required methods: `run_test()`, `save_results()`
|
|
3. Add the scenario to `chaos_orchestrator.py`
|
|
4. Update documentation
|
|
|
|
## Security Considerations
|
|
|
|
- Chaos tests require elevated privileges
|
|
- Only run in authorized environments
|
|
- Ensure test isolation from production data
|
|
- Review network rules before deployment
|
|
- Monitor for security violations during tests
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
- Check the troubleshooting section
|
|
- Review test logs for error details
|
|
- Contact the DevOps team at devops@aitbc.io
|
|
|
|
## License
|
|
|
|
This chaos testing framework is part of the AITBC project and follows the same license terms.
|