feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
330
infra/scripts/README_chaos.md
Normal file
330
infra/scripts/README_chaos.md
Normal file
@ -0,0 +1,330 @@
|
||||
# AITBC Chaos Testing Framework
|
||||
|
||||
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
|
||||
|
||||
## Overview
|
||||
|
||||
The chaos testing framework simulates real-world failure scenarios to:
|
||||
- Test system resilience under adverse conditions
|
||||
- Measure Mean-Time-To-Recovery (MTTR) metrics
|
||||
- Identify single points of failure
|
||||
- Validate recovery procedures
|
||||
- Ensure SLO compliance
|
||||
|
||||
## Components
|
||||
|
||||
### Test Scripts
|
||||
|
||||
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
|
||||
- Deletes coordinator pods to simulate complete service outage
|
||||
- Measures recovery time and service availability
|
||||
- Tests load handling during and after recovery
|
||||
|
||||
2. **`chaos_test_network.py`** - Network partition simulation
|
||||
- Creates network partitions between blockchain nodes
|
||||
- Tests consensus resilience during partition
|
||||
- Measures network recovery time
|
||||
|
||||
3. **`chaos_test_database.py`** - Database failure simulation
|
||||
- Simulates PostgreSQL connection failures
|
||||
- Tests high latency scenarios
|
||||
- Validates application error handling
|
||||
|
||||
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
|
||||
- Runs multiple chaos test scenarios
|
||||
- Aggregates MTTR metrics across tests
|
||||
- Generates comprehensive reports
|
||||
- Supports continuous chaos testing
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- kubectl configured with cluster access
|
||||
- Helm charts deployed in target namespace
|
||||
- Administrative privileges for network manipulation
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone <repository-url>
|
||||
cd aitbc/infra/scripts
|
||||
|
||||
# Install dependencies
|
||||
pip install aiohttp
|
||||
|
||||
# Make scripts executable
|
||||
chmod +x chaos_*.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Running Individual Tests
|
||||
|
||||
#### Coordinator Outage Test
|
||||
```bash
|
||||
# Basic test
|
||||
python3 chaos_test_coordinator.py --namespace default
|
||||
|
||||
# Custom outage duration
|
||||
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
|
||||
|
||||
# Dry run (no actual chaos)
|
||||
python3 chaos_test_coordinator.py --dry-run
|
||||
```
|
||||
|
||||
#### Network Partition Test
|
||||
```bash
|
||||
# Partition 50% of nodes for 60 seconds
|
||||
python3 chaos_test_network.py --namespace default
|
||||
|
||||
# Partition 30% of nodes for 90 seconds
|
||||
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
|
||||
```
|
||||
|
||||
#### Database Failure Test
|
||||
```bash
|
||||
# Simulate connection failure
|
||||
python3 chaos_test_database.py --namespace default --failure-type connection
|
||||
|
||||
# Simulate high latency (5000ms)
|
||||
python3 chaos_test_database.py --namespace default --failure-type latency
|
||||
```
|
||||
|
||||
### Running All Tests
|
||||
|
||||
```bash
|
||||
# Run all scenarios with default parameters
|
||||
python3 chaos_orchestrator.py --namespace default
|
||||
|
||||
# Run specific scenarios
|
||||
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
|
||||
|
||||
# Continuous chaos testing (24 hours, every 60 minutes)
|
||||
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
|
||||
```
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Coordinator API Outage
|
||||
|
||||
**Objective**: Test system resilience when the coordinator service becomes unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Generate baseline load on coordinator API
|
||||
2. Delete all coordinator pods
|
||||
3. Wait for specified outage duration
|
||||
4. Monitor service recovery
|
||||
5. Generate post-recovery load
|
||||
|
||||
**Metrics Collected**:
|
||||
- MTTR (Mean-Time-To-Recovery)
|
||||
- Success/error request counts
|
||||
- Recovery time distribution
|
||||
|
||||
### 2. Network Partition
|
||||
|
||||
**Objective**: Test blockchain consensus during network partitions.
|
||||
|
||||
**Steps**:
|
||||
1. Identify blockchain node pods
|
||||
2. Apply iptables rules to partition nodes
|
||||
3. Monitor consensus during partition
|
||||
4. Remove network partition
|
||||
5. Verify network recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Network recovery time
|
||||
- Consensus health during partition
|
||||
- Node connectivity status
|
||||
|
||||
### 3. Database Failure
|
||||
|
||||
**Objective**: Test application behavior when database is unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Simulate database connection failure or high latency
|
||||
2. Monitor API behavior during failure
|
||||
3. Restore database connectivity
|
||||
4. Verify application recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Database recovery time
|
||||
- API error rates during failure
|
||||
- Application resilience metrics
|
||||
|
||||
## Results and Reporting
|
||||
|
||||
### Test Results Format
|
||||
|
||||
Each test generates a JSON results file with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"test_start": "2024-12-22T10:00:00.000Z",
|
||||
"test_end": "2024-12-22T10:05:00.000Z",
|
||||
"scenario": "coordinator_outage",
|
||||
"mttr": 45.2,
|
||||
"error_count": 156,
|
||||
"success_count": 844,
|
||||
"recovery_time": 45.2
|
||||
}
|
||||
```
|
||||
|
||||
### Orchestrator Report
|
||||
|
||||
The orchestrator generates a comprehensive report including:
|
||||
|
||||
- Summary metrics across all scenarios
|
||||
- SLO compliance analysis
|
||||
- Recommendations for improvements
|
||||
- MTTR trends and statistics
|
||||
|
||||
Example report snippet:
|
||||
```json
|
||||
{
|
||||
"summary": {
|
||||
"total_scenarios": 3,
|
||||
"successful_scenarios": 3,
|
||||
"average_mttr": 67.8,
|
||||
"max_mttr": 120.5,
|
||||
"min_mttr": 45.2
|
||||
},
|
||||
"recommendations": [
|
||||
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
|
||||
"Coordinator recovery is slow. Consider reducing pod startup time."
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| Metric | Target | Current |
|
||||
|--------|--------|---------|
|
||||
| MTTR (Average) | ≤ 120 seconds | TBD |
|
||||
| MTTR (Maximum) | ≤ 300 seconds | TBD |
|
||||
| Success Rate | ≥ 99.9% | TBD |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Before Running Tests
|
||||
|
||||
1. **Backup Critical Data**: Ensure recent backups are available
|
||||
2. **Notify Team**: Inform stakeholders about chaos testing
|
||||
3. **Check Cluster Health**: Verify all components are healthy
|
||||
4. **Schedule Appropriately**: Run during low-traffic periods
|
||||
|
||||
### During Tests
|
||||
|
||||
1. **Monitor Logs**: Watch for unexpected errors
|
||||
2. **Have Rollback Plan**: Be ready to manually intervene
|
||||
3. **Document Observations**: Note any unusual behavior
|
||||
4. **Stop if Critical**: Abort tests if production is impacted
|
||||
|
||||
### After Tests
|
||||
|
||||
1. **Review Results**: Analyze MTTR and error rates
|
||||
2. **Update Documentation**: Record findings and improvements
|
||||
3. **Address Issues**: Fix any discovered problems
|
||||
4. **Schedule Follow-up**: Plan regular chaos testing
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
### GitHub Actions Example
|
||||
|
||||
```yaml
|
||||
name: Chaos Testing
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
chaos-test:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v2
|
||||
with:
|
||||
python-version: '3.9'
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install aiohttp
|
||||
- name: Run chaos tests
|
||||
run: |
|
||||
cd infra/scripts
|
||||
python3 chaos_orchestrator.py --namespace staging
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: chaos-results
|
||||
path: "*.json"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **kubectl not found**
|
||||
```bash
|
||||
# Ensure kubectl is installed and configured
|
||||
which kubectl
|
||||
kubectl version
|
||||
```
|
||||
|
||||
2. **Permission denied errors**
|
||||
```bash
|
||||
# Check RBAC permissions
|
||||
kubectl auth can-i create pods --namespace default
|
||||
kubectl auth can-i exec pods --namespace default
|
||||
```
|
||||
|
||||
3. **Network rules not applying**
|
||||
```bash
|
||||
# Check if iptables is available in pods
|
||||
kubectl exec -it <pod> -- iptables -L
|
||||
```
|
||||
|
||||
4. **Tests hanging**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods --namespace default
|
||||
kubectl describe pod <pod-name> --namespace default
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging:
|
||||
```bash
|
||||
export PYTHONPATH=.
|
||||
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
To add new chaos test scenarios:
|
||||
|
||||
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
|
||||
2. Implement the required methods: `run_test()`, `save_results()`
|
||||
3. Add the scenario to `chaos_orchestrator.py`
|
||||
4. Update documentation
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Chaos tests require elevated privileges
|
||||
- Only run in authorized environments
|
||||
- Ensure test isolation from production data
|
||||
- Review network rules before deployment
|
||||
- Monitor for security violations during tests
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check the troubleshooting section
|
||||
- Review test logs for error details
|
||||
- Contact the DevOps team at devops@aitbc.io
|
||||
|
||||
## License
|
||||
|
||||
This chaos testing framework is part of the AITBC project and follows the same license terms.
|
||||
Reference in New Issue
Block a user