✅ v0.2 Release Preparation: - Update version to 0.2.0 in pyproject.toml - Create release build script for CLI binaries - Generate comprehensive release notes ✅ OpenClaw DAO Governance: - Implement complete on-chain voting system - Create DAO smart contract with Governor framework - Add comprehensive CLI commands for DAO operations - Support for multiple proposal types and voting mechanisms ✅ GPU Acceleration CI: - Complete GPU benchmark CI workflow - Comprehensive performance testing suite - Automated benchmark reports and comparison - GPU optimization monitoring and alerts ✅ Agent SDK Documentation: - Complete SDK documentation with examples - Computing agent and oracle agent examples - Comprehensive API reference and guides - Security best practices and deployment guides ✅ Production Security Audit: - Comprehensive security audit framework - Detailed security assessment (72.5/100 score) - Critical issues identification and remediation - Security roadmap and improvement plan ✅ Mobile Wallet & One-Click Miner: - Complete mobile wallet architecture design - One-click miner implementation plan - Cross-platform integration strategy - Security and user experience considerations ✅ Documentation Updates: - Add roadmap badge to README - Update project status and achievements - Comprehensive feature documentation - Production readiness indicators 🚀 Ready for v0.2.0 release with agent-first architecture
8.1 KiB
AITBC Chaos Testing Framework
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
Overview
The chaos testing framework simulates real-world failure scenarios to:
- Test system resilience under adverse conditions
- Measure Mean-Time-To-Recovery (MTTR) metrics
- Identify single points of failure
- Validate recovery procedures
- Ensure SLO compliance
Components
Test Scripts
-
chaos_test_coordinator.py- Coordinator API outage simulation- Deletes coordinator pods to simulate complete service outage
- Measures recovery time and service availability
- Tests load handling during and after recovery
-
chaos_test_network.py- Network partition simulation- Creates network partitions between blockchain nodes
- Tests consensus resilience during partition
- Measures network recovery time
-
chaos_test_database.py- Database failure simulation- Simulates PostgreSQL connection failures
- Tests high latency scenarios
- Validates application error handling
-
chaos_orchestrator.py- Test orchestration and reporting- Runs multiple chaos test scenarios
- Aggregates MTTR metrics across tests
- Generates comprehensive reports
- Supports continuous chaos testing
Prerequisites
- Python 3.8+
- kubectl configured with cluster access
- Helm charts deployed in target namespace
- Administrative privileges for network manipulation
Installation
# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts
# Install dependencies
pip install aiohttp
# Make scripts executable
chmod +x chaos_*.py
Usage
Running Individual Tests
Coordinator Outage Test
# Basic test
python3 chaos_test_coordinator.py --namespace default
# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run
Network Partition Test
# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default
# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
Database Failure Test
# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection
# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency
Running All Tests
# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default
# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
Test Scenarios
1. Coordinator API Outage
Objective: Test system resilience when the coordinator service becomes unavailable.
Steps:
- Generate baseline load on coordinator API
- Delete all coordinator pods
- Wait for specified outage duration
- Monitor service recovery
- Generate post-recovery load
Metrics Collected:
- MTTR (Mean-Time-To-Recovery)
- Success/error request counts
- Recovery time distribution
2. Network Partition
Objective: Test blockchain consensus during network partitions.
Steps:
- Identify blockchain node pods
- Apply iptables rules to partition nodes
- Monitor consensus during partition
- Remove network partition
- Verify network recovery
Metrics Collected:
- Network recovery time
- Consensus health during partition
- Node connectivity status
3. Database Failure
Objective: Test application behavior when database is unavailable.
Steps:
- Simulate database connection failure or high latency
- Monitor API behavior during failure
- Restore database connectivity
- Verify application recovery
Metrics Collected:
- Database recovery time
- API error rates during failure
- Application resilience metrics
Results and Reporting
Test Results Format
Each test generates a JSON results file with the following structure:
{
"test_start": "2024-12-22T10:00:00.000Z",
"test_end": "2024-12-22T10:05:00.000Z",
"scenario": "coordinator_outage",
"mttr": 45.2,
"error_count": 156,
"success_count": 844,
"recovery_time": 45.2
}
Orchestrator Report
The orchestrator generates a comprehensive report including:
- Summary metrics across all scenarios
- SLO compliance analysis
- Recommendations for improvements
- MTTR trends and statistics
Example report snippet:
{
"summary": {
"total_scenarios": 3,
"successful_scenarios": 3,
"average_mttr": 67.8,
"max_mttr": 120.5,
"min_mttr": 45.2
},
"recommendations": [
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
"Coordinator recovery is slow. Consider reducing pod startup time."
]
}
SLO Targets
| Metric | Target | Current |
|---|---|---|
| MTTR (Average) | ≤ 120 seconds | TBD |
| MTTR (Maximum) | ≤ 300 seconds | TBD |
| Success Rate | ≥ 99.9% | TBD |
Best Practices
Before Running Tests
- Backup Critical Data: Ensure recent backups are available
- Notify Team: Inform stakeholders about chaos testing
- Check Cluster Health: Verify all components are healthy
- Schedule Appropriately: Run during low-traffic periods
During Tests
- Monitor Logs: Watch for unexpected errors
- Have Rollback Plan: Be ready to manually intervene
- Document Observations: Note any unusual behavior
- Stop if Critical: Abort tests if production is impacted
After Tests
- Review Results: Analyze MTTR and error rates
- Update Documentation: Record findings and improvements
- Address Issues: Fix any discovered problems
- Schedule Follow-up: Plan regular chaos testing
Integration with CI/CD
GitHub Actions Example
name: Chaos Testing
on:
schedule:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
workflow_dispatch:
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install aiohttp
- name: Run chaos tests
run: |
cd infra/scripts
python3 chaos_orchestrator.py --namespace staging
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: chaos-results
path: "*.json"
Troubleshooting
Common Issues
-
kubectl not found
# Ensure kubectl is installed and configured which kubectl kubectl version -
Permission denied errors
# Check RBAC permissions kubectl auth can-i create pods --namespace default kubectl auth can-i exec pods --namespace default -
Network rules not applying
# Check if iptables is available in pods kubectl exec -it <pod> -- iptables -L -
Tests hanging
# Check pod status kubectl get pods --namespace default kubectl describe pod <pod-name> --namespace default
Debug Mode
Enable debug logging:
export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
Contributing
To add new chaos test scenarios:
- Create a new script following the naming pattern
chaos_test_<scenario>.py - Implement the required methods:
run_test(),save_results() - Add the scenario to
chaos_orchestrator.py - Update documentation
Security Considerations
- Chaos tests require elevated privileges
- Only run in authorized environments
- Ensure test isolation from production data
- Review network rules before deployment
- Monitor for security violations during tests
Support
For issues or questions:
- Check the troubleshooting section
- Review test logs for error details
- Contact the DevOps team at devops@aitbc.io
License
This chaos testing framework is part of the AITBC project and follows the same license terms.