--- description: Autonomous AI skill for blockchain troubleshooting and recovery across multi-node AITBC setup title: Blockchain Troubleshoot & Recovery version: 1.0 --- # Blockchain Troubleshoot & Recovery Skill ## Purpose Autonomous AI skill for diagnosing and resolving blockchain communication issues between aitbc (genesis) and aitbc1 (follower) nodes running on port 8006 across different physical machines. ## Activation Activate this skill when: - Blockchain communication tests fail - Nodes become unreachable - Block synchronization lags (>10 blocks) - Transaction propagation times exceed thresholds - Git synchronization fails - Network latency issues detected - Service health checks fail ## Input Schema ```json { "issue_type": { "type": "string", "enum": ["connectivity", "sync_lag", "transaction_timeout", "service_failure", "git_sync_failure", "network_latency", "unknown"], "description": "Type of blockchain communication issue" }, "affected_nodes": { "type": "array", "items": {"type": "string", "enum": ["aitbc", "aitbc1", "both"]}, "description": "Nodes affected by the issue" }, "severity": { "type": "string", "enum": ["low", "medium", "high", "critical"], "description": "Severity level of the issue" }, "diagnostic_data": { "type": "object", "properties": { "error_logs": {"type": "string"}, "test_results": {"type": "object"}, "metrics": {"type": "object"} }, "description": "Diagnostic data from failed tests" }, "auto_recovery": { "type": "boolean", "default": true, "description": "Enable autonomous recovery actions" }, "recovery_timeout": { "type": "integer", "default": 300, "description": "Maximum time (seconds) for recovery attempts" } } ``` ## Output Schema ```json { "diagnosis": { "root_cause": {"type": "string"}, "affected_components": {"type": "array", "items": {"type": "string"}}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} }, "recovery_actions": { "type": "array", "items": { "type": "object", "properties": { "action": {"type": "string"}, "command": {"type": "string"}, "target_node": {"type": "string"}, "status": {"type": "string", "enum": ["pending", "in_progress", "completed", "failed"]}, "result": {"type": "string"} } } }, "recovery_status": { "type": "string", "enum": ["successful", "partial", "failed", "manual_intervention_required"] }, "post_recovery_validation": { "tests_passed": {"type": "integer"}, "tests_failed": {"type": "integer"}, "metrics_restored": {"type": "boolean"} }, "recommendations": { "type": "array", "items": {"type": "string"} }, "escalation_required": { "type": "boolean" } } ``` ## Process ### 1. Diagnose Issue ```bash # Collect diagnostic information tail -100 /var/log/aitbc/blockchain-communication-test.log > /tmp/diagnostic_logs.txt tail -50 /var/log/aitbc/blockchain-test-errors.txt >> /tmp/diagnostic_logs.txt # Check service status systemctl status aitbc-blockchain-rpc --no-pager >> /tmp/diagnostic_logs.txt ssh aitbc1 'systemctl status aitbc-blockchain-rpc --no-pager' >> /tmp/diagnostic_logs.txt # Check network connectivity ping -c 5 10.1.223.40 >> /tmp/diagnostic_logs.txt ping -c 5 >> /tmp/diagnostic_logs.txt # Check port accessibility netstat -tlnp | grep 8006 >> /tmp/diagnostic_logs.txt # Check blockchain status NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt NODE_URL=http://:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt ``` ### 2. Analyze Root Cause Based on diagnostic data, identify: - Network connectivity issues (firewall, routing) - Service failures (crashes, hangs) - Synchronization problems (git, blockchain) - Resource exhaustion (CPU, memory, disk) - Configuration errors ### 3. Execute Recovery Actions #### Connectivity Recovery ```bash # Restart network services systemctl restart aitbc-blockchain-p2p ssh aitbc1 'systemctl restart aitbc-blockchain-p2p' # Check and fix firewall rules iptables -L -n | grep 8006 if [ $? -ne 0 ]; then iptables -A INPUT -p tcp --dport 8006 -j ACCEPT iptables -A OUTPUT -p tcp --sport 8006 -j ACCEPT fi # Test connectivity curl -f -s http://10.1.223.40:8006/health curl -f -s http://:8006/health ``` #### Service Recovery ```bash # Restart blockchain services systemctl restart aitbc-blockchain-rpc ssh aitbc1 'systemctl restart aitbc-blockchain-rpc' # Restart coordinator if needed systemctl restart aitbc-coordinator ssh aitbc1 'systemctl restart aitbc-coordinator' # Check service logs journalctl -u aitbc-blockchain-rpc -n 50 --no-pager ``` #### Synchronization Recovery ```bash # Force blockchain sync ./aitbc-cli cluster sync --all --yes # Git sync recovery cd /opt/aitbc git fetch origin main git reset --hard origin/main ssh aitbc1 'cd /opt/aitbc && git fetch origin main && git reset --hard origin/main' # Verify sync git log --oneline -5 ssh aitbc1 'cd /opt/aitbc && git log --oneline -5' ``` #### Resource Recovery ```bash # Clear system caches sync && echo 3 > /proc/sys/vm/drop_caches # Restart if resource exhausted systemctl restart aitbc-* ssh aitbc1 'systemctl restart aitbc-*' ``` ### 4. Validate Recovery ```bash # Run full communication test ./scripts/blockchain-communication-test.sh --full --debug # Verify all services are healthy curl http://10.1.223.40:8006/health curl http://:8006/health curl http://10.1.223.40:8001/health curl http://10.1.223.40:8000/health # Check blockchain sync NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain height NODE_URL=http://:8006 ./aitbc-cli blockchain height ``` ### 5. Report and Escalate - Document recovery actions taken - Provide metrics before/after recovery - Recommend preventive measures - Escalate if recovery fails or manual intervention needed ## Constraints - Maximum recovery attempts: 3 per issue type - Recovery timeout: 300 seconds per action - Cannot restart services during peak hours (9AM-5PM local time) without confirmation - Must preserve blockchain data integrity - Cannot modify wallet keys or cryptographic material - Must log all recovery actions - Escalate to human if recovery fails after 3 attempts ## Environment Assumptions - Genesis node IP: 10.1.223.40 - Follower node IP: (replace with actual IP) - Both nodes use port 8006 for blockchain RPC - SSH access to aitbc1 configured and working - AITBC CLI accessible at /opt/aitbc/aitbc-cli - Git repository: http://gitea.bubuit.net:3000/oib/aitbc.git - Log directory: /var/log/aitbc/ - Test script: /opt/aitbc/scripts/blockchain-communication-test.sh - Systemd services: aitbc-blockchain-rpc, aitbc-coordinator, aitbc-blockchain-p2p ## Error Handling ### Recovery Action Failure - Log specific failure reason - Attempt alternative recovery method - Increment failure counter - Escalate after 3 failures ### Service Restart Failure - Check service logs for errors - Verify configuration files - Check system resources - Escalate if service cannot be restarted ### Network Unreachable - Check physical network connectivity - Verify firewall rules - Check routing tables - Escalate if network issue persists ### Data Integrity Concerns - Stop all recovery actions - Preserve current state - Escalate immediately for manual review - Do not attempt automated recovery ### Timeout Exceeded - Stop current recovery action - Log timeout event - Attempt next recovery method - Escalate if all methods timeout ## Example Usage Prompts ### Basic Troubleshooting "Blockchain communication test failed on aitbc1 node. Diagnose and recover." ### Specific Issue Type "Block synchronization lag detected (>15 blocks). Perform autonomous recovery." ### Service Failure "aitbc-blockchain-rpc service crashed on genesis node. Restart and validate." ### Network Issue "Cannot reach aitbc1 node on port 8006. Troubleshoot network connectivity." ### Full Recovery "Complete blockchain communication test failed with multiple issues. Perform full autonomous recovery." ### Escalation Scenario "Recovery actions failed after 3 attempts. Prepare escalation report with diagnostic data." ## Expected Output Example ```json { "diagnosis": { "root_cause": "Network firewall blocking port 8006 on follower node", "affected_components": ["network", "firewall", "aitbc1"], "confidence": 0.95 }, "recovery_actions": [ { "action": "Check firewall rules", "command": "iptables -L -n | grep 8006", "target_node": "aitbc1", "status": "completed", "result": "Port 8006 not in allowed rules" }, { "action": "Add firewall rule", "command": "iptables -A INPUT -p tcp --dport 8006 -j ACCEPT", "target_node": "aitbc1", "status": "completed", "result": "Rule added successfully" }, { "action": "Test connectivity", "command": "curl -f -s http://:8006/health", "target_node": "aitbc1", "status": "completed", "result": "Node reachable" } ], "recovery_status": "successful", "post_recovery_validation": { "tests_passed": 5, "tests_failed": 0, "metrics_restored": true }, "recommendations": [ "Add persistent firewall rules to /etc/iptables/rules.v4", "Monitor firewall changes for future prevention", "Consider implementing network monitoring alerts" ], "escalation_required": false } ``` ## Model Routing - **Fast Model**: Use for simple, routine recoveries (service restarts, basic connectivity) - **Reasoning Model**: Use for complex diagnostics, root cause analysis, multi-step recovery - **Reasoning Model**: Use when recovery fails and escalation planning is needed ## Performance Notes - **Diagnosis Time**: 10-30 seconds depending on issue complexity - **Recovery Time**: 30-120 seconds per recovery action - **Validation Time**: 60-180 seconds for full test suite - **Memory Usage**: <500MB during recovery operations - **Network Impact**: Minimal during diagnostics, moderate during git sync - **Concurrency**: Can handle single issue recovery; multiple issues should be queued - **Optimization**: Cache diagnostic data to avoid repeated collection - **Rate Limiting**: Limit service restarts to prevent thrashing - **Logging**: All actions logged with timestamps for audit trail ## Related Skills - [aitbc-node-coordinator](/aitbc-node-coordinator.md) - For cross-node coordination during recovery - [openclaw-error-handler](/openclaw-error-handler.md) - For error handling and escalation - [openclaw-coordination-orchestrator](/openclaw-coordination-orchestrator.md) - For multi-node recovery coordination ## Related Workflows - [Blockchain Communication Test](/workflows/blockchain-communication-test.md) - Testing workflow that triggers this skill - [Multi-Node Operations](/workflows/multi-node-blockchain-operations.md) - General node operations