Files
aitbc/.windsurf/skills/blockchain-troubleshoot-recovery.md
aitbc 084dcdef31
Some checks failed
Security Scanning / security-scan (push) Has been cancelled
Documentation Validation / validate-docs (push) Has been cancelled
Integration Tests / test-service-integration (push) Has been cancelled
Python Tests / test-python (push) Has been cancelled
docs: update refactoring summary and mastery plan to reflect completion of all 11 atomic skills
- Mark Phase 2 as completed with all 11/11 atomic skills created
- Update skill counts: AITBC skills (6/6), OpenClaw skills (5/5)
- Move aitbc-node-coordinator and aitbc-analytics-analyzer from remaining to completed
- Update Phase 3 status from PLANNED to IN PROGRESS
- Add Gitea-based node synchronization documentation (replaces SCP)
- Clarify two-node architecture with same port (8006) on different I
2026-04-10 12:46:09 +02:00

11 KiB

description, title, version
description title version
Autonomous AI skill for blockchain troubleshooting and recovery across multi-node AITBC setup Blockchain Troubleshoot & Recovery 1.0

Blockchain Troubleshoot & Recovery Skill

Purpose

Autonomous AI skill for diagnosing and resolving blockchain communication issues between aitbc (genesis) and aitbc1 (follower) nodes running on port 8006 across different physical machines.

Activation

Activate this skill when:

  • Blockchain communication tests fail
  • Nodes become unreachable
  • Block synchronization lags (>10 blocks)
  • Transaction propagation times exceed thresholds
  • Git synchronization fails
  • Network latency issues detected
  • Service health checks fail

Input Schema

{
  "issue_type": {
    "type": "string",
    "enum": ["connectivity", "sync_lag", "transaction_timeout", "service_failure", "git_sync_failure", "network_latency", "unknown"],
    "description": "Type of blockchain communication issue"
  },
  "affected_nodes": {
    "type": "array",
    "items": {"type": "string", "enum": ["aitbc", "aitbc1", "both"]},
    "description": "Nodes affected by the issue"
  },
  "severity": {
    "type": "string",
    "enum": ["low", "medium", "high", "critical"],
    "description": "Severity level of the issue"
  },
  "diagnostic_data": {
    "type": "object",
    "properties": {
      "error_logs": {"type": "string"},
      "test_results": {"type": "object"},
      "metrics": {"type": "object"}
    },
    "description": "Diagnostic data from failed tests"
  },
  "auto_recovery": {
    "type": "boolean",
    "default": true,
    "description": "Enable autonomous recovery actions"
  },
  "recovery_timeout": {
    "type": "integer",
    "default": 300,
    "description": "Maximum time (seconds) for recovery attempts"
  }
}

Output Schema

{
  "diagnosis": {
    "root_cause": {"type": "string"},
    "affected_components": {"type": "array", "items": {"type": "string"}},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "recovery_actions": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "action": {"type": "string"},
        "command": {"type": "string"},
        "target_node": {"type": "string"},
        "status": {"type": "string", "enum": ["pending", "in_progress", "completed", "failed"]},
        "result": {"type": "string"}
      }
    }
  },
  "recovery_status": {
    "type": "string",
    "enum": ["successful", "partial", "failed", "manual_intervention_required"]
  },
  "post_recovery_validation": {
    "tests_passed": {"type": "integer"},
    "tests_failed": {"type": "integer"},
    "metrics_restored": {"type": "boolean"}
  },
  "recommendations": {
    "type": "array",
    "items": {"type": "string"}
  },
  "escalation_required": {
    "type": "boolean"
  }
}

Process

1. Diagnose Issue

# Collect diagnostic information
tail -100 /var/log/aitbc/blockchain-communication-test.log > /tmp/diagnostic_logs.txt
tail -50 /var/log/aitbc/blockchain-test-errors.txt >> /tmp/diagnostic_logs.txt

# Check service status
systemctl status aitbc-blockchain-rpc --no-pager >> /tmp/diagnostic_logs.txt
ssh aitbc1 'systemctl status aitbc-blockchain-rpc --no-pager' >> /tmp/diagnostic_logs.txt

# Check network connectivity
ping -c 5 10.1.223.40 >> /tmp/diagnostic_logs.txt
ping -c 5 <aitbc1-ip> >> /tmp/diagnostic_logs.txt

# Check port accessibility
netstat -tlnp | grep 8006 >> /tmp/diagnostic_logs.txt

# Check blockchain status
NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt
NODE_URL=http://<aitbc1-ip>:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt

2. Analyze Root Cause

Based on diagnostic data, identify:

  • Network connectivity issues (firewall, routing)
  • Service failures (crashes, hangs)
  • Synchronization problems (git, blockchain)
  • Resource exhaustion (CPU, memory, disk)
  • Configuration errors

3. Execute Recovery Actions

Connectivity Recovery

# Restart network services
systemctl restart aitbc-blockchain-p2p
ssh aitbc1 'systemctl restart aitbc-blockchain-p2p'

# Check and fix firewall rules
iptables -L -n | grep 8006
if [ $? -ne 0 ]; then
    iptables -A INPUT -p tcp --dport 8006 -j ACCEPT
    iptables -A OUTPUT -p tcp --sport 8006 -j ACCEPT
fi

# Test connectivity
curl -f -s http://10.1.223.40:8006/health
curl -f -s http://<aitbc1-ip>:8006/health

Service Recovery

# Restart blockchain services
systemctl restart aitbc-blockchain-rpc
ssh aitbc1 'systemctl restart aitbc-blockchain-rpc'

# Restart coordinator if needed
systemctl restart aitbc-coordinator
ssh aitbc1 'systemctl restart aitbc-coordinator'

# Check service logs
journalctl -u aitbc-blockchain-rpc -n 50 --no-pager

Synchronization Recovery

# Force blockchain sync
./aitbc-cli cluster sync --all --yes

# Git sync recovery
cd /opt/aitbc
git fetch origin main
git reset --hard origin/main
ssh aitbc1 'cd /opt/aitbc && git fetch origin main && git reset --hard origin/main'

# Verify sync
git log --oneline -5
ssh aitbc1 'cd /opt/aitbc && git log --oneline -5'

Resource Recovery

# Clear system caches
sync && echo 3 > /proc/sys/vm/drop_caches

# Restart if resource exhausted
systemctl restart aitbc-*
ssh aitbc1 'systemctl restart aitbc-*'

4. Validate Recovery

# Run full communication test
./scripts/blockchain-communication-test.sh --full --debug

# Verify all services are healthy
curl http://10.1.223.40:8006/health
curl http://<aitbc1-ip>:8006/health
curl http://10.1.223.40:8001/health
curl http://10.1.223.40:8000/health

# Check blockchain sync
NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain height
NODE_URL=http://<aitbc1-ip>:8006 ./aitbc-cli blockchain height

5. Report and Escalate

  • Document recovery actions taken
  • Provide metrics before/after recovery
  • Recommend preventive measures
  • Escalate if recovery fails or manual intervention needed

Constraints

  • Maximum recovery attempts: 3 per issue type
  • Recovery timeout: 300 seconds per action
  • Cannot restart services during peak hours (9AM-5PM local time) without confirmation
  • Must preserve blockchain data integrity
  • Cannot modify wallet keys or cryptographic material
  • Must log all recovery actions
  • Escalate to human if recovery fails after 3 attempts

Environment Assumptions

  • Genesis node IP: 10.1.223.40
  • Follower node IP: (replace with actual IP)
  • Both nodes use port 8006 for blockchain RPC
  • SSH access to aitbc1 configured and working
  • AITBC CLI accessible at /opt/aitbc/aitbc-cli
  • Git repository: http://gitea.bubuit.net:3000/oib/aitbc.git
  • Log directory: /var/log/aitbc/
  • Test script: /opt/aitbc/scripts/blockchain-communication-test.sh
  • Systemd services: aitbc-blockchain-rpc, aitbc-coordinator, aitbc-blockchain-p2p

Error Handling

Recovery Action Failure

  • Log specific failure reason
  • Attempt alternative recovery method
  • Increment failure counter
  • Escalate after 3 failures

Service Restart Failure

  • Check service logs for errors
  • Verify configuration files
  • Check system resources
  • Escalate if service cannot be restarted

Network Unreachable

  • Check physical network connectivity
  • Verify firewall rules
  • Check routing tables
  • Escalate if network issue persists

Data Integrity Concerns

  • Stop all recovery actions
  • Preserve current state
  • Escalate immediately for manual review
  • Do not attempt automated recovery

Timeout Exceeded

  • Stop current recovery action
  • Log timeout event
  • Attempt next recovery method
  • Escalate if all methods timeout

Example Usage Prompts

Basic Troubleshooting

"Blockchain communication test failed on aitbc1 node. Diagnose and recover."

Specific Issue Type

"Block synchronization lag detected (>15 blocks). Perform autonomous recovery."

Service Failure

"aitbc-blockchain-rpc service crashed on genesis node. Restart and validate."

Network Issue

"Cannot reach aitbc1 node on port 8006. Troubleshoot network connectivity."

Full Recovery

"Complete blockchain communication test failed with multiple issues. Perform full autonomous recovery."

Escalation Scenario

"Recovery actions failed after 3 attempts. Prepare escalation report with diagnostic data."

Expected Output Example

{
  "diagnosis": {
    "root_cause": "Network firewall blocking port 8006 on follower node",
    "affected_components": ["network", "firewall", "aitbc1"],
    "confidence": 0.95
  },
  "recovery_actions": [
    {
      "action": "Check firewall rules",
      "command": "iptables -L -n | grep 8006",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Port 8006 not in allowed rules"
    },
    {
      "action": "Add firewall rule",
      "command": "iptables -A INPUT -p tcp --dport 8006 -j ACCEPT",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Rule added successfully"
    },
    {
      "action": "Test connectivity",
      "command": "curl -f -s http://<aitbc1-ip>:8006/health",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Node reachable"
    }
  ],
  "recovery_status": "successful",
  "post_recovery_validation": {
    "tests_passed": 5,
    "tests_failed": 0,
    "metrics_restored": true
  },
  "recommendations": [
    "Add persistent firewall rules to /etc/iptables/rules.v4",
    "Monitor firewall changes for future prevention",
    "Consider implementing network monitoring alerts"
  ],
  "escalation_required": false
}

Model Routing

  • Fast Model: Use for simple, routine recoveries (service restarts, basic connectivity)
  • Reasoning Model: Use for complex diagnostics, root cause analysis, multi-step recovery
  • Reasoning Model: Use when recovery fails and escalation planning is needed

Performance Notes

  • Diagnosis Time: 10-30 seconds depending on issue complexity
  • Recovery Time: 30-120 seconds per recovery action
  • Validation Time: 60-180 seconds for full test suite
  • Memory Usage: <500MB during recovery operations
  • Network Impact: Minimal during diagnostics, moderate during git sync
  • Concurrency: Can handle single issue recovery; multiple issues should be queued
  • Optimization: Cache diagnostic data to avoid repeated collection
  • Rate Limiting: Limit service restarts to prevent thrashing
  • Logging: All actions logged with timestamps for audit trail