oib/aitbc

Fork 0

Files

aitbc 084dcdef31

Security Scanning / security-scan (push) Has been cancelled

Details

Documentation Validation / validate-docs (push) Has been cancelled

Details

Integration Tests / test-service-integration (push) Has been cancelled

Details

Python Tests / test-python (push) Has been cancelled

Details

docs: update refactoring summary and mastery plan to reflect completion of all 11 atomic skills

- Mark Phase 2 as completed with all 11/11 atomic skills created
- Update skill counts: AITBC skills (6/6), OpenClaw skills (5/5)
- Move aitbc-node-coordinator and aitbc-analytics-analyzer from remaining to completed
- Update Phase 3 status from PLANNED to IN PROGRESS
- Add Gitea-based node synchronization documentation (replaces SCP)
- Clarify two-node architecture with same port (8006) on different I

2026-04-10 12:46:09 +02:00

11 KiB

Raw Blame History

description, title, version

description	title	version
Autonomous AI skill for blockchain troubleshooting and recovery across multi-node AITBC setup	Blockchain Troubleshoot & Recovery	1.0

Blockchain Troubleshoot & Recovery Skill

Purpose

Autonomous AI skill for diagnosing and resolving blockchain communication issues between aitbc (genesis) and aitbc1 (follower) nodes running on port 8006 across different physical machines.

Activation

Activate this skill when:

Blockchain communication tests fail
Nodes become unreachable
Block synchronization lags (>10 blocks)
Transaction propagation times exceed thresholds
Git synchronization fails
Network latency issues detected
Service health checks fail

Input Schema

{
  "issue_type": {
    "type": "string",
    "enum": ["connectivity", "sync_lag", "transaction_timeout", "service_failure", "git_sync_failure", "network_latency", "unknown"],
    "description": "Type of blockchain communication issue"
  },
  "affected_nodes": {
    "type": "array",
    "items": {"type": "string", "enum": ["aitbc", "aitbc1", "both"]},
    "description": "Nodes affected by the issue"
  },
  "severity": {
    "type": "string",
    "enum": ["low", "medium", "high", "critical"],
    "description": "Severity level of the issue"
  },
  "diagnostic_data": {
    "type": "object",
    "properties": {
      "error_logs": {"type": "string"},
      "test_results": {"type": "object"},
      "metrics": {"type": "object"}
    },
    "description": "Diagnostic data from failed tests"
  },
  "auto_recovery": {
    "type": "boolean",
    "default": true,
    "description": "Enable autonomous recovery actions"
  },
  "recovery_timeout": {
    "type": "integer",
    "default": 300,
    "description": "Maximum time (seconds) for recovery attempts"
  }
}

Output Schema

{
  "diagnosis": {
    "root_cause": {"type": "string"},
    "affected_components": {"type": "array", "items": {"type": "string"}},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "recovery_actions": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "action": {"type": "string"},
        "command": {"type": "string"},
        "target_node": {"type": "string"},
        "status": {"type": "string", "enum": ["pending", "in_progress", "completed", "failed"]},
        "result": {"type": "string"}
      }
    }
  },
  "recovery_status": {
    "type": "string",
    "enum": ["successful", "partial", "failed", "manual_intervention_required"]
  },
  "post_recovery_validation": {
    "tests_passed": {"type": "integer"},
    "tests_failed": {"type": "integer"},
    "metrics_restored": {"type": "boolean"}
  },
  "recommendations": {
    "type": "array",
    "items": {"type": "string"}
  },
  "escalation_required": {
    "type": "boolean"
  }
}

Process

1. Diagnose Issue

# Collect diagnostic information
tail -100 /var/log/aitbc/blockchain-communication-test.log > /tmp/diagnostic_logs.txt
tail -50 /var/log/aitbc/blockchain-test-errors.txt >> /tmp/diagnostic_logs.txt

# Check service status
systemctl status aitbc-blockchain-rpc --no-pager >> /tmp/diagnostic_logs.txt
ssh aitbc1 'systemctl status aitbc-blockchain-rpc --no-pager' >> /tmp/diagnostic_logs.txt

# Check network connectivity
ping -c 5 10.1.223.40 >> /tmp/diagnostic_logs.txt
ping -c 5 <aitbc1-ip> >> /tmp/diagnostic_logs.txt

# Check port accessibility
netstat -tlnp | grep 8006 >> /tmp/diagnostic_logs.txt

# Check blockchain status
NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt
NODE_URL=http://<aitbc1-ip>:8006 ./aitbc-cli blockchain info --verbose >> /tmp/diagnostic_logs.txt

2. Analyze Root Cause

Based on diagnostic data, identify:

Network connectivity issues (firewall, routing)
Service failures (crashes, hangs)
Synchronization problems (git, blockchain)
Resource exhaustion (CPU, memory, disk)
Configuration errors

3. Execute Recovery Actions

Connectivity Recovery

# Restart network services
systemctl restart aitbc-blockchain-p2p
ssh aitbc1 'systemctl restart aitbc-blockchain-p2p'

# Check and fix firewall rules
iptables -L -n | grep 8006
if [ $? -ne 0 ]; then
    iptables -A INPUT -p tcp --dport 8006 -j ACCEPT
    iptables -A OUTPUT -p tcp --sport 8006 -j ACCEPT
fi

# Test connectivity
curl -f -s http://10.1.223.40:8006/health
curl -f -s http://<aitbc1-ip>:8006/health

Service Recovery

# Restart blockchain services
systemctl restart aitbc-blockchain-rpc
ssh aitbc1 'systemctl restart aitbc-blockchain-rpc'

# Restart coordinator if needed
systemctl restart aitbc-coordinator
ssh aitbc1 'systemctl restart aitbc-coordinator'

# Check service logs
journalctl -u aitbc-blockchain-rpc -n 50 --no-pager

Synchronization Recovery

# Force blockchain sync
./aitbc-cli cluster sync --all --yes

# Git sync recovery
cd /opt/aitbc
git fetch origin main
git reset --hard origin/main
ssh aitbc1 'cd /opt/aitbc && git fetch origin main && git reset --hard origin/main'

# Verify sync
git log --oneline -5
ssh aitbc1 'cd /opt/aitbc && git log --oneline -5'

Resource Recovery

# Clear system caches
sync && echo 3 > /proc/sys/vm/drop_caches

# Restart if resource exhausted
systemctl restart aitbc-*
ssh aitbc1 'systemctl restart aitbc-*'

4. Validate Recovery

# Run full communication test
./scripts/blockchain-communication-test.sh --full --debug

# Verify all services are healthy
curl http://10.1.223.40:8006/health
curl http://<aitbc1-ip>:8006/health
curl http://10.1.223.40:8001/health
curl http://10.1.223.40:8000/health

# Check blockchain sync
NODE_URL=http://10.1.223.40:8006 ./aitbc-cli blockchain height
NODE_URL=http://<aitbc1-ip>:8006 ./aitbc-cli blockchain height

5. Report and Escalate

Document recovery actions taken
Provide metrics before/after recovery
Recommend preventive measures
Escalate if recovery fails or manual intervention needed

Constraints

Maximum recovery attempts: 3 per issue type
Recovery timeout: 300 seconds per action
Cannot restart services during peak hours (9AM-5PM local time) without confirmation
Must preserve blockchain data integrity
Cannot modify wallet keys or cryptographic material
Must log all recovery actions
Escalate to human if recovery fails after 3 attempts

Environment Assumptions

Genesis node IP: 10.1.223.40
Follower node IP: (replace with actual IP)
Both nodes use port 8006 for blockchain RPC
SSH access to aitbc1 configured and working
AITBC CLI accessible at /opt/aitbc/aitbc-cli
Git repository: http://gitea.bubuit.net:3000/oib/aitbc.git
Log directory: /var/log/aitbc/
Test script: /opt/aitbc/scripts/blockchain-communication-test.sh
Systemd services: aitbc-blockchain-rpc, aitbc-coordinator, aitbc-blockchain-p2p

Error Handling

Recovery Action Failure

Log specific failure reason
Attempt alternative recovery method
Increment failure counter
Escalate after 3 failures

Service Restart Failure

Check service logs for errors
Verify configuration files
Check system resources
Escalate if service cannot be restarted

Network Unreachable

Check physical network connectivity
Verify firewall rules
Check routing tables
Escalate if network issue persists

Data Integrity Concerns

Stop all recovery actions
Preserve current state
Escalate immediately for manual review
Do not attempt automated recovery

Timeout Exceeded

Stop current recovery action
Log timeout event
Attempt next recovery method
Escalate if all methods timeout

Example Usage Prompts

Basic Troubleshooting

"Blockchain communication test failed on aitbc1 node. Diagnose and recover."

Specific Issue Type

"Block synchronization lag detected (>15 blocks). Perform autonomous recovery."

Service Failure

"aitbc-blockchain-rpc service crashed on genesis node. Restart and validate."

Network Issue

"Cannot reach aitbc1 node on port 8006. Troubleshoot network connectivity."

Full Recovery

"Complete blockchain communication test failed with multiple issues. Perform full autonomous recovery."

Escalation Scenario

"Recovery actions failed after 3 attempts. Prepare escalation report with diagnostic data."

Expected Output Example

{
  "diagnosis": {
    "root_cause": "Network firewall blocking port 8006 on follower node",
    "affected_components": ["network", "firewall", "aitbc1"],
    "confidence": 0.95
  },
  "recovery_actions": [
    {
      "action": "Check firewall rules",
      "command": "iptables -L -n | grep 8006",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Port 8006 not in allowed rules"
    },
    {
      "action": "Add firewall rule",
      "command": "iptables -A INPUT -p tcp --dport 8006 -j ACCEPT",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Rule added successfully"
    },
    {
      "action": "Test connectivity",
      "command": "curl -f -s http://<aitbc1-ip>:8006/health",
      "target_node": "aitbc1",
      "status": "completed",
      "result": "Node reachable"
    }
  ],
  "recovery_status": "successful",
  "post_recovery_validation": {
    "tests_passed": 5,
    "tests_failed": 0,
    "metrics_restored": true
  },
  "recommendations": [
    "Add persistent firewall rules to /etc/iptables/rules.v4",
    "Monitor firewall changes for future prevention",
    "Consider implementing network monitoring alerts"
  ],
  "escalation_required": false
}

Model Routing

Fast Model: Use for simple, routine recoveries (service restarts, basic connectivity)
Reasoning Model: Use for complex diagnostics, root cause analysis, multi-step recovery
Reasoning Model: Use when recovery fails and escalation planning is needed

Performance Notes

Diagnosis Time: 10-30 seconds depending on issue complexity
Recovery Time: 30-120 seconds per recovery action
Validation Time: 60-180 seconds for full test suite
Memory Usage: <500MB during recovery operations
Network Impact: Minimal during diagnostics, moderate during git sync
Concurrency: Can handle single issue recovery; multiple issues should be queued
Optimization: Cache diagnostic data to avoid repeated collection
Rate Limiting: Limit service restarts to prevent thrashing
Logging: All actions logged with timestamps for audit trail

aitbc-node-coordinator - For cross-node coordination during recovery
openclaw-error-handler - For error handling and escalation
openclaw-coordination-orchestrator - For multi-node recovery coordination

Blockchain Communication Test - Testing workflow that triggers this skill
Multi-Node Operations - General node operations

11 KiB Raw Blame History

Blockchain Troubleshoot & Recovery Skill

Purpose

Activation

Input Schema

Output Schema

Process

1. Diagnose Issue

2. Analyze Root Cause

3. Execute Recovery Actions

Connectivity Recovery

Service Recovery

Synchronization Recovery

Resource Recovery

4. Validate Recovery

5. Report and Escalate

Constraints

Environment Assumptions

Error Handling

Recovery Action Failure

Service Restart Failure

Network Unreachable

Data Integrity Concerns

Timeout Exceeded

Example Usage Prompts

Basic Troubleshooting

Specific Issue Type

Service Failure

Network Issue

Full Recovery

Escalation Scenario

Expected Output Example

Model Routing

Performance Notes

Related Skills

Related Workflows

11 KiB

Raw Blame History