oib/aitbc

Fork 0

Files

aitbc 40ddf89b9c

API Endpoint Tests / test-api-endpoints (push) Waiting to run

Details

CLI Tests / test-cli (push) Has been cancelled

Details

Security Scanning / security-scan (push) Has been cancelled

Details

Integration Tests / test-service-integration (push) Has been cancelled

Details

Python Tests / test-python (push) Has been cancelled

Details

Documentation Validation / validate-docs (push) Has been cancelled

Details

docs: update CLI command syntax across workflow documentation

- Updated marketplace commands: `marketplace --action` → `market` subcommands
- Updated wallet commands: direct flags → `wallet` subcommands
- Updated AI commands: `ai-submit`, `ai-status` → `ai submit`, `ai status`
- Updated blockchain commands: `chain` → `blockchain info`
- Standardized command structure across all workflow files
- Affected files: MULTI_NODE_MASTER_INDEX.md, TEST_MASTER_INDEX.md, multi-node-blockchain-marketplace

2026-04-08 12:10:21 +02:00

6.3 KiB

Raw Blame History

Node Monitoring

Monitor your blockchain node performance and health.

Dashboard

aitbc-chain dashboard

Shows:

Block height
Peers connected
Mempool size
CPU/Memory/GPU usage
Network traffic

Prometheus Metrics

# Enable metrics
aitbc-chain metrics --port 9090

Available metrics:

aitbc_block_height - Current block height
aitbc_peers_count - Number of connected peers
aitbc_mempool_size - Transactions in mempool
aitbc_block_production_time - Block production time
aitbc_cpu_usage - CPU utilization
aitbc_memory_usage - Memory utilization

Coordinator API Metrics

The coordinator API now exposes a JSON metrics endpoint for dashboard consumption in addition to the Prometheus /metrics endpoint.

Live JSON Metrics

curl http://localhost:8000/v1/metrics

Includes:

API request and error counters
Average API response time
Cache hit/miss and hit-rate data
Lightweight process memory and CPU snapshot
Alert threshold evaluation state
Alert delivery result metadata

Dashboard Flow

The web dashboard at /opt/aitbc/website/dashboards/metrics.html consumes:

GET /v1/metrics for live JSON metrics
GET /v1/health for API health-state checks
GET /metrics for Prometheus-compatible scraping

Alert Configuration

Set Alerts

# Low peers alert
aitbc-chain alert --metric peers --threshold 3 --action notify

# High mempool alert
aitbc-chain alert --metric mempool --threshold 5000 --action notify

# Sync delay alert
aitbc-chain alert --metric sync_delay --threshold 100 --action notify

Alert Actions

Action	Description
notify	Send notification
restart	Restart node
pause	Pause block production

Log Monitoring

# Real-time logs
aitbc-chain logs --tail

# Search logs
aitbc-chain logs --grep "error" --since "1h"

# Export logs
aitbc-chain logs --export /var/log/aitbc-chain/

Health Checks

# Run health check
aitbc-chain health

# Detailed report
aitbc-chain health --detailed

Checks:

Disk space
Memory
P2P connectivity
RPC availability
Database sync

Coordinator Metrics Verification

Verify JSON Metrics Endpoint

# Check live JSON metrics for dashboard consumption
curl http://localhost:8000/v1/metrics | jq

Expected fields:

api_requests - Total API request count
api_errors - Total API error count
error_rate_percent - Calculated error rate percentage
avg_response_time_ms - Average API response time
cache_hit_rate_percent - Cache hit rate percentage
alerts - Alert threshold evaluation states
alert_delivery - Alert delivery result metadata
uptime_seconds - Service uptime in seconds

Verify Prometheus Metrics

# Check Prometheus-compatible metrics
curl http://localhost:8000/metrics

Verify Alert History

# Get recent production alerts (requires admin key)
curl -H "X-API-Key: your-admin-key" \
  "http://localhost:8000/agents/integration/production/alerts?limit=10" | jq

Filter by severity:

curl -H "X-API-Key: your-admin-key" \
  "http://localhost:8000/agents/integration/production/alerts?severity=critical" | jq

Verify Dashboard Access

# Open the metrics dashboard in a browser
# File location: /opt/aitbc/website/dashboards/metrics.html

The dashboard polls:

GET /v1/metrics for live JSON metrics
GET /v1/health for API health-state checks
GET /metrics for Prometheus-compatible scraping

Troubleshooting

Metrics Not Updating

If /v1/metrics shows stale or zeroed metrics:

Check middleware is active
- Verify request metrics middleware is registered in app/main.py
- Check that metrics_collector is imported and used
Check cache stats integration
- Verify cache_manager.get_stats() is called in the metrics endpoint
- Check that cache manager is properly initialized
Check system snapshot capture
- Verify capture_system_snapshot() is not raising exceptions
- Check that os.getloadavg() and resource module are available on your platform

Alert Delivery Not Working

If alerts are not being delivered:

Check webhook configuration
- Verify AITBC_ALERT_WEBHOOK_URL environment variable is set
- Test webhook URL with a simple curl POST request
- Check webhook server logs for incoming requests
Check alert suppression
- Alert dispatcher uses 5-minute cooldown by default
- Check if alerts are being suppressed due to recent deliveries
- Verify cooldown logic in alert_dispatcher._is_suppressed()
Check alert history
- Use /agents/integration/production/alerts to see recent alert attempts
- Check delivery_status field: sent, suppressed, or failed
- Check error field for failed deliveries
Check log fallback
- If webhook URL is not configured, alerts fall back to log output
- Check coordinator API logs for warning messages about alerts

Dashboard Not Loading

If the metrics dashboard is not displaying data:

Check API endpoints are accessible
- Verify /v1/metrics returns valid JSON
- Verify /v1/health returns healthy status
- Check browser console for CORS or network errors
Check dashboard file path
- Ensure dashboard is served from correct location
- Verify static file serving is configured in web server
Check browser console
- Look for JavaScript errors
- Check for failed API requests
- Verify polling interval is reasonable (default 5 seconds)

Alert Thresholds Not Triggering

If alerts should trigger but do not:

Verify threshold values
- Error rate threshold: 1%
- Average response time threshold: 500ms
- Memory usage threshold: 90%
- Cache hit rate threshold: 70%
Check metrics calculation
- Verify metrics are being collected correctly
- Check that response times are recorded in seconds (not milliseconds)
- Verify cache hit rate calculation includes both hits and misses
Check alert evaluation logic
- Verify get_alert_states() is called during metrics collection
- Check that alert states are included in /v1/metrics response

Quick Start — Get started
Configuration - Configure your node
Operations — Day-to-day ops

6.3 KiB Raw Blame History

Node Monitoring

Dashboard

Prometheus Metrics

Coordinator API Metrics

Live JSON Metrics

Dashboard Flow

Alert Configuration

Set Alerts

Alert Actions

Log Monitoring

Health Checks

Coordinator Metrics Verification

Verify JSON Metrics Endpoint

Verify Prometheus Metrics

Verify Alert History

Verify Dashboard Access

Troubleshooting

Metrics Not Updating

Alert Delivery Not Working

Dashboard Not Loading

Alert Thresholds Not Triggering

Next

6.3 KiB

Raw Blame History