Files
aitbc/docs/advanced/01_blockchain/7_monitoring.md
aitbc 40ddf89b9c
Some checks failed
API Endpoint Tests / test-api-endpoints (push) Waiting to run
CLI Tests / test-cli (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Integration Tests / test-service-integration (push) Has been cancelled
Python Tests / test-python (push) Has been cancelled
Documentation Validation / validate-docs (push) Has been cancelled
docs: update CLI command syntax across workflow documentation
- Updated marketplace commands: `marketplace --action` → `market` subcommands
- Updated wallet commands: direct flags → `wallet` subcommands
- Updated AI commands: `ai-submit`, `ai-status` → `ai submit`, `ai status`
- Updated blockchain commands: `chain` → `blockchain info`
- Standardized command structure across all workflow files
- Affected files: MULTI_NODE_MASTER_INDEX.md, TEST_MASTER_INDEX.md, multi-node-blockchain-marketplace
2026-04-08 12:10:21 +02:00

245 lines
6.3 KiB
Markdown

# Node Monitoring
Monitor your blockchain node performance and health.
## Dashboard
```bash
aitbc-chain dashboard
```
Shows:
- Block height
- Peers connected
- Mempool size
- CPU/Memory/GPU usage
- Network traffic
## Prometheus Metrics
```bash
# Enable metrics
aitbc-chain metrics --port 9090
```
Available metrics:
- `aitbc_block_height` - Current block height
- `aitbc_peers_count` - Number of connected peers
- `aitbc_mempool_size` - Transactions in mempool
- `aitbc_block_production_time` - Block production time
- `aitbc_cpu_usage` - CPU utilization
- `aitbc_memory_usage` - Memory utilization
## Coordinator API Metrics
The coordinator API now exposes a JSON metrics endpoint for dashboard consumption in addition to the Prometheus `/metrics` endpoint.
### Live JSON Metrics
```bash
curl http://localhost:8000/v1/metrics
```
Includes:
- API request and error counters
- Average API response time
- Cache hit/miss and hit-rate data
- Lightweight process memory and CPU snapshot
- Alert threshold evaluation state
- Alert delivery result metadata
### Dashboard Flow
The web dashboard at `/opt/aitbc/website/dashboards/metrics.html` consumes:
- `GET /v1/metrics` for live JSON metrics
- `GET /v1/health` for API health-state checks
- `GET /metrics` for Prometheus-compatible scraping
## Alert Configuration
### Set Alerts
```bash
# Low peers alert
aitbc-chain alert --metric peers --threshold 3 --action notify
# High mempool alert
aitbc-chain alert --metric mempool --threshold 5000 --action notify
# Sync delay alert
aitbc-chain alert --metric sync_delay --threshold 100 --action notify
```
### Alert Actions
| Action | Description |
|--------|-------------|
| notify | Send notification |
| restart | Restart node |
| pause | Pause block production |
## Log Monitoring
```bash
# Real-time logs
aitbc-chain logs --tail
# Search logs
aitbc-chain logs --grep "error" --since "1h"
# Export logs
aitbc-chain logs --export /var/log/aitbc-chain/
```
## Health Checks
```bash
# Run health check
aitbc-chain health
# Detailed report
aitbc-chain health --detailed
```
Checks:
- Disk space
- Memory
- P2P connectivity
- RPC availability
- Database sync
## Coordinator Metrics Verification
### Verify JSON Metrics Endpoint
```bash
# Check live JSON metrics for dashboard consumption
curl http://localhost:8000/v1/metrics | jq
```
Expected fields:
- `api_requests` - Total API request count
- `api_errors` - Total API error count
- `error_rate_percent` - Calculated error rate percentage
- `avg_response_time_ms` - Average API response time
- `cache_hit_rate_percent` - Cache hit rate percentage
- `alerts` - Alert threshold evaluation states
- `alert_delivery` - Alert delivery result metadata
- `uptime_seconds` - Service uptime in seconds
### Verify Prometheus Metrics
```bash
# Check Prometheus-compatible metrics
curl http://localhost:8000/metrics
```
### Verify Alert History
```bash
# Get recent production alerts (requires admin key)
curl -H "X-API-Key: your-admin-key" \
"http://localhost:8000/agents/integration/production/alerts?limit=10" | jq
```
Filter by severity:
```bash
curl -H "X-API-Key: your-admin-key" \
"http://localhost:8000/agents/integration/production/alerts?severity=critical" | jq
```
### Verify Dashboard Access
```bash
# Open the metrics dashboard in a browser
# File location: /opt/aitbc/website/dashboards/metrics.html
```
The dashboard polls:
- `GET /v1/metrics` for live JSON metrics
- `GET /v1/health` for API health-state checks
- `GET /metrics` for Prometheus-compatible scraping
## Troubleshooting
### Metrics Not Updating
If `/v1/metrics` shows stale or zeroed metrics:
1. **Check middleware is active**
- Verify request metrics middleware is registered in `app/main.py`
- Check that `metrics_collector` is imported and used
2. **Check cache stats integration**
- Verify `cache_manager.get_stats()` is called in the metrics endpoint
- Check that cache manager is properly initialized
3. **Check system snapshot capture**
- Verify `capture_system_snapshot()` is not raising exceptions
- Check that `os.getloadavg()` and `resource` module are available on your platform
### Alert Delivery Not Working
If alerts are not being delivered:
1. **Check webhook configuration**
- Verify `AITBC_ALERT_WEBHOOK_URL` environment variable is set
- Test webhook URL with a simple curl POST request
- Check webhook server logs for incoming requests
2. **Check alert suppression**
- Alert dispatcher uses 5-minute cooldown by default
- Check if alerts are being suppressed due to recent deliveries
- Verify cooldown logic in `alert_dispatcher._is_suppressed()`
3. **Check alert history**
- Use `/agents/integration/production/alerts` to see recent alert attempts
- Check `delivery_status` field: `sent`, `suppressed`, or `failed`
- Check `error` field for failed deliveries
4. **Check log fallback**
- If webhook URL is not configured, alerts fall back to log output
- Check coordinator API logs for warning messages about alerts
### Dashboard Not Loading
If the metrics dashboard is not displaying data:
1. **Check API endpoints are accessible**
- Verify `/v1/metrics` returns valid JSON
- Verify `/v1/health` returns healthy status
- Check browser console for CORS or network errors
2. **Check dashboard file path**
- Ensure dashboard is served from correct location
- Verify static file serving is configured in web server
3. **Check browser console**
- Look for JavaScript errors
- Check for failed API requests
- Verify polling interval is reasonable (default 5 seconds)
### Alert Thresholds Not Triggering
If alerts should trigger but do not:
1. **Verify threshold values**
- Error rate threshold: 1%
- Average response time threshold: 500ms
- Memory usage threshold: 90%
- Cache hit rate threshold: 70%
2. **Check metrics calculation**
- Verify metrics are being collected correctly
- Check that response times are recorded in seconds (not milliseconds)
- Verify cache hit rate calculation includes both hits and misses
3. **Check alert evaluation logic**
- Verify `get_alert_states()` is called during metrics collection
- Check that alert states are included in `/v1/metrics` response
## Next
- [Quick Start](./1_quick-start.md) — Get started
- [Configuration](./2_configuration.md) - Configure your node
- [Operations](./3_operations.md) — Day-to-day ops