Some checks failed
API Endpoint Tests / test-api-endpoints (push) Waiting to run
CLI Tests / test-cli (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Integration Tests / test-service-integration (push) Has been cancelled
Python Tests / test-python (push) Has been cancelled
Documentation Validation / validate-docs (push) Has been cancelled
- Updated marketplace commands: `marketplace --action` → `market` subcommands - Updated wallet commands: direct flags → `wallet` subcommands - Updated AI commands: `ai-submit`, `ai-status` → `ai submit`, `ai status` - Updated blockchain commands: `chain` → `blockchain info` - Standardized command structure across all workflow files - Affected files: MULTI_NODE_MASTER_INDEX.md, TEST_MASTER_INDEX.md, multi-node-blockchain-marketplace
245 lines
6.3 KiB
Markdown
245 lines
6.3 KiB
Markdown
# Node Monitoring
|
|
Monitor your blockchain node performance and health.
|
|
|
|
## Dashboard
|
|
|
|
```bash
|
|
aitbc-chain dashboard
|
|
```
|
|
|
|
Shows:
|
|
- Block height
|
|
- Peers connected
|
|
- Mempool size
|
|
- CPU/Memory/GPU usage
|
|
- Network traffic
|
|
|
|
## Prometheus Metrics
|
|
|
|
```bash
|
|
# Enable metrics
|
|
aitbc-chain metrics --port 9090
|
|
```
|
|
|
|
Available metrics:
|
|
- `aitbc_block_height` - Current block height
|
|
- `aitbc_peers_count` - Number of connected peers
|
|
- `aitbc_mempool_size` - Transactions in mempool
|
|
- `aitbc_block_production_time` - Block production time
|
|
- `aitbc_cpu_usage` - CPU utilization
|
|
- `aitbc_memory_usage` - Memory utilization
|
|
|
|
## Coordinator API Metrics
|
|
|
|
The coordinator API now exposes a JSON metrics endpoint for dashboard consumption in addition to the Prometheus `/metrics` endpoint.
|
|
|
|
### Live JSON Metrics
|
|
|
|
```bash
|
|
curl http://localhost:8000/v1/metrics
|
|
```
|
|
|
|
Includes:
|
|
- API request and error counters
|
|
- Average API response time
|
|
- Cache hit/miss and hit-rate data
|
|
- Lightweight process memory and CPU snapshot
|
|
- Alert threshold evaluation state
|
|
- Alert delivery result metadata
|
|
|
|
### Dashboard Flow
|
|
|
|
The web dashboard at `/opt/aitbc/website/dashboards/metrics.html` consumes:
|
|
- `GET /v1/metrics` for live JSON metrics
|
|
- `GET /v1/health` for API health-state checks
|
|
- `GET /metrics` for Prometheus-compatible scraping
|
|
|
|
## Alert Configuration
|
|
|
|
### Set Alerts
|
|
|
|
```bash
|
|
# Low peers alert
|
|
aitbc-chain alert --metric peers --threshold 3 --action notify
|
|
|
|
# High mempool alert
|
|
aitbc-chain alert --metric mempool --threshold 5000 --action notify
|
|
|
|
# Sync delay alert
|
|
aitbc-chain alert --metric sync_delay --threshold 100 --action notify
|
|
```
|
|
|
|
### Alert Actions
|
|
|
|
| Action | Description |
|
|
|--------|-------------|
|
|
| notify | Send notification |
|
|
| restart | Restart node |
|
|
| pause | Pause block production |
|
|
|
|
## Log Monitoring
|
|
|
|
```bash
|
|
# Real-time logs
|
|
aitbc-chain logs --tail
|
|
|
|
# Search logs
|
|
aitbc-chain logs --grep "error" --since "1h"
|
|
|
|
# Export logs
|
|
aitbc-chain logs --export /var/log/aitbc-chain/
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
```bash
|
|
# Run health check
|
|
aitbc-chain health
|
|
|
|
# Detailed report
|
|
aitbc-chain health --detailed
|
|
```
|
|
|
|
Checks:
|
|
- Disk space
|
|
- Memory
|
|
- P2P connectivity
|
|
- RPC availability
|
|
- Database sync
|
|
|
|
## Coordinator Metrics Verification
|
|
|
|
### Verify JSON Metrics Endpoint
|
|
|
|
```bash
|
|
# Check live JSON metrics for dashboard consumption
|
|
curl http://localhost:8000/v1/metrics | jq
|
|
```
|
|
|
|
Expected fields:
|
|
- `api_requests` - Total API request count
|
|
- `api_errors` - Total API error count
|
|
- `error_rate_percent` - Calculated error rate percentage
|
|
- `avg_response_time_ms` - Average API response time
|
|
- `cache_hit_rate_percent` - Cache hit rate percentage
|
|
- `alerts` - Alert threshold evaluation states
|
|
- `alert_delivery` - Alert delivery result metadata
|
|
- `uptime_seconds` - Service uptime in seconds
|
|
|
|
### Verify Prometheus Metrics
|
|
|
|
```bash
|
|
# Check Prometheus-compatible metrics
|
|
curl http://localhost:8000/metrics
|
|
```
|
|
|
|
### Verify Alert History
|
|
|
|
```bash
|
|
# Get recent production alerts (requires admin key)
|
|
curl -H "X-API-Key: your-admin-key" \
|
|
"http://localhost:8000/agents/integration/production/alerts?limit=10" | jq
|
|
```
|
|
|
|
Filter by severity:
|
|
```bash
|
|
curl -H "X-API-Key: your-admin-key" \
|
|
"http://localhost:8000/agents/integration/production/alerts?severity=critical" | jq
|
|
```
|
|
|
|
### Verify Dashboard Access
|
|
|
|
```bash
|
|
# Open the metrics dashboard in a browser
|
|
# File location: /opt/aitbc/website/dashboards/metrics.html
|
|
```
|
|
|
|
The dashboard polls:
|
|
- `GET /v1/metrics` for live JSON metrics
|
|
- `GET /v1/health` for API health-state checks
|
|
- `GET /metrics` for Prometheus-compatible scraping
|
|
|
|
## Troubleshooting
|
|
|
|
### Metrics Not Updating
|
|
|
|
If `/v1/metrics` shows stale or zeroed metrics:
|
|
|
|
1. **Check middleware is active**
|
|
- Verify request metrics middleware is registered in `app/main.py`
|
|
- Check that `metrics_collector` is imported and used
|
|
|
|
2. **Check cache stats integration**
|
|
- Verify `cache_manager.get_stats()` is called in the metrics endpoint
|
|
- Check that cache manager is properly initialized
|
|
|
|
3. **Check system snapshot capture**
|
|
- Verify `capture_system_snapshot()` is not raising exceptions
|
|
- Check that `os.getloadavg()` and `resource` module are available on your platform
|
|
|
|
### Alert Delivery Not Working
|
|
|
|
If alerts are not being delivered:
|
|
|
|
1. **Check webhook configuration**
|
|
- Verify `AITBC_ALERT_WEBHOOK_URL` environment variable is set
|
|
- Test webhook URL with a simple curl POST request
|
|
- Check webhook server logs for incoming requests
|
|
|
|
2. **Check alert suppression**
|
|
- Alert dispatcher uses 5-minute cooldown by default
|
|
- Check if alerts are being suppressed due to recent deliveries
|
|
- Verify cooldown logic in `alert_dispatcher._is_suppressed()`
|
|
|
|
3. **Check alert history**
|
|
- Use `/agents/integration/production/alerts` to see recent alert attempts
|
|
- Check `delivery_status` field: `sent`, `suppressed`, or `failed`
|
|
- Check `error` field for failed deliveries
|
|
|
|
4. **Check log fallback**
|
|
- If webhook URL is not configured, alerts fall back to log output
|
|
- Check coordinator API logs for warning messages about alerts
|
|
|
|
### Dashboard Not Loading
|
|
|
|
If the metrics dashboard is not displaying data:
|
|
|
|
1. **Check API endpoints are accessible**
|
|
- Verify `/v1/metrics` returns valid JSON
|
|
- Verify `/v1/health` returns healthy status
|
|
- Check browser console for CORS or network errors
|
|
|
|
2. **Check dashboard file path**
|
|
- Ensure dashboard is served from correct location
|
|
- Verify static file serving is configured in web server
|
|
|
|
3. **Check browser console**
|
|
- Look for JavaScript errors
|
|
- Check for failed API requests
|
|
- Verify polling interval is reasonable (default 5 seconds)
|
|
|
|
### Alert Thresholds Not Triggering
|
|
|
|
If alerts should trigger but do not:
|
|
|
|
1. **Verify threshold values**
|
|
- Error rate threshold: 1%
|
|
- Average response time threshold: 500ms
|
|
- Memory usage threshold: 90%
|
|
- Cache hit rate threshold: 70%
|
|
|
|
2. **Check metrics calculation**
|
|
- Verify metrics are being collected correctly
|
|
- Check that response times are recorded in seconds (not milliseconds)
|
|
- Verify cache hit rate calculation includes both hits and misses
|
|
|
|
3. **Check alert evaluation logic**
|
|
- Verify `get_alert_states()` is called during metrics collection
|
|
- Check that alert states are included in `/v1/metrics` response
|
|
|
|
## Next
|
|
|
|
- [Quick Start](./1_quick-start.md) — Get started
|
|
- [Configuration](./2_configuration.md) - Configure your node
|
|
- [Operations](./3_operations.md) — Day-to-day ops
|