aitbc/docs/advanced/01_blockchain/7_monitoring.md

# Node Monitoring
Monitor your blockchain node performance and health.

## Dashboard

```bash
aitbc-chain dashboard
```

Shows:
- Block height
- Peers connected
- Mempool size
- CPU/Memory/GPU usage
- Network traffic

## Prometheus Metrics

```bash
# Enable metrics
aitbc-chain metrics --port 9090
```

Available metrics:
- `aitbc_block_height` - Current block height
- `aitbc_peers_count` - Number of connected peers
- `aitbc_mempool_size` - Transactions in mempool
- `aitbc_block_production_time` - Block production time
- `aitbc_cpu_usage` - CPU utilization
- `aitbc_memory_usage` - Memory utilization

## Coordinator API Metrics

The coordinator API now exposes a JSON metrics endpoint for dashboard consumption in addition to the Prometheus `/metrics` endpoint.

### Live JSON Metrics

```bash
curl http://localhost:8000/v1/metrics
```

Includes:
- API request and error counters
- Average API response time
- Cache hit/miss and hit-rate data
- Lightweight process memory and CPU snapshot
- Alert threshold evaluation state
- Alert delivery result metadata

### Dashboard Flow

The web dashboard at `/opt/aitbc/website/dashboards/metrics.html` consumes:
- `GET /v1/metrics` for live JSON metrics
- `GET /v1/health` for API health-state checks
- `GET /metrics` for Prometheus-compatible scraping

## Alert Configuration

### Set Alerts

```bash
# Low peers alert
aitbc-chain alert --metric peers --threshold 3 --action notify

# High mempool alert
aitbc-chain alert --metric mempool --threshold 5000 --action notify

# Sync delay alert
aitbc-chain alert --metric sync_delay --threshold 100 --action notify
```

### Alert Actions

| Action | Description |
|--------|-------------|
| notify | Send notification |
| restart | Restart node |
| pause | Pause block production |

## Log Monitoring

```bash
# Real-time logs
aitbc-chain logs --tail

# Search logs
aitbc-chain logs --grep "error" --since "1h"

# Export logs
aitbc-chain logs --export /var/log/aitbc-chain/
```

## Health Checks

```bash
# Run health check
aitbc-chain health

# Detailed report
aitbc-chain health --detailed
```

Checks:
- Disk space
- Memory
- P2P connectivity
- RPC availability
- Database sync

## Coordinator Metrics Verification

### Verify JSON Metrics Endpoint

```bash
# Check live JSON metrics for dashboard consumption
curl http://localhost:8000/v1/metrics | jq
```

Expected fields:
- `api_requests` - Total API request count
- `api_errors` - Total API error count
- `error_rate_percent` - Calculated error rate percentage
- `avg_response_time_ms` - Average API response time
- `cache_hit_rate_percent` - Cache hit rate percentage
- `alerts` - Alert threshold evaluation states
- `alert_delivery` - Alert delivery result metadata
- `uptime_seconds` - Service uptime in seconds

### Verify Prometheus Metrics

```bash
# Check Prometheus-compatible metrics
curl http://localhost:8000/metrics
```

### Verify Alert History

```bash
# Get recent production alerts (requires admin key)
curl -H "X-API-Key: your-admin-key" \
  "http://localhost:8000/agents/integration/production/alerts?limit=10" | jq
```

Filter by severity:
```bash
curl -H "X-API-Key: your-admin-key" \
  "http://localhost:8000/agents/integration/production/alerts?severity=critical" | jq
```

### Verify Dashboard Access

```bash
# Open the metrics dashboard in a browser
# File location: /opt/aitbc/website/dashboards/metrics.html
```

The dashboard polls:
- `GET /v1/metrics` for live JSON metrics
- `GET /v1/health` for API health-state checks
- `GET /metrics` for Prometheus-compatible scraping

## Troubleshooting

### Metrics Not Updating

If `/v1/metrics` shows stale or zeroed metrics:

1. **Check middleware is active**
   - Verify request metrics middleware is registered in `app/main.py`
   - Check that `metrics_collector` is imported and used

2. **Check cache stats integration**
   - Verify `cache_manager.get_stats()` is called in the metrics endpoint
   - Check that cache manager is properly initialized

3. **Check system snapshot capture**
   - Verify `capture_system_snapshot()` is not raising exceptions
   - Check that `os.getloadavg()` and `resource` module are available on your platform

### Alert Delivery Not Working

If alerts are not being delivered:

1. **Check webhook configuration**
   - Verify `AITBC_ALERT_WEBHOOK_URL` environment variable is set
   - Test webhook URL with a simple curl POST request
   - Check webhook server logs for incoming requests

2. **Check alert suppression**
   - Alert dispatcher uses 5-minute cooldown by default
   - Check if alerts are being suppressed due to recent deliveries
   - Verify cooldown logic in `alert_dispatcher._is_suppressed()`

3. **Check alert history**
   - Use `/agents/integration/production/alerts` to see recent alert attempts
   - Check `delivery_status` field: `sent`, `suppressed`, or `failed`
   - Check `error` field for failed deliveries

4. **Check log fallback**
   - If webhook URL is not configured, alerts fall back to log output
   - Check coordinator API logs for warning messages about alerts

### Dashboard Not Loading

If the metrics dashboard is not displaying data:

1. **Check API endpoints are accessible**
   - Verify `/v1/metrics` returns valid JSON
   - Verify `/v1/health` returns healthy status
   - Check browser console for CORS or network errors

2. **Check dashboard file path**
   - Ensure dashboard is served from correct location
   - Verify static file serving is configured in web server

3. **Check browser console**
   - Look for JavaScript errors
   - Check for failed API requests
   - Verify polling interval is reasonable (default 5 seconds)

### Alert Thresholds Not Triggering

If alerts should trigger but do not:

1. **Verify threshold values**
   - Error rate threshold: 1%
   - Average response time threshold: 500ms
   - Memory usage threshold: 90%
   - Cache hit rate threshold: 70%

2. **Check metrics calculation**
   - Verify metrics are being collected correctly
   - Check that response times are recorded in seconds (not milliseconds)
   - Verify cache hit rate calculation includes both hits and misses

3. **Check alert evaluation logic**
   - Verify `get_alert_states()` is called during metrics collection
   - Check that alert states are included in `/v1/metrics` response

## Next

- [Quick Start](./1_quick-start.md) — Get started
- [Configuration](./2_configuration.md) - Configure your node
- [Operations](./3_operations.md) — Day-to-day ops