docs: update CLI command syntax across workflow documentation

- Updated marketplace commands: `marketplace --action` → `market` subcommands - Updated wallet commands: direct flags → `wallet` subcommands - Updated AI commands: `ai-submit`, `ai-status` → `ai submit`, `ai status` - Updated blockchain commands: `chain` → `blockchain info` - Standardized command structure across all workflow files - Affected files: MULTI_NODE_MASTER_INDEX.md, TEST_MASTER_INDEX.md, multi-node-blockchain-marketplace
2026-04-08 12:10:21 +02:00
parent ef4a1c0e87
commit 40ddf89b9c
251 changed files with 3555 additions and 61407 deletions
--- a/docs/advanced/01_blockchain/7_monitoring.md
+++ b/docs/advanced/01_blockchain/7_monitoring.md
@@ -29,6 +29,31 @@ Available metrics:
 - `aitbc_cpu_usage` - CPU utilization
 - `aitbc_memory_usage` - Memory utilization

+## Coordinator API Metrics
+
+The coordinator API now exposes a JSON metrics endpoint for dashboard consumption in addition to the Prometheus `/metrics` endpoint.
+
+### Live JSON Metrics
+
+```bash
+curl http://localhost:8000/v1/metrics
+```
+
+Includes:
+- API request and error counters
+- Average API response time
+- Cache hit/miss and hit-rate data
+- Lightweight process memory and CPU snapshot
+- Alert threshold evaluation state
+- Alert delivery result metadata
+
+### Dashboard Flow
+
+The web dashboard at `/opt/aitbc/website/dashboards/metrics.html` consumes:
+- `GET /v1/metrics` for live JSON metrics
+- `GET /v1/health` for API health-state checks
+- `GET /metrics` for Prometheus-compatible scraping
+
 ## Alert Configuration

 ### Set Alerts
@@ -82,6 +107,136 @@ Checks:
 - RPC availability
 - Database sync

+## Coordinator Metrics Verification
+
+### Verify JSON Metrics Endpoint
+
+```bash
+# Check live JSON metrics for dashboard consumption
+curl http://localhost:8000/v1/metrics | jq
+```
+
+Expected fields:
+- `api_requests` - Total API request count
+- `api_errors` - Total API error count
+- `error_rate_percent` - Calculated error rate percentage
+- `avg_response_time_ms` - Average API response time
+- `cache_hit_rate_percent` - Cache hit rate percentage
+- `alerts` - Alert threshold evaluation states
+- `alert_delivery` - Alert delivery result metadata
+- `uptime_seconds` - Service uptime in seconds
+
+### Verify Prometheus Metrics
+
+```bash
+# Check Prometheus-compatible metrics
+curl http://localhost:8000/metrics
+```
+
+### Verify Alert History
+
+```bash
+# Get recent production alerts (requires admin key)
+curl -H "X-API-Key: your-admin-key" \
+  "http://localhost:8000/agents/integration/production/alerts?limit=10" | jq
+```
+
+Filter by severity:
+```bash
+curl -H "X-API-Key: your-admin-key" \
+  "http://localhost:8000/agents/integration/production/alerts?severity=critical" | jq
+```
+
+### Verify Dashboard Access
+
+```bash
+# Open the metrics dashboard in a browser
+# File location: /opt/aitbc/website/dashboards/metrics.html
+```
+
+The dashboard polls:
+- `GET /v1/metrics` for live JSON metrics
+- `GET /v1/health` for API health-state checks
+- `GET /metrics` for Prometheus-compatible scraping
+
+## Troubleshooting
+
+### Metrics Not Updating
+
+If `/v1/metrics` shows stale or zeroed metrics:
+
+1. **Check middleware is active**
+   - Verify request metrics middleware is registered in `app/main.py`
+   - Check that `metrics_collector` is imported and used
+
+2. **Check cache stats integration**
+   - Verify `cache_manager.get_stats()` is called in the metrics endpoint
+   - Check that cache manager is properly initialized
+
+3. **Check system snapshot capture**
+   - Verify `capture_system_snapshot()` is not raising exceptions
+   - Check that `os.getloadavg()` and `resource` module are available on your platform
+
+### Alert Delivery Not Working
+
+If alerts are not being delivered:
+
+1. **Check webhook configuration**
+   - Verify `AITBC_ALERT_WEBHOOK_URL` environment variable is set
+   - Test webhook URL with a simple curl POST request
+   - Check webhook server logs for incoming requests
+
+2. **Check alert suppression**
+   - Alert dispatcher uses 5-minute cooldown by default
+   - Check if alerts are being suppressed due to recent deliveries
+   - Verify cooldown logic in `alert_dispatcher._is_suppressed()`
+
+3. **Check alert history**
+   - Use `/agents/integration/production/alerts` to see recent alert attempts
+   - Check `delivery_status` field: `sent`, `suppressed`, or `failed`
+   - Check `error` field for failed deliveries
+
+4. **Check log fallback**
+   - If webhook URL is not configured, alerts fall back to log output
+   - Check coordinator API logs for warning messages about alerts
+
+### Dashboard Not Loading
+
+If the metrics dashboard is not displaying data:
+
+1. **Check API endpoints are accessible**
+   - Verify `/v1/metrics` returns valid JSON
+   - Verify `/v1/health` returns healthy status
+   - Check browser console for CORS or network errors
+
+2. **Check dashboard file path**
+   - Ensure dashboard is served from correct location
+   - Verify static file serving is configured in web server
+
+3. **Check browser console**
+   - Look for JavaScript errors
+   - Check for failed API requests
+   - Verify polling interval is reasonable (default 5 seconds)
+
+### Alert Thresholds Not Triggering
+
+If alerts should trigger but do not:
+
+1. **Verify threshold values**
+   - Error rate threshold: 1%
+   - Average response time threshold: 500ms
+   - Memory usage threshold: 90%
+   - Cache hit rate threshold: 70%
+
+2. **Check metrics calculation**
+   - Verify metrics are being collected correctly
+   - Check that response times are recorded in seconds (not milliseconds)
+   - Verify cache hit rate calculation includes both hits and misses
+
+3. **Check alert evaluation logic**
+   - Verify `get_alert_states()` is called during metrics collection
+   - Check that alert states are included in `/v1/metrics` response
+
 ## Next

 - [Quick Start](./1_quick-start.md) — Get started