Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 8s
CLI Tests / test-cli (push) Successful in 10s
Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s
Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s
Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Cross-Node Transaction Testing / transaction-test (push) Successful in 5s
Deploy to Testnet / deploy-testnet (push) Successful in 1m14s
Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled
Documentation Validation / validate-docs (push) Failing after 10s
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled
Smart Contract Tests / test-foundry (push) Has been cancelled
Smart Contract Tests / lint-solidity (push) Has been cancelled
Smart Contract Tests / deploy-contracts (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Failing after 45s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s
P2P Network Verification / p2p-verification (push) Successful in 3s
Production Tests / Production Integration Tests (push) Failing after 7s
Python Tests / test-python (push) Failing after 46s
Staking Tests / test-staking-service (push) Failing after 2s
Staking Tests / test-staking-integration (push) Has been skipped
Staking Tests / test-staking-contract (push) Has been skipped
Staking Tests / run-staking-test-runner (push) Has been skipped
Systemd Sync / sync-systemd (push) Successful in 21s
API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s
- Changed pytest calls to use `venv/bin/python -m pytest` with explicit config - Added `--rootdir "$PWD"` and `--import-mode=importlib` for consistent imports - Fixed PYTHONPATH to use absolute paths with $PWD prefix - Added smart contract security scanning for Solidity files - Added Circom circuit security checks for ZK proof circuits - Added ZK proof implementation security validation - Added contracts/** to security scanning workflow
440 lines
10 KiB
Markdown
440 lines
10 KiB
Markdown
# AITBC Performance Monitoring Setup
|
|
|
|
## Overview
|
|
|
|
This document describes the performance monitoring setup for AITBC, including block processing time, job processing time, and uptime monitoring using systemd services.
|
|
|
|
## Monitoring Infrastructure
|
|
|
|
### Components
|
|
|
|
1. **Prometheus** - Metrics collection and storage (systemd service)
|
|
2. **Grafana** - Visualization and dashboards (systemd service)
|
|
3. **Node Exporter** - System-level metrics (systemd service)
|
|
4. **Custom Metrics Exporters** - Application-specific metrics
|
|
|
|
### Systemd Service Management
|
|
|
|
AITBC uses systemd for service orchestration. Monitoring services are managed as systemd units.
|
|
|
|
#### Starting Monitoring Services
|
|
|
|
```bash
|
|
# Start Prometheus
|
|
sudo systemctl start prometheus
|
|
sudo systemctl enable prometheus
|
|
|
|
# Start Grafana
|
|
sudo systemctl start grafana
|
|
sudo systemctl enable grafana
|
|
|
|
# Start Node Exporter
|
|
sudo systemctl start node-exporter
|
|
sudo systemctl enable node-exporter
|
|
|
|
# Check service status
|
|
sudo systemctl status prometheus
|
|
sudo systemctl status grafana
|
|
sudo systemctl status node-exporter
|
|
```
|
|
|
|
## Block Processing Time Monitoring
|
|
|
|
### Metrics to Track
|
|
|
|
- `block_processing_duration_seconds` - Time to process a block
|
|
- `block_height` - Current blockchain height
|
|
- `block_validation_duration_seconds` - Time to validate a block
|
|
- `block_propagation_duration_seconds` - Time to propagate block to peers
|
|
|
|
### Implementation
|
|
|
|
Add metrics to blockchain node:
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
block_processing_duration = Histogram(
|
|
'block_processing_duration_seconds',
|
|
'Time to process a block',
|
|
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
|
|
)
|
|
|
|
block_height = Gauge(
|
|
'block_height',
|
|
'Current blockchain height'
|
|
)
|
|
|
|
block_validation_duration = Histogram(
|
|
'block_validation_duration_seconds',
|
|
'Time to validate a block',
|
|
buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
|
|
)
|
|
|
|
block_propagation_duration = Histogram(
|
|
'block_propagation_duration_seconds',
|
|
'Time to propagate block to peers',
|
|
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
|
|
)
|
|
```
|
|
|
|
### Monitoring Endpoint
|
|
|
|
Add `/metrics` endpoint to blockchain node:
|
|
|
|
```python
|
|
from prometheus_client import make_asgi_app
|
|
|
|
metrics_app = make_asgi_app()
|
|
|
|
# In FastAPI app
|
|
app.mount("/metrics", metrics_app)
|
|
```
|
|
|
|
## Job Processing Time Monitoring
|
|
|
|
### Metrics to Track
|
|
|
|
- `job_submission_duration_seconds` - Time to submit a job
|
|
- `job_processing_duration_seconds` - Time to complete a job
|
|
- `job_queue_duration_seconds` - Time job spends in queue
|
|
- `job_execution_duration_seconds` - Time for actual GPU execution
|
|
- `jobs_total` - Total number of jobs processed
|
|
- `jobs_failed_total` - Total number of failed jobs
|
|
|
|
### Implementation
|
|
|
|
Add metrics to coordinator API:
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
job_submission_duration = Histogram(
|
|
'job_submission_duration_seconds',
|
|
'Time to submit a job',
|
|
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
|
|
)
|
|
|
|
job_processing_duration = Histogram(
|
|
'job_processing_duration_seconds',
|
|
'Time to complete a job from submission to result',
|
|
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
|
|
)
|
|
|
|
job_queue_duration = Histogram(
|
|
'job_queue_duration_seconds',
|
|
'Time job spends in queue before assignment',
|
|
buckets=[1.0, 5.0, 10.0, 30.0, 60.0]
|
|
)
|
|
|
|
job_execution_duration = Histogram(
|
|
'job_execution_duration_seconds',
|
|
'Time for actual GPU execution',
|
|
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
|
|
)
|
|
|
|
jobs_total = Counter(
|
|
'jobs_total',
|
|
'Total number of jobs processed',
|
|
['status']
|
|
)
|
|
|
|
jobs_in_queue = Gauge(
|
|
'jobs_in_queue',
|
|
'Number of jobs currently in queue'
|
|
)
|
|
```
|
|
|
|
### Instrumentation Points
|
|
|
|
1. **Job Submission** - Track submission duration
|
|
2. **Job Assignment** - Track queue duration
|
|
3. **Job Execution** - Track execution duration
|
|
4. **Job Completion** - Track total processing duration
|
|
|
|
## Uptime Monitoring
|
|
|
|
### Metrics to Track
|
|
|
|
- `up` - Service availability (1 = up, 0 = down)
|
|
- `service_uptime_seconds` - Total uptime duration
|
|
- `service_downtime_seconds` - Total downtime duration
|
|
- `service_restart_count` - Number of service restarts
|
|
|
|
### Implementation
|
|
|
|
Use Prometheus blackbox exporter for external uptime monitoring:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'blackbox'
|
|
metrics_path: /probe
|
|
params:
|
|
module: [http_2xx]
|
|
static_configs:
|
|
- targets:
|
|
- http://coordinator-api:8011/v1/health
|
|
- http://blockchain-node:8080/v1/health
|
|
- http://marketplace:8102/v1/health
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: instance
|
|
replacement: '$1'
|
|
```
|
|
|
|
### Internal Uptime Metrics
|
|
|
|
Add to each service:
|
|
|
|
```python
|
|
from prometheus_client import Gauge, Counter
|
|
|
|
service_uptime = Gauge(
|
|
'service_uptime_seconds',
|
|
'Service uptime in seconds'
|
|
)
|
|
|
|
service_restart_count = Counter(
|
|
'service_restart_count',
|
|
'Number of service restarts'
|
|
)
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
### Critical Alerts
|
|
|
|
```yaml
|
|
groups:
|
|
- name: critical
|
|
rules:
|
|
- alert: ServiceDown
|
|
expr: up == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Service {{ $labels.instance }} is down"
|
|
|
|
- alert: BlockProcessingTooSlow
|
|
expr: histogram_quantile(0.95, block_processing_duration_seconds) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Block processing time exceeds 1s (p95)"
|
|
|
|
- alert: JobProcessingTooSlow
|
|
expr: histogram_quantile(0.95, job_processing_duration_seconds) > 5
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Job processing time exceeds 5s (p95)"
|
|
```
|
|
|
|
### Warning Alerts
|
|
|
|
```yaml
|
|
- name: warnings
|
|
rules:
|
|
- alert: HighJobQueue
|
|
expr: jobs_in_queue > 100
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Job queue backlog exceeds 100 jobs"
|
|
|
|
- alert: HighFailureRate
|
|
expr: rate(jobs_failed_total[5m]) / rate(jobs_total[5m]) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Job failure rate exceeds 5%"
|
|
```
|
|
|
|
## Grafana Dashboards
|
|
|
|
### Dashboard: AITBC System Overview
|
|
|
|
**Panels:**
|
|
1. Service Uptime (uptime gauge)
|
|
2. Request Rate (requests per second)
|
|
3. Error Rate (errors per second)
|
|
4. Response Time (p50, p95, p99)
|
|
5. Queue Length (jobs in queue)
|
|
6. Blockchain Height (current block)
|
|
7. Block Processing Time (histogram)
|
|
8. Job Processing Time (histogram)
|
|
|
|
### Dashboard: Blockchain Performance
|
|
|
|
**Panels:**
|
|
1. Block Processing Time (p95)
|
|
2. Block Validation Time (p95)
|
|
3. Block Propagation Time (p95)
|
|
4. Block Height (current)
|
|
5. Transactions per Block
|
|
6. Network Peer Count
|
|
|
|
### Dashboard: Job Processing Performance
|
|
|
|
**Panels:**
|
|
1. Job Submission Rate (jobs/second)
|
|
2. Job Processing Time (p95)
|
|
3. Job Queue Duration (p95)
|
|
4. Job Execution Time (p95)
|
|
5. Jobs in Queue (current)
|
|
6. Job Success Rate (percentage)
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Install Prometheus (available in Debian stable)
|
|
sudo apt update
|
|
sudo apt install prometheus promtool prometheus-node-exporter
|
|
|
|
# Grafana is NOT available in Debian stable
|
|
# Install from official Grafana repository or download .deb
|
|
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
|
|
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
|
|
sudo apt update
|
|
sudo apt install grafana
|
|
```
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
# Create systemd service for Prometheus
|
|
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
|
|
[Unit]
|
|
Description=Prometheus
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=prometheus
|
|
ExecStart=/usr/local/bin/prometheus \
|
|
--config.file=/etc/prometheus/prometheus.yml \
|
|
--storage.tsdb.path=/var/lib/prometheus
|
|
Restart=on-failure
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
sudo useradd --no-create-home --shell /bin/false prometheus
|
|
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable prometheus
|
|
sudo systemctl start prometheus
|
|
|
|
# Access Grafana
|
|
# URL: http://localhost:3000
|
|
# Username: admin
|
|
# Password: admin
|
|
```
|
|
|
|
### Import Dashboards
|
|
|
|
1. Navigate to Grafana
|
|
2. Go to Dashboards → Import
|
|
3. Import dashboard JSON files from `infra/monitoring/grafana/dashboards/`
|
|
|
|
## Configuration Files
|
|
|
|
### Prometheus Config
|
|
|
|
Location: `infra/monitoring/prometheus.yml`
|
|
|
|
### Grafana Datasources
|
|
|
|
Location: `infra/monitoring/grafana/datasources/prometheus.yml`
|
|
|
|
### Grafana Dashboards
|
|
|
|
Location: `infra/monitoring/grafana/dashboards/`
|
|
|
|
## Testing
|
|
|
|
### Verify Metrics Endpoint
|
|
|
|
```bash
|
|
# Test coordinator API metrics
|
|
curl http://localhost:8011/metrics
|
|
|
|
# Test blockchain node metrics
|
|
curl http://localhost:8080/metrics
|
|
|
|
# Test marketplace metrics
|
|
curl http://localhost:8102/metrics
|
|
```
|
|
|
|
### Verify Prometheus
|
|
|
|
```bash
|
|
# Check Prometheus targets
|
|
curl http://localhost:9090/api/v1/targets
|
|
|
|
# Query metrics
|
|
curl http://localhost:9090/api/v1/query?query=up
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Regular Tasks
|
|
|
|
1. **Review Alerts** - Weekly review of alert rules
|
|
2. **Update Dashboards** - Monthly dashboard updates
|
|
3. **Review Retention** - Quarterly review of data retention policies
|
|
4. **Capacity Planning** - Quarterly review of storage needs
|
|
|
|
### Backup
|
|
|
|
```bash
|
|
# Backup Prometheus data
|
|
sudo tar -czf /tmp/prometheus-backup.tar.gz /var/lib/prometheus
|
|
|
|
# Backup Grafana data
|
|
sudo tar -czf /tmp/grafana-backup.tar.gz /var/lib/grafana
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Metrics Not Appearing
|
|
|
|
1. Check service is running: `sudo systemctl status prometheus`
|
|
2. Check metrics endpoint: `curl http://service:port/metrics`
|
|
3. Check Prometheus logs: `sudo journalctl -u prometheus -n 50`
|
|
4. Check Prometheus targets: http://localhost:9090/targets
|
|
|
|
### High Memory Usage
|
|
|
|
1. Reduce retention period in prometheus.yml
|
|
2. Reduce scrape interval
|
|
3. Add more storage to Prometheus
|
|
|
|
### Alerts Not Firing
|
|
|
|
1. Check alert rules syntax
|
|
2. Check alert manager configuration
|
|
3. Check Grafana notification channels
|
|
|
|
### Service Won't Start
|
|
|
|
1. Check service logs: `sudo journalctl -u [service] -n 50`
|
|
2. Check configuration: `sudo systemctl cat [service]`
|
|
3. Check port conflicts: `sudo netstat -tulpn`
|
|
|
|
## Next Steps
|
|
|
|
1. Implement metrics instrumentation in services
|
|
2. Create Grafana dashboards
|
|
3. Set up alert notifications
|
|
4. Configure external uptime monitoring (e.g., UptimeRobot, Pingdom)
|
|
5. Integrate with incident management system
|