Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 8s
CLI Tests / test-cli (push) Successful in 10s
Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s
Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s
Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Cross-Node Transaction Testing / transaction-test (push) Successful in 5s
Deploy to Testnet / deploy-testnet (push) Successful in 1m14s
Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled
Documentation Validation / validate-docs (push) Failing after 10s
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled
Smart Contract Tests / test-foundry (push) Has been cancelled
Smart Contract Tests / lint-solidity (push) Has been cancelled
Smart Contract Tests / deploy-contracts (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Failing after 45s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s
P2P Network Verification / p2p-verification (push) Successful in 3s
Production Tests / Production Integration Tests (push) Failing after 7s
Python Tests / test-python (push) Failing after 46s
Staking Tests / test-staking-service (push) Failing after 2s
Staking Tests / test-staking-integration (push) Has been skipped
Staking Tests / test-staking-contract (push) Has been skipped
Staking Tests / run-staking-test-runner (push) Has been skipped
Systemd Sync / sync-systemd (push) Successful in 21s
API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s
- Changed pytest calls to use `venv/bin/python -m pytest` with explicit config - Added `--rootdir "$PWD"` and `--import-mode=importlib` for consistent imports - Fixed PYTHONPATH to use absolute paths with $PWD prefix - Added smart contract security scanning for Solidity files - Added Circom circuit security checks for ZK proof circuits - Added ZK proof implementation security validation - Added contracts/** to security scanning workflow
10 KiB
10 KiB
AITBC Performance Monitoring Setup
Overview
This document describes the performance monitoring setup for AITBC, including block processing time, job processing time, and uptime monitoring using systemd services.
Monitoring Infrastructure
Components
- Prometheus - Metrics collection and storage (systemd service)
- Grafana - Visualization and dashboards (systemd service)
- Node Exporter - System-level metrics (systemd service)
- Custom Metrics Exporters - Application-specific metrics
Systemd Service Management
AITBC uses systemd for service orchestration. Monitoring services are managed as systemd units.
Starting Monitoring Services
# Start Prometheus
sudo systemctl start prometheus
sudo systemctl enable prometheus
# Start Grafana
sudo systemctl start grafana
sudo systemctl enable grafana
# Start Node Exporter
sudo systemctl start node-exporter
sudo systemctl enable node-exporter
# Check service status
sudo systemctl status prometheus
sudo systemctl status grafana
sudo systemctl status node-exporter
Block Processing Time Monitoring
Metrics to Track
block_processing_duration_seconds- Time to process a blockblock_height- Current blockchain heightblock_validation_duration_seconds- Time to validate a blockblock_propagation_duration_seconds- Time to propagate block to peers
Implementation
Add metrics to blockchain node:
from prometheus_client import Counter, Histogram, Gauge
block_processing_duration = Histogram(
'block_processing_duration_seconds',
'Time to process a block',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
block_height = Gauge(
'block_height',
'Current blockchain height'
)
block_validation_duration = Histogram(
'block_validation_duration_seconds',
'Time to validate a block',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)
block_propagation_duration = Histogram(
'block_propagation_duration_seconds',
'Time to propagate block to peers',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)
Monitoring Endpoint
Add /metrics endpoint to blockchain node:
from prometheus_client import make_asgi_app
metrics_app = make_asgi_app()
# In FastAPI app
app.mount("/metrics", metrics_app)
Job Processing Time Monitoring
Metrics to Track
job_submission_duration_seconds- Time to submit a jobjob_processing_duration_seconds- Time to complete a jobjob_queue_duration_seconds- Time job spends in queuejob_execution_duration_seconds- Time for actual GPU executionjobs_total- Total number of jobs processedjobs_failed_total- Total number of failed jobs
Implementation
Add metrics to coordinator API:
from prometheus_client import Counter, Histogram, Gauge
job_submission_duration = Histogram(
'job_submission_duration_seconds',
'Time to submit a job',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)
job_processing_duration = Histogram(
'job_processing_duration_seconds',
'Time to complete a job from submission to result',
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
)
job_queue_duration = Histogram(
'job_queue_duration_seconds',
'Time job spends in queue before assignment',
buckets=[1.0, 5.0, 10.0, 30.0, 60.0]
)
job_execution_duration = Histogram(
'job_execution_duration_seconds',
'Time for actual GPU execution',
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
)
jobs_total = Counter(
'jobs_total',
'Total number of jobs processed',
['status']
)
jobs_in_queue = Gauge(
'jobs_in_queue',
'Number of jobs currently in queue'
)
Instrumentation Points
- Job Submission - Track submission duration
- Job Assignment - Track queue duration
- Job Execution - Track execution duration
- Job Completion - Track total processing duration
Uptime Monitoring
Metrics to Track
up- Service availability (1 = up, 0 = down)service_uptime_seconds- Total uptime durationservice_downtime_seconds- Total downtime durationservice_restart_count- Number of service restarts
Implementation
Use Prometheus blackbox exporter for external uptime monitoring:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://coordinator-api:8011/v1/health
- http://blockchain-node:8080/v1/health
- http://marketplace:8102/v1/health
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: '$1'
Internal Uptime Metrics
Add to each service:
from prometheus_client import Gauge, Counter
service_uptime = Gauge(
'service_uptime_seconds',
'Service uptime in seconds'
)
service_restart_count = Counter(
'service_restart_count',
'Number of service restarts'
)
Alerting Rules
Critical Alerts
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
- alert: BlockProcessingTooSlow
expr: histogram_quantile(0.95, block_processing_duration_seconds) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Block processing time exceeds 1s (p95)"
- alert: JobProcessingTooSlow
expr: histogram_quantile(0.95, job_processing_duration_seconds) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Job processing time exceeds 5s (p95)"
Warning Alerts
- name: warnings
rules:
- alert: HighJobQueue
expr: jobs_in_queue > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Job queue backlog exceeds 100 jobs"
- alert: HighFailureRate
expr: rate(jobs_failed_total[5m]) / rate(jobs_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Job failure rate exceeds 5%"
Grafana Dashboards
Dashboard: AITBC System Overview
Panels:
- Service Uptime (uptime gauge)
- Request Rate (requests per second)
- Error Rate (errors per second)
- Response Time (p50, p95, p99)
- Queue Length (jobs in queue)
- Blockchain Height (current block)
- Block Processing Time (histogram)
- Job Processing Time (histogram)
Dashboard: Blockchain Performance
Panels:
- Block Processing Time (p95)
- Block Validation Time (p95)
- Block Propagation Time (p95)
- Block Height (current)
- Transactions per Block
- Network Peer Count
Dashboard: Job Processing Performance
Panels:
- Job Submission Rate (jobs/second)
- Job Processing Time (p95)
- Job Queue Duration (p95)
- Job Execution Time (p95)
- Jobs in Queue (current)
- Job Success Rate (percentage)
Installation
Prerequisites
# Install Prometheus (available in Debian stable)
sudo apt update
sudo apt install prometheus promtool prometheus-node-exporter
# Grafana is NOT available in Debian stable
# Install from official Grafana repository or download .deb
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install grafana
Setup
# Create systemd service for Prometheus
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
# Access Grafana
# URL: http://localhost:3000
# Username: admin
# Password: admin
Import Dashboards
- Navigate to Grafana
- Go to Dashboards → Import
- Import dashboard JSON files from
infra/monitoring/grafana/dashboards/
Configuration Files
Prometheus Config
Location: infra/monitoring/prometheus.yml
Grafana Datasources
Location: infra/monitoring/grafana/datasources/prometheus.yml
Grafana Dashboards
Location: infra/monitoring/grafana/dashboards/
Testing
Verify Metrics Endpoint
# Test coordinator API metrics
curl http://localhost:8011/metrics
# Test blockchain node metrics
curl http://localhost:8080/metrics
# Test marketplace metrics
curl http://localhost:8102/metrics
Verify Prometheus
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Query metrics
curl http://localhost:9090/api/v1/query?query=up
Maintenance
Regular Tasks
- Review Alerts - Weekly review of alert rules
- Update Dashboards - Monthly dashboard updates
- Review Retention - Quarterly review of data retention policies
- Capacity Planning - Quarterly review of storage needs
Backup
# Backup Prometheus data
sudo tar -czf /tmp/prometheus-backup.tar.gz /var/lib/prometheus
# Backup Grafana data
sudo tar -czf /tmp/grafana-backup.tar.gz /var/lib/grafana
Troubleshooting
Metrics Not Appearing
- Check service is running:
sudo systemctl status prometheus - Check metrics endpoint:
curl http://service:port/metrics - Check Prometheus logs:
sudo journalctl -u prometheus -n 50 - Check Prometheus targets: http://localhost:9090/targets
High Memory Usage
- Reduce retention period in prometheus.yml
- Reduce scrape interval
- Add more storage to Prometheus
Alerts Not Firing
- Check alert rules syntax
- Check alert manager configuration
- Check Grafana notification channels
Service Won't Start
- Check service logs:
sudo journalctl -u [service] -n 50 - Check configuration:
sudo systemctl cat [service] - Check port conflicts:
sudo netstat -tulpn
Next Steps
- Implement metrics instrumentation in services
- Create Grafana dashboards
- Set up alert notifications
- Configure external uptime monitoring (e.g., UptimeRobot, Pingdom)
- Integrate with incident management system