Files
aitbc/docs/advanced/04_deployment/sla-monitoring.md
aitbc e22d864944
Some checks failed
API Endpoint Tests / test-api-endpoints (push) Successful in 11s
CLI Tests / test-cli (push) Failing after 7s
Documentation Validation / validate-docs (push) Successful in 8s
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Successful in 38s
Python Tests / test-python (push) Successful in 11s
Security Scanning / security-scan (push) Successful in 29s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 1s
feat: implement CLI blockchain features and pool hub enhancements
CLI Blockchain Features:
- Added block operations: import, export, import-chain, blocks-range
- Added messaging system commands (deploy, state, topics, create-topic, messages, post, vote, search, reputation, moderate)
- Added network force-sync operation
- Replaced marketplace handlers with actual RPC calls
- Replaced AI handlers with actual RPC calls
- Added account operations (account get)
- Added transaction query operations
- Added mempool query operations
- Created keystore_auth.py for authentication
- Removed extended features interception
- All handlers use keystore credentials for authenticated endpoints

Pool Hub Enhancements:
- Added SLA monitoring and capacity tables
- Added billing integration service
- Added SLA collector service
- Added SLA router endpoints
- Updated pool hub models and settings
- Added integration tests for billing and SLA
- Updated documentation with SLA monitoring guide
2026-04-22 15:59:00 +02:00

16 KiB

SLA Monitoring Guide

This guide covers SLA (Service Level Agreement) monitoring and billing instrumentation for coordinator/pool hub services in the AITBC ecosystem.

Overview

The SLA monitoring system provides:

  • Real-time tracking of miner performance metrics
  • Automated SLA violation detection and alerting
  • Capacity planning with forecasting and scaling recommendations
  • Integration with coordinator-api billing system
  • Comprehensive API endpoints for monitoring and management

Architecture

┌─────────────────┐
│   Pool-Hub      │
│                 │
│  SLA Collector  │──────┐
│  Capacity       │      │
│  Planner        │      │
│                 │      │
└────────┬────────┘      │
         │               │
         │ HTTP API      │
         │               │
┌────────▼────────┐     │
│ Coordinator-API  │◀────┘
│                 │
│ Usage Tracking  │
│ Billing Service  │
│ Multi-tenant DB │
└─────────────────┘

SLA Metrics

Miner Uptime

  • Definition: Percentage of time a miner is available and responsive
  • Calculation: Based on heartbeat intervals (5-minute threshold)
  • Threshold: 95%
  • Alert Levels:
    • Critical: <85.5% (threshold * 0.9)
    • High: <95% (threshold)

Response Time

  • Definition: Average time for miner to respond to match requests
  • Calculation: Average of eta_ms from match results (last 100 results)
  • Threshold: 1000ms (P95)
  • Alert Levels:
    • Critical: >2000ms (threshold * 2)
    • High: >1000ms (threshold)

Job Completion Rate

  • Definition: Percentage of jobs completed successfully
  • Calculation: Successful outcomes / total outcomes (last 7 days)
  • Threshold: 90%
  • Alert Levels:
    • Critical: <90% (threshold)

Capacity Availability

  • Definition: Percentage of miners available (not busy)
  • Calculation: Active miners / Total miners
  • Threshold: 80%
  • Alert Levels:
    • High: <80% (threshold)

Configuration

Environment Variables

Add to pool-hub .env:

# Coordinator-API Billing Integration
COORDINATOR_BILLING_URL=http://localhost:8011
COORDINATOR_API_KEY=your_api_key_here

# SLA Configuration
SLA_UPTIME_THRESHOLD=95.0
SLA_RESPONSE_TIME_THRESHOLD=1000.0
SLA_COMPLETION_RATE_THRESHOLD=90.0
SLA_CAPACITY_THRESHOLD=80.0

# Capacity Planning
CAPACITY_FORECAST_HOURS=168
CAPACITY_ALERT_THRESHOLD_PCT=80.0

# Billing Sync
BILLING_SYNC_INTERVAL_HOURS=1

# SLA Collection
SLA_COLLECTION_INTERVAL_SECONDS=300

Settings File

Configuration can also be set in poolhub/settings.py:

class Settings(BaseSettings):
    # Coordinator-API Billing Integration
    coordinator_billing_url: str = Field(default="http://localhost:8011")
    coordinator_api_key: str | None = Field(default=None)

    # SLA Configuration
    sla_thresholds: Dict[str, float] = Field(
        default_factory=lambda: {
            "uptime_pct": 95.0,
            "response_time_ms": 1000.0,
            "completion_rate_pct": 90.0,
            "capacity_availability_pct": 80.0,
        }
    )

    # Capacity Planning Configuration
    capacity_forecast_hours: int = Field(default=168)
    capacity_alert_threshold_pct: float = Field(default=80.0)

    # Billing Sync Configuration
    billing_sync_interval_hours: int = Field(default=1)

    # SLA Collection Configuration
    sla_collection_interval_seconds: int = Field(default=300)

Database Schema

SLA Metrics Table

CREATE TABLE sla_metrics (
    id UUID PRIMARY KEY,
    miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
    metric_type VARCHAR(32) NOT NULL,
    metric_value FLOAT NOT NULL,
    threshold FLOAT NOT NULL,
    is_violation BOOLEAN DEFAULT FALSE,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_sla_metrics_miner_id ON sla_metrics(miner_id);
CREATE INDEX ix_sla_metrics_timestamp ON sla_metrics(timestamp);
CREATE INDEX ix_sla_metrics_metric_type ON sla_metrics(metric_type);

SLA Violations Table

CREATE TABLE sla_violations (
    id UUID PRIMARY KEY,
    miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
    violation_type VARCHAR(32) NOT NULL,
    severity VARCHAR(16) NOT NULL,
    metric_value FLOAT NOT NULL,
    threshold FLOAT NOT NULL,
    violation_duration_ms INTEGER,
    resolved_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_sla_violations_miner_id ON sla_violations(miner_id);
CREATE INDEX ix_sla_violations_created_at ON sla_violations(created_at);
CREATE INDEX ix_sla_violations_severity ON sla_violations(severity);

Capacity Snapshots Table

CREATE TABLE capacity_snapshots (
    id UUID PRIMARY KEY,
    total_miners INTEGER NOT NULL,
    active_miners INTEGER NOT NULL,
    total_parallel_capacity INTEGER NOT NULL,
    total_queue_length INTEGER NOT NULL,
    capacity_utilization_pct FLOAT NOT NULL,
    forecast_capacity INTEGER NOT NULL,
    recommended_scaling VARCHAR(32) NOT NULL,
    scaling_reason TEXT,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_capacity_snapshots_timestamp ON capacity_snapshots(timestamp);

Database Migration

Run the migration to add SLA and capacity tables:

cd apps/pool-hub
alembic upgrade head

API Endpoints

SLA Metrics Endpoints

Get SLA Metrics for a Miner

GET /sla/metrics/{miner_id}?hours=24

Response:

[
  {
    "id": "uuid",
    "miner_id": "miner_001",
    "metric_type": "uptime_pct",
    "metric_value": 98.5,
    "threshold": 95.0,
    "is_violation": false,
    "timestamp": "2026-04-22T15:00:00Z",
    "metadata": {}
  }
]

Get All SLA Metrics

GET /sla/metrics?hours=24

Get SLA Violations

GET /sla/violations?resolved=false&miner_id=miner_001

Trigger SLA Metrics Collection

POST /sla/metrics/collect

Response:

{
  "miners_processed": 10,
  "metrics_collected": [...],
  "violations_detected": 2,
  "capacity": {
    "total_miners": 10,
    "active_miners": 8,
    "capacity_availability_pct": 80.0
  }
}

Capacity Planning Endpoints

Get Capacity Snapshots

GET /sla/capacity/snapshots?hours=24

Get Capacity Forecast

GET /sla/capacity/forecast?hours_ahead=168

Response:

{
  "forecast_horizon_hours": 168,
  "current_capacity": 1000,
  "projected_capacity": 1500,
  "recommended_scaling": "+50%",
  "confidence": 0.85,
  "source": "coordinator_api"
}

Get Scaling Recommendations

GET /sla/capacity/recommendations

Response:

{
  "current_state": "healthy",
  "recommendations": [
    {
      "action": "add_miners",
      "quantity": 2,
      "reason": "Projected capacity shortage in 2 weeks",
      "priority": "medium"
    }
  ],
  "source": "coordinator_api"
}

Configure Capacity Alerts

POST /sla/capacity/alerts/configure

Request:

{
  "threshold_pct": 80.0,
  "notification_email": "admin@example.com"
}

Billing Integration Endpoints

Get Billing Usage

GET /sla/billing/usage?hours=24&tenant_id=tenant_001

Sync Billing Usage

POST /sla/billing/sync

Request:

{
  "miner_id": "miner_001",
  "hours_back": 24
}

Record Usage Event

POST /sla/billing/usage/record

Request:

{
  "tenant_id": "tenant_001",
  "resource_type": "gpu_hours",
  "quantity": 10.5,
  "unit_price": 0.50,
  "job_id": "job_123",
  "metadata": {}
}

Generate Invoice

POST /sla/billing/invoice/generate

Request:

{
  "tenant_id": "tenant_001",
  "period_start": "2026-03-01T00:00:00Z",
  "period_end": "2026-03-31T23:59:59Z"
}

Status Endpoint

Get SLA Status

GET /sla/status

Response:

{
  "status": "healthy",
  "active_violations": 0,
  "recent_metrics_count": 50,
  "timestamp": "2026-04-22T15:00:00Z"
}

Automated Collection

SLA Collection Scheduler

The SLA collector can be run as a background service to automatically collect metrics:

from poolhub.services.sla_collector import SLACollector, SLACollectorScheduler
from poolhub.database import get_db

# Initialize
db = next(get_db())
sla_collector = SLACollector(db)
scheduler = SLACollectorScheduler(sla_collector)

# Start automated collection (every 5 minutes)
await scheduler.start(collection_interval_seconds=300)

Billing Sync Scheduler

The billing integration can be run as a background service to automatically sync usage:

from poolhub.services.billing_integration import BillingIntegration, BillingIntegrationScheduler
from poolhub.database import get_db

# Initialize
db = next(get_db())
billing_integration = BillingIntegration(db)
scheduler = BillingIntegrationScheduler(billing_integration)

# Start automated sync (every 1 hour)
await scheduler.start(sync_interval_hours=1)

Monitoring and Alerting

Prometheus Metrics

SLA metrics are exposed to Prometheus with the namespace poolhub:

  • poolhub_sla_uptime_pct - Miner uptime percentage
  • poolhub_sla_response_time_ms - Response time in milliseconds
  • poolhub_sla_completion_rate_pct - Job completion rate percentage
  • poolhub_sla_capacity_availability_pct - Capacity availability percentage
  • poolhub_sla_violations_total - Total SLA violations
  • poolhub_billing_sync_errors_total - Billing sync errors

Alert Rules

Example Prometheus alert rules:

groups:
  - name: poolhub_sla
    rules:
      - alert: HighSLAViolationRate
        expr: rate(poolhub_sla_violations_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High SLA violation rate

      - alert: LowMinerUptime
        expr: poolhub_sla_uptime_pct < 95
        for: 5m
        labels:
          severity: high
        annotations:
          summary: Miner uptime below threshold

      - alert: HighResponseTime
        expr: poolhub_sla_response_time_ms > 1000
        for: 5m
        labels:
          severity: high
        annotations:
          summary: Response time above threshold

Troubleshooting

SLA Metrics Not Recording

Symptom: SLA metrics are not being recorded in the database

Solutions:

  1. Check SLA collector is running: ps aux | grep sla_collector
  2. Verify database connection: Check pool-hub database logs
  3. Check SLA collection interval: Ensure sla_collection_interval_seconds is configured
  4. Verify miner heartbeats: Check miner_status.last_heartbeat_at is being updated

Billing Sync Failing

Symptom: Billing sync to coordinator-api is failing

Solutions:

  1. Verify coordinator-api is accessible: curl http://localhost:8011/health
  2. Check API key: Ensure COORDINATOR_API_KEY is set correctly
  3. Check network connectivity: Ensure pool-hub can reach coordinator-api
  4. Review billing integration logs: Check for HTTP errors or timeouts

Capacity Alerts Not Triggering

Symptom: Capacity alerts are not being generated

Solutions:

  1. Verify capacity snapshots are being created: Check capacity_snapshots table
  2. Check alert thresholds: Ensure capacity_alert_threshold_pct is configured
  3. Verify alert configuration: Check alert configuration endpoint
  4. Review coordinator-api capacity planning: Ensure it's receiving pool-hub data

Testing

Run the SLA and billing integration tests:

cd apps/pool-hub

# Run all SLA and billing tests
pytest tests/test_sla_collector.py
pytest tests/test_billing_integration.py
pytest tests/test_sla_endpoints.py
pytest tests/test_integration_coordinator.py

# Run with coverage
pytest --cov=poolhub.services.sla_collector tests/test_sla_collector.py
pytest --cov=poolhub.services.billing_integration tests/test_billing_integration.py

Best Practices

  1. Monitor SLA Metrics Regularly: Set up automated monitoring dashboards to track SLA metrics in real-time
  2. Configure Appropriate Thresholds: Adjust SLA thresholds based on your service requirements
  3. Review Violations Promptly: Investigate and resolve SLA violations quickly to maintain service quality
  4. Plan Capacity Proactively: Use capacity forecasting to anticipate scaling needs
  5. Test Billing Integration: Regularly test billing sync to ensure accurate usage tracking
  6. Keep Documentation Updated: Maintain up-to-date documentation for SLA configurations and procedures

Integration with Existing Systems

Coordinator-API Integration

The pool-hub integrates with coordinator-api's billing system via HTTP API:

  1. Usage Recording: Pool-hub sends usage events to coordinator-api's /api/billing/usage endpoint
  2. Billing Metrics: Pool-hub can query billing metrics from coordinator-api
  3. Invoice Generation: Pool-hub can trigger invoice generation in coordinator-api
  4. Capacity Planning: Pool-hub provides capacity data to coordinator-api's capacity planning system

Prometheus Integration

SLA metrics are automatically exposed to Prometheus:

  • Metrics are labeled by miner_id, metric_type, and other dimensions
  • Use Prometheus query language to create custom dashboards
  • Set up alert rules based on SLA thresholds

Alerting Integration

SLA violations can trigger alerts through:

  • Prometheus Alertmanager
  • Custom webhook integrations
  • Email notifications (via coordinator-api)
  • Slack/Discord integrations (via coordinator-api)

Security Considerations

  1. API Key Security: Store coordinator-api API keys securely (use environment variables or secret management)
  2. Database Access: Ensure database connections use SSL/TLS in production
  3. Rate Limiting: Implement rate limiting on billing sync endpoints to prevent abuse
  4. Audit Logging: Enable audit logging for SLA and billing operations
  5. Access Control: Restrict access to SLA and billing endpoints to authorized users

Performance Considerations

  1. Batch Operations: Use batch operations for billing sync to reduce HTTP overhead
  2. Index Optimization: Ensure database indexes are properly configured for SLA queries
  3. Caching: Use Redis caching for frequently accessed SLA metrics
  4. Async Processing: Use async operations for SLA collection and billing sync
  5. Data Retention: Implement data retention policies for SLA metrics and capacity snapshots

Maintenance

Regular Tasks

  1. Review SLA Thresholds: Quarterly review and adjust SLA thresholds based on service performance
  2. Clean Up Old Data: Regularly clean up old SLA metrics and capacity snapshots (e.g., keep 90 days)
  3. Review Capacity Forecasts: Monthly review of capacity forecasts and scaling recommendations
  4. Audit Billing Records: Monthly audit of billing records for accuracy
  5. Update Documentation: Keep documentation updated with any configuration changes

Backup and Recovery

  1. Database Backups: Ensure regular backups of SLA and billing tables
  2. Configuration Backups: Backup configuration files and environment variables
  3. Recovery Procedures: Document recovery procedures for SLA and billing systems
  4. Testing Backups: Regularly test backup and recovery procedures

References