oib/aitbc

Fork 0

Files

aitbc e22d864944

API Endpoint Tests / test-api-endpoints (push) Successful in 11s

Details

CLI Tests / test-cli (push) Failing after 7s

Details

Documentation Validation / validate-docs (push) Successful in 8s

Details

Documentation Validation / validate-policies-strict (push) Successful in 3s

Details

Integration Tests / test-service-integration (push) Successful in 38s

Details

Python Tests / test-python (push) Successful in 11s

Details

Security Scanning / security-scan (push) Successful in 29s

Details

Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 1s

Details

feat: implement CLI blockchain features and pool hub enhancements

CLI Blockchain Features:
- Added block operations: import, export, import-chain, blocks-range
- Added messaging system commands (deploy, state, topics, create-topic, messages, post, vote, search, reputation, moderate)
- Added network force-sync operation
- Replaced marketplace handlers with actual RPC calls
- Replaced AI handlers with actual RPC calls
- Added account operations (account get)
- Added transaction query operations
- Added mempool query operations
- Created keystore_auth.py for authentication
- Removed extended features interception
- All handlers use keystore credentials for authenticated endpoints

Pool Hub Enhancements:
- Added SLA monitoring and capacity tables
- Added billing integration service
- Added SLA collector service
- Added SLA router endpoints
- Updated pool hub models and settings
- Added integration tests for billing and SLA
- Updated documentation with SLA monitoring guide

2026-04-22 15:59:00 +02:00

16 KiB

Raw Blame History

SLA Monitoring Guide

This guide covers SLA (Service Level Agreement) monitoring and billing instrumentation for coordinator/pool hub services in the AITBC ecosystem.

Overview

The SLA monitoring system provides:

Real-time tracking of miner performance metrics
Automated SLA violation detection and alerting
Capacity planning with forecasting and scaling recommendations
Integration with coordinator-api billing system
Comprehensive API endpoints for monitoring and management

Architecture

┌─────────────────┐
│   Pool-Hub      │
│                 │
│  SLA Collector  │──────┐
│  Capacity       │      │
│  Planner        │      │
│                 │      │
└────────┬────────┘      │
         │               │
         │ HTTP API      │
         │               │
┌────────▼────────┐     │
│ Coordinator-API  │◀────┘
│                 │
│ Usage Tracking  │
│ Billing Service  │
│ Multi-tenant DB │
└─────────────────┘

SLA Metrics

Miner Uptime

Definition: Percentage of time a miner is available and responsive
Calculation: Based on heartbeat intervals (5-minute threshold)
Threshold: 95%
Alert Levels:
- Critical: <85.5% (threshold * 0.9)
- High: <95% (threshold)

Response Time

Definition: Average time for miner to respond to match requests
Calculation: Average of eta_ms from match results (last 100 results)
Threshold: 1000ms (P95)
Alert Levels:
- Critical: >2000ms (threshold * 2)
- High: >1000ms (threshold)

Job Completion Rate

Definition: Percentage of jobs completed successfully
Calculation: Successful outcomes / total outcomes (last 7 days)
Threshold: 90%
Alert Levels:
- Critical: <90% (threshold)

Capacity Availability

Definition: Percentage of miners available (not busy)
Calculation: Active miners / Total miners
Threshold: 80%
Alert Levels:
- High: <80% (threshold)

Configuration

Environment Variables

Add to pool-hub .env:

# Coordinator-API Billing Integration
COORDINATOR_BILLING_URL=http://localhost:8011
COORDINATOR_API_KEY=your_api_key_here

# SLA Configuration
SLA_UPTIME_THRESHOLD=95.0
SLA_RESPONSE_TIME_THRESHOLD=1000.0
SLA_COMPLETION_RATE_THRESHOLD=90.0
SLA_CAPACITY_THRESHOLD=80.0

# Capacity Planning
CAPACITY_FORECAST_HOURS=168
CAPACITY_ALERT_THRESHOLD_PCT=80.0

# Billing Sync
BILLING_SYNC_INTERVAL_HOURS=1

# SLA Collection
SLA_COLLECTION_INTERVAL_SECONDS=300

Settings File

Configuration can also be set in poolhub/settings.py:

class Settings(BaseSettings):
    # Coordinator-API Billing Integration
    coordinator_billing_url: str = Field(default="http://localhost:8011")
    coordinator_api_key: str | None = Field(default=None)

    # SLA Configuration
    sla_thresholds: Dict[str, float] = Field(
        default_factory=lambda: {
            "uptime_pct": 95.0,
            "response_time_ms": 1000.0,
            "completion_rate_pct": 90.0,
            "capacity_availability_pct": 80.0,
        }
    )

    # Capacity Planning Configuration
    capacity_forecast_hours: int = Field(default=168)
    capacity_alert_threshold_pct: float = Field(default=80.0)

    # Billing Sync Configuration
    billing_sync_interval_hours: int = Field(default=1)

    # SLA Collection Configuration
    sla_collection_interval_seconds: int = Field(default=300)

Database Schema

SLA Metrics Table

CREATE TABLE sla_metrics (
    id UUID PRIMARY KEY,
    miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
    metric_type VARCHAR(32) NOT NULL,
    metric_value FLOAT NOT NULL,
    threshold FLOAT NOT NULL,
    is_violation BOOLEAN DEFAULT FALSE,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_sla_metrics_miner_id ON sla_metrics(miner_id);
CREATE INDEX ix_sla_metrics_timestamp ON sla_metrics(timestamp);
CREATE INDEX ix_sla_metrics_metric_type ON sla_metrics(metric_type);

SLA Violations Table

CREATE TABLE sla_violations (
    id UUID PRIMARY KEY,
    miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
    violation_type VARCHAR(32) NOT NULL,
    severity VARCHAR(16) NOT NULL,
    metric_value FLOAT NOT NULL,
    threshold FLOAT NOT NULL,
    violation_duration_ms INTEGER,
    resolved_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_sla_violations_miner_id ON sla_violations(miner_id);
CREATE INDEX ix_sla_violations_created_at ON sla_violations(created_at);
CREATE INDEX ix_sla_violations_severity ON sla_violations(severity);

Capacity Snapshots Table

CREATE TABLE capacity_snapshots (
    id UUID PRIMARY KEY,
    total_miners INTEGER NOT NULL,
    active_miners INTEGER NOT NULL,
    total_parallel_capacity INTEGER NOT NULL,
    total_queue_length INTEGER NOT NULL,
    capacity_utilization_pct FLOAT NOT NULL,
    forecast_capacity INTEGER NOT NULL,
    recommended_scaling VARCHAR(32) NOT NULL,
    scaling_reason TEXT,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ix_capacity_snapshots_timestamp ON capacity_snapshots(timestamp);

Database Migration

Run the migration to add SLA and capacity tables:

cd apps/pool-hub
alembic upgrade head

API Endpoints

SLA Metrics Endpoints

Get SLA Metrics for a Miner

GET /sla/metrics/{miner_id}?hours=24

Response:

[
  {
    "id": "uuid",
    "miner_id": "miner_001",
    "metric_type": "uptime_pct",
    "metric_value": 98.5,
    "threshold": 95.0,
    "is_violation": false,
    "timestamp": "2026-04-22T15:00:00Z",
    "metadata": {}
  }
]

Get All SLA Metrics

GET /sla/metrics?hours=24

Get SLA Violations

GET /sla/violations?resolved=false&miner_id=miner_001

Trigger SLA Metrics Collection

POST /sla/metrics/collect

Response:

{
  "miners_processed": 10,
  "metrics_collected": [...],
  "violations_detected": 2,
  "capacity": {
    "total_miners": 10,
    "active_miners": 8,
    "capacity_availability_pct": 80.0
  }
}

Capacity Planning Endpoints

Get Capacity Snapshots

GET /sla/capacity/snapshots?hours=24

Get Capacity Forecast

GET /sla/capacity/forecast?hours_ahead=168

Response:

{
  "forecast_horizon_hours": 168,
  "current_capacity": 1000,
  "projected_capacity": 1500,
  "recommended_scaling": "+50%",
  "confidence": 0.85,
  "source": "coordinator_api"
}

Get Scaling Recommendations

GET /sla/capacity/recommendations

Response:

{
  "current_state": "healthy",
  "recommendations": [
    {
      "action": "add_miners",
      "quantity": 2,
      "reason": "Projected capacity shortage in 2 weeks",
      "priority": "medium"
    }
  ],
  "source": "coordinator_api"
}

Configure Capacity Alerts

POST /sla/capacity/alerts/configure

Request:

{
  "threshold_pct": 80.0,
  "notification_email": "admin@example.com"
}

Billing Integration Endpoints

Get Billing Usage

GET /sla/billing/usage?hours=24&tenant_id=tenant_001

Sync Billing Usage

POST /sla/billing/sync

Request:

{
  "miner_id": "miner_001",
  "hours_back": 24
}

Record Usage Event

POST /sla/billing/usage/record

Request:

{
  "tenant_id": "tenant_001",
  "resource_type": "gpu_hours",
  "quantity": 10.5,
  "unit_price": 0.50,
  "job_id": "job_123",
  "metadata": {}
}

Generate Invoice

POST /sla/billing/invoice/generate

Request:

{
  "tenant_id": "tenant_001",
  "period_start": "2026-03-01T00:00:00Z",
  "period_end": "2026-03-31T23:59:59Z"
}

Status Endpoint

Get SLA Status

GET /sla/status

Response:

{
  "status": "healthy",
  "active_violations": 0,
  "recent_metrics_count": 50,
  "timestamp": "2026-04-22T15:00:00Z"
}

Automated Collection

SLA Collection Scheduler

The SLA collector can be run as a background service to automatically collect metrics:

from poolhub.services.sla_collector import SLACollector, SLACollectorScheduler
from poolhub.database import get_db

# Initialize
db = next(get_db())
sla_collector = SLACollector(db)
scheduler = SLACollectorScheduler(sla_collector)

# Start automated collection (every 5 minutes)
await scheduler.start(collection_interval_seconds=300)

Billing Sync Scheduler

The billing integration can be run as a background service to automatically sync usage:

from poolhub.services.billing_integration import BillingIntegration, BillingIntegrationScheduler
from poolhub.database import get_db

# Initialize
db = next(get_db())
billing_integration = BillingIntegration(db)
scheduler = BillingIntegrationScheduler(billing_integration)

# Start automated sync (every 1 hour)
await scheduler.start(sync_interval_hours=1)

Monitoring and Alerting

Prometheus Metrics

SLA metrics are exposed to Prometheus with the namespace poolhub:

poolhub_sla_uptime_pct - Miner uptime percentage
poolhub_sla_response_time_ms - Response time in milliseconds
poolhub_sla_completion_rate_pct - Job completion rate percentage
poolhub_sla_capacity_availability_pct - Capacity availability percentage
poolhub_sla_violations_total - Total SLA violations
poolhub_billing_sync_errors_total - Billing sync errors

Alert Rules

Example Prometheus alert rules:

groups:
  - name: poolhub_sla
    rules:
      - alert: HighSLAViolationRate
        expr: rate(poolhub_sla_violations_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High SLA violation rate

      - alert: LowMinerUptime
        expr: poolhub_sla_uptime_pct < 95
        for: 5m
        labels:
          severity: high
        annotations:
          summary: Miner uptime below threshold

      - alert: HighResponseTime
        expr: poolhub_sla_response_time_ms > 1000
        for: 5m
        labels:
          severity: high
        annotations:
          summary: Response time above threshold

Troubleshooting

SLA Metrics Not Recording

Symptom: SLA metrics are not being recorded in the database

Solutions:

Check SLA collector is running: ps aux | grep sla_collector
Verify database connection: Check pool-hub database logs
Check SLA collection interval: Ensure sla_collection_interval_seconds is configured
Verify miner heartbeats: Check miner_status.last_heartbeat_at is being updated

Billing Sync Failing

Symptom: Billing sync to coordinator-api is failing

Solutions:

Verify coordinator-api is accessible: curl http://localhost:8011/health
Check API key: Ensure COORDINATOR_API_KEY is set correctly
Check network connectivity: Ensure pool-hub can reach coordinator-api
Review billing integration logs: Check for HTTP errors or timeouts

Capacity Alerts Not Triggering

Symptom: Capacity alerts are not being generated

Solutions:

Verify capacity snapshots are being created: Check capacity_snapshots table
Check alert thresholds: Ensure capacity_alert_threshold_pct is configured
Verify alert configuration: Check alert configuration endpoint
Review coordinator-api capacity planning: Ensure it's receiving pool-hub data

Testing

Run the SLA and billing integration tests:

cd apps/pool-hub

# Run all SLA and billing tests
pytest tests/test_sla_collector.py
pytest tests/test_billing_integration.py
pytest tests/test_sla_endpoints.py
pytest tests/test_integration_coordinator.py

# Run with coverage
pytest --cov=poolhub.services.sla_collector tests/test_sla_collector.py
pytest --cov=poolhub.services.billing_integration tests/test_billing_integration.py

Best Practices

Monitor SLA Metrics Regularly: Set up automated monitoring dashboards to track SLA metrics in real-time
Configure Appropriate Thresholds: Adjust SLA thresholds based on your service requirements
Review Violations Promptly: Investigate and resolve SLA violations quickly to maintain service quality
Plan Capacity Proactively: Use capacity forecasting to anticipate scaling needs
Test Billing Integration: Regularly test billing sync to ensure accurate usage tracking
Keep Documentation Updated: Maintain up-to-date documentation for SLA configurations and procedures

Integration with Existing Systems

Coordinator-API Integration

The pool-hub integrates with coordinator-api's billing system via HTTP API:

Usage Recording: Pool-hub sends usage events to coordinator-api's /api/billing/usage endpoint
Billing Metrics: Pool-hub can query billing metrics from coordinator-api
Invoice Generation: Pool-hub can trigger invoice generation in coordinator-api
Capacity Planning: Pool-hub provides capacity data to coordinator-api's capacity planning system

Prometheus Integration

SLA metrics are automatically exposed to Prometheus:

Metrics are labeled by miner_id, metric_type, and other dimensions
Use Prometheus query language to create custom dashboards
Set up alert rules based on SLA thresholds

Alerting Integration

SLA violations can trigger alerts through:

Prometheus Alertmanager
Custom webhook integrations
Email notifications (via coordinator-api)
Slack/Discord integrations (via coordinator-api)

Security Considerations

API Key Security: Store coordinator-api API keys securely (use environment variables or secret management)
Database Access: Ensure database connections use SSL/TLS in production
Rate Limiting: Implement rate limiting on billing sync endpoints to prevent abuse
Audit Logging: Enable audit logging for SLA and billing operations
Access Control: Restrict access to SLA and billing endpoints to authorized users

Performance Considerations

Batch Operations: Use batch operations for billing sync to reduce HTTP overhead
Index Optimization: Ensure database indexes are properly configured for SLA queries
Caching: Use Redis caching for frequently accessed SLA metrics
Async Processing: Use async operations for SLA collection and billing sync
Data Retention: Implement data retention policies for SLA metrics and capacity snapshots

Maintenance

Regular Tasks

Review SLA Thresholds: Quarterly review and adjust SLA thresholds based on service performance
Clean Up Old Data: Regularly clean up old SLA metrics and capacity snapshots (e.g., keep 90 days)
Review Capacity Forecasts: Monthly review of capacity forecasts and scaling recommendations
Audit Billing Records: Monthly audit of billing records for accuracy
Update Documentation: Keep documentation updated with any configuration changes

Backup and Recovery

Database Backups: Ensure regular backups of SLA and billing tables
Configuration Backups: Backup configuration files and environment variables
Recovery Procedures: Document recovery procedures for SLA and billing systems
Testing Backups: Regularly test backup and recovery procedures

16 KiB Raw Blame History

SLA Monitoring Guide

Overview

Architecture

SLA Metrics

Miner Uptime

Response Time

Job Completion Rate

Capacity Availability

Configuration

Environment Variables

Settings File

Database Schema

SLA Metrics Table

SLA Violations Table

Capacity Snapshots Table

Database Migration

API Endpoints

SLA Metrics Endpoints

Get SLA Metrics for a Miner

Get All SLA Metrics

Get SLA Violations

Trigger SLA Metrics Collection

Capacity Planning Endpoints

Get Capacity Snapshots

Get Capacity Forecast

Get Scaling Recommendations

Configure Capacity Alerts

Billing Integration Endpoints

Get Billing Usage

Sync Billing Usage

Record Usage Event

Generate Invoice

Status Endpoint

Get SLA Status

Automated Collection

SLA Collection Scheduler

Billing Sync Scheduler

Monitoring and Alerting

Prometheus Metrics

Alert Rules

Troubleshooting

SLA Metrics Not Recording

Billing Sync Failing

Capacity Alerts Not Triggering

Testing

Best Practices

Integration with Existing Systems

Coordinator-API Integration

Prometheus Integration

Alerting Integration

Security Considerations

Performance Considerations

Maintenance

Regular Tasks

Backup and Recovery

References

16 KiB

Raw Blame History