# SLA Monitoring Guide This guide covers SLA (Service Level Agreement) monitoring and billing instrumentation for coordinator/pool hub services in the AITBC ecosystem. ## Overview The SLA monitoring system provides: - Real-time tracking of miner performance metrics - Automated SLA violation detection and alerting - Capacity planning with forecasting and scaling recommendations - Integration with coordinator-api billing system - Comprehensive API endpoints for monitoring and management ## Architecture ``` ┌─────────────────┐ │ Pool-Hub │ │ │ │ SLA Collector │──────┐ │ Capacity │ │ │ Planner │ │ │ │ │ └────────┬────────┘ │ │ │ │ HTTP API │ │ │ ┌────────▼────────┐ │ │ Coordinator-API │◀────┘ │ │ │ Usage Tracking │ │ Billing Service │ │ Multi-tenant DB │ └─────────────────┘ ``` ## SLA Metrics ### Miner Uptime - **Definition**: Percentage of time a miner is available and responsive - **Calculation**: Based on heartbeat intervals (5-minute threshold) - **Threshold**: 95% - **Alert Levels**: - Critical: <85.5% (threshold * 0.9) - High: <95% (threshold) ### Response Time - **Definition**: Average time for miner to respond to match requests - **Calculation**: Average of `eta_ms` from match results (last 100 results) - **Threshold**: 1000ms (P95) - **Alert Levels**: - Critical: >2000ms (threshold * 2) - High: >1000ms (threshold) ### Job Completion Rate - **Definition**: Percentage of jobs completed successfully - **Calculation**: Successful outcomes / total outcomes (last 7 days) - **Threshold**: 90% - **Alert Levels**: - Critical: <90% (threshold) ### Capacity Availability - **Definition**: Percentage of miners available (not busy) - **Calculation**: Active miners / Total miners - **Threshold**: 80% - **Alert Levels**: - High: <80% (threshold) ## Configuration ### Environment Variables Add to pool-hub `.env`: ```bash # Coordinator-API Billing Integration COORDINATOR_BILLING_URL=http://localhost:8011 COORDINATOR_API_KEY=your_api_key_here # SLA Configuration SLA_UPTIME_THRESHOLD=95.0 SLA_RESPONSE_TIME_THRESHOLD=1000.0 SLA_COMPLETION_RATE_THRESHOLD=90.0 SLA_CAPACITY_THRESHOLD=80.0 # Capacity Planning CAPACITY_FORECAST_HOURS=168 CAPACITY_ALERT_THRESHOLD_PCT=80.0 # Billing Sync BILLING_SYNC_INTERVAL_HOURS=1 # SLA Collection SLA_COLLECTION_INTERVAL_SECONDS=300 ``` ### Settings File Configuration can also be set in `poolhub/settings.py`: ```python class Settings(BaseSettings): # Coordinator-API Billing Integration coordinator_billing_url: str = Field(default="http://localhost:8011") coordinator_api_key: str | None = Field(default=None) # SLA Configuration sla_thresholds: Dict[str, float] = Field( default_factory=lambda: { "uptime_pct": 95.0, "response_time_ms": 1000.0, "completion_rate_pct": 90.0, "capacity_availability_pct": 80.0, } ) # Capacity Planning Configuration capacity_forecast_hours: int = Field(default=168) capacity_alert_threshold_pct: float = Field(default=80.0) # Billing Sync Configuration billing_sync_interval_hours: int = Field(default=1) # SLA Collection Configuration sla_collection_interval_seconds: int = Field(default=300) ``` ## Database Schema ### SLA Metrics Table ```sql CREATE TABLE sla_metrics ( id UUID PRIMARY KEY, miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE, metric_type VARCHAR(32) NOT NULL, metric_value FLOAT NOT NULL, threshold FLOAT NOT NULL, is_violation BOOLEAN DEFAULT FALSE, timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), metadata JSONB DEFAULT '{}' ); CREATE INDEX ix_sla_metrics_miner_id ON sla_metrics(miner_id); CREATE INDEX ix_sla_metrics_timestamp ON sla_metrics(timestamp); CREATE INDEX ix_sla_metrics_metric_type ON sla_metrics(metric_type); ``` ### SLA Violations Table ```sql CREATE TABLE sla_violations ( id UUID PRIMARY KEY, miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE, violation_type VARCHAR(32) NOT NULL, severity VARCHAR(16) NOT NULL, metric_value FLOAT NOT NULL, threshold FLOAT NOT NULL, violation_duration_ms INTEGER, resolved_at TIMESTAMP WITH TIME ZONE, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), metadata JSONB DEFAULT '{}' ); CREATE INDEX ix_sla_violations_miner_id ON sla_violations(miner_id); CREATE INDEX ix_sla_violations_created_at ON sla_violations(created_at); CREATE INDEX ix_sla_violations_severity ON sla_violations(severity); ``` ### Capacity Snapshots Table ```sql CREATE TABLE capacity_snapshots ( id UUID PRIMARY KEY, total_miners INTEGER NOT NULL, active_miners INTEGER NOT NULL, total_parallel_capacity INTEGER NOT NULL, total_queue_length INTEGER NOT NULL, capacity_utilization_pct FLOAT NOT NULL, forecast_capacity INTEGER NOT NULL, recommended_scaling VARCHAR(32) NOT NULL, scaling_reason TEXT, timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), metadata JSONB DEFAULT '{}' ); CREATE INDEX ix_capacity_snapshots_timestamp ON capacity_snapshots(timestamp); ``` ## Database Migration Run the migration to add SLA and capacity tables: ```bash cd apps/pool-hub alembic upgrade head ``` ## API Endpoints ### SLA Metrics Endpoints #### Get SLA Metrics for a Miner ```bash GET /sla/metrics/{miner_id}?hours=24 ``` Response: ```json [ { "id": "uuid", "miner_id": "miner_001", "metric_type": "uptime_pct", "metric_value": 98.5, "threshold": 95.0, "is_violation": false, "timestamp": "2026-04-22T15:00:00Z", "metadata": {} } ] ``` #### Get All SLA Metrics ```bash GET /sla/metrics?hours=24 ``` #### Get SLA Violations ```bash GET /sla/violations?resolved=false&miner_id=miner_001 ``` #### Trigger SLA Metrics Collection ```bash POST /sla/metrics/collect ``` Response: ```json { "miners_processed": 10, "metrics_collected": [...], "violations_detected": 2, "capacity": { "total_miners": 10, "active_miners": 8, "capacity_availability_pct": 80.0 } } ``` ### Capacity Planning Endpoints #### Get Capacity Snapshots ```bash GET /sla/capacity/snapshots?hours=24 ``` #### Get Capacity Forecast ```bash GET /sla/capacity/forecast?hours_ahead=168 ``` Response: ```json { "forecast_horizon_hours": 168, "current_capacity": 1000, "projected_capacity": 1500, "recommended_scaling": "+50%", "confidence": 0.85, "source": "coordinator_api" } ``` #### Get Scaling Recommendations ```bash GET /sla/capacity/recommendations ``` Response: ```json { "current_state": "healthy", "recommendations": [ { "action": "add_miners", "quantity": 2, "reason": "Projected capacity shortage in 2 weeks", "priority": "medium" } ], "source": "coordinator_api" } ``` #### Configure Capacity Alerts ```bash POST /sla/capacity/alerts/configure ``` Request: ```json { "threshold_pct": 80.0, "notification_email": "admin@example.com" } ``` ### Billing Integration Endpoints #### Get Billing Usage ```bash GET /sla/billing/usage?hours=24&tenant_id=tenant_001 ``` #### Sync Billing Usage ```bash POST /sla/billing/sync ``` Request: ```json { "miner_id": "miner_001", "hours_back": 24 } ``` #### Record Usage Event ```bash POST /sla/billing/usage/record ``` Request: ```json { "tenant_id": "tenant_001", "resource_type": "gpu_hours", "quantity": 10.5, "unit_price": 0.50, "job_id": "job_123", "metadata": {} } ``` #### Generate Invoice ```bash POST /sla/billing/invoice/generate ``` Request: ```json { "tenant_id": "tenant_001", "period_start": "2026-03-01T00:00:00Z", "period_end": "2026-03-31T23:59:59Z" } ``` ### Status Endpoint #### Get SLA Status ```bash GET /sla/status ``` Response: ```json { "status": "healthy", "active_violations": 0, "recent_metrics_count": 50, "timestamp": "2026-04-22T15:00:00Z" } ``` ## Automated Collection ### SLA Collection Scheduler The SLA collector can be run as a background service to automatically collect metrics: ```python from poolhub.services.sla_collector import SLACollector, SLACollectorScheduler from poolhub.database import get_db # Initialize db = next(get_db()) sla_collector = SLACollector(db) scheduler = SLACollectorScheduler(sla_collector) # Start automated collection (every 5 minutes) await scheduler.start(collection_interval_seconds=300) ``` ### Billing Sync Scheduler The billing integration can be run as a background service to automatically sync usage: ```python from poolhub.services.billing_integration import BillingIntegration, BillingIntegrationScheduler from poolhub.database import get_db # Initialize db = next(get_db()) billing_integration = BillingIntegration(db) scheduler = BillingIntegrationScheduler(billing_integration) # Start automated sync (every 1 hour) await scheduler.start(sync_interval_hours=1) ``` ## Monitoring and Alerting ### Prometheus Metrics SLA metrics are exposed to Prometheus with the namespace `poolhub`: - `poolhub_sla_uptime_pct` - Miner uptime percentage - `poolhub_sla_response_time_ms` - Response time in milliseconds - `poolhub_sla_completion_rate_pct` - Job completion rate percentage - `poolhub_sla_capacity_availability_pct` - Capacity availability percentage - `poolhub_sla_violations_total` - Total SLA violations - `poolhub_billing_sync_errors_total` - Billing sync errors ### Alert Rules Example Prometheus alert rules: ```yaml groups: - name: poolhub_sla rules: - alert: HighSLAViolationRate expr: rate(poolhub_sla_violations_total[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: High SLA violation rate - alert: LowMinerUptime expr: poolhub_sla_uptime_pct < 95 for: 5m labels: severity: high annotations: summary: Miner uptime below threshold - alert: HighResponseTime expr: poolhub_sla_response_time_ms > 1000 for: 5m labels: severity: high annotations: summary: Response time above threshold ``` ## Troubleshooting ### SLA Metrics Not Recording **Symptom**: SLA metrics are not being recorded in the database **Solutions**: 1. Check SLA collector is running: `ps aux | grep sla_collector` 2. Verify database connection: Check pool-hub database logs 3. Check SLA collection interval: Ensure `sla_collection_interval_seconds` is configured 4. Verify miner heartbeats: Check `miner_status.last_heartbeat_at` is being updated ### Billing Sync Failing **Symptom**: Billing sync to coordinator-api is failing **Solutions**: 1. Verify coordinator-api is accessible: `curl http://localhost:8011/health` 2. Check API key: Ensure `COORDINATOR_API_KEY` is set correctly 3. Check network connectivity: Ensure pool-hub can reach coordinator-api 4. Review billing integration logs: Check for HTTP errors or timeouts ### Capacity Alerts Not Triggering **Symptom**: Capacity alerts are not being generated **Solutions**: 1. Verify capacity snapshots are being created: Check `capacity_snapshots` table 2. Check alert thresholds: Ensure `capacity_alert_threshold_pct` is configured 3. Verify alert configuration: Check alert configuration endpoint 4. Review coordinator-api capacity planning: Ensure it's receiving pool-hub data ## Testing Run the SLA and billing integration tests: ```bash cd apps/pool-hub # Run all SLA and billing tests pytest tests/test_sla_collector.py pytest tests/test_billing_integration.py pytest tests/test_sla_endpoints.py pytest tests/test_integration_coordinator.py # Run with coverage pytest --cov=poolhub.services.sla_collector tests/test_sla_collector.py pytest --cov=poolhub.services.billing_integration tests/test_billing_integration.py ``` ## Best Practices 1. **Monitor SLA Metrics Regularly**: Set up automated monitoring dashboards to track SLA metrics in real-time 2. **Configure Appropriate Thresholds**: Adjust SLA thresholds based on your service requirements 3. **Review Violations Promptly**: Investigate and resolve SLA violations quickly to maintain service quality 4. **Plan Capacity Proactively**: Use capacity forecasting to anticipate scaling needs 5. **Test Billing Integration**: Regularly test billing sync to ensure accurate usage tracking 6. **Keep Documentation Updated**: Maintain up-to-date documentation for SLA configurations and procedures ## Integration with Existing Systems ### Coordinator-API Integration The pool-hub integrates with coordinator-api's billing system via HTTP API: 1. **Usage Recording**: Pool-hub sends usage events to coordinator-api's `/api/billing/usage` endpoint 2. **Billing Metrics**: Pool-hub can query billing metrics from coordinator-api 3. **Invoice Generation**: Pool-hub can trigger invoice generation in coordinator-api 4. **Capacity Planning**: Pool-hub provides capacity data to coordinator-api's capacity planning system ### Prometheus Integration SLA metrics are automatically exposed to Prometheus: - Metrics are labeled by miner_id, metric_type, and other dimensions - Use Prometheus query language to create custom dashboards - Set up alert rules based on SLA thresholds ### Alerting Integration SLA violations can trigger alerts through: - Prometheus Alertmanager - Custom webhook integrations - Email notifications (via coordinator-api) - Slack/Discord integrations (via coordinator-api) ## Security Considerations 1. **API Key Security**: Store coordinator-api API keys securely (use environment variables or secret management) 2. **Database Access**: Ensure database connections use SSL/TLS in production 3. **Rate Limiting**: Implement rate limiting on billing sync endpoints to prevent abuse 4. **Audit Logging**: Enable audit logging for SLA and billing operations 5. **Access Control**: Restrict access to SLA and billing endpoints to authorized users ## Performance Considerations 1. **Batch Operations**: Use batch operations for billing sync to reduce HTTP overhead 2. **Index Optimization**: Ensure database indexes are properly configured for SLA queries 3. **Caching**: Use Redis caching for frequently accessed SLA metrics 4. **Async Processing**: Use async operations for SLA collection and billing sync 5. **Data Retention**: Implement data retention policies for SLA metrics and capacity snapshots ## Maintenance ### Regular Tasks 1. **Review SLA Thresholds**: Quarterly review and adjust SLA thresholds based on service performance 2. **Clean Up Old Data**: Regularly clean up old SLA metrics and capacity snapshots (e.g., keep 90 days) 3. **Review Capacity Forecasts**: Monthly review of capacity forecasts and scaling recommendations 4. **Audit Billing Records**: Monthly audit of billing records for accuracy 5. **Update Documentation**: Keep documentation updated with any configuration changes ### Backup and Recovery 1. **Database Backups**: Ensure regular backups of SLA and billing tables 2. **Configuration Backups**: Backup configuration files and environment variables 3. **Recovery Procedures**: Document recovery procedures for SLA and billing systems 4. **Testing Backups**: Regularly test backup and recovery procedures ## References - [Pool-Hub README](/opt/aitbc/apps/pool-hub/README.md) - [Coordinator-API Billing Documentation](/opt/aitbc/apps/coordinator-api/README.md) - [Roadmap](/opt/aitbc/docs/beginner/02_project/2_roadmap.md) - [Deployment Guide](/opt/aitbc/docs/advanced/04_deployment/0_index.md)