Some checks failed
API Endpoint Tests / test-api-endpoints (push) Successful in 11s
CLI Tests / test-cli (push) Failing after 7s
Documentation Validation / validate-docs (push) Successful in 8s
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Successful in 38s
Python Tests / test-python (push) Successful in 11s
Security Scanning / security-scan (push) Successful in 29s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 1s
CLI Blockchain Features: - Added block operations: import, export, import-chain, blocks-range - Added messaging system commands (deploy, state, topics, create-topic, messages, post, vote, search, reputation, moderate) - Added network force-sync operation - Replaced marketplace handlers with actual RPC calls - Replaced AI handlers with actual RPC calls - Added account operations (account get) - Added transaction query operations - Added mempool query operations - Created keystore_auth.py for authentication - Removed extended features interception - All handlers use keystore credentials for authenticated endpoints Pool Hub Enhancements: - Added SLA monitoring and capacity tables - Added billing integration service - Added SLA collector service - Added SLA router endpoints - Updated pool hub models and settings - Added integration tests for billing and SLA - Updated documentation with SLA monitoring guide
585 lines
16 KiB
Markdown
585 lines
16 KiB
Markdown
# SLA Monitoring Guide
|
|
|
|
This guide covers SLA (Service Level Agreement) monitoring and billing instrumentation for coordinator/pool hub services in the AITBC ecosystem.
|
|
|
|
## Overview
|
|
|
|
The SLA monitoring system provides:
|
|
- Real-time tracking of miner performance metrics
|
|
- Automated SLA violation detection and alerting
|
|
- Capacity planning with forecasting and scaling recommendations
|
|
- Integration with coordinator-api billing system
|
|
- Comprehensive API endpoints for monitoring and management
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Pool-Hub │
|
|
│ │
|
|
│ SLA Collector │──────┐
|
|
│ Capacity │ │
|
|
│ Planner │ │
|
|
│ │ │
|
|
└────────┬────────┘ │
|
|
│ │
|
|
│ HTTP API │
|
|
│ │
|
|
┌────────▼────────┐ │
|
|
│ Coordinator-API │◀────┘
|
|
│ │
|
|
│ Usage Tracking │
|
|
│ Billing Service │
|
|
│ Multi-tenant DB │
|
|
└─────────────────┘
|
|
```
|
|
|
|
## SLA Metrics
|
|
|
|
### Miner Uptime
|
|
- **Definition**: Percentage of time a miner is available and responsive
|
|
- **Calculation**: Based on heartbeat intervals (5-minute threshold)
|
|
- **Threshold**: 95%
|
|
- **Alert Levels**:
|
|
- Critical: <85.5% (threshold * 0.9)
|
|
- High: <95% (threshold)
|
|
|
|
### Response Time
|
|
- **Definition**: Average time for miner to respond to match requests
|
|
- **Calculation**: Average of `eta_ms` from match results (last 100 results)
|
|
- **Threshold**: 1000ms (P95)
|
|
- **Alert Levels**:
|
|
- Critical: >2000ms (threshold * 2)
|
|
- High: >1000ms (threshold)
|
|
|
|
### Job Completion Rate
|
|
- **Definition**: Percentage of jobs completed successfully
|
|
- **Calculation**: Successful outcomes / total outcomes (last 7 days)
|
|
- **Threshold**: 90%
|
|
- **Alert Levels**:
|
|
- Critical: <90% (threshold)
|
|
|
|
### Capacity Availability
|
|
- **Definition**: Percentage of miners available (not busy)
|
|
- **Calculation**: Active miners / Total miners
|
|
- **Threshold**: 80%
|
|
- **Alert Levels**:
|
|
- High: <80% (threshold)
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Add to pool-hub `.env`:
|
|
|
|
```bash
|
|
# Coordinator-API Billing Integration
|
|
COORDINATOR_BILLING_URL=http://localhost:8011
|
|
COORDINATOR_API_KEY=your_api_key_here
|
|
|
|
# SLA Configuration
|
|
SLA_UPTIME_THRESHOLD=95.0
|
|
SLA_RESPONSE_TIME_THRESHOLD=1000.0
|
|
SLA_COMPLETION_RATE_THRESHOLD=90.0
|
|
SLA_CAPACITY_THRESHOLD=80.0
|
|
|
|
# Capacity Planning
|
|
CAPACITY_FORECAST_HOURS=168
|
|
CAPACITY_ALERT_THRESHOLD_PCT=80.0
|
|
|
|
# Billing Sync
|
|
BILLING_SYNC_INTERVAL_HOURS=1
|
|
|
|
# SLA Collection
|
|
SLA_COLLECTION_INTERVAL_SECONDS=300
|
|
```
|
|
|
|
### Settings File
|
|
|
|
Configuration can also be set in `poolhub/settings.py`:
|
|
|
|
```python
|
|
class Settings(BaseSettings):
|
|
# Coordinator-API Billing Integration
|
|
coordinator_billing_url: str = Field(default="http://localhost:8011")
|
|
coordinator_api_key: str | None = Field(default=None)
|
|
|
|
# SLA Configuration
|
|
sla_thresholds: Dict[str, float] = Field(
|
|
default_factory=lambda: {
|
|
"uptime_pct": 95.0,
|
|
"response_time_ms": 1000.0,
|
|
"completion_rate_pct": 90.0,
|
|
"capacity_availability_pct": 80.0,
|
|
}
|
|
)
|
|
|
|
# Capacity Planning Configuration
|
|
capacity_forecast_hours: int = Field(default=168)
|
|
capacity_alert_threshold_pct: float = Field(default=80.0)
|
|
|
|
# Billing Sync Configuration
|
|
billing_sync_interval_hours: int = Field(default=1)
|
|
|
|
# SLA Collection Configuration
|
|
sla_collection_interval_seconds: int = Field(default=300)
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### SLA Metrics Table
|
|
|
|
```sql
|
|
CREATE TABLE sla_metrics (
|
|
id UUID PRIMARY KEY,
|
|
miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
|
|
metric_type VARCHAR(32) NOT NULL,
|
|
metric_value FLOAT NOT NULL,
|
|
threshold FLOAT NOT NULL,
|
|
is_violation BOOLEAN DEFAULT FALSE,
|
|
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
|
metadata JSONB DEFAULT '{}'
|
|
);
|
|
|
|
CREATE INDEX ix_sla_metrics_miner_id ON sla_metrics(miner_id);
|
|
CREATE INDEX ix_sla_metrics_timestamp ON sla_metrics(timestamp);
|
|
CREATE INDEX ix_sla_metrics_metric_type ON sla_metrics(metric_type);
|
|
```
|
|
|
|
### SLA Violations Table
|
|
|
|
```sql
|
|
CREATE TABLE sla_violations (
|
|
id UUID PRIMARY KEY,
|
|
miner_id VARCHAR(64) NOT NULL REFERENCES miners(miner_id) ON DELETE CASCADE,
|
|
violation_type VARCHAR(32) NOT NULL,
|
|
severity VARCHAR(16) NOT NULL,
|
|
metric_value FLOAT NOT NULL,
|
|
threshold FLOAT NOT NULL,
|
|
violation_duration_ms INTEGER,
|
|
resolved_at TIMESTAMP WITH TIME ZONE,
|
|
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
|
metadata JSONB DEFAULT '{}'
|
|
);
|
|
|
|
CREATE INDEX ix_sla_violations_miner_id ON sla_violations(miner_id);
|
|
CREATE INDEX ix_sla_violations_created_at ON sla_violations(created_at);
|
|
CREATE INDEX ix_sla_violations_severity ON sla_violations(severity);
|
|
```
|
|
|
|
### Capacity Snapshots Table
|
|
|
|
```sql
|
|
CREATE TABLE capacity_snapshots (
|
|
id UUID PRIMARY KEY,
|
|
total_miners INTEGER NOT NULL,
|
|
active_miners INTEGER NOT NULL,
|
|
total_parallel_capacity INTEGER NOT NULL,
|
|
total_queue_length INTEGER NOT NULL,
|
|
capacity_utilization_pct FLOAT NOT NULL,
|
|
forecast_capacity INTEGER NOT NULL,
|
|
recommended_scaling VARCHAR(32) NOT NULL,
|
|
scaling_reason TEXT,
|
|
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
|
metadata JSONB DEFAULT '{}'
|
|
);
|
|
|
|
CREATE INDEX ix_capacity_snapshots_timestamp ON capacity_snapshots(timestamp);
|
|
```
|
|
|
|
## Database Migration
|
|
|
|
Run the migration to add SLA and capacity tables:
|
|
|
|
```bash
|
|
cd apps/pool-hub
|
|
alembic upgrade head
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### SLA Metrics Endpoints
|
|
|
|
#### Get SLA Metrics for a Miner
|
|
```bash
|
|
GET /sla/metrics/{miner_id}?hours=24
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
[
|
|
{
|
|
"id": "uuid",
|
|
"miner_id": "miner_001",
|
|
"metric_type": "uptime_pct",
|
|
"metric_value": 98.5,
|
|
"threshold": 95.0,
|
|
"is_violation": false,
|
|
"timestamp": "2026-04-22T15:00:00Z",
|
|
"metadata": {}
|
|
}
|
|
]
|
|
```
|
|
|
|
#### Get All SLA Metrics
|
|
```bash
|
|
GET /sla/metrics?hours=24
|
|
```
|
|
|
|
#### Get SLA Violations
|
|
```bash
|
|
GET /sla/violations?resolved=false&miner_id=miner_001
|
|
```
|
|
|
|
#### Trigger SLA Metrics Collection
|
|
```bash
|
|
POST /sla/metrics/collect
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"miners_processed": 10,
|
|
"metrics_collected": [...],
|
|
"violations_detected": 2,
|
|
"capacity": {
|
|
"total_miners": 10,
|
|
"active_miners": 8,
|
|
"capacity_availability_pct": 80.0
|
|
}
|
|
}
|
|
```
|
|
|
|
### Capacity Planning Endpoints
|
|
|
|
#### Get Capacity Snapshots
|
|
```bash
|
|
GET /sla/capacity/snapshots?hours=24
|
|
```
|
|
|
|
#### Get Capacity Forecast
|
|
```bash
|
|
GET /sla/capacity/forecast?hours_ahead=168
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"forecast_horizon_hours": 168,
|
|
"current_capacity": 1000,
|
|
"projected_capacity": 1500,
|
|
"recommended_scaling": "+50%",
|
|
"confidence": 0.85,
|
|
"source": "coordinator_api"
|
|
}
|
|
```
|
|
|
|
#### Get Scaling Recommendations
|
|
```bash
|
|
GET /sla/capacity/recommendations
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"current_state": "healthy",
|
|
"recommendations": [
|
|
{
|
|
"action": "add_miners",
|
|
"quantity": 2,
|
|
"reason": "Projected capacity shortage in 2 weeks",
|
|
"priority": "medium"
|
|
}
|
|
],
|
|
"source": "coordinator_api"
|
|
}
|
|
```
|
|
|
|
#### Configure Capacity Alerts
|
|
```bash
|
|
POST /sla/capacity/alerts/configure
|
|
```
|
|
|
|
Request:
|
|
```json
|
|
{
|
|
"threshold_pct": 80.0,
|
|
"notification_email": "admin@example.com"
|
|
}
|
|
```
|
|
|
|
### Billing Integration Endpoints
|
|
|
|
#### Get Billing Usage
|
|
```bash
|
|
GET /sla/billing/usage?hours=24&tenant_id=tenant_001
|
|
```
|
|
|
|
#### Sync Billing Usage
|
|
```bash
|
|
POST /sla/billing/sync
|
|
```
|
|
|
|
Request:
|
|
```json
|
|
{
|
|
"miner_id": "miner_001",
|
|
"hours_back": 24
|
|
}
|
|
```
|
|
|
|
#### Record Usage Event
|
|
```bash
|
|
POST /sla/billing/usage/record
|
|
```
|
|
|
|
Request:
|
|
```json
|
|
{
|
|
"tenant_id": "tenant_001",
|
|
"resource_type": "gpu_hours",
|
|
"quantity": 10.5,
|
|
"unit_price": 0.50,
|
|
"job_id": "job_123",
|
|
"metadata": {}
|
|
}
|
|
```
|
|
|
|
#### Generate Invoice
|
|
```bash
|
|
POST /sla/billing/invoice/generate
|
|
```
|
|
|
|
Request:
|
|
```json
|
|
{
|
|
"tenant_id": "tenant_001",
|
|
"period_start": "2026-03-01T00:00:00Z",
|
|
"period_end": "2026-03-31T23:59:59Z"
|
|
}
|
|
```
|
|
|
|
### Status Endpoint
|
|
|
|
#### Get SLA Status
|
|
```bash
|
|
GET /sla/status
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"active_violations": 0,
|
|
"recent_metrics_count": 50,
|
|
"timestamp": "2026-04-22T15:00:00Z"
|
|
}
|
|
```
|
|
|
|
## Automated Collection
|
|
|
|
### SLA Collection Scheduler
|
|
|
|
The SLA collector can be run as a background service to automatically collect metrics:
|
|
|
|
```python
|
|
from poolhub.services.sla_collector import SLACollector, SLACollectorScheduler
|
|
from poolhub.database import get_db
|
|
|
|
# Initialize
|
|
db = next(get_db())
|
|
sla_collector = SLACollector(db)
|
|
scheduler = SLACollectorScheduler(sla_collector)
|
|
|
|
# Start automated collection (every 5 minutes)
|
|
await scheduler.start(collection_interval_seconds=300)
|
|
```
|
|
|
|
### Billing Sync Scheduler
|
|
|
|
The billing integration can be run as a background service to automatically sync usage:
|
|
|
|
```python
|
|
from poolhub.services.billing_integration import BillingIntegration, BillingIntegrationScheduler
|
|
from poolhub.database import get_db
|
|
|
|
# Initialize
|
|
db = next(get_db())
|
|
billing_integration = BillingIntegration(db)
|
|
scheduler = BillingIntegrationScheduler(billing_integration)
|
|
|
|
# Start automated sync (every 1 hour)
|
|
await scheduler.start(sync_interval_hours=1)
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Prometheus Metrics
|
|
|
|
SLA metrics are exposed to Prometheus with the namespace `poolhub`:
|
|
|
|
- `poolhub_sla_uptime_pct` - Miner uptime percentage
|
|
- `poolhub_sla_response_time_ms` - Response time in milliseconds
|
|
- `poolhub_sla_completion_rate_pct` - Job completion rate percentage
|
|
- `poolhub_sla_capacity_availability_pct` - Capacity availability percentage
|
|
- `poolhub_sla_violations_total` - Total SLA violations
|
|
- `poolhub_billing_sync_errors_total` - Billing sync errors
|
|
|
|
### Alert Rules
|
|
|
|
Example Prometheus alert rules:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: poolhub_sla
|
|
rules:
|
|
- alert: HighSLAViolationRate
|
|
expr: rate(poolhub_sla_violations_total[5m]) > 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: High SLA violation rate
|
|
|
|
- alert: LowMinerUptime
|
|
expr: poolhub_sla_uptime_pct < 95
|
|
for: 5m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: Miner uptime below threshold
|
|
|
|
- alert: HighResponseTime
|
|
expr: poolhub_sla_response_time_ms > 1000
|
|
for: 5m
|
|
labels:
|
|
severity: high
|
|
annotations:
|
|
summary: Response time above threshold
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### SLA Metrics Not Recording
|
|
|
|
**Symptom**: SLA metrics are not being recorded in the database
|
|
|
|
**Solutions**:
|
|
1. Check SLA collector is running: `ps aux | grep sla_collector`
|
|
2. Verify database connection: Check pool-hub database logs
|
|
3. Check SLA collection interval: Ensure `sla_collection_interval_seconds` is configured
|
|
4. Verify miner heartbeats: Check `miner_status.last_heartbeat_at` is being updated
|
|
|
|
### Billing Sync Failing
|
|
|
|
**Symptom**: Billing sync to coordinator-api is failing
|
|
|
|
**Solutions**:
|
|
1. Verify coordinator-api is accessible: `curl http://localhost:8011/health`
|
|
2. Check API key: Ensure `COORDINATOR_API_KEY` is set correctly
|
|
3. Check network connectivity: Ensure pool-hub can reach coordinator-api
|
|
4. Review billing integration logs: Check for HTTP errors or timeouts
|
|
|
|
### Capacity Alerts Not Triggering
|
|
|
|
**Symptom**: Capacity alerts are not being generated
|
|
|
|
**Solutions**:
|
|
1. Verify capacity snapshots are being created: Check `capacity_snapshots` table
|
|
2. Check alert thresholds: Ensure `capacity_alert_threshold_pct` is configured
|
|
3. Verify alert configuration: Check alert configuration endpoint
|
|
4. Review coordinator-api capacity planning: Ensure it's receiving pool-hub data
|
|
|
|
## Testing
|
|
|
|
Run the SLA and billing integration tests:
|
|
|
|
```bash
|
|
cd apps/pool-hub
|
|
|
|
# Run all SLA and billing tests
|
|
pytest tests/test_sla_collector.py
|
|
pytest tests/test_billing_integration.py
|
|
pytest tests/test_sla_endpoints.py
|
|
pytest tests/test_integration_coordinator.py
|
|
|
|
# Run with coverage
|
|
pytest --cov=poolhub.services.sla_collector tests/test_sla_collector.py
|
|
pytest --cov=poolhub.services.billing_integration tests/test_billing_integration.py
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Monitor SLA Metrics Regularly**: Set up automated monitoring dashboards to track SLA metrics in real-time
|
|
2. **Configure Appropriate Thresholds**: Adjust SLA thresholds based on your service requirements
|
|
3. **Review Violations Promptly**: Investigate and resolve SLA violations quickly to maintain service quality
|
|
4. **Plan Capacity Proactively**: Use capacity forecasting to anticipate scaling needs
|
|
5. **Test Billing Integration**: Regularly test billing sync to ensure accurate usage tracking
|
|
6. **Keep Documentation Updated**: Maintain up-to-date documentation for SLA configurations and procedures
|
|
|
|
## Integration with Existing Systems
|
|
|
|
### Coordinator-API Integration
|
|
|
|
The pool-hub integrates with coordinator-api's billing system via HTTP API:
|
|
|
|
1. **Usage Recording**: Pool-hub sends usage events to coordinator-api's `/api/billing/usage` endpoint
|
|
2. **Billing Metrics**: Pool-hub can query billing metrics from coordinator-api
|
|
3. **Invoice Generation**: Pool-hub can trigger invoice generation in coordinator-api
|
|
4. **Capacity Planning**: Pool-hub provides capacity data to coordinator-api's capacity planning system
|
|
|
|
### Prometheus Integration
|
|
|
|
SLA metrics are automatically exposed to Prometheus:
|
|
- Metrics are labeled by miner_id, metric_type, and other dimensions
|
|
- Use Prometheus query language to create custom dashboards
|
|
- Set up alert rules based on SLA thresholds
|
|
|
|
### Alerting Integration
|
|
|
|
SLA violations can trigger alerts through:
|
|
- Prometheus Alertmanager
|
|
- Custom webhook integrations
|
|
- Email notifications (via coordinator-api)
|
|
- Slack/Discord integrations (via coordinator-api)
|
|
|
|
## Security Considerations
|
|
|
|
1. **API Key Security**: Store coordinator-api API keys securely (use environment variables or secret management)
|
|
2. **Database Access**: Ensure database connections use SSL/TLS in production
|
|
3. **Rate Limiting**: Implement rate limiting on billing sync endpoints to prevent abuse
|
|
4. **Audit Logging**: Enable audit logging for SLA and billing operations
|
|
5. **Access Control**: Restrict access to SLA and billing endpoints to authorized users
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Batch Operations**: Use batch operations for billing sync to reduce HTTP overhead
|
|
2. **Index Optimization**: Ensure database indexes are properly configured for SLA queries
|
|
3. **Caching**: Use Redis caching for frequently accessed SLA metrics
|
|
4. **Async Processing**: Use async operations for SLA collection and billing sync
|
|
5. **Data Retention**: Implement data retention policies for SLA metrics and capacity snapshots
|
|
|
|
## Maintenance
|
|
|
|
### Regular Tasks
|
|
|
|
1. **Review SLA Thresholds**: Quarterly review and adjust SLA thresholds based on service performance
|
|
2. **Clean Up Old Data**: Regularly clean up old SLA metrics and capacity snapshots (e.g., keep 90 days)
|
|
3. **Review Capacity Forecasts**: Monthly review of capacity forecasts and scaling recommendations
|
|
4. **Audit Billing Records**: Monthly audit of billing records for accuracy
|
|
5. **Update Documentation**: Keep documentation updated with any configuration changes
|
|
|
|
### Backup and Recovery
|
|
|
|
1. **Database Backups**: Ensure regular backups of SLA and billing tables
|
|
2. **Configuration Backups**: Backup configuration files and environment variables
|
|
3. **Recovery Procedures**: Document recovery procedures for SLA and billing systems
|
|
4. **Testing Backups**: Regularly test backup and recovery procedures
|
|
|
|
## References
|
|
|
|
- [Pool-Hub README](/opt/aitbc/apps/pool-hub/README.md)
|
|
- [Coordinator-API Billing Documentation](/opt/aitbc/apps/coordinator-api/README.md)
|
|
- [Roadmap](/opt/aitbc/docs/beginner/02_project/2_roadmap.md)
|
|
- [Deployment Guide](/opt/aitbc/docs/advanced/04_deployment/0_index.md)
|