Files
aitbc/docs/operator/monitoring/monitoring-playbook.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

450 lines
10 KiB
Markdown

# AITBC Monitoring Playbook & On-Call Guide
## Overview
This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
## Service Overview
### Core Services
- **Coordinator API**: Job management and marketplace coordination
- **Blockchain Nodes**: Consensus and transaction processing
- **Explorer UI**: Block explorer and transaction visualization
- **Marketplace UI**: User interface for marketplace operations
- **Wallet Daemon**: Cryptographic key management
- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
### Critical Metrics
- **Availability**: 99.9% uptime SLA
- **Performance**: <200ms API response time (95th percentile)
- **Throughput**: 100+ TPS sustained
- **MTTR**: <2 minutes for critical incidents
## On-Call Rotation
### Rotation Schedule
- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
- **Secondary On-Call**: Shadow primary, handles escalations
- **Tertiary**: Backup for both primary and secondary
- **Rotation Handoff**: Every Monday at 08:00 UTC
### Team Structure
```
Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
```
### Handoff Procedures
1. **Pre-handoff Check** (Sunday 22:00 UTC):
- Review active incidents
- Check scheduled maintenance
- Verify monitoring systems health
2. **Handoff Meeting** (Monday 08:00 UTC):
- 15-minute video call
- Discuss current issues
- Transfer knowledge
- Confirm contact information
3. **Post-handoff** (Monday 09:00 UTC):
- Primary acknowledges receipt
- Update on-call calendar
- Test alerting systems
### Contact Information
- **Primary**: +1-555-ONCALL-1 (PagerDuty)
- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
- **Escalation Manager**: +1-555-ESCALATE
- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
## Alerting & Escalation
### Alert Severity Levels
#### Critical (P0)
- Service completely down
- Data loss or corruption
- Security breach
- SLA violation in progress
- **Response Time**: 5 minutes
- **Escalation**: 15 minutes if no response
#### High (P1)
- Significant degradation
- Partial service outage
- High error rates (>10%)
- **Response Time**: 15 minutes
- **Escalation**: 1 hour if no response
#### Medium (P2)
- Minor degradation
- Elevated error rates (5-10%)
- Performance issues
- **Response Time**: 1 hour
- **Escalation**: 4 hours if no response
#### Low (P3)
- Informational alerts
- Non-critical issues
- **Response Time**: 4 hours
- **Escalation**: 24 hours if no response
### Escalation Policy
1. **Level 1**: Primary On-Call (5-60 minutes)
2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
4. **Level 4**: Engineering Manager (4 hours)
5. **Level 5**: CTO (Critical incidents only)
### Alert Channels
- **PagerDuty**: Primary alerting system
- **Slack**: #on-call-aitbc channel
- **Email**: oncall@aitbc.io
- **SMS**: Critical alerts only
- **Phone**: Critical incidents only
## Incident Response
### Incident Classification
#### SEV-0 (Critical)
- Complete service outage
- Data loss or security breach
- Financial impact >$10,000/hour
- Customer impact >50%
#### SEV-1 (High)
- Significant service degradation
- Feature unavailable
- Financial impact $1,000-$10,000/hour
- Customer impact 10-50%
#### SEV-2 (Medium)
- Minor service degradation
- Performance issues
- Financial impact <$1,000/hour
- Customer impact <10%
#### SEV-3 (Low)
- Informational
- No customer impact
### Incident Response Process
#### 1. Detection & Triage (0-5 minutes)
```bash
# Check alert severity
# Verify impact
# Create incident channel
# Notify stakeholders
```
#### 2. Assessment (5-15 minutes)
- Determine scope
- Identify root cause area
- Estimate resolution time
- Declare severity level
#### 3. Communication (15-30 minutes)
- Update status page
- Notify customers (if needed)
- Internal stakeholder updates
- Set up war room
#### 4. Resolution (Varies)
- Implement fix
- Verify resolution
- Monitor for recurrence
- Document actions
#### 5. Recovery (30-60 minutes)
- Full service restoration
- Performance validation
- Customer communication
- Incident closure
## Service-Specific Runbooks
### Coordinator API
#### High Error Rate
**Symptoms**: 5xx errors >5%, response time >500ms
**Runbook**:
1. Check pod health: `kubectl get pods -l app=coordinator`
2. Review logs: `kubectl logs -f deployment/coordinator`
3. Check database connectivity
4. Verify Redis connection
5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
#### Service Unavailable
**Symptoms**: 503 errors, health check failures
**Runbook**:
1. Check deployment status
2. Review recent deployments
3. Rollback if necessary
4. Check resource limits
5. Verify ingress configuration
### Blockchain Nodes
#### Consensus Stalled
**Symptoms**: No new blocks, high finality latency
**Runbook**:
1. Check node sync status
2. Verify network connectivity
3. Review validator set
4. Check governance proposals
5. Restart if needed (with caution)
#### High Peer Drop Rate
**Symptoms**: Connected peers <50%, network partition
**Runbook**:
1. Check network policies
2. Verify DNS resolution
3. Review firewall rules
4. Check load balancer health
5. Restart networking components
### Database (PostgreSQL)
#### Connection Exhaustion
**Symptoms**: "Too many connections" errors
**Runbook**:
1. Check active connections
2. Identify long-running queries
3. Kill idle connections
4. Increase pool size if needed
5. Scale database
#### Replica Lag
**Symptoms**: Read replica lag >10 seconds
**Runbook**:
1. Check replica status
2. Review network latency
3. Verify disk space
4. Restart replication if needed
5. Failover if necessary
### Redis
#### Memory Pressure
**Symptoms**: OOM errors, high eviction rate
**Runbook**:
1. Check memory usage
2. Review key expiration
3. Clean up unused keys
4. Scale Redis cluster
5. Optimize data structures
#### Connection Issues
**Symptoms**: Connection timeouts, errors
**Runbook**:
1. Check max connections
2. Review connection pool
3. Verify network policies
4. Restart Redis if needed
5. Scale horizontally
## Monitoring Dashboards
### Primary Dashboards
#### 1. System Overview
- Service health status
- Error rates (4xx/5xx)
- Response times
- Throughput metrics
- Resource utilization
#### 2. Infrastructure
- Kubernetes cluster health
- Node resource usage
- Pod status and restarts
- Network traffic
- Storage capacity
#### 3. Application Metrics
- Job submission rates
- Transaction processing
- Marketplace activity
- Wallet operations
- Mining statistics
#### 4. Business KPIs
- Active users
- Transaction volume
- Revenue metrics
- Customer satisfaction
- SLA compliance
### Alert Rules
#### Critical Alerts
- Service down >1 minute
- Error rate >10%
- Response time >1 second
- Disk space >90%
- Memory usage >95%
#### Warning Alerts
- Error rate >5%
- Response time >500ms
- CPU usage >80%
- Queue depth >1000
- Replica lag >5s
## SLOs & SLIs
### Service Level Objectives
| Service | Metric | Target | Measurement |
|---------|--------|--------|-------------|
| Coordinator API | Availability | 99.9% | 30-day rolling |
| Coordinator API | Latency | <200ms | 95th percentile |
| Blockchain | Block Time | <2s | 24-hour average |
| Marketplace | Success Rate | 99.5% | Daily |
| Explorer | Response Time | <500ms | 95th percentile |
### Service Level Indicators
#### Availability
- HTTP status codes
- Health check responses
- Pod readiness status
#### Latency
- Request duration histogram
- Database query times
- External API calls
#### Throughput
- Requests per second
- Transactions per block
- Jobs completed per hour
#### Quality
- Error rates
- Success rates
- Customer satisfaction
## Post-Incident Process
### Immediate Actions (0-1 hour)
1. Verify full resolution
2. Monitor for recurrence
3. Update status page
4. Notify stakeholders
### Post-Mortem (1-24 hours)
1. Create incident document
2. Gather timeline and logs
3. Identify root cause
4. Document lessons learned
### Follow-up (1-7 days)
1. Schedule post-mortem meeting
2. Assign action items
3. Update runbooks
4. Improve monitoring
### Review (Weekly)
1. Review incident trends
2. Update SLOs if needed
3. Adjust alerting thresholds
4. Improve processes
## Maintenance Windows
### Scheduled Maintenance
- **Frequency**: Weekly maintenance window
- **Time**: Sunday 02:00-04:00 UTC
- **Duration**: Maximum 2 hours
- **Notification**: 72 hours advance
### Emergency Maintenance
- **Approval**: Engineering Manager required
- **Notification**: 4 hours advance (if possible)
- **Duration**: As needed
- **Rollback**: Always required
## Tools & Systems
### Monitoring Stack
- **Prometheus**: Metrics collection
- **Grafana**: Visualization and dashboards
- **Alertmanager**: Alert routing and management
- **PagerDuty**: On-call scheduling and escalation
### Observability
- **Jaeger**: Distributed tracing
- **Loki**: Log aggregation
- **Kiali**: Service mesh visualization
- **Kube-state-metrics**: Kubernetes metrics
### Communication
- **Slack**: Primary communication
- **Zoom**: War room meetings
- **Status Page**: Customer notifications
- **Email**: Formal communications
## Training & Onboarding
### New On-Call Engineer
1. Shadow primary for 1 week
2. Review all runbooks
3. Test alerting systems
4. Handle low-severity incidents
5. Solo on-call with mentor
### Ongoing Training
- Monthly incident drills
- Quarterly runbook updates
- Annual training refreshers
- Cross-team knowledge sharing
## Emergency Procedures
### Major Outage
1. Declare incident (SEV-0)
2. Activate war room
3. Customer communication
4. Executive updates
5. Recovery coordination
### Security Incident
1. Isolate affected systems
2. Preserve evidence
3. Notify security team
4. Customer notification
5. Regulatory compliance
### Data Loss
1. Stop affected services
2. Assess impact
3. Initiate recovery
4. Customer communication
5. Prevent recurrence
## Appendix
### A. Contact List
[Detailed contact information]
### B. Runbook Checklist
[Quick reference checklists]
### C. Alert Configuration
[Prometheus rules and thresholds]
### D. Dashboard Links
[Grafana dashboard URLs]
---
*Document Version: 1.0*
*Last Updated: 2024-12-22*
*Next Review: 2025-01-22*
*Owner: SRE Team*