- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
450 lines
10 KiB
Markdown
450 lines
10 KiB
Markdown
# AITBC Monitoring Playbook & On-Call Guide
|
|
|
|
## Overview
|
|
|
|
This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
|
|
|
|
## Service Overview
|
|
|
|
### Core Services
|
|
- **Coordinator API**: Job management and marketplace coordination
|
|
- **Blockchain Nodes**: Consensus and transaction processing
|
|
- **Explorer UI**: Block explorer and transaction visualization
|
|
- **Marketplace UI**: User interface for marketplace operations
|
|
- **Wallet Daemon**: Cryptographic key management
|
|
- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
|
|
|
|
### Critical Metrics
|
|
- **Availability**: 99.9% uptime SLA
|
|
- **Performance**: <200ms API response time (95th percentile)
|
|
- **Throughput**: 100+ TPS sustained
|
|
- **MTTR**: <2 minutes for critical incidents
|
|
|
|
## On-Call Rotation
|
|
|
|
### Rotation Schedule
|
|
- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
|
|
- **Secondary On-Call**: Shadow primary, handles escalations
|
|
- **Tertiary**: Backup for both primary and secondary
|
|
- **Rotation Handoff**: Every Monday at 08:00 UTC
|
|
|
|
### Team Structure
|
|
```
|
|
Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
|
|
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
|
|
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
|
|
```
|
|
|
|
### Handoff Procedures
|
|
1. **Pre-handoff Check** (Sunday 22:00 UTC):
|
|
- Review active incidents
|
|
- Check scheduled maintenance
|
|
- Verify monitoring systems health
|
|
|
|
2. **Handoff Meeting** (Monday 08:00 UTC):
|
|
- 15-minute video call
|
|
- Discuss current issues
|
|
- Transfer knowledge
|
|
- Confirm contact information
|
|
|
|
3. **Post-handoff** (Monday 09:00 UTC):
|
|
- Primary acknowledges receipt
|
|
- Update on-call calendar
|
|
- Test alerting systems
|
|
|
|
### Contact Information
|
|
- **Primary**: +1-555-ONCALL-1 (PagerDuty)
|
|
- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
|
|
- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
|
|
- **Escalation Manager**: +1-555-ESCALATE
|
|
- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
|
|
|
|
## Alerting & Escalation
|
|
|
|
### Alert Severity Levels
|
|
|
|
#### Critical (P0)
|
|
- Service completely down
|
|
- Data loss or corruption
|
|
- Security breach
|
|
- SLA violation in progress
|
|
- **Response Time**: 5 minutes
|
|
- **Escalation**: 15 minutes if no response
|
|
|
|
#### High (P1)
|
|
- Significant degradation
|
|
- Partial service outage
|
|
- High error rates (>10%)
|
|
- **Response Time**: 15 minutes
|
|
- **Escalation**: 1 hour if no response
|
|
|
|
#### Medium (P2)
|
|
- Minor degradation
|
|
- Elevated error rates (5-10%)
|
|
- Performance issues
|
|
- **Response Time**: 1 hour
|
|
- **Escalation**: 4 hours if no response
|
|
|
|
#### Low (P3)
|
|
- Informational alerts
|
|
- Non-critical issues
|
|
- **Response Time**: 4 hours
|
|
- **Escalation**: 24 hours if no response
|
|
|
|
### Escalation Policy
|
|
1. **Level 1**: Primary On-Call (5-60 minutes)
|
|
2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
|
|
3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
|
|
4. **Level 4**: Engineering Manager (4 hours)
|
|
5. **Level 5**: CTO (Critical incidents only)
|
|
|
|
### Alert Channels
|
|
- **PagerDuty**: Primary alerting system
|
|
- **Slack**: #on-call-aitbc channel
|
|
- **Email**: oncall@aitbc.io
|
|
- **SMS**: Critical alerts only
|
|
- **Phone**: Critical incidents only
|
|
|
|
## Incident Response
|
|
|
|
### Incident Classification
|
|
|
|
#### SEV-0 (Critical)
|
|
- Complete service outage
|
|
- Data loss or security breach
|
|
- Financial impact >$10,000/hour
|
|
- Customer impact >50%
|
|
|
|
#### SEV-1 (High)
|
|
- Significant service degradation
|
|
- Feature unavailable
|
|
- Financial impact $1,000-$10,000/hour
|
|
- Customer impact 10-50%
|
|
|
|
#### SEV-2 (Medium)
|
|
- Minor service degradation
|
|
- Performance issues
|
|
- Financial impact <$1,000/hour
|
|
- Customer impact <10%
|
|
|
|
#### SEV-3 (Low)
|
|
- Informational
|
|
- No customer impact
|
|
|
|
### Incident Response Process
|
|
|
|
#### 1. Detection & Triage (0-5 minutes)
|
|
```bash
|
|
# Check alert severity
|
|
# Verify impact
|
|
# Create incident channel
|
|
# Notify stakeholders
|
|
```
|
|
|
|
#### 2. Assessment (5-15 minutes)
|
|
- Determine scope
|
|
- Identify root cause area
|
|
- Estimate resolution time
|
|
- Declare severity level
|
|
|
|
#### 3. Communication (15-30 minutes)
|
|
- Update status page
|
|
- Notify customers (if needed)
|
|
- Internal stakeholder updates
|
|
- Set up war room
|
|
|
|
#### 4. Resolution (Varies)
|
|
- Implement fix
|
|
- Verify resolution
|
|
- Monitor for recurrence
|
|
- Document actions
|
|
|
|
#### 5. Recovery (30-60 minutes)
|
|
- Full service restoration
|
|
- Performance validation
|
|
- Customer communication
|
|
- Incident closure
|
|
|
|
## Service-Specific Runbooks
|
|
|
|
### Coordinator API
|
|
|
|
#### High Error Rate
|
|
**Symptoms**: 5xx errors >5%, response time >500ms
|
|
**Runbook**:
|
|
1. Check pod health: `kubectl get pods -l app=coordinator`
|
|
2. Review logs: `kubectl logs -f deployment/coordinator`
|
|
3. Check database connectivity
|
|
4. Verify Redis connection
|
|
5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
|
|
|
|
#### Service Unavailable
|
|
**Symptoms**: 503 errors, health check failures
|
|
**Runbook**:
|
|
1. Check deployment status
|
|
2. Review recent deployments
|
|
3. Rollback if necessary
|
|
4. Check resource limits
|
|
5. Verify ingress configuration
|
|
|
|
### Blockchain Nodes
|
|
|
|
#### Consensus Stalled
|
|
**Symptoms**: No new blocks, high finality latency
|
|
**Runbook**:
|
|
1. Check node sync status
|
|
2. Verify network connectivity
|
|
3. Review validator set
|
|
4. Check governance proposals
|
|
5. Restart if needed (with caution)
|
|
|
|
#### High Peer Drop Rate
|
|
**Symptoms**: Connected peers <50%, network partition
|
|
**Runbook**:
|
|
1. Check network policies
|
|
2. Verify DNS resolution
|
|
3. Review firewall rules
|
|
4. Check load balancer health
|
|
5. Restart networking components
|
|
|
|
### Database (PostgreSQL)
|
|
|
|
#### Connection Exhaustion
|
|
**Symptoms**: "Too many connections" errors
|
|
**Runbook**:
|
|
1. Check active connections
|
|
2. Identify long-running queries
|
|
3. Kill idle connections
|
|
4. Increase pool size if needed
|
|
5. Scale database
|
|
|
|
#### Replica Lag
|
|
**Symptoms**: Read replica lag >10 seconds
|
|
**Runbook**:
|
|
1. Check replica status
|
|
2. Review network latency
|
|
3. Verify disk space
|
|
4. Restart replication if needed
|
|
5. Failover if necessary
|
|
|
|
### Redis
|
|
|
|
#### Memory Pressure
|
|
**Symptoms**: OOM errors, high eviction rate
|
|
**Runbook**:
|
|
1. Check memory usage
|
|
2. Review key expiration
|
|
3. Clean up unused keys
|
|
4. Scale Redis cluster
|
|
5. Optimize data structures
|
|
|
|
#### Connection Issues
|
|
**Symptoms**: Connection timeouts, errors
|
|
**Runbook**:
|
|
1. Check max connections
|
|
2. Review connection pool
|
|
3. Verify network policies
|
|
4. Restart Redis if needed
|
|
5. Scale horizontally
|
|
|
|
## Monitoring Dashboards
|
|
|
|
### Primary Dashboards
|
|
|
|
#### 1. System Overview
|
|
- Service health status
|
|
- Error rates (4xx/5xx)
|
|
- Response times
|
|
- Throughput metrics
|
|
- Resource utilization
|
|
|
|
#### 2. Infrastructure
|
|
- Kubernetes cluster health
|
|
- Node resource usage
|
|
- Pod status and restarts
|
|
- Network traffic
|
|
- Storage capacity
|
|
|
|
#### 3. Application Metrics
|
|
- Job submission rates
|
|
- Transaction processing
|
|
- Marketplace activity
|
|
- Wallet operations
|
|
- Mining statistics
|
|
|
|
#### 4. Business KPIs
|
|
- Active users
|
|
- Transaction volume
|
|
- Revenue metrics
|
|
- Customer satisfaction
|
|
- SLA compliance
|
|
|
|
### Alert Rules
|
|
|
|
#### Critical Alerts
|
|
- Service down >1 minute
|
|
- Error rate >10%
|
|
- Response time >1 second
|
|
- Disk space >90%
|
|
- Memory usage >95%
|
|
|
|
#### Warning Alerts
|
|
- Error rate >5%
|
|
- Response time >500ms
|
|
- CPU usage >80%
|
|
- Queue depth >1000
|
|
- Replica lag >5s
|
|
|
|
## SLOs & SLIs
|
|
|
|
### Service Level Objectives
|
|
|
|
| Service | Metric | Target | Measurement |
|
|
|---------|--------|--------|-------------|
|
|
| Coordinator API | Availability | 99.9% | 30-day rolling |
|
|
| Coordinator API | Latency | <200ms | 95th percentile |
|
|
| Blockchain | Block Time | <2s | 24-hour average |
|
|
| Marketplace | Success Rate | 99.5% | Daily |
|
|
| Explorer | Response Time | <500ms | 95th percentile |
|
|
|
|
### Service Level Indicators
|
|
|
|
#### Availability
|
|
- HTTP status codes
|
|
- Health check responses
|
|
- Pod readiness status
|
|
|
|
#### Latency
|
|
- Request duration histogram
|
|
- Database query times
|
|
- External API calls
|
|
|
|
#### Throughput
|
|
- Requests per second
|
|
- Transactions per block
|
|
- Jobs completed per hour
|
|
|
|
#### Quality
|
|
- Error rates
|
|
- Success rates
|
|
- Customer satisfaction
|
|
|
|
## Post-Incident Process
|
|
|
|
### Immediate Actions (0-1 hour)
|
|
1. Verify full resolution
|
|
2. Monitor for recurrence
|
|
3. Update status page
|
|
4. Notify stakeholders
|
|
|
|
### Post-Mortem (1-24 hours)
|
|
1. Create incident document
|
|
2. Gather timeline and logs
|
|
3. Identify root cause
|
|
4. Document lessons learned
|
|
|
|
### Follow-up (1-7 days)
|
|
1. Schedule post-mortem meeting
|
|
2. Assign action items
|
|
3. Update runbooks
|
|
4. Improve monitoring
|
|
|
|
### Review (Weekly)
|
|
1. Review incident trends
|
|
2. Update SLOs if needed
|
|
3. Adjust alerting thresholds
|
|
4. Improve processes
|
|
|
|
## Maintenance Windows
|
|
|
|
### Scheduled Maintenance
|
|
- **Frequency**: Weekly maintenance window
|
|
- **Time**: Sunday 02:00-04:00 UTC
|
|
- **Duration**: Maximum 2 hours
|
|
- **Notification**: 72 hours advance
|
|
|
|
### Emergency Maintenance
|
|
- **Approval**: Engineering Manager required
|
|
- **Notification**: 4 hours advance (if possible)
|
|
- **Duration**: As needed
|
|
- **Rollback**: Always required
|
|
|
|
## Tools & Systems
|
|
|
|
### Monitoring Stack
|
|
- **Prometheus**: Metrics collection
|
|
- **Grafana**: Visualization and dashboards
|
|
- **Alertmanager**: Alert routing and management
|
|
- **PagerDuty**: On-call scheduling and escalation
|
|
|
|
### Observability
|
|
- **Jaeger**: Distributed tracing
|
|
- **Loki**: Log aggregation
|
|
- **Kiali**: Service mesh visualization
|
|
- **Kube-state-metrics**: Kubernetes metrics
|
|
|
|
### Communication
|
|
- **Slack**: Primary communication
|
|
- **Zoom**: War room meetings
|
|
- **Status Page**: Customer notifications
|
|
- **Email**: Formal communications
|
|
|
|
## Training & Onboarding
|
|
|
|
### New On-Call Engineer
|
|
1. Shadow primary for 1 week
|
|
2. Review all runbooks
|
|
3. Test alerting systems
|
|
4. Handle low-severity incidents
|
|
5. Solo on-call with mentor
|
|
|
|
### Ongoing Training
|
|
- Monthly incident drills
|
|
- Quarterly runbook updates
|
|
- Annual training refreshers
|
|
- Cross-team knowledge sharing
|
|
|
|
## Emergency Procedures
|
|
|
|
### Major Outage
|
|
1. Declare incident (SEV-0)
|
|
2. Activate war room
|
|
3. Customer communication
|
|
4. Executive updates
|
|
5. Recovery coordination
|
|
|
|
### Security Incident
|
|
1. Isolate affected systems
|
|
2. Preserve evidence
|
|
3. Notify security team
|
|
4. Customer notification
|
|
5. Regulatory compliance
|
|
|
|
### Data Loss
|
|
1. Stop affected services
|
|
2. Assess impact
|
|
3. Initiate recovery
|
|
4. Customer communication
|
|
5. Prevent recurrence
|
|
|
|
## Appendix
|
|
|
|
### A. Contact List
|
|
[Detailed contact information]
|
|
|
|
### B. Runbook Checklist
|
|
[Quick reference checklists]
|
|
|
|
### C. Alert Configuration
|
|
[Prometheus rules and thresholds]
|
|
|
|
### D. Dashboard Links
|
|
[Grafana dashboard URLs]
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Last Updated: 2024-12-22*
|
|
*Next Review: 2025-01-22*
|
|
*Owner: SRE Team*
|