feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
449
docs/operator/monitoring/monitoring-playbook.md
Normal file
449
docs/operator/monitoring/monitoring-playbook.md
Normal file
@ -0,0 +1,449 @@
|
||||
# AITBC Monitoring Playbook & On-Call Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
|
||||
|
||||
## Service Overview
|
||||
|
||||
### Core Services
|
||||
- **Coordinator API**: Job management and marketplace coordination
|
||||
- **Blockchain Nodes**: Consensus and transaction processing
|
||||
- **Explorer UI**: Block explorer and transaction visualization
|
||||
- **Marketplace UI**: User interface for marketplace operations
|
||||
- **Wallet Daemon**: Cryptographic key management
|
||||
- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
|
||||
|
||||
### Critical Metrics
|
||||
- **Availability**: 99.9% uptime SLA
|
||||
- **Performance**: <200ms API response time (95th percentile)
|
||||
- **Throughput**: 100+ TPS sustained
|
||||
- **MTTR**: <2 minutes for critical incidents
|
||||
|
||||
## On-Call Rotation
|
||||
|
||||
### Rotation Schedule
|
||||
- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
|
||||
- **Secondary On-Call**: Shadow primary, handles escalations
|
||||
- **Tertiary**: Backup for both primary and secondary
|
||||
- **Rotation Handoff**: Every Monday at 08:00 UTC
|
||||
|
||||
### Team Structure
|
||||
```
|
||||
Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
|
||||
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
|
||||
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
|
||||
```
|
||||
|
||||
### Handoff Procedures
|
||||
1. **Pre-handoff Check** (Sunday 22:00 UTC):
|
||||
- Review active incidents
|
||||
- Check scheduled maintenance
|
||||
- Verify monitoring systems health
|
||||
|
||||
2. **Handoff Meeting** (Monday 08:00 UTC):
|
||||
- 15-minute video call
|
||||
- Discuss current issues
|
||||
- Transfer knowledge
|
||||
- Confirm contact information
|
||||
|
||||
3. **Post-handoff** (Monday 09:00 UTC):
|
||||
- Primary acknowledges receipt
|
||||
- Update on-call calendar
|
||||
- Test alerting systems
|
||||
|
||||
### Contact Information
|
||||
- **Primary**: +1-555-ONCALL-1 (PagerDuty)
|
||||
- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
|
||||
- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
|
||||
- **Escalation Manager**: +1-555-ESCALATE
|
||||
- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
|
||||
|
||||
## Alerting & Escalation
|
||||
|
||||
### Alert Severity Levels
|
||||
|
||||
#### Critical (P0)
|
||||
- Service completely down
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- SLA violation in progress
|
||||
- **Response Time**: 5 minutes
|
||||
- **Escalation**: 15 minutes if no response
|
||||
|
||||
#### High (P1)
|
||||
- Significant degradation
|
||||
- Partial service outage
|
||||
- High error rates (>10%)
|
||||
- **Response Time**: 15 minutes
|
||||
- **Escalation**: 1 hour if no response
|
||||
|
||||
#### Medium (P2)
|
||||
- Minor degradation
|
||||
- Elevated error rates (5-10%)
|
||||
- Performance issues
|
||||
- **Response Time**: 1 hour
|
||||
- **Escalation**: 4 hours if no response
|
||||
|
||||
#### Low (P3)
|
||||
- Informational alerts
|
||||
- Non-critical issues
|
||||
- **Response Time**: 4 hours
|
||||
- **Escalation**: 24 hours if no response
|
||||
|
||||
### Escalation Policy
|
||||
1. **Level 1**: Primary On-Call (5-60 minutes)
|
||||
2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
|
||||
3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
|
||||
4. **Level 4**: Engineering Manager (4 hours)
|
||||
5. **Level 5**: CTO (Critical incidents only)
|
||||
|
||||
### Alert Channels
|
||||
- **PagerDuty**: Primary alerting system
|
||||
- **Slack**: #on-call-aitbc channel
|
||||
- **Email**: oncall@aitbc.io
|
||||
- **SMS**: Critical alerts only
|
||||
- **Phone**: Critical incidents only
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Incident Classification
|
||||
|
||||
#### SEV-0 (Critical)
|
||||
- Complete service outage
|
||||
- Data loss or security breach
|
||||
- Financial impact >$10,000/hour
|
||||
- Customer impact >50%
|
||||
|
||||
#### SEV-1 (High)
|
||||
- Significant service degradation
|
||||
- Feature unavailable
|
||||
- Financial impact $1,000-$10,000/hour
|
||||
- Customer impact 10-50%
|
||||
|
||||
#### SEV-2 (Medium)
|
||||
- Minor service degradation
|
||||
- Performance issues
|
||||
- Financial impact <$1,000/hour
|
||||
- Customer impact <10%
|
||||
|
||||
#### SEV-3 (Low)
|
||||
- Informational
|
||||
- No customer impact
|
||||
|
||||
### Incident Response Process
|
||||
|
||||
#### 1. Detection & Triage (0-5 minutes)
|
||||
```bash
|
||||
# Check alert severity
|
||||
# Verify impact
|
||||
# Create incident channel
|
||||
# Notify stakeholders
|
||||
```
|
||||
|
||||
#### 2. Assessment (5-15 minutes)
|
||||
- Determine scope
|
||||
- Identify root cause area
|
||||
- Estimate resolution time
|
||||
- Declare severity level
|
||||
|
||||
#### 3. Communication (15-30 minutes)
|
||||
- Update status page
|
||||
- Notify customers (if needed)
|
||||
- Internal stakeholder updates
|
||||
- Set up war room
|
||||
|
||||
#### 4. Resolution (Varies)
|
||||
- Implement fix
|
||||
- Verify resolution
|
||||
- Monitor for recurrence
|
||||
- Document actions
|
||||
|
||||
#### 5. Recovery (30-60 minutes)
|
||||
- Full service restoration
|
||||
- Performance validation
|
||||
- Customer communication
|
||||
- Incident closure
|
||||
|
||||
## Service-Specific Runbooks
|
||||
|
||||
### Coordinator API
|
||||
|
||||
#### High Error Rate
|
||||
**Symptoms**: 5xx errors >5%, response time >500ms
|
||||
**Runbook**:
|
||||
1. Check pod health: `kubectl get pods -l app=coordinator`
|
||||
2. Review logs: `kubectl logs -f deployment/coordinator`
|
||||
3. Check database connectivity
|
||||
4. Verify Redis connection
|
||||
5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
|
||||
|
||||
#### Service Unavailable
|
||||
**Symptoms**: 503 errors, health check failures
|
||||
**Runbook**:
|
||||
1. Check deployment status
|
||||
2. Review recent deployments
|
||||
3. Rollback if necessary
|
||||
4. Check resource limits
|
||||
5. Verify ingress configuration
|
||||
|
||||
### Blockchain Nodes
|
||||
|
||||
#### Consensus Stalled
|
||||
**Symptoms**: No new blocks, high finality latency
|
||||
**Runbook**:
|
||||
1. Check node sync status
|
||||
2. Verify network connectivity
|
||||
3. Review validator set
|
||||
4. Check governance proposals
|
||||
5. Restart if needed (with caution)
|
||||
|
||||
#### High Peer Drop Rate
|
||||
**Symptoms**: Connected peers <50%, network partition
|
||||
**Runbook**:
|
||||
1. Check network policies
|
||||
2. Verify DNS resolution
|
||||
3. Review firewall rules
|
||||
4. Check load balancer health
|
||||
5. Restart networking components
|
||||
|
||||
### Database (PostgreSQL)
|
||||
|
||||
#### Connection Exhaustion
|
||||
**Symptoms**: "Too many connections" errors
|
||||
**Runbook**:
|
||||
1. Check active connections
|
||||
2. Identify long-running queries
|
||||
3. Kill idle connections
|
||||
4. Increase pool size if needed
|
||||
5. Scale database
|
||||
|
||||
#### Replica Lag
|
||||
**Symptoms**: Read replica lag >10 seconds
|
||||
**Runbook**:
|
||||
1. Check replica status
|
||||
2. Review network latency
|
||||
3. Verify disk space
|
||||
4. Restart replication if needed
|
||||
5. Failover if necessary
|
||||
|
||||
### Redis
|
||||
|
||||
#### Memory Pressure
|
||||
**Symptoms**: OOM errors, high eviction rate
|
||||
**Runbook**:
|
||||
1. Check memory usage
|
||||
2. Review key expiration
|
||||
3. Clean up unused keys
|
||||
4. Scale Redis cluster
|
||||
5. Optimize data structures
|
||||
|
||||
#### Connection Issues
|
||||
**Symptoms**: Connection timeouts, errors
|
||||
**Runbook**:
|
||||
1. Check max connections
|
||||
2. Review connection pool
|
||||
3. Verify network policies
|
||||
4. Restart Redis if needed
|
||||
5. Scale horizontally
|
||||
|
||||
## Monitoring Dashboards
|
||||
|
||||
### Primary Dashboards
|
||||
|
||||
#### 1. System Overview
|
||||
- Service health status
|
||||
- Error rates (4xx/5xx)
|
||||
- Response times
|
||||
- Throughput metrics
|
||||
- Resource utilization
|
||||
|
||||
#### 2. Infrastructure
|
||||
- Kubernetes cluster health
|
||||
- Node resource usage
|
||||
- Pod status and restarts
|
||||
- Network traffic
|
||||
- Storage capacity
|
||||
|
||||
#### 3. Application Metrics
|
||||
- Job submission rates
|
||||
- Transaction processing
|
||||
- Marketplace activity
|
||||
- Wallet operations
|
||||
- Mining statistics
|
||||
|
||||
#### 4. Business KPIs
|
||||
- Active users
|
||||
- Transaction volume
|
||||
- Revenue metrics
|
||||
- Customer satisfaction
|
||||
- SLA compliance
|
||||
|
||||
### Alert Rules
|
||||
|
||||
#### Critical Alerts
|
||||
- Service down >1 minute
|
||||
- Error rate >10%
|
||||
- Response time >1 second
|
||||
- Disk space >90%
|
||||
- Memory usage >95%
|
||||
|
||||
#### Warning Alerts
|
||||
- Error rate >5%
|
||||
- Response time >500ms
|
||||
- CPU usage >80%
|
||||
- Queue depth >1000
|
||||
- Replica lag >5s
|
||||
|
||||
## SLOs & SLIs
|
||||
|
||||
### Service Level Objectives
|
||||
|
||||
| Service | Metric | Target | Measurement |
|
||||
|---------|--------|--------|-------------|
|
||||
| Coordinator API | Availability | 99.9% | 30-day rolling |
|
||||
| Coordinator API | Latency | <200ms | 95th percentile |
|
||||
| Blockchain | Block Time | <2s | 24-hour average |
|
||||
| Marketplace | Success Rate | 99.5% | Daily |
|
||||
| Explorer | Response Time | <500ms | 95th percentile |
|
||||
|
||||
### Service Level Indicators
|
||||
|
||||
#### Availability
|
||||
- HTTP status codes
|
||||
- Health check responses
|
||||
- Pod readiness status
|
||||
|
||||
#### Latency
|
||||
- Request duration histogram
|
||||
- Database query times
|
||||
- External API calls
|
||||
|
||||
#### Throughput
|
||||
- Requests per second
|
||||
- Transactions per block
|
||||
- Jobs completed per hour
|
||||
|
||||
#### Quality
|
||||
- Error rates
|
||||
- Success rates
|
||||
- Customer satisfaction
|
||||
|
||||
## Post-Incident Process
|
||||
|
||||
### Immediate Actions (0-1 hour)
|
||||
1. Verify full resolution
|
||||
2. Monitor for recurrence
|
||||
3. Update status page
|
||||
4. Notify stakeholders
|
||||
|
||||
### Post-Mortem (1-24 hours)
|
||||
1. Create incident document
|
||||
2. Gather timeline and logs
|
||||
3. Identify root cause
|
||||
4. Document lessons learned
|
||||
|
||||
### Follow-up (1-7 days)
|
||||
1. Schedule post-mortem meeting
|
||||
2. Assign action items
|
||||
3. Update runbooks
|
||||
4. Improve monitoring
|
||||
|
||||
### Review (Weekly)
|
||||
1. Review incident trends
|
||||
2. Update SLOs if needed
|
||||
3. Adjust alerting thresholds
|
||||
4. Improve processes
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Scheduled Maintenance
|
||||
- **Frequency**: Weekly maintenance window
|
||||
- **Time**: Sunday 02:00-04:00 UTC
|
||||
- **Duration**: Maximum 2 hours
|
||||
- **Notification**: 72 hours advance
|
||||
|
||||
### Emergency Maintenance
|
||||
- **Approval**: Engineering Manager required
|
||||
- **Notification**: 4 hours advance (if possible)
|
||||
- **Duration**: As needed
|
||||
- **Rollback**: Always required
|
||||
|
||||
## Tools & Systems
|
||||
|
||||
### Monitoring Stack
|
||||
- **Prometheus**: Metrics collection
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **Alertmanager**: Alert routing and management
|
||||
- **PagerDuty**: On-call scheduling and escalation
|
||||
|
||||
### Observability
|
||||
- **Jaeger**: Distributed tracing
|
||||
- **Loki**: Log aggregation
|
||||
- **Kiali**: Service mesh visualization
|
||||
- **Kube-state-metrics**: Kubernetes metrics
|
||||
|
||||
### Communication
|
||||
- **Slack**: Primary communication
|
||||
- **Zoom**: War room meetings
|
||||
- **Status Page**: Customer notifications
|
||||
- **Email**: Formal communications
|
||||
|
||||
## Training & Onboarding
|
||||
|
||||
### New On-Call Engineer
|
||||
1. Shadow primary for 1 week
|
||||
2. Review all runbooks
|
||||
3. Test alerting systems
|
||||
4. Handle low-severity incidents
|
||||
5. Solo on-call with mentor
|
||||
|
||||
### Ongoing Training
|
||||
- Monthly incident drills
|
||||
- Quarterly runbook updates
|
||||
- Annual training refreshers
|
||||
- Cross-team knowledge sharing
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Major Outage
|
||||
1. Declare incident (SEV-0)
|
||||
2. Activate war room
|
||||
3. Customer communication
|
||||
4. Executive updates
|
||||
5. Recovery coordination
|
||||
|
||||
### Security Incident
|
||||
1. Isolate affected systems
|
||||
2. Preserve evidence
|
||||
3. Notify security team
|
||||
4. Customer notification
|
||||
5. Regulatory compliance
|
||||
|
||||
### Data Loss
|
||||
1. Stop affected services
|
||||
2. Assess impact
|
||||
3. Initiate recovery
|
||||
4. Customer communication
|
||||
5. Prevent recurrence
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Contact List
|
||||
[Detailed contact information]
|
||||
|
||||
### B. Runbook Checklist
|
||||
[Quick reference checklists]
|
||||
|
||||
### C. Alert Configuration
|
||||
[Prometheus rules and thresholds]
|
||||
|
||||
### D. Dashboard Links
|
||||
[Grafana dashboard URLs]
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0*
|
||||
*Last Updated: 2024-12-22*
|
||||
*Next Review: 2025-01-22*
|
||||
*Owner: SRE Team*
|
||||
Reference in New Issue
Block a user