feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions
--- a/docs/operator/monitoring/monitoring-playbook.md
+++ b/docs/operator/monitoring/monitoring-playbook.md
@ -0,0 +1,449 @@
+# AITBC Monitoring Playbook & On-Call Guide
+
+## Overview
+
+This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
+
+## Service Overview
+
+### Core Services
+- **Coordinator API**: Job management and marketplace coordination
+- **Blockchain Nodes**: Consensus and transaction processing
+- **Explorer UI**: Block explorer and transaction visualization
+- **Marketplace UI**: User interface for marketplace operations
+- **Wallet Daemon**: Cryptographic key management
+- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
+
+### Critical Metrics
+- **Availability**: 99.9% uptime SLA
+- **Performance**: <200ms API response time (95th percentile)
+- **Throughput**: 100+ TPS sustained
+- **MTTR**: <2 minutes for critical incidents
+
+## On-Call Rotation
+
+### Rotation Schedule
+- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
+- **Secondary On-Call**: Shadow primary, handles escalations
+- **Tertiary**: Backup for both primary and secondary
+- **Rotation Handoff**: Every Monday at 08:00 UTC
+
+### Team Structure
+```
+Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
+Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
+Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
+```
+
+### Handoff Procedures
+1. **Pre-handoff Check** (Sunday 22:00 UTC):
+   - Review active incidents
+   - Check scheduled maintenance
+   - Verify monitoring systems health
+
+2. **Handoff Meeting** (Monday 08:00 UTC):
+   - 15-minute video call
+   - Discuss current issues
+   - Transfer knowledge
+   - Confirm contact information
+
+3. **Post-handoff** (Monday 09:00 UTC):
+   - Primary acknowledges receipt
+   - Update on-call calendar
+   - Test alerting systems
+
+### Contact Information
+- **Primary**: +1-555-ONCALL-1 (PagerDuty)
+- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
+- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
+- **Escalation Manager**: +1-555-ESCALATE
+- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
+
+## Alerting & Escalation
+
+### Alert Severity Levels
+
+#### Critical (P0)
+- Service completely down
+- Data loss or corruption
+- Security breach
+- SLA violation in progress
+- **Response Time**: 5 minutes
+- **Escalation**: 15 minutes if no response
+
+#### High (P1)
+- Significant degradation
+- Partial service outage
+- High error rates (>10%)
+- **Response Time**: 15 minutes
+- **Escalation**: 1 hour if no response
+
+#### Medium (P2)
+- Minor degradation
+- Elevated error rates (5-10%)
+- Performance issues
+- **Response Time**: 1 hour
+- **Escalation**: 4 hours if no response
+
+#### Low (P3)
+- Informational alerts
+- Non-critical issues
+- **Response Time**: 4 hours
+- **Escalation**: 24 hours if no response
+
+### Escalation Policy
+1. **Level 1**: Primary On-Call (5-60 minutes)
+2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
+3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
+4. **Level 4**: Engineering Manager (4 hours)
+5. **Level 5**: CTO (Critical incidents only)
+
+### Alert Channels
+- **PagerDuty**: Primary alerting system
+- **Slack**: #on-call-aitbc channel
+- **Email**: oncall@aitbc.io
+- **SMS**: Critical alerts only
+- **Phone**: Critical incidents only
+
+## Incident Response
+
+### Incident Classification
+
+#### SEV-0 (Critical)
+- Complete service outage
+- Data loss or security breach
+- Financial impact >$10,000/hour
+- Customer impact >50%
+
+#### SEV-1 (High)
+- Significant service degradation
+- Feature unavailable
+- Financial impact $1,000-$10,000/hour
+- Customer impact 10-50%
+
+#### SEV-2 (Medium)
+- Minor service degradation
+- Performance issues
+- Financial impact <$1,000/hour
+- Customer impact <10%
+
+#### SEV-3 (Low)
+- Informational
+- No customer impact
+
+### Incident Response Process
+
+#### 1. Detection & Triage (0-5 minutes)
+```bash
+# Check alert severity
+# Verify impact
+# Create incident channel
+# Notify stakeholders
+```
+
+#### 2. Assessment (5-15 minutes)
+- Determine scope
+- Identify root cause area
+- Estimate resolution time
+- Declare severity level
+
+#### 3. Communication (15-30 minutes)
+- Update status page
+- Notify customers (if needed)
+- Internal stakeholder updates
+- Set up war room
+
+#### 4. Resolution (Varies)
+- Implement fix
+- Verify resolution
+- Monitor for recurrence
+- Document actions
+
+#### 5. Recovery (30-60 minutes)
+- Full service restoration
+- Performance validation
+- Customer communication
+- Incident closure
+
+## Service-Specific Runbooks
+
+### Coordinator API
+
+#### High Error Rate
+**Symptoms**: 5xx errors >5%, response time >500ms
+**Runbook**:
+1. Check pod health: `kubectl get pods -l app=coordinator`
+2. Review logs: `kubectl logs -f deployment/coordinator`
+3. Check database connectivity
+4. Verify Redis connection
+5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
+
+#### Service Unavailable
+**Symptoms**: 503 errors, health check failures
+**Runbook**:
+1. Check deployment status
+2. Review recent deployments
+3. Rollback if necessary
+4. Check resource limits
+5. Verify ingress configuration
+
+### Blockchain Nodes
+
+#### Consensus Stalled
+**Symptoms**: No new blocks, high finality latency
+**Runbook**:
+1. Check node sync status
+2. Verify network connectivity
+3. Review validator set
+4. Check governance proposals
+5. Restart if needed (with caution)
+
+#### High Peer Drop Rate
+**Symptoms**: Connected peers <50%, network partition
+**Runbook**:
+1. Check network policies
+2. Verify DNS resolution
+3. Review firewall rules
+4. Check load balancer health
+5. Restart networking components
+
+### Database (PostgreSQL)
+
+#### Connection Exhaustion
+**Symptoms**: "Too many connections" errors
+**Runbook**:
+1. Check active connections
+2. Identify long-running queries
+3. Kill idle connections
+4. Increase pool size if needed
+5. Scale database
+
+#### Replica Lag
+**Symptoms**: Read replica lag >10 seconds
+**Runbook**:
+1. Check replica status
+2. Review network latency
+3. Verify disk space
+4. Restart replication if needed
+5. Failover if necessary
+
+### Redis
+
+#### Memory Pressure
+**Symptoms**: OOM errors, high eviction rate
+**Runbook**:
+1. Check memory usage
+2. Review key expiration
+3. Clean up unused keys
+4. Scale Redis cluster
+5. Optimize data structures
+
+#### Connection Issues
+**Symptoms**: Connection timeouts, errors
+**Runbook**:
+1. Check max connections
+2. Review connection pool
+3. Verify network policies
+4. Restart Redis if needed
+5. Scale horizontally
+
+## Monitoring Dashboards
+
+### Primary Dashboards
+
+#### 1. System Overview
+- Service health status
+- Error rates (4xx/5xx)
+- Response times
+- Throughput metrics
+- Resource utilization
+
+#### 2. Infrastructure
+- Kubernetes cluster health
+- Node resource usage
+- Pod status and restarts
+- Network traffic
+- Storage capacity
+
+#### 3. Application Metrics
+- Job submission rates
+- Transaction processing
+- Marketplace activity
+- Wallet operations
+- Mining statistics
+
+#### 4. Business KPIs
+- Active users
+- Transaction volume
+- Revenue metrics
+- Customer satisfaction
+- SLA compliance
+
+### Alert Rules
+
+#### Critical Alerts
+- Service down >1 minute
+- Error rate >10%
+- Response time >1 second
+- Disk space >90%
+- Memory usage >95%
+
+#### Warning Alerts
+- Error rate >5%
+- Response time >500ms
+- CPU usage >80%
+- Queue depth >1000
+- Replica lag >5s
+
+## SLOs & SLIs
+
+### Service Level Objectives
+
+| Service | Metric | Target | Measurement |
+|---------|--------|--------|-------------|
+| Coordinator API | Availability | 99.9% | 30-day rolling |
+| Coordinator API | Latency | <200ms | 95th percentile |
+| Blockchain | Block Time | <2s | 24-hour average |
+| Marketplace | Success Rate | 99.5% | Daily |
+| Explorer | Response Time | <500ms | 95th percentile |
+
+### Service Level Indicators
+
+#### Availability
+- HTTP status codes
+- Health check responses
+- Pod readiness status
+
+#### Latency
+- Request duration histogram
+- Database query times
+- External API calls
+
+#### Throughput
+- Requests per second
+- Transactions per block
+- Jobs completed per hour
+
+#### Quality
+- Error rates
+- Success rates
+- Customer satisfaction
+
+## Post-Incident Process
+
+### Immediate Actions (0-1 hour)
+1. Verify full resolution
+2. Monitor for recurrence
+3. Update status page
+4. Notify stakeholders
+
+### Post-Mortem (1-24 hours)
+1. Create incident document
+2. Gather timeline and logs
+3. Identify root cause
+4. Document lessons learned
+
+### Follow-up (1-7 days)
+1. Schedule post-mortem meeting
+2. Assign action items
+3. Update runbooks
+4. Improve monitoring
+
+### Review (Weekly)
+1. Review incident trends
+2. Update SLOs if needed
+3. Adjust alerting thresholds
+4. Improve processes
+
+## Maintenance Windows
+
+### Scheduled Maintenance
+- **Frequency**: Weekly maintenance window
+- **Time**: Sunday 02:00-04:00 UTC
+- **Duration**: Maximum 2 hours
+- **Notification**: 72 hours advance
+
+### Emergency Maintenance
+- **Approval**: Engineering Manager required
+- **Notification**: 4 hours advance (if possible)
+- **Duration**: As needed
+- **Rollback**: Always required
+
+## Tools & Systems
+
+### Monitoring Stack
+- **Prometheus**: Metrics collection
+- **Grafana**: Visualization and dashboards
+- **Alertmanager**: Alert routing and management
+- **PagerDuty**: On-call scheduling and escalation
+
+### Observability
+- **Jaeger**: Distributed tracing
+- **Loki**: Log aggregation
+- **Kiali**: Service mesh visualization
+- **Kube-state-metrics**: Kubernetes metrics
+
+### Communication
+- **Slack**: Primary communication
+- **Zoom**: War room meetings
+- **Status Page**: Customer notifications
+- **Email**: Formal communications
+
+## Training & Onboarding
+
+### New On-Call Engineer
+1. Shadow primary for 1 week
+2. Review all runbooks
+3. Test alerting systems
+4. Handle low-severity incidents
+5. Solo on-call with mentor
+
+### Ongoing Training
+- Monthly incident drills
+- Quarterly runbook updates
+- Annual training refreshers
+- Cross-team knowledge sharing
+
+## Emergency Procedures
+
+### Major Outage
+1. Declare incident (SEV-0)
+2. Activate war room
+3. Customer communication
+4. Executive updates
+5. Recovery coordination
+
+### Security Incident
+1. Isolate affected systems
+2. Preserve evidence
+3. Notify security team
+4. Customer notification
+5. Regulatory compliance
+
+### Data Loss
+1. Stop affected services
+2. Assess impact
+3. Initiate recovery
+4. Customer communication
+5. Prevent recurrence
+
+## Appendix
+
+### A. Contact List
+[Detailed contact information]
+
+### B. Runbook Checklist
+[Quick reference checklists]
+
+### C. Alert Configuration
+[Prometheus rules and thresholds]
+
+### D. Dashboard Links
+[Grafana dashboard URLs]
+
+---
+
+*Document Version: 1.0*
+*Last Updated: 2024-12-22*
+*Next Review: 2025-01-22*
+*Owner: SRE Team*