oib/aitbc

Files

oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru

2025-12-22 10:33:23 +01:00

10 KiB

Raw Blame History

AITBC Monitoring Playbook & On-Call Guide

Overview

This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.

Service Overview

Core Services

Coordinator API: Job management and marketplace coordination
Blockchain Nodes: Consensus and transaction processing
Explorer UI: Block explorer and transaction visualization
Marketplace UI: User interface for marketplace operations
Wallet Daemon: Cryptographic key management
Infrastructure: PostgreSQL, Redis, Kubernetes cluster

Critical Metrics

Availability: 99.9% uptime SLA
Performance: <200ms API response time (95th percentile)
Throughput: 100+ TPS sustained
MTTR: <2 minutes for critical incidents

On-Call Rotation

Rotation Schedule

Primary On-Call: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
Secondary On-Call: Shadow primary, handles escalations
Tertiary: Backup for both primary and secondary
Rotation Handoff: Every Monday at 08:00 UTC

Team Structure

Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)

Handoff Procedures

Pre-handoff Check (Sunday 22:00 UTC):
- Review active incidents
- Check scheduled maintenance
- Verify monitoring systems health
Handoff Meeting (Monday 08:00 UTC):
- 15-minute video call
- Discuss current issues
- Transfer knowledge
- Confirm contact information
Post-handoff (Monday 09:00 UTC):
- Primary acknowledges receipt
- Update on-call calendar
- Test alerting systems

Contact Information

Primary: +1-555-ONCALL-1 (PagerDuty)
Secondary: +1-555-ONCALL-2 (PagerDuty)
Tertiary: +1-555-ONCALL-3 (PagerDuty)
Escalation Manager: +1-555-ESCALATE
Emergency: +1-555-EMERGENCY (Critical infrastructure only)

Alerting & Escalation

Alert Severity Levels

Critical (P0)

Service completely down
Data loss or corruption
Security breach
SLA violation in progress
Response Time: 5 minutes
Escalation: 15 minutes if no response

High (P1)

Significant degradation
Partial service outage
High error rates (>10%)
Response Time: 15 minutes
Escalation: 1 hour if no response

Medium (P2)

Minor degradation
Elevated error rates (5-10%)
Performance issues
Response Time: 1 hour
Escalation: 4 hours if no response

Low (P3)

Informational alerts
Non-critical issues
Response Time: 4 hours
Escalation: 24 hours if no response

Escalation Policy

Level 1: Primary On-Call (5-60 minutes)
Level 2: Secondary On-Call (15 minutes - 4 hours)
Level 3: Tertiary On-Call (1 hour - 24 hours)
Level 4: Engineering Manager (4 hours)
Level 5: CTO (Critical incidents only)

Alert Channels

PagerDuty: Primary alerting system
Slack: #on-call-aitbc channel
Email: oncall@aitbc.io
SMS: Critical alerts only
Phone: Critical incidents only

Incident Response

Incident Classification

SEV-0 (Critical)

Complete service outage
Data loss or security breach
Financial impact >$10,000/hour
Customer impact >50%

SEV-1 (High)

Significant service degradation
Feature unavailable
Financial impact $1,000-$10,000/hour
Customer impact 10-50%

SEV-2 (Medium)

Minor service degradation
Performance issues
Financial impact <$1,000/hour
Customer impact <10%

SEV-3 (Low)

Informational
No customer impact

Incident Response Process

1. Detection & Triage (0-5 minutes)

# Check alert severity
# Verify impact
# Create incident channel
# Notify stakeholders

2. Assessment (5-15 minutes)

Determine scope
Identify root cause area
Estimate resolution time
Declare severity level

3. Communication (15-30 minutes)

Update status page
Notify customers (if needed)
Internal stakeholder updates
Set up war room

4. Resolution (Varies)

Implement fix
Verify resolution
Monitor for recurrence
Document actions

5. Recovery (30-60 minutes)

Full service restoration
Performance validation
Customer communication
Incident closure

Service-Specific Runbooks

Coordinator API

High Error Rate

Symptoms: 5xx errors >5%, response time >500ms Runbook:

Check pod health: kubectl get pods -l app=coordinator
Review logs: kubectl logs -f deployment/coordinator
Check database connectivity
Verify Redis connection
Scale if needed: kubectl scale deployment coordinator --replicas=5

Service Unavailable

Symptoms: 503 errors, health check failures Runbook:

Check deployment status
Review recent deployments
Rollback if necessary
Check resource limits
Verify ingress configuration

Blockchain Nodes

Consensus Stalled

Symptoms: No new blocks, high finality latency Runbook:

Check node sync status
Verify network connectivity
Review validator set
Check governance proposals
Restart if needed (with caution)

High Peer Drop Rate

Symptoms: Connected peers <50%, network partition Runbook:

Check network policies
Verify DNS resolution
Review firewall rules
Check load balancer health
Restart networking components

Database (PostgreSQL)

Connection Exhaustion

Symptoms: "Too many connections" errors Runbook:

Check active connections
Identify long-running queries
Kill idle connections
Increase pool size if needed
Scale database

Replica Lag

Symptoms: Read replica lag >10 seconds Runbook:

Check replica status
Review network latency
Verify disk space
Restart replication if needed
Failover if necessary

Redis

Memory Pressure

Symptoms: OOM errors, high eviction rate Runbook:

Check memory usage
Review key expiration
Clean up unused keys
Scale Redis cluster
Optimize data structures

Connection Issues

Symptoms: Connection timeouts, errors Runbook:

Check max connections
Review connection pool
Verify network policies
Restart Redis if needed
Scale horizontally

Monitoring Dashboards

Primary Dashboards

1. System Overview

Service health status
Error rates (4xx/5xx)
Response times
Throughput metrics
Resource utilization

2. Infrastructure

Kubernetes cluster health
Node resource usage
Pod status and restarts
Network traffic
Storage capacity

3. Application Metrics

Job submission rates
Transaction processing
Marketplace activity
Wallet operations
Mining statistics

4. Business KPIs

Active users
Transaction volume
Revenue metrics
Customer satisfaction
SLA compliance

Alert Rules

Critical Alerts

Service down >1 minute
Error rate >10%
Response time >1 second
Disk space >90%
Memory usage >95%

Warning Alerts

Error rate >5%
Response time >500ms
CPU usage >80%
Queue depth >1000
Replica lag >5s

SLOs & SLIs

Service Level Objectives

Service	Metric	Target	Measurement
Coordinator API	Availability	99.9%	30-day rolling
Coordinator API	Latency	<200ms	95th percentile
Blockchain	Block Time	<2s	24-hour average
Marketplace	Success Rate	99.5%	Daily
Explorer	Response Time	<500ms	95th percentile

Service Level Indicators

Availability

HTTP status codes
Health check responses
Pod readiness status

Latency

Request duration histogram
Database query times
External API calls

Throughput

Requests per second
Transactions per block
Jobs completed per hour

Quality

Error rates
Success rates
Customer satisfaction

Post-Incident Process

Immediate Actions (0-1 hour)

Verify full resolution
Monitor for recurrence
Update status page
Notify stakeholders

Post-Mortem (1-24 hours)

Create incident document
Gather timeline and logs
Identify root cause
Document lessons learned

Follow-up (1-7 days)

Schedule post-mortem meeting
Assign action items
Update runbooks
Improve monitoring

Review (Weekly)

Review incident trends
Update SLOs if needed
Adjust alerting thresholds
Improve processes

Maintenance Windows

Scheduled Maintenance

Frequency: Weekly maintenance window
Time: Sunday 02:00-04:00 UTC
Duration: Maximum 2 hours
Notification: 72 hours advance

Emergency Maintenance

Approval: Engineering Manager required
Notification: 4 hours advance (if possible)
Duration: As needed
Rollback: Always required

Tools & Systems

Monitoring Stack

Prometheus: Metrics collection
Grafana: Visualization and dashboards
Alertmanager: Alert routing and management
PagerDuty: On-call scheduling and escalation

Observability

Jaeger: Distributed tracing
Loki: Log aggregation
Kiali: Service mesh visualization
Kube-state-metrics: Kubernetes metrics

Communication

Slack: Primary communication
Zoom: War room meetings
Status Page: Customer notifications
Email: Formal communications

Training & Onboarding

New On-Call Engineer

Shadow primary for 1 week
Review all runbooks
Test alerting systems
Handle low-severity incidents
Solo on-call with mentor

Ongoing Training

Monthly incident drills
Quarterly runbook updates
Annual training refreshers
Cross-team knowledge sharing

Emergency Procedures

Major Outage

Declare incident (SEV-0)
Activate war room
Customer communication
Executive updates
Recovery coordination

Security Incident

Isolate affected systems
Preserve evidence
Notify security team
Customer notification
Regulatory compliance

Data Loss

Stop affected services
Assess impact
Initiate recovery
Customer communication
Prevent recurrence

Appendix

A. Contact List

[Detailed contact information]

B. Runbook Checklist

[Quick reference checklists]

C. Alert Configuration

[Prometheus rules and thresholds]

D. Dashboard Links

[Grafana dashboard URLs]

Document Version: 1.0 Last Updated: 2024-12-22 Next Review: 2025-01-22 Owner: SRE Team

10 KiB Raw Blame History