feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
316
docs/operator/backup_restore.md
Normal file
316
docs/operator/backup_restore.md
Normal file
@ -0,0 +1,316 @@
|
||||
# AITBC Backup and Restore Procedures
|
||||
|
||||
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
|
||||
|
||||
## Overview
|
||||
|
||||
The AITBC platform implements a comprehensive backup strategy with:
|
||||
- **Automated daily backups** via Kubernetes CronJobs
|
||||
- **Manual backup capabilities** for on-demand operations
|
||||
- **Incremental and full backup options** for ledger data
|
||||
- **Cloud storage integration** for off-site backups
|
||||
- **Retention policies** to manage storage efficiently
|
||||
|
||||
## Components
|
||||
|
||||
### 1. PostgreSQL Database
|
||||
- **Location**: Coordinator API persistent storage
|
||||
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
|
||||
- **Backup Format**: Custom PostgreSQL dump with compression
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
### 2. Redis Cache
|
||||
- **Location**: In-memory cache with persistence
|
||||
- **Data**: Session cache, temporary data, rate limiting
|
||||
- **Backup Format**: RDB snapshot + AOF (if enabled)
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
### 3. Ledger Storage
|
||||
- **Location**: Blockchain node persistent storage
|
||||
- **Data**: Blocks, transactions, receipts, wallet states
|
||||
- **Backup Format**: Compressed tar archives
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
## Automated Backups
|
||||
|
||||
### Kubernetes CronJob
|
||||
|
||||
The automated backup system runs daily at 2:00 AM UTC:
|
||||
|
||||
```bash
|
||||
# Deploy the backup CronJob
|
||||
kubectl apply -f infra/k8s/backup-cronjob.yaml
|
||||
|
||||
# Check CronJob status
|
||||
kubectl get cronjob aitbc-backup
|
||||
|
||||
# View backup jobs
|
||||
kubectl get jobs -l app=aitbc-backup
|
||||
|
||||
# View backup logs
|
||||
kubectl logs job/aitbc-backup-<timestamp>
|
||||
```
|
||||
|
||||
### Backup Schedule
|
||||
|
||||
| Time (UTC) | Component | Type | Retention |
|
||||
|------------|----------------|------------|-----------|
|
||||
| 02:00 | PostgreSQL | Full | 30 days |
|
||||
| 02:01 | Redis | Full | 30 days |
|
||||
| 02:02 | Ledger | Full | 30 days |
|
||||
|
||||
## Manual Backups
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
```bash
|
||||
# Create a manual backup
|
||||
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
|
||||
|
||||
# View available backups
|
||||
ls -la /tmp/postgresql-backups/
|
||||
|
||||
# Upload to S3 manually
|
||||
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
```bash
|
||||
# Create a manual backup
|
||||
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
|
||||
|
||||
# Force background save before backup
|
||||
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
|
||||
```
|
||||
|
||||
### Ledger Storage
|
||||
|
||||
```bash
|
||||
# Create a full backup
|
||||
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
|
||||
|
||||
# Create incremental backup
|
||||
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
|
||||
```
|
||||
|
||||
## Restore Procedures
|
||||
|
||||
### PostgreSQL Restore
|
||||
|
||||
```bash
|
||||
# List available backups
|
||||
aws s3 ls s3://aitbc-backups-default/postgresql/
|
||||
|
||||
# Download backup from S3
|
||||
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
|
||||
|
||||
# Restore database
|
||||
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
|
||||
```
|
||||
|
||||
### Redis Restore
|
||||
|
||||
```bash
|
||||
# Stop Redis service
|
||||
kubectl scale deployment redis --replicas=0 -n default
|
||||
|
||||
# Clear existing data
|
||||
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
|
||||
|
||||
# Copy backup file
|
||||
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
|
||||
|
||||
# Start Redis service
|
||||
kubectl scale deployment redis --replicas=1 -n default
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
|
||||
```
|
||||
|
||||
### Ledger Restore
|
||||
|
||||
```bash
|
||||
# Stop blockchain nodes
|
||||
kubectl scale deployment blockchain-node --replicas=0 -n default
|
||||
|
||||
# Extract backup
|
||||
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
|
||||
|
||||
# Copy ledger data
|
||||
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
|
||||
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
|
||||
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
|
||||
|
||||
# Start blockchain nodes
|
||||
kubectl scale deployment blockchain-node --replicas=3 -n default
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
|
||||
```
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Recovery Time Objective (RTO)
|
||||
|
||||
| Component | RTO Target | Notes |
|
||||
|----------------|------------|---------------------------------|
|
||||
| PostgreSQL | 1 hour | Database restore from backup |
|
||||
| Redis | 15 minutes | Cache rebuild from backup |
|
||||
| Ledger | 2 hours | Full chain synchronization |
|
||||
|
||||
### Recovery Point Objective (RPO)
|
||||
|
||||
| Component | RPO Target | Notes |
|
||||
|----------------|------------|---------------------------------|
|
||||
| PostgreSQL | 24 hours | Daily backups |
|
||||
| Redis | 24 hours | Daily backups |
|
||||
| Ledger | 24 hours | Daily full + incremental backups|
|
||||
|
||||
### Disaster Recovery Steps
|
||||
|
||||
1. **Assess Impact**
|
||||
```bash
|
||||
# Check component status
|
||||
kubectl get pods -n default
|
||||
kubectl get events --sort-by=.metadata.creationTimestamp
|
||||
```
|
||||
|
||||
2. **Restore Critical Services**
|
||||
```bash
|
||||
# Restore PostgreSQL first (critical for operations)
|
||||
./infra/scripts/restore_postgresql.sh default [latest-backup]
|
||||
|
||||
# Restore Redis cache
|
||||
./restore_redis.sh default [latest-backup]
|
||||
|
||||
# Restore ledger data
|
||||
./restore_ledger.sh default [latest-backup]
|
||||
```
|
||||
|
||||
3. **Verify System Health**
|
||||
```bash
|
||||
# Check all services
|
||||
kubectl get pods -n default
|
||||
|
||||
# Verify API endpoints
|
||||
curl -s http://coordinator-api:8011/v1/health
|
||||
curl -s http://blockchain-node:8080/v1/health
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Backup Monitoring
|
||||
|
||||
Prometheus metrics track backup success/failure:
|
||||
|
||||
```yaml
|
||||
# AlertManager rules for backups
|
||||
- alert: BackupFailed
|
||||
expr: backup_success == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Backup failed for {{ $labels.component }}"
|
||||
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
|
||||
```
|
||||
|
||||
### Log Monitoring
|
||||
|
||||
```bash
|
||||
# View backup logs
|
||||
kubectl logs -l app=aitbc-backup -n default --tail=100
|
||||
|
||||
# Monitor backup CronJob
|
||||
kubectl get cronjob aitbc-backup -w
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Backup Security
|
||||
|
||||
1. **Encryption**: Backups uploaded to S3 use server-side encryption
|
||||
2. **Access Control**: IAM policies restrict backup access
|
||||
3. **Retention**: Automatic cleanup of old backups
|
||||
4. **Validation**: Regular restore testing
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
|
||||
2. **Parallel Processing**: Components backed up sequentially
|
||||
3. **Compression**: All backups compressed to save storage
|
||||
4. **Incremental Backups**: Ledger supports incremental to reduce size
|
||||
|
||||
### Testing
|
||||
|
||||
1. **Monthly Restore Tests**: Validate backup integrity
|
||||
2. **Disaster Recovery Drills**: Quarterly full scenario testing
|
||||
3. **Documentation Updates**: Keep procedures current
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Backup Fails with "Permission Denied"
|
||||
```bash
|
||||
# Check service account permissions
|
||||
kubectl describe serviceaccount backup-service-account
|
||||
kubectl describe role backup-role
|
||||
```
|
||||
|
||||
#### Restore Fails with "Database in Use"
|
||||
```bash
|
||||
# Scale down application before restore
|
||||
kubectl scale deployment coordinator-api --replicas=0
|
||||
# Perform restore
|
||||
# Scale up after restore
|
||||
kubectl scale deployment coordinator-api --replicas=3
|
||||
```
|
||||
|
||||
#### Ledger Restore Incomplete
|
||||
```bash
|
||||
# Verify backup integrity
|
||||
tar -tzf ledger-backup.tar.gz
|
||||
# Check metadata.json for block height
|
||||
cat metadata.json | jq '.latest_block_height'
|
||||
```
|
||||
|
||||
### Getting Help
|
||||
|
||||
1. Check logs: `kubectl logs -l app=aitbc-backup`
|
||||
2. Verify storage: `df -h` on backup nodes
|
||||
3. Check network: Test S3 connectivity
|
||||
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|------------------------|------------------|---------------------------------|
|
||||
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
|
||||
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
|
||||
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
|
||||
| COMPRESSION_LEVEL | 6 | gzip compression level |
|
||||
|
||||
### Customizing Backup Schedule
|
||||
|
||||
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
schedule: "0 3 * * *" # Change to 3 AM UTC
|
||||
```
|
||||
|
||||
### Adjusting Retention
|
||||
|
||||
Modify retention in each backup script:
|
||||
|
||||
```bash
|
||||
# In backup_*.sh scripts
|
||||
RETENTION_DAYS=60 # Keep for 60 days instead of 30
|
||||
```
|
||||
273
docs/operator/beta-release-plan.md
Normal file
273
docs/operator/beta-release-plan.md
Normal file
@ -0,0 +1,273 @@
|
||||
# AITBC Beta Release Plan
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document outlines the beta release plan for AITBC (AI Trusted Blockchain Computing), a blockchain platform designed for AI workloads. The release follows a phased approach: Alpha → Beta → Release Candidate (RC) → General Availability (GA).
|
||||
|
||||
## Release Phases
|
||||
|
||||
### Phase 1: Alpha Release (Completed)
|
||||
- **Duration**: 2 weeks
|
||||
- **Participants**: Internal team (10 members)
|
||||
- **Focus**: Core functionality validation
|
||||
- **Status**: ✅ Completed
|
||||
|
||||
### Phase 2: Beta Release (Current)
|
||||
- **Duration**: 6 weeks
|
||||
- **Participants**: 50-100 external testers
|
||||
- **Focus**: User acceptance testing, performance validation, security assessment
|
||||
- **Start Date**: 2025-01-15
|
||||
- **End Date**: 2025-02-26
|
||||
|
||||
### Phase 3: Release Candidate
|
||||
- **Duration**: 2 weeks
|
||||
- **Participants**: 20 selected beta testers
|
||||
- **Focus**: Final bug fixes, performance optimization
|
||||
- **Start Date**: 2025-03-04
|
||||
- **End Date**: 2025-03-18
|
||||
|
||||
### Phase 4: General Availability
|
||||
- **Date**: 2025-03-25
|
||||
- **Target**: Public launch
|
||||
|
||||
## Beta Release Timeline
|
||||
|
||||
### Week 1-2: Onboarding & Basic Flows
|
||||
- **Jan 15-19**: Tester onboarding and environment setup
|
||||
- **Jan 22-26**: Basic job submission and completion flows
|
||||
- **Milestone**: 80% of testers successfully submit and complete jobs
|
||||
|
||||
### Week 3-4: Marketplace & Explorer Testing
|
||||
- **Jan 29 - Feb 2**: Marketplace functionality testing
|
||||
- **Feb 5-9**: Explorer UI validation and transaction tracking
|
||||
- **Milestone**: 100 marketplace transactions completed
|
||||
|
||||
### Week 5-6: Stress Testing & Feedback
|
||||
- **Feb 12-16**: Performance stress testing (1000+ concurrent jobs)
|
||||
- **Feb 19-23**: Security testing and final feedback collection
|
||||
- **Milestone**: All critical bugs resolved
|
||||
|
||||
## User Acceptance Testing (UAT) Scenarios
|
||||
|
||||
### 1. Core Job Lifecycle
|
||||
- **Scenario**: Submit AI inference job → Miner picks up → Execution → Results delivery → Payment
|
||||
- **Test Cases**:
|
||||
- Job submission with various model types
|
||||
- Job monitoring and status tracking
|
||||
- Result retrieval and verification
|
||||
- Payment processing and wallet updates
|
||||
- **Success Criteria**: 95% success rate across 1000 test jobs
|
||||
|
||||
### 2. Marketplace Operations
|
||||
- **Scenario**: Create offer → Accept offer → Execute job → Complete transaction
|
||||
- **Test Cases**:
|
||||
- Offer creation and management
|
||||
- Bid acceptance and matching
|
||||
- Price discovery mechanisms
|
||||
- Dispute resolution
|
||||
- **Success Criteria**: 50 successful marketplace transactions
|
||||
|
||||
### 3. Explorer Functionality
|
||||
- **Scenario**: Transaction lookup → Job tracking → Address analysis
|
||||
- **Test Cases**:
|
||||
- Real-time transaction monitoring
|
||||
- Job history and status visualization
|
||||
- Wallet balance tracking
|
||||
- Block explorer features
|
||||
- **Success Criteria**: All transactions visible within 5 seconds
|
||||
|
||||
### 4. Wallet Management
|
||||
- **Scenario**: Wallet creation → Funding → Transactions → Backup/Restore
|
||||
- **Test Cases**:
|
||||
- Multi-signature wallet creation
|
||||
- Cross-chain transfers
|
||||
- Backup and recovery procedures
|
||||
- Staking and unstaking operations
|
||||
- **Success Criteria**: 100% wallet recovery success rate
|
||||
|
||||
### 5. Mining Operations
|
||||
- **Scenario**: Miner setup → Job acceptance → Mining rewards → Pool participation
|
||||
- **Test Cases**:
|
||||
- Miner registration and setup
|
||||
- Job bidding and execution
|
||||
- Reward distribution
|
||||
- Pool mining operations
|
||||
- **Success Criteria**: 90% of submitted jobs accepted by miners
|
||||
|
||||
### 6. Community Management
|
||||
|
||||
### Discord Community Structure
|
||||
- **#announcements**: Official updates and milestones
|
||||
- **#beta-testers**: Private channel for testers only
|
||||
- **#bug-reports**: Structured bug reporting format
|
||||
- **#feature-feedback**: Feature requests and discussions
|
||||
- **#technical-support**: 24/7 support from the team
|
||||
|
||||
### Regulatory Considerations
|
||||
- **KYC/AML**: Basic identity verification for testers
|
||||
- **Securities Law**: Beta tokens have no monetary value
|
||||
- **Tax Reporting**: Testnet transactions not taxable
|
||||
- **Export Controls**: Compliance with technology export laws
|
||||
|
||||
### Geographic Restrictions
|
||||
Beta testing is not available in:
|
||||
- North Korea, Iran, Cuba, Syria, Crimea
|
||||
- Countries under US sanctions
|
||||
- Jurdictions with unclear crypto regulations
|
||||
|
||||
### 7. Token Economics Validation
|
||||
- **Scenario**: Token issuance → Reward distribution → Staking yields → Fee mechanisms
|
||||
- **Test Cases**:
|
||||
- Mining reward calculations match whitepaper specs
|
||||
- Staking yields and unstaking penalties
|
||||
- Transaction fee burning and distribution
|
||||
- Marketplace fee structures
|
||||
- Token inflation/deflation mechanics
|
||||
- **Success Criteria**: All token operations within 1% of theoretical values
|
||||
|
||||
## Performance Benchmarks (Go/No-Go Criteria)
|
||||
|
||||
### Must-Have Metrics
|
||||
- **Transaction Throughput**: ≥ 100 TPS (Transactions Per Second)
|
||||
- **Job Completion Time**: ≤ 5 minutes for standard inference jobs
|
||||
- **API Response Time**: ≤ 200ms (95th percentile)
|
||||
- **System Uptime**: ≥ 99.9% during beta period
|
||||
- **MTTR (Mean Time To Recovery)**: ≤ 2 minutes (from chaos tests)
|
||||
|
||||
### Nice-to-Have Metrics
|
||||
- **Transaction Throughput**: ≥ 500 TPS
|
||||
- **Job Completion Time**: ≤ 2 minutes
|
||||
- **API Response Time**: ≤ 100ms (95th percentile)
|
||||
- **Concurrent Users**: ≥ 1000 simultaneous users
|
||||
|
||||
## Security Testing
|
||||
|
||||
### Automated Security Scans
|
||||
- **Smart Contract Audits**: Completed by [Security Firm]
|
||||
- **Penetration Testing**: OWASP Top 10 validation
|
||||
- **Dependency Scanning**: CVE scan of all dependencies
|
||||
- **Chaos Testing**: Network partition and coordinator outage scenarios
|
||||
|
||||
### Manual Security Reviews
|
||||
- **Authorization Testing**: API key validation and permissions
|
||||
- **Data Privacy**: GDPR compliance validation
|
||||
- **Cryptography**: Proof verification and signature validation
|
||||
- **Infrastructure Security**: Kubernetes and cloud security review
|
||||
|
||||
## Test Environment Setup
|
||||
|
||||
### Beta Environment
|
||||
- **Network**: Separate testnet with faucet for test tokens
|
||||
- **Infrastructure**: Production-like setup with monitoring
|
||||
- **Data**: Reset weekly to ensure clean testing
|
||||
- **Support**: 24/7 Discord support channel
|
||||
|
||||
### Access Credentials
|
||||
- **Testnet Faucet**: 1000 AITBC tokens per tester
|
||||
- **API Keys**: Unique keys per tester with rate limits
|
||||
- **Wallet Seeds**: Generated per tester with backup instructions
|
||||
- **Mining Accounts**: Pre-configured mining pools for testing
|
||||
|
||||
## Feedback Collection Mechanisms
|
||||
|
||||
### Automated Collection
|
||||
- **Error Reporting**: Automatic crash reports and error logs
|
||||
- **Performance Metrics**: Client-side performance data
|
||||
- **Usage Analytics**: Feature usage tracking (anonymized)
|
||||
- **Survey System**: In-app feedback prompts
|
||||
|
||||
### Manual Collection
|
||||
- **Weekly Surveys**: Structured feedback on specific features
|
||||
- **Discord Channels**: Real-time feedback and discussions
|
||||
- **Office Hours**: Weekly Q&A sessions with the team
|
||||
- **Bug Bounty**: Program for critical issue discovery
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Go/No-Go Decision Points
|
||||
|
||||
#### Week 2 Checkpoint (Jan 26)
|
||||
- **Go Criteria**: 80% of testers onboarded, basic flows working
|
||||
- **Blockers**: Critical bugs in job submission/completion
|
||||
|
||||
#### Week 4 Checkpoint (Feb 9)
|
||||
- **Go Criteria**: 50 marketplace transactions, explorer functional
|
||||
- **Blockers**: Security vulnerabilities, performance < 50 TPS
|
||||
|
||||
#### Week 6 Final Decision (Feb 23)
|
||||
- **Go Criteria**: All UAT scenarios passed, benchmarks met
|
||||
- **Blockers**: Any critical security issue, MTTR > 5 minutes
|
||||
|
||||
### Overall Success Metrics
|
||||
- **User Satisfaction**: ≥ 4.0/5.0 average rating
|
||||
- **Bug Resolution**: 90% of reported bugs fixed
|
||||
- **Performance**: All benchmarks met
|
||||
- **Security**: No critical vulnerabilities
|
||||
|
||||
## Risk Management
|
||||
|
||||
### Technical Risks
|
||||
- **Consensus Issues**: Rollback to previous version
|
||||
- **Performance Degradation**: Auto-scaling and optimization
|
||||
- **Security Breaches**: Immediate patch and notification
|
||||
|
||||
### Operational Risks
|
||||
- **Test Environment Downtime**: Backup environment ready
|
||||
- **Low Tester Participation**: Incentive program adjustments
|
||||
- **Feature Scope Creep**: Strict feature freeze after Week 4
|
||||
|
||||
### Mitigation Strategies
|
||||
- **Daily Health Checks**: Automated monitoring and alerts
|
||||
- **Rollback Plan**: Documented procedures for quick rollback
|
||||
- **Communication Plan**: Regular updates to all stakeholders
|
||||
|
||||
## Communication Plan
|
||||
|
||||
### Internal Updates
|
||||
- **Daily Standups**: Development team sync
|
||||
- **Weekly Reports**: Progress to leadership
|
||||
- **Bi-weekly Demos**: Feature demonstrations
|
||||
|
||||
### External Updates
|
||||
- **Beta Newsletter**: Weekly updates to testers
|
||||
- **Blog Posts**: Public progress updates
|
||||
- **Social Media**: Regular platform updates
|
||||
|
||||
## Post-Beta Activities
|
||||
|
||||
### RC Phase Preparation
|
||||
- **Bug Triage**: Prioritize and assign all reported issues
|
||||
- **Performance Tuning**: Optimize based on beta metrics
|
||||
- **Documentation Updates**: Incorporate beta feedback
|
||||
|
||||
### GA Preparation
|
||||
- **Final Security Review**: Complete audit and penetration test
|
||||
- **Infrastructure Scaling**: Prepare for production load
|
||||
- **Support Team Training**: Enable customer support team
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Test Case Matrix
|
||||
[Detailed test case spreadsheet link]
|
||||
|
||||
### B. Performance Benchmark Results
|
||||
[Benchmark data and graphs]
|
||||
|
||||
### C. Security Audit Reports
|
||||
[Audit firm reports and findings]
|
||||
|
||||
### D. Feedback Analysis
|
||||
[Summary of all user feedback and actions taken]
|
||||
|
||||
## Contact Information
|
||||
|
||||
- **Beta Program Manager**: beta@aitbc.io
|
||||
- **Technical Support**: support@aitbc.io
|
||||
- **Security Issues**: security@aitbc.io
|
||||
- **Discord Community**: https://discord.gg/aitbc
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-01-10*
|
||||
*Version: 1.0*
|
||||
*Next Review: 2025-01-17*
|
||||
30
docs/operator/deployment/ports.md
Normal file
30
docs/operator/deployment/ports.md
Normal file
@ -0,0 +1,30 @@
|
||||
# Port Allocation Plan
|
||||
|
||||
This document tracks current and planned TCP port assignments across the AITBC devnet stack. Update it whenever new services are introduced or defaults change.
|
||||
|
||||
## Current Usage
|
||||
|
||||
| Port | Service | Location | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| 8011 | Coordinator API (dev) | `apps/coordinator-api/` | Development coordinator API with job and marketplace endpoints. |
|
||||
| 8071 | Wallet Daemon API | `apps/wallet-daemon/` | REST and JSON-RPC wallet service with receipt verification. |
|
||||
| 8080 | Blockchain RPC API (FastAPI) | `apps/blockchain-node/scripts/devnet_up.sh` → `python -m uvicorn aitbc_chain.app:app` | Exposes REST/WebSocket RPC endpoints for blocks, transactions, receipts. |
|
||||
| 8090 | Mock Coordinator API | `apps/blockchain-node/scripts/devnet_up.sh` → `uvicorn mock_coordinator:app` | Generates synthetic coordinator/miner telemetry consumed by Grafana dashboards. |
|
||||
| 8100 | Pool Hub API (planned) | `apps/pool-hub/` | FastAPI service for miner registry and matching. |
|
||||
| 8900 | Coordinator API (production) | `apps/coordinator-api/` | Production-style deployment port. |
|
||||
| 9090 | Prometheus | `apps/blockchain-node/observability/` | Scrapes blockchain node + mock coordinator metrics. |
|
||||
| 3000 | Grafana | `apps/blockchain-node/observability/` | Visualizes metrics dashboards for blockchain and coordinator. |
|
||||
| 4173 | Explorer Web (dev) | `apps/explorer-web/` | Vite dev server for blockchain explorer interface. |
|
||||
| 5173 | Marketplace Web (dev) | `apps/marketplace-web/` | Vite dev server for marketplace interface. |
|
||||
|
||||
## Reserved / Planned Ports
|
||||
|
||||
- **Miner Node** – No default port (connects to coordinator via HTTP).
|
||||
- **JavaScript/Python SDKs** – Client libraries, no dedicated ports.
|
||||
|
||||
## Guidance
|
||||
|
||||
- Avoid reusing the same port across services in devnet scripts to prevent binding conflicts (recent issues occurred when `8080`/`8090` were already in use).
|
||||
- For production-grade environments, place HTTP services behind a reverse proxy (nginx/Traefik) and update this table with the external vs. internal port mapping.
|
||||
- When adding new dashboards or exporters, note both the scrape port (Prometheus) and any UI port (Grafana/others).
|
||||
- If a port is deprecated, strike it through in this table and add a note describing the migration path.
|
||||
281
docs/operator/deployment/run.md
Normal file
281
docs/operator/deployment/run.md
Normal file
@ -0,0 +1,281 @@
|
||||
# Service Run Instructions
|
||||
|
||||
These instructions cover the newly scaffolded services. Install dependencies using Poetry (preferred) or `pip` inside a virtual environment.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.11+
|
||||
- Poetry 1.7+ (or virtualenv + pip)
|
||||
- Optional: GPU drivers for miner node workloads
|
||||
|
||||
## Coordinator API (`apps/coordinator-api/`)
|
||||
|
||||
1. Navigate to the service directory:
|
||||
```bash
|
||||
cd apps/coordinator-api
|
||||
```
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
```
|
||||
3. Copy environment template and adjust values:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
Add coordinator API keys and, if you want signed receipts, set `RECEIPT_SIGNING_KEY_HEX` to a 64-byte Ed25519 private key encoded in hex.
|
||||
4. Configure database (shared Postgres): ensure `.env` contains `DATABASE_URL=postgresql://aitbc:248218d8b7657aef@localhost:5432/aitbc` or export it in the shell before running commands.
|
||||
|
||||
5. Run the API locally (development):
|
||||
```bash
|
||||
poetry run uvicorn app.main:app --host 127.0.0.2 --port 8011 --reload
|
||||
```
|
||||
6. Production-style launch using Gunicorn (ports start at 8900):
|
||||
```bash
|
||||
poetry run gunicorn app.main:app -k uvicorn.workers.UvicornWorker -b 127.0.0.2:8900
|
||||
```
|
||||
7. Generate a signing key (optional):
|
||||
```bash
|
||||
python - <<'PY'
|
||||
from nacl.signing import SigningKey
|
||||
sk = SigningKey.generate()
|
||||
print(sk.encode().hex())
|
||||
PY
|
||||
```
|
||||
Store the printed hex string in `RECEIPT_SIGNING_KEY_HEX` to enable signed receipts in responses.
|
||||
To add coordinator attestations, set `RECEIPT_ATTESTATION_KEY_HEX` to a separate Ed25519 private key; responses include an `attestations` array that can be verified with the corresponding public key.
|
||||
8. Retrieve receipts:
|
||||
- Latest receipt for a job: `GET /v1/jobs/{job_id}/receipt`
|
||||
- Entire receipt history: `GET /v1/jobs/{job_id}/receipts`
|
||||
|
||||
Ensure the client request includes the appropriate API key; responses embed signed payloads compatible with `packages/py/aitbc-crypto` verification helpers.
|
||||
Example verification snippet using the Python helpers:
|
||||
```bash
|
||||
export PYTHONPATH=packages/py/aitbc-crypto/src
|
||||
python - <<'PY'
|
||||
from aitbc_crypto.signing import ReceiptVerifier
|
||||
from aitbc_crypto.receipt import canonical_json
|
||||
import json
|
||||
|
||||
receipt = json.load(open("receipt.json", "r"))
|
||||
verifier = ReceiptVerifier(receipt["signature"]["public_key"])
|
||||
verifier.verify(receipt)
|
||||
print("receipt verified", receipt["receipt_id"])
|
||||
PY
|
||||
```
|
||||
Alternatively, install the Python SDK helpers:
|
||||
```bash
|
||||
cd packages/py/aitbc-sdk
|
||||
poetry install
|
||||
export PYTHONPATH=packages/py/aitbc-sdk/src:packages/py/aitbc-crypto/src
|
||||
python - <<'PY'
|
||||
from aitbc_sdk import CoordinatorReceiptClient, verify_receipt
|
||||
|
||||
client = CoordinatorReceiptClient("http://localhost:8011", "client_dev_key_1")
|
||||
receipt = client.fetch_latest("<job_id>")
|
||||
verification = verify_receipt(receipt)
|
||||
print("miner signature valid:", verification.miner_signature.valid)
|
||||
print("coordinator attestations:", [att.valid for att in verification.coordinator_attestations])
|
||||
PY
|
||||
For receipts containing `attestations`, iterate the list and verify each entry with the corresponding public key.
|
||||
A JavaScript helper will ship with the Stage 2 SDK under `packages/js/`; until then, receipts can be verified with Node.js by loading the canonical JSON and invoking an Ed25519 verify function from `tweetnacl` (the payload is `canonical_json(receipt)` and the public key is `receipt.signature.public_key`).
|
||||
Example Node.js snippet:
|
||||
```bash
|
||||
node <<'JS'
|
||||
import fs from "fs";
|
||||
import nacl from "tweetnacl";
|
||||
import canonical from "json-canonicalize";
|
||||
|
||||
const receipt = JSON.parse(fs.readFileSync("receipt.json", "utf-8"));
|
||||
const message = canonical(receipt).trim();
|
||||
const sig = receipt.signature.sig;
|
||||
const key = receipt.signature.key_id;
|
||||
|
||||
const signature = Buffer.from(sig.replace(/-/g, "+").replace(/_/g, "/"), "base64");
|
||||
const publicKey = Buffer.from(key.replace(/-/g, "+").replace(/_/g, "/"), "base64");
|
||||
|
||||
const ok = nacl.sign.detached.verify(Buffer.from(message, "utf-8"), signature, publicKey);
|
||||
console.log("verified:", ok);
|
||||
JS
|
||||
```
|
||||
|
||||
## Solidity Token (`packages/solidity/aitbc-token/`)
|
||||
|
||||
1. Navigate to the token project:
|
||||
```bash
|
||||
cd packages/solidity/aitbc-token
|
||||
npm install
|
||||
```
|
||||
2. Run the contract unit tests:
|
||||
```bash
|
||||
npx hardhat test
|
||||
```
|
||||
3. Deploy `AIToken` to the configured Hardhat network. Provide the coordinator (required) and attestor (optional) role recipients via environment variables:
|
||||
```bash
|
||||
COORDINATOR_ADDRESS=0xCoordinator \
|
||||
ATTESTOR_ADDRESS=0xAttestor \
|
||||
npx hardhat run scripts/deploy.ts --network localhost
|
||||
```
|
||||
The script prints the deployed address and automatically grants the coordinator and attestor roles if they are not already assigned. Export the printed address for follow-on steps:
|
||||
```bash
|
||||
export AITOKEN_ADDRESS=0xDeployedAddress
|
||||
```
|
||||
4. Mint tokens against an attested receipt by calling the contract from Hardhat’s console or a script. The helper below loads the deployed contract and invokes `mintWithReceipt` with an attestor signature:
|
||||
```ts
|
||||
// scripts/mintWithReceipt.ts
|
||||
import { ethers } from "hardhat";
|
||||
import { AIToken__factory } from "../typechain-types";
|
||||
|
||||
async function main() {
|
||||
const [coordinator] = await ethers.getSigners();
|
||||
const token = AIToken__factory.connect(process.env.AITOKEN_ADDRESS!, coordinator);
|
||||
|
||||
const provider = "0xProvider";
|
||||
const units = 100n;
|
||||
const receiptHash = "0x...";
|
||||
const signature = "0xSignedStructHash";
|
||||
|
||||
const tx = await token.mintWithReceipt(provider, units, receiptHash, signature);
|
||||
await tx.wait();
|
||||
console.log("Mint complete", await token.balanceOf(provider));
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error(err);
|
||||
process.exitCode = 1;
|
||||
});
|
||||
```
|
||||
Execute the helper with `AITOKEN_ADDRESS` exported and the signature produced by the attestor key used in your tests or integration flow:
|
||||
```bash
|
||||
AITOKEN_ADDRESS=$AITOKEN_ADDRESS npx ts-node scripts/mintWithReceipt.ts
|
||||
```
|
||||
5. To derive the signature payload, reuse the `buildSignature` helper from `test/aitoken.test.ts` or recreate it in a script. The struct hash encodes `(chainId, contractAddress, provider, units, receiptHash)` and must be signed by an authorized attestor account.
|
||||
|
||||
## Wallet Daemon (`apps/wallet-daemon/`)
|
||||
|
||||
1. Navigate to the service directory:
|
||||
```bash
|
||||
```
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
poetry install
|
||||
```
|
||||
3. Copy or create `.env` with coordinator access:
|
||||
```bash
|
||||
cp .env.example .env # create if missing
|
||||
```
|
||||
Populate `COORDINATOR_BASE_URL` and `COORDINATOR_API_KEY` to reuse the coordinator API when verifying receipts.
|
||||
4. Run the API locally:
|
||||
```bash
|
||||
poetry run uvicorn app.main:app --host 127.0.0.2 --port 8071 --reload
|
||||
```
|
||||
5. REST endpoints:
|
||||
- `GET /v1/receipts/{job_id}` – fetch + verify latest coordinator receipt.
|
||||
- `GET /v1/receipts/{job_id}/history` – fetch + verify entire receipt history.
|
||||
6. JSON-RPC endpoint:
|
||||
- `POST /rpc` with methods `receipts.verify_latest` and `receipts.verify_history` returning signature validation metadata identical to REST responses.
|
||||
7. Example REST usage:
|
||||
```bash
|
||||
curl -s "http://localhost:8071/v1/receipts/<job_id>" | jq
|
||||
```
|
||||
8. Example JSON-RPC call:
|
||||
```bash
|
||||
curl -s http://localhost:8071/rpc \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"jsonrpc":"2.0","id":1,"method":"receipts.verify_latest","params":{"job_id":"<job_id>"}}' | jq
|
||||
```
|
||||
9. Keystore scaffold:
|
||||
- `KeystoreService` currently stores wallets in-memory using Argon2id key derivation + XChaCha20-Poly1305 encryption.
|
||||
- Subsequent milestones will back this with persistence and CLI/REST routes for wallet creation/import.
|
||||
|
||||
## Miner Node (`apps/miner-node/`)
|
||||
|
||||
1. Navigate to the directory:
|
||||
```bash
|
||||
cd apps/miner-node
|
||||
```
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
poetry install
|
||||
```
|
||||
3. Configure environment:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
Adjust `COORDINATOR_BASE_URL`, `MINER_AUTH_TOKEN`, and workspace paths.
|
||||
4. Run the miner control loop:
|
||||
```bash
|
||||
poetry run python -m aitbc_miner.main
|
||||
```
|
||||
The miner now registers and heartbeats against the coordinator, polling for work and executing CLI/Python runners. Ensure the coordinator service is running first.
|
||||
5. Deploy as a systemd service (optional):
|
||||
```bash
|
||||
sudo scripts/ops/install_miner_systemd.sh
|
||||
```
|
||||
Add or update `/opt/aitbc/apps/miner-node/.env`, then use `sudo systemctl status aitbc-miner` to monitor the service.
|
||||
|
||||
## Blockchain Node (`apps/blockchain-node/`)
|
||||
|
||||
1. Navigate to the directory:
|
||||
```bash
|
||||
cd apps/blockchain-node
|
||||
```
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
poetry install
|
||||
```
|
||||
3. Configure environment:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
Update database path, proposer key, and bind host/port as needed.
|
||||
4. Run the node placeholder:
|
||||
```bash
|
||||
poetry run python -m aitbc_chain.main
|
||||
```
|
||||
(RPC, consensus, and P2P logic still to be implemented.)
|
||||
|
||||
### Observability Dashboards & Alerts
|
||||
|
||||
1. Generate the starter Grafana dashboards (if not already present):
|
||||
```bash
|
||||
cd apps/blockchain-node
|
||||
PYTHONPATH=src python - <<'PY'
|
||||
from pathlib import Path
|
||||
from aitbc_chain.observability.dashboards import generate_default_dashboards
|
||||
|
||||
output_dir = Path("observability/generated_dashboards")
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
generate_default_dashboards(output_dir)
|
||||
print("Dashboards written to", output_dir)
|
||||
PY
|
||||
```
|
||||
2. Import each JSON file into Grafana (**Dashboards → Import**):
|
||||
- `apps/blockchain-node/observability/generated_dashboards/coordinator-overview.json`
|
||||
- `apps/blockchain-node/observability/generated_dashboards/blockchain-node-overview.json`
|
||||
|
||||
Select your Prometheus datasource (pointing at `127.0.0.1:8080` and `127.0.0.1:8090`) during import.
|
||||
3. Ensure Prometheus scrapes both services. Example snippet from `apps/blockchain-node/observability/prometheus.yml`:
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: "blockchain-node"
|
||||
static_configs:
|
||||
- targets: ["127.0.0.1:8080"]
|
||||
|
||||
- job_name: "mock-coordinator"
|
||||
static_configs:
|
||||
- targets: ["127.0.0.1:8090"]
|
||||
```
|
||||
4. Deploy the Alertmanager rules in `apps/blockchain-node/observability/alerts.yml` (proposer stalls, miner errors, receipt drop-offs, RPC error spikes). After modifying rule files, reload Prometheus/Alertmanager:
|
||||
```bash
|
||||
systemctl restart prometheus
|
||||
systemctl restart alertmanager
|
||||
```
|
||||
5. Validate by briefly stopping `aitbc-coordinator.service`, confirming Grafana panels pause and the new alerts fire, then restart the service.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Flesh out remaining logic per task breakdowns in `docs/*.md` (e.g., capability-aware scheduling, artifact uploads).
|
||||
- Run the growing test suites regularly:
|
||||
- `pytest apps/coordinator-api/tests/test_jobs.py`
|
||||
- `pytest apps/coordinator-api/tests/test_miner_service.py`
|
||||
- `pytest apps/miner-node/tests/test_runners.py`
|
||||
- Create systemd and Nginx configs once services are runnable in production mode.
|
||||
485
docs/operator/incident-runbooks.md
Normal file
485
docs/operator/incident-runbooks.md
Normal file
@ -0,0 +1,485 @@
|
||||
# AITBC Incident Runbooks
|
||||
|
||||
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
|
||||
|
||||
## Runbook: Coordinator API Outage
|
||||
|
||||
### Based on Chaos Test: `chaos_test_coordinator.py`
|
||||
|
||||
### Symptoms
|
||||
- 503/504 errors on all endpoints
|
||||
- Health check failures
|
||||
- Job submission failures
|
||||
- Marketplace unresponsive
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check pod status
|
||||
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 2. Check recent events
|
||||
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
|
||||
|
||||
# 3. Check if pods are crashlooping
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 4. Quick restart if needed
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Investigation (2-10 minutes)
|
||||
1. **Review Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/coordinator --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
|
||||
```
|
||||
|
||||
3. **Verify Database Connectivity**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
|
||||
```
|
||||
|
||||
4. **Check Redis Connection**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Up if Resource Starved**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
```
|
||||
|
||||
2. **Manual Pod Deletion if Stuck**
|
||||
```bash
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
|
||||
```
|
||||
|
||||
3. **Rollback Deployment**
|
||||
```bash
|
||||
kubectl rollout undo deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test health endpoint
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
|
||||
# Test API with sample request
|
||||
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
|
||||
```
|
||||
|
||||
## Runbook: Network Partition
|
||||
|
||||
### Based on Chaos Test: `chaos_test_network.py`
|
||||
|
||||
### Symptoms
|
||||
- Blockchain nodes not communicating
|
||||
- Consensus stalled
|
||||
- High finality latency
|
||||
- Transaction processing delays
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
|
||||
|
||||
# 2. Check consensus status
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
|
||||
|
||||
# 3. Check network policies
|
||||
kubectl get networkpolicies -n default
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Identify Partitioned Nodes**
|
||||
```bash
|
||||
# Check each node's peer count
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "Pod: $pod"
|
||||
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
done
|
||||
```
|
||||
|
||||
2. **Check Network Policies**
|
||||
```bash
|
||||
kubectl describe networkpolicy default-deny-all-ingress -n default
|
||||
kubectl describe networkpolicy blockchain-node-netpol -n default
|
||||
```
|
||||
|
||||
3. **Verify DNS Resolution**
|
||||
```bash
|
||||
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Remove Problematic Network Rules**
|
||||
```bash
|
||||
# Flush iptables on affected nodes
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
kubectl exec -n default $pod -- iptables -F
|
||||
done
|
||||
```
|
||||
|
||||
2. **Restart Network Components**
|
||||
```bash
|
||||
kubectl rollout restart deployment/blockchain-node -n default
|
||||
```
|
||||
|
||||
3. **Force Re-peering**
|
||||
```bash
|
||||
# Delete and recreate pods to force re-peering
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Wait for consensus to resume
|
||||
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
|
||||
|
||||
# Verify peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
```
|
||||
|
||||
## Runbook: Database Failure
|
||||
|
||||
### Based on Chaos Test: `chaos_test_database.py`
|
||||
|
||||
### Symptoms
|
||||
- Database connection errors
|
||||
- Service degradation
|
||||
- Failed transactions
|
||||
- High error rates
|
||||
|
||||
### MTTR Target: 3 minutes
|
||||
|
||||
### Immediate Actions (0-3 minutes)
|
||||
```bash
|
||||
# 1. Check PostgreSQL status
|
||||
kubectl exec -n default deployment/postgresql -- pg_isready
|
||||
|
||||
# 2. Check connection count
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# 3. Check replica lag
|
||||
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
|
||||
```
|
||||
|
||||
### Investigation (3-10 minutes)
|
||||
1. **Review Database Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/postgresql --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Usage**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
|
||||
df -h /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
3. **Identify Long-running Queries**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Kill Idle Connections**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
|
||||
```
|
||||
|
||||
2. **Restart PostgreSQL**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
3. **Failover to Replica**
|
||||
```bash
|
||||
# Promote replica if primary fails
|
||||
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test database connectivity
|
||||
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
|
||||
|
||||
# Check application health
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Redis Failure
|
||||
|
||||
### Symptoms
|
||||
- Caching failures
|
||||
- Session loss
|
||||
- Increased database load
|
||||
- Slow response times
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check Redis status
|
||||
kubectl exec -n default deployment/redis -- redis-cli ping
|
||||
|
||||
# 2. Check memory usage
|
||||
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
|
||||
|
||||
# 3. Check connection count
|
||||
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
|
||||
```
|
||||
|
||||
### Investigation (2-5 minutes)
|
||||
1. **Review Redis Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/redis --tail=100
|
||||
```
|
||||
|
||||
2. **Check for Eviction**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
|
||||
```
|
||||
|
||||
3. **Identify Large Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Clear Expired Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
|
||||
```
|
||||
|
||||
2. **Restart Redis**
|
||||
```bash
|
||||
kubectl rollout restart deployment/redis -n default
|
||||
```
|
||||
|
||||
3. **Scale Redis Cluster**
|
||||
```bash
|
||||
kubectl scale deployment/redis --replicas=3 -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test Redis connectivity
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
|
||||
# Check application performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: High CPU/Memory Usage
|
||||
|
||||
### Symptoms
|
||||
- Slow response times
|
||||
- Pod evictions
|
||||
- OOM errors
|
||||
- System degradation
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check resource usage
|
||||
kubectl top pods -n default
|
||||
kubectl top nodes
|
||||
|
||||
# 2. Identify resource-hungry pods
|
||||
kubectl exec -n default deployment/coordinator -- top
|
||||
|
||||
# 3. Check for OOM kills
|
||||
dmesg | grep -i "killed process"
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Analyze Resource Usage**
|
||||
```bash
|
||||
# Detailed pod metrics
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
|
||||
```
|
||||
|
||||
3. **Review Application Metrics**
|
||||
```bash
|
||||
# Check Prometheus metrics
|
||||
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Services**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
kubectl scale deployment/blockchain-node --replicas=3 -n default
|
||||
```
|
||||
|
||||
2. **Increase Resource Limits**
|
||||
```bash
|
||||
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
|
||||
```
|
||||
|
||||
3. **Restart Affected Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Monitor resource usage
|
||||
watch -n 5 'kubectl top pods -n default'
|
||||
|
||||
# Test service performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Storage Issues
|
||||
|
||||
### Symptoms
|
||||
- Disk space warnings
|
||||
- Write failures
|
||||
- Database errors
|
||||
- Pod crashes
|
||||
|
||||
### MTTR Target: 10 minutes
|
||||
|
||||
### Immediate Actions (0-10 minutes)
|
||||
```bash
|
||||
# 1. Check disk usage
|
||||
df -h
|
||||
kubectl exec -n default deployment/postgresql -- df -h
|
||||
|
||||
# 2. Identify large files
|
||||
find /var/log -name "*.log" -size +100M
|
||||
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
|
||||
|
||||
# 3. Clean up logs
|
||||
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
|
||||
```
|
||||
|
||||
### Investigation (10-20 minutes)
|
||||
1. **Analyze Storage Usage**
|
||||
```bash
|
||||
du -sh /var/log/*
|
||||
du -sh /var/lib/docker/*
|
||||
```
|
||||
|
||||
2. **Check PVC Usage**
|
||||
```bash
|
||||
kubectl get pvc -n default
|
||||
kubectl describe pvc postgresql-data -n default
|
||||
```
|
||||
|
||||
3. **Review Retention Policies**
|
||||
```bash
|
||||
kubectl get cronjobs -n default
|
||||
kubectl describe cronjob log-cleanup -n default
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Expand Storage**
|
||||
```bash
|
||||
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
|
||||
```
|
||||
|
||||
2. **Force Cleanup**
|
||||
```bash
|
||||
# Clean old logs
|
||||
find /var/log -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Clean Docker images
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
3. **Restart Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# Verify database operations
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
|
||||
```
|
||||
|
||||
## Emergency Contact Procedures
|
||||
|
||||
### Escalation Matrix
|
||||
1. **Level 1**: On-call engineer (5 minutes)
|
||||
2. **Level 2**: On-call secondary (15 minutes)
|
||||
3. **Level 3**: Engineering manager (30 minutes)
|
||||
4. **Level 4**: CTO (1 hour, critical only)
|
||||
|
||||
### War Room Activation
|
||||
```bash
|
||||
# Create Slack channel
|
||||
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
|
||||
|
||||
# Invite stakeholders
|
||||
/slack invite @sre-team @engineering-manager @cto
|
||||
|
||||
# Start Zoom meeting
|
||||
/zoom start "AITBC Incident War Room"
|
||||
```
|
||||
|
||||
### Customer Communication
|
||||
1. **Status Page Update** (5 minutes)
|
||||
2. **Email Notification** (15 minutes)
|
||||
3. **Twitter Update** (30 minutes, critical only)
|
||||
|
||||
## Post-Incident Checklist
|
||||
|
||||
### Immediate (0-1 hour)
|
||||
- [ ] Service fully restored
|
||||
- [ ] Monitoring normal
|
||||
- [ ] Status page updated
|
||||
- [ ] Stakeholders notified
|
||||
|
||||
### Short-term (1-24 hours)
|
||||
- [ ] Incident document created
|
||||
- [ ] Root cause identified
|
||||
- [ ] Runbooks updated
|
||||
- [ ] Post-mortem scheduled
|
||||
|
||||
### Long-term (1-7 days)
|
||||
- [ ] Post-mortem completed
|
||||
- [ ] Action items assigned
|
||||
- [ ] Monitoring improved
|
||||
- [ ] Process updated
|
||||
|
||||
## Runbook Maintenance
|
||||
|
||||
### Review Schedule
|
||||
- **Monthly**: Review and update runbooks
|
||||
- **Quarterly**: Full review and testing
|
||||
- **Annually**: Major revision
|
||||
|
||||
### Update Process
|
||||
1. Test runbook procedures
|
||||
2. Document lessons learned
|
||||
3. Update procedures
|
||||
4. Train team members
|
||||
5. Update documentation
|
||||
|
||||
---
|
||||
|
||||
*Version: 1.0*
|
||||
*Last Updated: 2024-12-22*
|
||||
*Owner: SRE Team*
|
||||
40
docs/operator/index.md
Normal file
40
docs/operator/index.md
Normal file
@ -0,0 +1,40 @@
|
||||
# AITBC Operator Documentation
|
||||
|
||||
Welcome to the AITBC operator documentation. This section contains resources for deploying, operating, and maintaining AITBC infrastructure.
|
||||
|
||||
## Deployment
|
||||
|
||||
- [Deployment Guide](deployment/run.md) - How to deploy AITBC components
|
||||
- [Installation](deployment/installation.md) - System requirements and installation
|
||||
- [Configuration](deployment/configuration.md) - Configuration options
|
||||
- [Ports](deployment/ports.md) - Network ports and requirements
|
||||
|
||||
## Operations
|
||||
|
||||
- [Backup & Restore](backup_restore.md) - Data backup and recovery procedures
|
||||
- [Security](security.md) - Security best practices and hardening
|
||||
- [Monitoring](monitoring/monitoring-playbook.md) - System monitoring and observability
|
||||
- [Incident Response](incident-runbooks.md) - Incident handling procedures
|
||||
|
||||
## Architecture
|
||||
|
||||
- [System Architecture](../reference/architecture/) - Understanding AITBC architecture
|
||||
- [Components](../reference/architecture/) - Component documentation
|
||||
- [Multi-tenancy](../reference/architecture/) - Multi-tenant infrastructure
|
||||
|
||||
## Scaling
|
||||
|
||||
- [Scaling Guide](scaling.md) - How to scale AITBC infrastructure
|
||||
- [Performance Tuning](performance.md) - Performance optimization
|
||||
- [Capacity Planning](capacity.md) - Resource planning
|
||||
|
||||
## Reference
|
||||
|
||||
- [Glossary](../reference/glossary.md) - Terms and definitions
|
||||
- [Troubleshooting](../user-guide/troubleshooting.md) - Common issues and solutions
|
||||
- [FAQ](../user-guide/faq.md) - Frequently asked questions
|
||||
|
||||
## Support
|
||||
|
||||
- [Getting Help](../user-guide/support.md) - How to get support
|
||||
- [Contact](../user-guide/support.md) - Contact information
|
||||
449
docs/operator/monitoring/monitoring-playbook.md
Normal file
449
docs/operator/monitoring/monitoring-playbook.md
Normal file
@ -0,0 +1,449 @@
|
||||
# AITBC Monitoring Playbook & On-Call Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
|
||||
|
||||
## Service Overview
|
||||
|
||||
### Core Services
|
||||
- **Coordinator API**: Job management and marketplace coordination
|
||||
- **Blockchain Nodes**: Consensus and transaction processing
|
||||
- **Explorer UI**: Block explorer and transaction visualization
|
||||
- **Marketplace UI**: User interface for marketplace operations
|
||||
- **Wallet Daemon**: Cryptographic key management
|
||||
- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
|
||||
|
||||
### Critical Metrics
|
||||
- **Availability**: 99.9% uptime SLA
|
||||
- **Performance**: <200ms API response time (95th percentile)
|
||||
- **Throughput**: 100+ TPS sustained
|
||||
- **MTTR**: <2 minutes for critical incidents
|
||||
|
||||
## On-Call Rotation
|
||||
|
||||
### Rotation Schedule
|
||||
- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
|
||||
- **Secondary On-Call**: Shadow primary, handles escalations
|
||||
- **Tertiary**: Backup for both primary and secondary
|
||||
- **Rotation Handoff**: Every Monday at 08:00 UTC
|
||||
|
||||
### Team Structure
|
||||
```
|
||||
Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
|
||||
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
|
||||
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
|
||||
```
|
||||
|
||||
### Handoff Procedures
|
||||
1. **Pre-handoff Check** (Sunday 22:00 UTC):
|
||||
- Review active incidents
|
||||
- Check scheduled maintenance
|
||||
- Verify monitoring systems health
|
||||
|
||||
2. **Handoff Meeting** (Monday 08:00 UTC):
|
||||
- 15-minute video call
|
||||
- Discuss current issues
|
||||
- Transfer knowledge
|
||||
- Confirm contact information
|
||||
|
||||
3. **Post-handoff** (Monday 09:00 UTC):
|
||||
- Primary acknowledges receipt
|
||||
- Update on-call calendar
|
||||
- Test alerting systems
|
||||
|
||||
### Contact Information
|
||||
- **Primary**: +1-555-ONCALL-1 (PagerDuty)
|
||||
- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
|
||||
- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
|
||||
- **Escalation Manager**: +1-555-ESCALATE
|
||||
- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
|
||||
|
||||
## Alerting & Escalation
|
||||
|
||||
### Alert Severity Levels
|
||||
|
||||
#### Critical (P0)
|
||||
- Service completely down
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- SLA violation in progress
|
||||
- **Response Time**: 5 minutes
|
||||
- **Escalation**: 15 minutes if no response
|
||||
|
||||
#### High (P1)
|
||||
- Significant degradation
|
||||
- Partial service outage
|
||||
- High error rates (>10%)
|
||||
- **Response Time**: 15 minutes
|
||||
- **Escalation**: 1 hour if no response
|
||||
|
||||
#### Medium (P2)
|
||||
- Minor degradation
|
||||
- Elevated error rates (5-10%)
|
||||
- Performance issues
|
||||
- **Response Time**: 1 hour
|
||||
- **Escalation**: 4 hours if no response
|
||||
|
||||
#### Low (P3)
|
||||
- Informational alerts
|
||||
- Non-critical issues
|
||||
- **Response Time**: 4 hours
|
||||
- **Escalation**: 24 hours if no response
|
||||
|
||||
### Escalation Policy
|
||||
1. **Level 1**: Primary On-Call (5-60 minutes)
|
||||
2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
|
||||
3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
|
||||
4. **Level 4**: Engineering Manager (4 hours)
|
||||
5. **Level 5**: CTO (Critical incidents only)
|
||||
|
||||
### Alert Channels
|
||||
- **PagerDuty**: Primary alerting system
|
||||
- **Slack**: #on-call-aitbc channel
|
||||
- **Email**: oncall@aitbc.io
|
||||
- **SMS**: Critical alerts only
|
||||
- **Phone**: Critical incidents only
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Incident Classification
|
||||
|
||||
#### SEV-0 (Critical)
|
||||
- Complete service outage
|
||||
- Data loss or security breach
|
||||
- Financial impact >$10,000/hour
|
||||
- Customer impact >50%
|
||||
|
||||
#### SEV-1 (High)
|
||||
- Significant service degradation
|
||||
- Feature unavailable
|
||||
- Financial impact $1,000-$10,000/hour
|
||||
- Customer impact 10-50%
|
||||
|
||||
#### SEV-2 (Medium)
|
||||
- Minor service degradation
|
||||
- Performance issues
|
||||
- Financial impact <$1,000/hour
|
||||
- Customer impact <10%
|
||||
|
||||
#### SEV-3 (Low)
|
||||
- Informational
|
||||
- No customer impact
|
||||
|
||||
### Incident Response Process
|
||||
|
||||
#### 1. Detection & Triage (0-5 minutes)
|
||||
```bash
|
||||
# Check alert severity
|
||||
# Verify impact
|
||||
# Create incident channel
|
||||
# Notify stakeholders
|
||||
```
|
||||
|
||||
#### 2. Assessment (5-15 minutes)
|
||||
- Determine scope
|
||||
- Identify root cause area
|
||||
- Estimate resolution time
|
||||
- Declare severity level
|
||||
|
||||
#### 3. Communication (15-30 minutes)
|
||||
- Update status page
|
||||
- Notify customers (if needed)
|
||||
- Internal stakeholder updates
|
||||
- Set up war room
|
||||
|
||||
#### 4. Resolution (Varies)
|
||||
- Implement fix
|
||||
- Verify resolution
|
||||
- Monitor for recurrence
|
||||
- Document actions
|
||||
|
||||
#### 5. Recovery (30-60 minutes)
|
||||
- Full service restoration
|
||||
- Performance validation
|
||||
- Customer communication
|
||||
- Incident closure
|
||||
|
||||
## Service-Specific Runbooks
|
||||
|
||||
### Coordinator API
|
||||
|
||||
#### High Error Rate
|
||||
**Symptoms**: 5xx errors >5%, response time >500ms
|
||||
**Runbook**:
|
||||
1. Check pod health: `kubectl get pods -l app=coordinator`
|
||||
2. Review logs: `kubectl logs -f deployment/coordinator`
|
||||
3. Check database connectivity
|
||||
4. Verify Redis connection
|
||||
5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
|
||||
|
||||
#### Service Unavailable
|
||||
**Symptoms**: 503 errors, health check failures
|
||||
**Runbook**:
|
||||
1. Check deployment status
|
||||
2. Review recent deployments
|
||||
3. Rollback if necessary
|
||||
4. Check resource limits
|
||||
5. Verify ingress configuration
|
||||
|
||||
### Blockchain Nodes
|
||||
|
||||
#### Consensus Stalled
|
||||
**Symptoms**: No new blocks, high finality latency
|
||||
**Runbook**:
|
||||
1. Check node sync status
|
||||
2. Verify network connectivity
|
||||
3. Review validator set
|
||||
4. Check governance proposals
|
||||
5. Restart if needed (with caution)
|
||||
|
||||
#### High Peer Drop Rate
|
||||
**Symptoms**: Connected peers <50%, network partition
|
||||
**Runbook**:
|
||||
1. Check network policies
|
||||
2. Verify DNS resolution
|
||||
3. Review firewall rules
|
||||
4. Check load balancer health
|
||||
5. Restart networking components
|
||||
|
||||
### Database (PostgreSQL)
|
||||
|
||||
#### Connection Exhaustion
|
||||
**Symptoms**: "Too many connections" errors
|
||||
**Runbook**:
|
||||
1. Check active connections
|
||||
2. Identify long-running queries
|
||||
3. Kill idle connections
|
||||
4. Increase pool size if needed
|
||||
5. Scale database
|
||||
|
||||
#### Replica Lag
|
||||
**Symptoms**: Read replica lag >10 seconds
|
||||
**Runbook**:
|
||||
1. Check replica status
|
||||
2. Review network latency
|
||||
3. Verify disk space
|
||||
4. Restart replication if needed
|
||||
5. Failover if necessary
|
||||
|
||||
### Redis
|
||||
|
||||
#### Memory Pressure
|
||||
**Symptoms**: OOM errors, high eviction rate
|
||||
**Runbook**:
|
||||
1. Check memory usage
|
||||
2. Review key expiration
|
||||
3. Clean up unused keys
|
||||
4. Scale Redis cluster
|
||||
5. Optimize data structures
|
||||
|
||||
#### Connection Issues
|
||||
**Symptoms**: Connection timeouts, errors
|
||||
**Runbook**:
|
||||
1. Check max connections
|
||||
2. Review connection pool
|
||||
3. Verify network policies
|
||||
4. Restart Redis if needed
|
||||
5. Scale horizontally
|
||||
|
||||
## Monitoring Dashboards
|
||||
|
||||
### Primary Dashboards
|
||||
|
||||
#### 1. System Overview
|
||||
- Service health status
|
||||
- Error rates (4xx/5xx)
|
||||
- Response times
|
||||
- Throughput metrics
|
||||
- Resource utilization
|
||||
|
||||
#### 2. Infrastructure
|
||||
- Kubernetes cluster health
|
||||
- Node resource usage
|
||||
- Pod status and restarts
|
||||
- Network traffic
|
||||
- Storage capacity
|
||||
|
||||
#### 3. Application Metrics
|
||||
- Job submission rates
|
||||
- Transaction processing
|
||||
- Marketplace activity
|
||||
- Wallet operations
|
||||
- Mining statistics
|
||||
|
||||
#### 4. Business KPIs
|
||||
- Active users
|
||||
- Transaction volume
|
||||
- Revenue metrics
|
||||
- Customer satisfaction
|
||||
- SLA compliance
|
||||
|
||||
### Alert Rules
|
||||
|
||||
#### Critical Alerts
|
||||
- Service down >1 minute
|
||||
- Error rate >10%
|
||||
- Response time >1 second
|
||||
- Disk space >90%
|
||||
- Memory usage >95%
|
||||
|
||||
#### Warning Alerts
|
||||
- Error rate >5%
|
||||
- Response time >500ms
|
||||
- CPU usage >80%
|
||||
- Queue depth >1000
|
||||
- Replica lag >5s
|
||||
|
||||
## SLOs & SLIs
|
||||
|
||||
### Service Level Objectives
|
||||
|
||||
| Service | Metric | Target | Measurement |
|
||||
|---------|--------|--------|-------------|
|
||||
| Coordinator API | Availability | 99.9% | 30-day rolling |
|
||||
| Coordinator API | Latency | <200ms | 95th percentile |
|
||||
| Blockchain | Block Time | <2s | 24-hour average |
|
||||
| Marketplace | Success Rate | 99.5% | Daily |
|
||||
| Explorer | Response Time | <500ms | 95th percentile |
|
||||
|
||||
### Service Level Indicators
|
||||
|
||||
#### Availability
|
||||
- HTTP status codes
|
||||
- Health check responses
|
||||
- Pod readiness status
|
||||
|
||||
#### Latency
|
||||
- Request duration histogram
|
||||
- Database query times
|
||||
- External API calls
|
||||
|
||||
#### Throughput
|
||||
- Requests per second
|
||||
- Transactions per block
|
||||
- Jobs completed per hour
|
||||
|
||||
#### Quality
|
||||
- Error rates
|
||||
- Success rates
|
||||
- Customer satisfaction
|
||||
|
||||
## Post-Incident Process
|
||||
|
||||
### Immediate Actions (0-1 hour)
|
||||
1. Verify full resolution
|
||||
2. Monitor for recurrence
|
||||
3. Update status page
|
||||
4. Notify stakeholders
|
||||
|
||||
### Post-Mortem (1-24 hours)
|
||||
1. Create incident document
|
||||
2. Gather timeline and logs
|
||||
3. Identify root cause
|
||||
4. Document lessons learned
|
||||
|
||||
### Follow-up (1-7 days)
|
||||
1. Schedule post-mortem meeting
|
||||
2. Assign action items
|
||||
3. Update runbooks
|
||||
4. Improve monitoring
|
||||
|
||||
### Review (Weekly)
|
||||
1. Review incident trends
|
||||
2. Update SLOs if needed
|
||||
3. Adjust alerting thresholds
|
||||
4. Improve processes
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Scheduled Maintenance
|
||||
- **Frequency**: Weekly maintenance window
|
||||
- **Time**: Sunday 02:00-04:00 UTC
|
||||
- **Duration**: Maximum 2 hours
|
||||
- **Notification**: 72 hours advance
|
||||
|
||||
### Emergency Maintenance
|
||||
- **Approval**: Engineering Manager required
|
||||
- **Notification**: 4 hours advance (if possible)
|
||||
- **Duration**: As needed
|
||||
- **Rollback**: Always required
|
||||
|
||||
## Tools & Systems
|
||||
|
||||
### Monitoring Stack
|
||||
- **Prometheus**: Metrics collection
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **Alertmanager**: Alert routing and management
|
||||
- **PagerDuty**: On-call scheduling and escalation
|
||||
|
||||
### Observability
|
||||
- **Jaeger**: Distributed tracing
|
||||
- **Loki**: Log aggregation
|
||||
- **Kiali**: Service mesh visualization
|
||||
- **Kube-state-metrics**: Kubernetes metrics
|
||||
|
||||
### Communication
|
||||
- **Slack**: Primary communication
|
||||
- **Zoom**: War room meetings
|
||||
- **Status Page**: Customer notifications
|
||||
- **Email**: Formal communications
|
||||
|
||||
## Training & Onboarding
|
||||
|
||||
### New On-Call Engineer
|
||||
1. Shadow primary for 1 week
|
||||
2. Review all runbooks
|
||||
3. Test alerting systems
|
||||
4. Handle low-severity incidents
|
||||
5. Solo on-call with mentor
|
||||
|
||||
### Ongoing Training
|
||||
- Monthly incident drills
|
||||
- Quarterly runbook updates
|
||||
- Annual training refreshers
|
||||
- Cross-team knowledge sharing
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Major Outage
|
||||
1. Declare incident (SEV-0)
|
||||
2. Activate war room
|
||||
3. Customer communication
|
||||
4. Executive updates
|
||||
5. Recovery coordination
|
||||
|
||||
### Security Incident
|
||||
1. Isolate affected systems
|
||||
2. Preserve evidence
|
||||
3. Notify security team
|
||||
4. Customer notification
|
||||
5. Regulatory compliance
|
||||
|
||||
### Data Loss
|
||||
1. Stop affected services
|
||||
2. Assess impact
|
||||
3. Initiate recovery
|
||||
4. Customer communication
|
||||
5. Prevent recurrence
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Contact List
|
||||
[Detailed contact information]
|
||||
|
||||
### B. Runbook Checklist
|
||||
[Quick reference checklists]
|
||||
|
||||
### C. Alert Configuration
|
||||
[Prometheus rules and thresholds]
|
||||
|
||||
### D. Dashboard Links
|
||||
[Grafana dashboard URLs]
|
||||
|
||||
---
|
||||
|
||||
*Document Version: 1.0*
|
||||
*Last Updated: 2024-12-22*
|
||||
*Next Review: 2025-01-22*
|
||||
*Owner: SRE Team*
|
||||
340
docs/operator/security.md
Normal file
340
docs/operator/security.md
Normal file
@ -0,0 +1,340 @@
|
||||
# AITBC Security Documentation
|
||||
|
||||
This document outlines the security architecture, threat model, and implementation details for the AITBC platform.
|
||||
|
||||
## Overview
|
||||
|
||||
AITBC implements defense-in-depth security across multiple layers:
|
||||
- Network security with TLS termination
|
||||
- API authentication and authorization
|
||||
- Secrets management and encryption
|
||||
- Infrastructure security best practices
|
||||
- Monitoring and incident response
|
||||
|
||||
## Threat Model
|
||||
|
||||
### Threat Actors
|
||||
|
||||
| Actor | Motivation | Capabilities | Impact |
|
||||
|-------|-----------|--------------|--------|
|
||||
| External attacker | Financial gain, disruption | Network access, exploits | High |
|
||||
| Malicious insider | Data theft, sabotage | Internal access | Critical |
|
||||
| Competitor | IP theft, market manipulation | Sophisticated attacks | High |
|
||||
| Casual user | Accidental misuse | Limited knowledge | Low |
|
||||
|
||||
### Attack Vectors
|
||||
|
||||
1. **Network Attacks**
|
||||
- Man-in-the-middle (MITM) attacks
|
||||
- DDoS attacks
|
||||
- Network reconnaissance
|
||||
|
||||
2. **API Attacks**
|
||||
- Unauthorized access to marketplace
|
||||
- API key leakage
|
||||
- Rate limiting bypass
|
||||
- Injection attacks
|
||||
|
||||
3. **Infrastructure Attacks**
|
||||
- Container escape
|
||||
- Pod-to-pod attacks
|
||||
- Secrets exfiltration
|
||||
- Supply chain attacks
|
||||
|
||||
4. **Blockchain-Specific Attacks**
|
||||
- 51% attacks on consensus
|
||||
- Transaction replay attacks
|
||||
- Smart contract exploits
|
||||
- Miner collusion
|
||||
|
||||
### Security Controls
|
||||
|
||||
| Control | Implementation | Mitigates |
|
||||
|---------|----------------|-----------|
|
||||
| TLS 1.3 | cert-manager + ingress | MITM, eavesdropping |
|
||||
| API Keys | X-API-Key header | Unauthorized access |
|
||||
| Rate Limiting | slowapi middleware | DDoS, abuse |
|
||||
| Network Policies | Kubernetes NetworkPolicy | Pod-to-pod attacks |
|
||||
| Secrets Mgmt | Kubernetes Secrets + SealedSecrets | Secrets exfiltration |
|
||||
| RBAC | Kubernetes RBAC | Privilege escalation |
|
||||
| Monitoring | Prometheus + AlertManager | Incident detection |
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Network Security
|
||||
|
||||
#### TLS Termination
|
||||
```yaml
|
||||
# Ingress configuration with TLS
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
|
||||
spec:
|
||||
tls:
|
||||
- hosts:
|
||||
- api.aitbc.io
|
||||
secretName: api-tls
|
||||
```
|
||||
|
||||
#### Certificate Management
|
||||
- Uses cert-manager for automatic certificate provisioning
|
||||
- Supports Let's Encrypt for production
|
||||
- Internal CA for development environments
|
||||
- Automatic renewal 30 days before expiry
|
||||
|
||||
### API Security
|
||||
|
||||
#### Authentication
|
||||
- API key-based authentication for all services
|
||||
- Keys stored in Kubernetes Secrets
|
||||
- Per-service key rotation policies
|
||||
- Audit logging for all authenticated requests
|
||||
|
||||
#### Authorization
|
||||
- Role-based access control (RBAC)
|
||||
- Resource-level permissions
|
||||
- Rate limiting per API key
|
||||
- IP whitelisting for sensitive operations
|
||||
|
||||
#### API Key Format
|
||||
```
|
||||
Header: X-API-Key: aitbc_prod_ak_1a2b3c4d5e6f7g8h9i0j
|
||||
```
|
||||
|
||||
### Secrets Management
|
||||
|
||||
#### Kubernetes Secrets
|
||||
- Base64 encoded secrets (not encrypted by default)
|
||||
- Encrypted at rest with etcd encryption
|
||||
- Access controlled via RBAC
|
||||
|
||||
#### SealedSecrets (Recommended for Production)
|
||||
- Client-side encryption of secrets
|
||||
- GitOps friendly
|
||||
- Zero-knowledge encryption
|
||||
|
||||
#### Secret Rotation
|
||||
- Automated rotation every 90 days
|
||||
- Zero-downtime rotation for services
|
||||
- Audit trail of all rotations
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### 1. TLS Configuration
|
||||
|
||||
#### Coordinator API
|
||||
```yaml
|
||||
# Helm values for coordinator
|
||||
ingress:
|
||||
enabled: true
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
|
||||
tls:
|
||||
- secretName: coordinator-tls
|
||||
hosts:
|
||||
- api.aitbc.io
|
||||
```
|
||||
|
||||
#### Blockchain Node RPC
|
||||
```yaml
|
||||
# WebSocket with TLS
|
||||
wss://api.aitbc.io:8080/ws
|
||||
```
|
||||
|
||||
### 2. API Authentication Middleware
|
||||
|
||||
#### Coordinator API Implementation
|
||||
```python
|
||||
from fastapi import Security, HTTPException
|
||||
from fastapi.security import APIKeyHeader
|
||||
|
||||
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=True)
|
||||
|
||||
async def verify_api_key(api_key: str = Security(api_key_header)):
|
||||
if not verify_key(api_key):
|
||||
raise HTTPException(status_code=403, detail="Invalid API key")
|
||||
return api_key
|
||||
|
||||
@app.middleware("http")
|
||||
async def auth_middleware(request: Request, call_next):
|
||||
if request.url.path.startswith("/v1/"):
|
||||
api_key = request.headers.get("X-API-Key")
|
||||
if not verify_key(api_key):
|
||||
raise HTTPException(status_code=403, detail="API key required")
|
||||
response = await call_next(request)
|
||||
return response
|
||||
```
|
||||
|
||||
### 3. Secrets Management Setup
|
||||
|
||||
#### SealedSecrets Installation
|
||||
```bash
|
||||
# Install sealed-secrets controller
|
||||
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
|
||||
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system
|
||||
|
||||
# Create a sealed secret
|
||||
kubeseal --format yaml < secret.yaml > sealed-secret.yaml
|
||||
```
|
||||
|
||||
#### Example Secret Structure
|
||||
```yaml
|
||||
apiVersion: bitnami.com/v1alpha1
|
||||
kind: SealedSecret
|
||||
metadata:
|
||||
name: coordinator-api-keys
|
||||
spec:
|
||||
encryptedData:
|
||||
api-key-prod: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
api-key-dev: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
```
|
||||
|
||||
### 4. Network Policies
|
||||
|
||||
#### Default Deny Policy
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: default-deny-all
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
- Egress
|
||||
```
|
||||
|
||||
#### Service-Specific Policies
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: coordinator-api-netpol
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: coordinator-api
|
||||
policyTypes:
|
||||
- Ingress
|
||||
- Egress
|
||||
ingress:
|
||||
- from:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app: ingress-nginx
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8011
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
### Development Environment
|
||||
- Use 127.0.0.2 for local development (not 0.0.0.0)
|
||||
- Separate API keys for dev/staging/prod
|
||||
- Enable debug logging only in development
|
||||
- Use self-signed certificates for local TLS
|
||||
|
||||
### Production Environment
|
||||
- Enable all security headers
|
||||
- Implement comprehensive logging
|
||||
- Use external secret management
|
||||
- Regular security audits
|
||||
- Penetration testing quarterly
|
||||
|
||||
### Monitoring and Alerting
|
||||
|
||||
#### Security Metrics
|
||||
- Failed authentication attempts
|
||||
- Unusual API usage patterns
|
||||
- Certificate expiry warnings
|
||||
- Secret access audits
|
||||
|
||||
#### Alert Rules
|
||||
```yaml
|
||||
- alert: HighAuthFailureRate
|
||||
expr: rate(auth_failures_total[5m]) > 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High authentication failure rate detected"
|
||||
|
||||
- alert: CertificateExpiringSoon
|
||||
expr: cert_certificate_expiry_time < time() + 86400 * 7
|
||||
for: 1h
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Certificate expires in less than 7 days"
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Security Incident Categories
|
||||
1. **Critical**: Data breach, system compromise
|
||||
2. **High**: Service disruption, privilege escalation
|
||||
3. **Medium**: Suspicious activity, policy violation
|
||||
4. **Low**: Misconfiguration, minor issue
|
||||
|
||||
### Response Procedures
|
||||
1. **Detection**: Automated alerts, manual monitoring
|
||||
2. **Assessment**: Impact analysis, containment
|
||||
3. **Remediation**: Patch, rotate credentials, restore
|
||||
4. **Post-mortem**: Document, improve controls
|
||||
|
||||
### Emergency Contacts
|
||||
- Security Team: security@aitbc.io
|
||||
- On-call Engineer: +1-555-SECURITY
|
||||
- Incident Commander: incident@aitbc.io
|
||||
|
||||
## Compliance
|
||||
|
||||
### Data Protection
|
||||
- GDPR compliance for EU users
|
||||
- CCPA compliance for California users
|
||||
- Data retention policies
|
||||
- Right to deletion implementation
|
||||
|
||||
### Auditing
|
||||
- Quarterly security audits
|
||||
- Annual penetration testing
|
||||
- Continuous vulnerability scanning
|
||||
- Third-party security assessments
|
||||
|
||||
## Security Checklist
|
||||
|
||||
### Pre-deployment
|
||||
- [ ] All API endpoints require authentication
|
||||
- [ ] TLS certificates valid and properly configured
|
||||
- [ ] Secrets encrypted and access-controlled
|
||||
- [ ] Network policies implemented
|
||||
- [ ] RBAC configured correctly
|
||||
- [ ] Monitoring and alerting active
|
||||
- [ ] Backup encryption enabled
|
||||
- [ ] Security headers configured
|
||||
|
||||
### Post-deployment
|
||||
- [ ] Security testing completed
|
||||
- [ ] Documentation updated
|
||||
- [ ] Team trained on procedures
|
||||
- [ ] Incident response tested
|
||||
- [ ] Compliance verified
|
||||
|
||||
## References
|
||||
|
||||
- [OWASP API Security Top 10](https://owasp.org/www-project-api-security/)
|
||||
- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)
|
||||
- [NIST Cybersecurity Framework](https://www.nist.gov/cyberframework)
|
||||
- [CERT Coordination Center](https://www.cert.org/)
|
||||
|
||||
## Security Updates
|
||||
|
||||
This document is updated regularly. Last updated: 2024-12-22
|
||||
|
||||
For questions or concerns, contact the security team at security@aitbc.io
|
||||
Reference in New Issue
Block a user