feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
This commit is contained in:
oib
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions

View File

@ -0,0 +1,316 @@
# AITBC Backup and Restore Procedures
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
## Overview
The AITBC platform implements a comprehensive backup strategy with:
- **Automated daily backups** via Kubernetes CronJobs
- **Manual backup capabilities** for on-demand operations
- **Incremental and full backup options** for ledger data
- **Cloud storage integration** for off-site backups
- **Retention policies** to manage storage efficiently
## Components
### 1. PostgreSQL Database
- **Location**: Coordinator API persistent storage
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
- **Backup Format**: Custom PostgreSQL dump with compression
- **Retention**: 30 days (configurable)
### 2. Redis Cache
- **Location**: In-memory cache with persistence
- **Data**: Session cache, temporary data, rate limiting
- **Backup Format**: RDB snapshot + AOF (if enabled)
- **Retention**: 30 days (configurable)
### 3. Ledger Storage
- **Location**: Blockchain node persistent storage
- **Data**: Blocks, transactions, receipts, wallet states
- **Backup Format**: Compressed tar archives
- **Retention**: 30 days (configurable)
## Automated Backups
### Kubernetes CronJob
The automated backup system runs daily at 2:00 AM UTC:
```bash
# Deploy the backup CronJob
kubectl apply -f infra/k8s/backup-cronjob.yaml
# Check CronJob status
kubectl get cronjob aitbc-backup
# View backup jobs
kubectl get jobs -l app=aitbc-backup
# View backup logs
kubectl logs job/aitbc-backup-<timestamp>
```
### Backup Schedule
| Time (UTC) | Component | Type | Retention |
|------------|----------------|------------|-----------|
| 02:00 | PostgreSQL | Full | 30 days |
| 02:01 | Redis | Full | 30 days |
| 02:02 | Ledger | Full | 30 days |
## Manual Backups
### PostgreSQL
```bash
# Create a manual backup
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
# View available backups
ls -la /tmp/postgresql-backups/
# Upload to S3 manually
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
```
### Redis
```bash
# Create a manual backup
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
# Force background save before backup
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
```
### Ledger Storage
```bash
# Create a full backup
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
# Create incremental backup
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
```
## Restore Procedures
### PostgreSQL Restore
```bash
# List available backups
aws s3 ls s3://aitbc-backups-default/postgresql/
# Download backup from S3
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
# Restore database
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
# Verify restore
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
```
### Redis Restore
```bash
# Stop Redis service
kubectl scale deployment redis --replicas=0 -n default
# Clear existing data
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
# Copy backup file
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
# Start Redis service
kubectl scale deployment redis --replicas=1 -n default
# Verify restore
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
```
### Ledger Restore
```bash
# Stop blockchain nodes
kubectl scale deployment blockchain-node --replicas=0 -n default
# Extract backup
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
# Copy ledger data
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
# Start blockchain nodes
kubectl scale deployment blockchain-node --replicas=3 -n default
# Verify restore
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
```
## Disaster Recovery
### Recovery Time Objective (RTO)
| Component | RTO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 1 hour | Database restore from backup |
| Redis | 15 minutes | Cache rebuild from backup |
| Ledger | 2 hours | Full chain synchronization |
### Recovery Point Objective (RPO)
| Component | RPO Target | Notes |
|----------------|------------|---------------------------------|
| PostgreSQL | 24 hours | Daily backups |
| Redis | 24 hours | Daily backups |
| Ledger | 24 hours | Daily full + incremental backups|
### Disaster Recovery Steps
1. **Assess Impact**
```bash
# Check component status
kubectl get pods -n default
kubectl get events --sort-by=.metadata.creationTimestamp
```
2. **Restore Critical Services**
```bash
# Restore PostgreSQL first (critical for operations)
./infra/scripts/restore_postgresql.sh default [latest-backup]
# Restore Redis cache
./restore_redis.sh default [latest-backup]
# Restore ledger data
./restore_ledger.sh default [latest-backup]
```
3. **Verify System Health**
```bash
# Check all services
kubectl get pods -n default
# Verify API endpoints
curl -s http://coordinator-api:8011/v1/health
curl -s http://blockchain-node:8080/v1/health
```
## Monitoring and Alerting
### Backup Monitoring
Prometheus metrics track backup success/failure:
```yaml
# AlertManager rules for backups
- alert: BackupFailed
expr: backup_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed for {{ $labels.component }}"
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
```
### Log Monitoring
```bash
# View backup logs
kubectl logs -l app=aitbc-backup -n default --tail=100
# Monitor backup CronJob
kubectl get cronjob aitbc-backup -w
```
## Best Practices
### Backup Security
1. **Encryption**: Backups uploaded to S3 use server-side encryption
2. **Access Control**: IAM policies restrict backup access
3. **Retention**: Automatic cleanup of old backups
4. **Validation**: Regular restore testing
### Performance Considerations
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
2. **Parallel Processing**: Components backed up sequentially
3. **Compression**: All backups compressed to save storage
4. **Incremental Backups**: Ledger supports incremental to reduce size
### Testing
1. **Monthly Restore Tests**: Validate backup integrity
2. **Disaster Recovery Drills**: Quarterly full scenario testing
3. **Documentation Updates**: Keep procedures current
## Troubleshooting
### Common Issues
#### Backup Fails with "Permission Denied"
```bash
# Check service account permissions
kubectl describe serviceaccount backup-service-account
kubectl describe role backup-role
```
#### Restore Fails with "Database in Use"
```bash
# Scale down application before restore
kubectl scale deployment coordinator-api --replicas=0
# Perform restore
# Scale up after restore
kubectl scale deployment coordinator-api --replicas=3
```
#### Ledger Restore Incomplete
```bash
# Verify backup integrity
tar -tzf ledger-backup.tar.gz
# Check metadata.json for block height
cat metadata.json | jq '.latest_block_height'
```
### Getting Help
1. Check logs: `kubectl logs -l app=aitbc-backup`
2. Verify storage: `df -h` on backup nodes
3. Check network: Test S3 connectivity
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
## Configuration
### Environment Variables
| Variable | Default | Description |
|------------------------|------------------|---------------------------------|
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
| COMPRESSION_LEVEL | 6 | gzip compression level |
### Customizing Backup Schedule
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
```yaml
spec:
schedule: "0 3 * * *" # Change to 3 AM UTC
```
### Adjusting Retention
Modify retention in each backup script:
```bash
# In backup_*.sh scripts
RETENTION_DAYS=60 # Keep for 60 days instead of 30
```

View File

@ -0,0 +1,273 @@
# AITBC Beta Release Plan
## Executive Summary
This document outlines the beta release plan for AITBC (AI Trusted Blockchain Computing), a blockchain platform designed for AI workloads. The release follows a phased approach: Alpha → Beta → Release Candidate (RC) → General Availability (GA).
## Release Phases
### Phase 1: Alpha Release (Completed)
- **Duration**: 2 weeks
- **Participants**: Internal team (10 members)
- **Focus**: Core functionality validation
- **Status**: ✅ Completed
### Phase 2: Beta Release (Current)
- **Duration**: 6 weeks
- **Participants**: 50-100 external testers
- **Focus**: User acceptance testing, performance validation, security assessment
- **Start Date**: 2025-01-15
- **End Date**: 2025-02-26
### Phase 3: Release Candidate
- **Duration**: 2 weeks
- **Participants**: 20 selected beta testers
- **Focus**: Final bug fixes, performance optimization
- **Start Date**: 2025-03-04
- **End Date**: 2025-03-18
### Phase 4: General Availability
- **Date**: 2025-03-25
- **Target**: Public launch
## Beta Release Timeline
### Week 1-2: Onboarding & Basic Flows
- **Jan 15-19**: Tester onboarding and environment setup
- **Jan 22-26**: Basic job submission and completion flows
- **Milestone**: 80% of testers successfully submit and complete jobs
### Week 3-4: Marketplace & Explorer Testing
- **Jan 29 - Feb 2**: Marketplace functionality testing
- **Feb 5-9**: Explorer UI validation and transaction tracking
- **Milestone**: 100 marketplace transactions completed
### Week 5-6: Stress Testing & Feedback
- **Feb 12-16**: Performance stress testing (1000+ concurrent jobs)
- **Feb 19-23**: Security testing and final feedback collection
- **Milestone**: All critical bugs resolved
## User Acceptance Testing (UAT) Scenarios
### 1. Core Job Lifecycle
- **Scenario**: Submit AI inference job → Miner picks up → Execution → Results delivery → Payment
- **Test Cases**:
- Job submission with various model types
- Job monitoring and status tracking
- Result retrieval and verification
- Payment processing and wallet updates
- **Success Criteria**: 95% success rate across 1000 test jobs
### 2. Marketplace Operations
- **Scenario**: Create offer → Accept offer → Execute job → Complete transaction
- **Test Cases**:
- Offer creation and management
- Bid acceptance and matching
- Price discovery mechanisms
- Dispute resolution
- **Success Criteria**: 50 successful marketplace transactions
### 3. Explorer Functionality
- **Scenario**: Transaction lookup → Job tracking → Address analysis
- **Test Cases**:
- Real-time transaction monitoring
- Job history and status visualization
- Wallet balance tracking
- Block explorer features
- **Success Criteria**: All transactions visible within 5 seconds
### 4. Wallet Management
- **Scenario**: Wallet creation → Funding → Transactions → Backup/Restore
- **Test Cases**:
- Multi-signature wallet creation
- Cross-chain transfers
- Backup and recovery procedures
- Staking and unstaking operations
- **Success Criteria**: 100% wallet recovery success rate
### 5. Mining Operations
- **Scenario**: Miner setup → Job acceptance → Mining rewards → Pool participation
- **Test Cases**:
- Miner registration and setup
- Job bidding and execution
- Reward distribution
- Pool mining operations
- **Success Criteria**: 90% of submitted jobs accepted by miners
### 6. Community Management
### Discord Community Structure
- **#announcements**: Official updates and milestones
- **#beta-testers**: Private channel for testers only
- **#bug-reports**: Structured bug reporting format
- **#feature-feedback**: Feature requests and discussions
- **#technical-support**: 24/7 support from the team
### Regulatory Considerations
- **KYC/AML**: Basic identity verification for testers
- **Securities Law**: Beta tokens have no monetary value
- **Tax Reporting**: Testnet transactions not taxable
- **Export Controls**: Compliance with technology export laws
### Geographic Restrictions
Beta testing is not available in:
- North Korea, Iran, Cuba, Syria, Crimea
- Countries under US sanctions
- Jurdictions with unclear crypto regulations
### 7. Token Economics Validation
- **Scenario**: Token issuance → Reward distribution → Staking yields → Fee mechanisms
- **Test Cases**:
- Mining reward calculations match whitepaper specs
- Staking yields and unstaking penalties
- Transaction fee burning and distribution
- Marketplace fee structures
- Token inflation/deflation mechanics
- **Success Criteria**: All token operations within 1% of theoretical values
## Performance Benchmarks (Go/No-Go Criteria)
### Must-Have Metrics
- **Transaction Throughput**: ≥ 100 TPS (Transactions Per Second)
- **Job Completion Time**: ≤ 5 minutes for standard inference jobs
- **API Response Time**: ≤ 200ms (95th percentile)
- **System Uptime**: ≥ 99.9% during beta period
- **MTTR (Mean Time To Recovery)**: ≤ 2 minutes (from chaos tests)
### Nice-to-Have Metrics
- **Transaction Throughput**: ≥ 500 TPS
- **Job Completion Time**: ≤ 2 minutes
- **API Response Time**: ≤ 100ms (95th percentile)
- **Concurrent Users**: ≥ 1000 simultaneous users
## Security Testing
### Automated Security Scans
- **Smart Contract Audits**: Completed by [Security Firm]
- **Penetration Testing**: OWASP Top 10 validation
- **Dependency Scanning**: CVE scan of all dependencies
- **Chaos Testing**: Network partition and coordinator outage scenarios
### Manual Security Reviews
- **Authorization Testing**: API key validation and permissions
- **Data Privacy**: GDPR compliance validation
- **Cryptography**: Proof verification and signature validation
- **Infrastructure Security**: Kubernetes and cloud security review
## Test Environment Setup
### Beta Environment
- **Network**: Separate testnet with faucet for test tokens
- **Infrastructure**: Production-like setup with monitoring
- **Data**: Reset weekly to ensure clean testing
- **Support**: 24/7 Discord support channel
### Access Credentials
- **Testnet Faucet**: 1000 AITBC tokens per tester
- **API Keys**: Unique keys per tester with rate limits
- **Wallet Seeds**: Generated per tester with backup instructions
- **Mining Accounts**: Pre-configured mining pools for testing
## Feedback Collection Mechanisms
### Automated Collection
- **Error Reporting**: Automatic crash reports and error logs
- **Performance Metrics**: Client-side performance data
- **Usage Analytics**: Feature usage tracking (anonymized)
- **Survey System**: In-app feedback prompts
### Manual Collection
- **Weekly Surveys**: Structured feedback on specific features
- **Discord Channels**: Real-time feedback and discussions
- **Office Hours**: Weekly Q&A sessions with the team
- **Bug Bounty**: Program for critical issue discovery
## Success Criteria
### Go/No-Go Decision Points
#### Week 2 Checkpoint (Jan 26)
- **Go Criteria**: 80% of testers onboarded, basic flows working
- **Blockers**: Critical bugs in job submission/completion
#### Week 4 Checkpoint (Feb 9)
- **Go Criteria**: 50 marketplace transactions, explorer functional
- **Blockers**: Security vulnerabilities, performance < 50 TPS
#### Week 6 Final Decision (Feb 23)
- **Go Criteria**: All UAT scenarios passed, benchmarks met
- **Blockers**: Any critical security issue, MTTR > 5 minutes
### Overall Success Metrics
- **User Satisfaction**: ≥ 4.0/5.0 average rating
- **Bug Resolution**: 90% of reported bugs fixed
- **Performance**: All benchmarks met
- **Security**: No critical vulnerabilities
## Risk Management
### Technical Risks
- **Consensus Issues**: Rollback to previous version
- **Performance Degradation**: Auto-scaling and optimization
- **Security Breaches**: Immediate patch and notification
### Operational Risks
- **Test Environment Downtime**: Backup environment ready
- **Low Tester Participation**: Incentive program adjustments
- **Feature Scope Creep**: Strict feature freeze after Week 4
### Mitigation Strategies
- **Daily Health Checks**: Automated monitoring and alerts
- **Rollback Plan**: Documented procedures for quick rollback
- **Communication Plan**: Regular updates to all stakeholders
## Communication Plan
### Internal Updates
- **Daily Standups**: Development team sync
- **Weekly Reports**: Progress to leadership
- **Bi-weekly Demos**: Feature demonstrations
### External Updates
- **Beta Newsletter**: Weekly updates to testers
- **Blog Posts**: Public progress updates
- **Social Media**: Regular platform updates
## Post-Beta Activities
### RC Phase Preparation
- **Bug Triage**: Prioritize and assign all reported issues
- **Performance Tuning**: Optimize based on beta metrics
- **Documentation Updates**: Incorporate beta feedback
### GA Preparation
- **Final Security Review**: Complete audit and penetration test
- **Infrastructure Scaling**: Prepare for production load
- **Support Team Training**: Enable customer support team
## Appendix
### A. Test Case Matrix
[Detailed test case spreadsheet link]
### B. Performance Benchmark Results
[Benchmark data and graphs]
### C. Security Audit Reports
[Audit firm reports and findings]
### D. Feedback Analysis
[Summary of all user feedback and actions taken]
## Contact Information
- **Beta Program Manager**: beta@aitbc.io
- **Technical Support**: support@aitbc.io
- **Security Issues**: security@aitbc.io
- **Discord Community**: https://discord.gg/aitbc
---
*Last Updated: 2025-01-10*
*Version: 1.0*
*Next Review: 2025-01-17*

View File

@ -0,0 +1,30 @@
# Port Allocation Plan
This document tracks current and planned TCP port assignments across the AITBC devnet stack. Update it whenever new services are introduced or defaults change.
## Current Usage
| Port | Service | Location | Notes |
| --- | --- | --- | --- |
| 8011 | Coordinator API (dev) | `apps/coordinator-api/` | Development coordinator API with job and marketplace endpoints. |
| 8071 | Wallet Daemon API | `apps/wallet-daemon/` | REST and JSON-RPC wallet service with receipt verification. |
| 8080 | Blockchain RPC API (FastAPI) | `apps/blockchain-node/scripts/devnet_up.sh``python -m uvicorn aitbc_chain.app:app` | Exposes REST/WebSocket RPC endpoints for blocks, transactions, receipts. |
| 8090 | Mock Coordinator API | `apps/blockchain-node/scripts/devnet_up.sh``uvicorn mock_coordinator:app` | Generates synthetic coordinator/miner telemetry consumed by Grafana dashboards. |
| 8100 | Pool Hub API (planned) | `apps/pool-hub/` | FastAPI service for miner registry and matching. |
| 8900 | Coordinator API (production) | `apps/coordinator-api/` | Production-style deployment port. |
| 9090 | Prometheus | `apps/blockchain-node/observability/` | Scrapes blockchain node + mock coordinator metrics. |
| 3000 | Grafana | `apps/blockchain-node/observability/` | Visualizes metrics dashboards for blockchain and coordinator. |
| 4173 | Explorer Web (dev) | `apps/explorer-web/` | Vite dev server for blockchain explorer interface. |
| 5173 | Marketplace Web (dev) | `apps/marketplace-web/` | Vite dev server for marketplace interface. |
## Reserved / Planned Ports
- **Miner Node** No default port (connects to coordinator via HTTP).
- **JavaScript/Python SDKs** Client libraries, no dedicated ports.
## Guidance
- Avoid reusing the same port across services in devnet scripts to prevent binding conflicts (recent issues occurred when `8080`/`8090` were already in use).
- For production-grade environments, place HTTP services behind a reverse proxy (nginx/Traefik) and update this table with the external vs. internal port mapping.
- When adding new dashboards or exporters, note both the scrape port (Prometheus) and any UI port (Grafana/others).
- If a port is deprecated, strike it through in this table and add a note describing the migration path.

View File

@ -0,0 +1,281 @@
# Service Run Instructions
These instructions cover the newly scaffolded services. Install dependencies using Poetry (preferred) or `pip` inside a virtual environment.
## Prerequisites
- Python 3.11+
- Poetry 1.7+ (or virtualenv + pip)
- Optional: GPU drivers for miner node workloads
## Coordinator API (`apps/coordinator-api/`)
1. Navigate to the service directory:
```bash
cd apps/coordinator-api
```
2. Install dependencies:
```bash
```
3. Copy environment template and adjust values:
```bash
cp .env.example .env
```
Add coordinator API keys and, if you want signed receipts, set `RECEIPT_SIGNING_KEY_HEX` to a 64-byte Ed25519 private key encoded in hex.
4. Configure database (shared Postgres): ensure `.env` contains `DATABASE_URL=postgresql://aitbc:248218d8b7657aef@localhost:5432/aitbc` or export it in the shell before running commands.
5. Run the API locally (development):
```bash
poetry run uvicorn app.main:app --host 127.0.0.2 --port 8011 --reload
```
6. Production-style launch using Gunicorn (ports start at 8900):
```bash
poetry run gunicorn app.main:app -k uvicorn.workers.UvicornWorker -b 127.0.0.2:8900
```
7. Generate a signing key (optional):
```bash
python - <<'PY'
from nacl.signing import SigningKey
sk = SigningKey.generate()
print(sk.encode().hex())
PY
```
Store the printed hex string in `RECEIPT_SIGNING_KEY_HEX` to enable signed receipts in responses.
To add coordinator attestations, set `RECEIPT_ATTESTATION_KEY_HEX` to a separate Ed25519 private key; responses include an `attestations` array that can be verified with the corresponding public key.
8. Retrieve receipts:
- Latest receipt for a job: `GET /v1/jobs/{job_id}/receipt`
- Entire receipt history: `GET /v1/jobs/{job_id}/receipts`
Ensure the client request includes the appropriate API key; responses embed signed payloads compatible with `packages/py/aitbc-crypto` verification helpers.
Example verification snippet using the Python helpers:
```bash
export PYTHONPATH=packages/py/aitbc-crypto/src
python - <<'PY'
from aitbc_crypto.signing import ReceiptVerifier
from aitbc_crypto.receipt import canonical_json
import json
receipt = json.load(open("receipt.json", "r"))
verifier = ReceiptVerifier(receipt["signature"]["public_key"])
verifier.verify(receipt)
print("receipt verified", receipt["receipt_id"])
PY
```
Alternatively, install the Python SDK helpers:
```bash
cd packages/py/aitbc-sdk
poetry install
export PYTHONPATH=packages/py/aitbc-sdk/src:packages/py/aitbc-crypto/src
python - <<'PY'
from aitbc_sdk import CoordinatorReceiptClient, verify_receipt
client = CoordinatorReceiptClient("http://localhost:8011", "client_dev_key_1")
receipt = client.fetch_latest("<job_id>")
verification = verify_receipt(receipt)
print("miner signature valid:", verification.miner_signature.valid)
print("coordinator attestations:", [att.valid for att in verification.coordinator_attestations])
PY
For receipts containing `attestations`, iterate the list and verify each entry with the corresponding public key.
A JavaScript helper will ship with the Stage 2 SDK under `packages/js/`; until then, receipts can be verified with Node.js by loading the canonical JSON and invoking an Ed25519 verify function from `tweetnacl` (the payload is `canonical_json(receipt)` and the public key is `receipt.signature.public_key`).
Example Node.js snippet:
```bash
node <<'JS'
import fs from "fs";
import nacl from "tweetnacl";
import canonical from "json-canonicalize";
const receipt = JSON.parse(fs.readFileSync("receipt.json", "utf-8"));
const message = canonical(receipt).trim();
const sig = receipt.signature.sig;
const key = receipt.signature.key_id;
const signature = Buffer.from(sig.replace(/-/g, "+").replace(/_/g, "/"), "base64");
const publicKey = Buffer.from(key.replace(/-/g, "+").replace(/_/g, "/"), "base64");
const ok = nacl.sign.detached.verify(Buffer.from(message, "utf-8"), signature, publicKey);
console.log("verified:", ok);
JS
```
## Solidity Token (`packages/solidity/aitbc-token/`)
1. Navigate to the token project:
```bash
cd packages/solidity/aitbc-token
npm install
```
2. Run the contract unit tests:
```bash
npx hardhat test
```
3. Deploy `AIToken` to the configured Hardhat network. Provide the coordinator (required) and attestor (optional) role recipients via environment variables:
```bash
COORDINATOR_ADDRESS=0xCoordinator \
ATTESTOR_ADDRESS=0xAttestor \
npx hardhat run scripts/deploy.ts --network localhost
```
The script prints the deployed address and automatically grants the coordinator and attestor roles if they are not already assigned. Export the printed address for follow-on steps:
```bash
export AITOKEN_ADDRESS=0xDeployedAddress
```
4. Mint tokens against an attested receipt by calling the contract from Hardhats console or a script. The helper below loads the deployed contract and invokes `mintWithReceipt` with an attestor signature:
```ts
// scripts/mintWithReceipt.ts
import { ethers } from "hardhat";
import { AIToken__factory } from "../typechain-types";
async function main() {
const [coordinator] = await ethers.getSigners();
const token = AIToken__factory.connect(process.env.AITOKEN_ADDRESS!, coordinator);
const provider = "0xProvider";
const units = 100n;
const receiptHash = "0x...";
const signature = "0xSignedStructHash";
const tx = await token.mintWithReceipt(provider, units, receiptHash, signature);
await tx.wait();
console.log("Mint complete", await token.balanceOf(provider));
}
main().catch((err) => {
console.error(err);
process.exitCode = 1;
});
```
Execute the helper with `AITOKEN_ADDRESS` exported and the signature produced by the attestor key used in your tests or integration flow:
```bash
AITOKEN_ADDRESS=$AITOKEN_ADDRESS npx ts-node scripts/mintWithReceipt.ts
```
5. To derive the signature payload, reuse the `buildSignature` helper from `test/aitoken.test.ts` or recreate it in a script. The struct hash encodes `(chainId, contractAddress, provider, units, receiptHash)` and must be signed by an authorized attestor account.
## Wallet Daemon (`apps/wallet-daemon/`)
1. Navigate to the service directory:
```bash
```
2. Install dependencies:
```bash
poetry install
```
3. Copy or create `.env` with coordinator access:
```bash
cp .env.example .env # create if missing
```
Populate `COORDINATOR_BASE_URL` and `COORDINATOR_API_KEY` to reuse the coordinator API when verifying receipts.
4. Run the API locally:
```bash
poetry run uvicorn app.main:app --host 127.0.0.2 --port 8071 --reload
```
5. REST endpoints:
- `GET /v1/receipts/{job_id}` fetch + verify latest coordinator receipt.
- `GET /v1/receipts/{job_id}/history` fetch + verify entire receipt history.
6. JSON-RPC endpoint:
- `POST /rpc` with methods `receipts.verify_latest` and `receipts.verify_history` returning signature validation metadata identical to REST responses.
7. Example REST usage:
```bash
curl -s "http://localhost:8071/v1/receipts/<job_id>" | jq
```
8. Example JSON-RPC call:
```bash
curl -s http://localhost:8071/rpc \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"receipts.verify_latest","params":{"job_id":"<job_id>"}}' | jq
```
9. Keystore scaffold:
- `KeystoreService` currently stores wallets in-memory using Argon2id key derivation + XChaCha20-Poly1305 encryption.
- Subsequent milestones will back this with persistence and CLI/REST routes for wallet creation/import.
## Miner Node (`apps/miner-node/`)
1. Navigate to the directory:
```bash
cd apps/miner-node
```
2. Install dependencies:
```bash
poetry install
```
3. Configure environment:
```bash
cp .env.example .env
```
Adjust `COORDINATOR_BASE_URL`, `MINER_AUTH_TOKEN`, and workspace paths.
4. Run the miner control loop:
```bash
poetry run python -m aitbc_miner.main
```
The miner now registers and heartbeats against the coordinator, polling for work and executing CLI/Python runners. Ensure the coordinator service is running first.
5. Deploy as a systemd service (optional):
```bash
sudo scripts/ops/install_miner_systemd.sh
```
Add or update `/opt/aitbc/apps/miner-node/.env`, then use `sudo systemctl status aitbc-miner` to monitor the service.
## Blockchain Node (`apps/blockchain-node/`)
1. Navigate to the directory:
```bash
cd apps/blockchain-node
```
2. Install dependencies:
```bash
poetry install
```
3. Configure environment:
```bash
cp .env.example .env
```
Update database path, proposer key, and bind host/port as needed.
4. Run the node placeholder:
```bash
poetry run python -m aitbc_chain.main
```
(RPC, consensus, and P2P logic still to be implemented.)
### Observability Dashboards & Alerts
1. Generate the starter Grafana dashboards (if not already present):
```bash
cd apps/blockchain-node
PYTHONPATH=src python - <<'PY'
from pathlib import Path
from aitbc_chain.observability.dashboards import generate_default_dashboards
output_dir = Path("observability/generated_dashboards")
output_dir.mkdir(parents=True, exist_ok=True)
generate_default_dashboards(output_dir)
print("Dashboards written to", output_dir)
PY
```
2. Import each JSON file into Grafana (**Dashboards → Import**):
- `apps/blockchain-node/observability/generated_dashboards/coordinator-overview.json`
- `apps/blockchain-node/observability/generated_dashboards/blockchain-node-overview.json`
Select your Prometheus datasource (pointing at `127.0.0.1:8080` and `127.0.0.1:8090`) during import.
3. Ensure Prometheus scrapes both services. Example snippet from `apps/blockchain-node/observability/prometheus.yml`:
```yaml
scrape_configs:
- job_name: "blockchain-node"
static_configs:
- targets: ["127.0.0.1:8080"]
- job_name: "mock-coordinator"
static_configs:
- targets: ["127.0.0.1:8090"]
```
4. Deploy the Alertmanager rules in `apps/blockchain-node/observability/alerts.yml` (proposer stalls, miner errors, receipt drop-offs, RPC error spikes). After modifying rule files, reload Prometheus/Alertmanager:
```bash
systemctl restart prometheus
systemctl restart alertmanager
```
5. Validate by briefly stopping `aitbc-coordinator.service`, confirming Grafana panels pause and the new alerts fire, then restart the service.
## Next Steps
- Flesh out remaining logic per task breakdowns in `docs/*.md` (e.g., capability-aware scheduling, artifact uploads).
- Run the growing test suites regularly:
- `pytest apps/coordinator-api/tests/test_jobs.py`
- `pytest apps/coordinator-api/tests/test_miner_service.py`
- `pytest apps/miner-node/tests/test_runners.py`
- Create systemd and Nginx configs once services are runnable in production mode.

View File

@ -0,0 +1,485 @@
# AITBC Incident Runbooks
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation.
## Runbook: Coordinator API Outage
### Based on Chaos Test: `chaos_test_coordinator.py`
### Symptoms
- 503/504 errors on all endpoints
- Health check failures
- Job submission failures
- Marketplace unresponsive
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check pod status
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
# 2. Check recent events
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
# 3. Check if pods are crashlooping
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
# 4. Quick restart if needed
kubectl rollout restart deployment/coordinator -n default
```
### Investigation (2-10 minutes)
1. **Review Logs**
```bash
kubectl logs -n default deployment/coordinator --tail=100
```
2. **Check Resource Limits**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
```
3. **Verify Database Connectivity**
```bash
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
```
4. **Check Redis Connection**
```bash
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
```
### Recovery Actions
1. **Scale Up if Resource Starved**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
```
2. **Manual Pod Deletion if Stuck**
```bash
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
```
3. **Rollback Deployment**
```bash
kubectl rollout undo deployment/coordinator -n default
```
### Verification
```bash
# Test health endpoint
curl -f http://127.0.0.2:8011/v1/health
# Test API with sample request
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
```
## Runbook: Network Partition
### Based on Chaos Test: `chaos_test_network.py`
### Symptoms
- Blockchain nodes not communicating
- Consensus stalled
- High finality latency
- Transaction processing delays
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
# 2. Check consensus status
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
# 3. Check network policies
kubectl get networkpolicies -n default
```
### Investigation (5-15 minutes)
1. **Identify Partitioned Nodes**
```bash
# Check each node's peer count
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
echo "Pod: $pod"
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
done
```
2. **Check Network Policies**
```bash
kubectl describe networkpolicy default-deny-all-ingress -n default
kubectl describe networkpolicy blockchain-node-netpol -n default
```
3. **Verify DNS Resolution**
```bash
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
```
### Recovery Actions
1. **Remove Problematic Network Rules**
```bash
# Flush iptables on affected nodes
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
kubectl exec -n default $pod -- iptables -F
done
```
2. **Restart Network Components**
```bash
kubectl rollout restart deployment/blockchain-node -n default
```
3. **Force Re-peering**
```bash
# Delete and recreate pods to force re-peering
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
```
### Verification
```bash
# Wait for consensus to resume
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
# Verify peer connectivity
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
```
## Runbook: Database Failure
### Based on Chaos Test: `chaos_test_database.py`
### Symptoms
- Database connection errors
- Service degradation
- Failed transactions
- High error rates
### MTTR Target: 3 minutes
### Immediate Actions (0-3 minutes)
```bash
# 1. Check PostgreSQL status
kubectl exec -n default deployment/postgresql -- pg_isready
# 2. Check connection count
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
# 3. Check replica lag
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
```
### Investigation (3-10 minutes)
1. **Review Database Logs**
```bash
kubectl logs -n default deployment/postgresql --tail=100
```
2. **Check Resource Usage**
```bash
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
df -h /var/lib/postgresql/data
```
3. **Identify Long-running Queries**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```
### Recovery Actions
1. **Kill Idle Connections**
```bash
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
```
2. **Restart PostgreSQL**
```bash
kubectl rollout restart deployment/postgresql -n default
```
3. **Failover to Replica**
```bash
# Promote replica if primary fails
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
```
### Verification
```bash
# Test database connectivity
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
# Check application health
curl -f http://127.0.0.2:8011/v1/health
```
## Runbook: Redis Failure
### Symptoms
- Caching failures
- Session loss
- Increased database load
- Slow response times
### MTTR Target: 2 minutes
### Immediate Actions (0-2 minutes)
```bash
# 1. Check Redis status
kubectl exec -n default deployment/redis -- redis-cli ping
# 2. Check memory usage
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
# 3. Check connection count
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
```
### Investigation (2-5 minutes)
1. **Review Redis Logs**
```bash
kubectl logs -n default deployment/redis --tail=100
```
2. **Check for Eviction**
```bash
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
```
3. **Identify Large Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
```
### Recovery Actions
1. **Clear Expired Keys**
```bash
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
```
2. **Restart Redis**
```bash
kubectl rollout restart deployment/redis -n default
```
3. **Scale Redis Cluster**
```bash
kubectl scale deployment/redis --replicas=3 -n default
```
### Verification
```bash
# Test Redis connectivity
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
# Check application performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: High CPU/Memory Usage
### Symptoms
- Slow response times
- Pod evictions
- OOM errors
- System degradation
### MTTR Target: 5 minutes
### Immediate Actions (0-5 minutes)
```bash
# 1. Check resource usage
kubectl top pods -n default
kubectl top nodes
# 2. Identify resource-hungry pods
kubectl exec -n default deployment/coordinator -- top
# 3. Check for OOM kills
dmesg | grep -i "killed process"
```
### Investigation (5-15 minutes)
1. **Analyze Resource Usage**
```bash
# Detailed pod metrics
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
```
2. **Check Resource Limits**
```bash
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
```
3. **Review Application Metrics**
```bash
# Check Prometheus metrics
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
```
### Recovery Actions
1. **Scale Services**
```bash
kubectl scale deployment/coordinator --replicas=5 -n default
kubectl scale deployment/blockchain-node --replicas=3 -n default
```
2. **Increase Resource Limits**
```bash
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
```
3. **Restart Affected Services**
```bash
kubectl rollout restart deployment/coordinator -n default
```
### Verification
```bash
# Monitor resource usage
watch -n 5 'kubectl top pods -n default'
# Test service performance
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
```
## Runbook: Storage Issues
### Symptoms
- Disk space warnings
- Write failures
- Database errors
- Pod crashes
### MTTR Target: 10 minutes
### Immediate Actions (0-10 minutes)
```bash
# 1. Check disk usage
df -h
kubectl exec -n default deployment/postgresql -- df -h
# 2. Identify large files
find /var/log -name "*.log" -size +100M
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
# 3. Clean up logs
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
```
### Investigation (10-20 minutes)
1. **Analyze Storage Usage**
```bash
du -sh /var/log/*
du -sh /var/lib/docker/*
```
2. **Check PVC Usage**
```bash
kubectl get pvc -n default
kubectl describe pvc postgresql-data -n default
```
3. **Review Retention Policies**
```bash
kubectl get cronjobs -n default
kubectl describe cronjob log-cleanup -n default
```
### Recovery Actions
1. **Expand Storage**
```bash
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
```
2. **Force Cleanup**
```bash
# Clean old logs
find /var/log -name "*.log" -mtime +7 -delete
# Clean Docker images
docker system prune -a
```
3. **Restart Services**
```bash
kubectl rollout restart deployment/postgresql -n default
```
### Verification
```bash
# Check disk space
df -h
# Verify database operations
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
```
## Emergency Contact Procedures
### Escalation Matrix
1. **Level 1**: On-call engineer (5 minutes)
2. **Level 2**: On-call secondary (15 minutes)
3. **Level 3**: Engineering manager (30 minutes)
4. **Level 4**: CTO (1 hour, critical only)
### War Room Activation
```bash
# Create Slack channel
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
# Invite stakeholders
/slack invite @sre-team @engineering-manager @cto
# Start Zoom meeting
/zoom start "AITBC Incident War Room"
```
### Customer Communication
1. **Status Page Update** (5 minutes)
2. **Email Notification** (15 minutes)
3. **Twitter Update** (30 minutes, critical only)
## Post-Incident Checklist
### Immediate (0-1 hour)
- [ ] Service fully restored
- [ ] Monitoring normal
- [ ] Status page updated
- [ ] Stakeholders notified
### Short-term (1-24 hours)
- [ ] Incident document created
- [ ] Root cause identified
- [ ] Runbooks updated
- [ ] Post-mortem scheduled
### Long-term (1-7 days)
- [ ] Post-mortem completed
- [ ] Action items assigned
- [ ] Monitoring improved
- [ ] Process updated
## Runbook Maintenance
### Review Schedule
- **Monthly**: Review and update runbooks
- **Quarterly**: Full review and testing
- **Annually**: Major revision
### Update Process
1. Test runbook procedures
2. Document lessons learned
3. Update procedures
4. Train team members
5. Update documentation
---
*Version: 1.0*
*Last Updated: 2024-12-22*
*Owner: SRE Team*

40
docs/operator/index.md Normal file
View File

@ -0,0 +1,40 @@
# AITBC Operator Documentation
Welcome to the AITBC operator documentation. This section contains resources for deploying, operating, and maintaining AITBC infrastructure.
## Deployment
- [Deployment Guide](deployment/run.md) - How to deploy AITBC components
- [Installation](deployment/installation.md) - System requirements and installation
- [Configuration](deployment/configuration.md) - Configuration options
- [Ports](deployment/ports.md) - Network ports and requirements
## Operations
- [Backup & Restore](backup_restore.md) - Data backup and recovery procedures
- [Security](security.md) - Security best practices and hardening
- [Monitoring](monitoring/monitoring-playbook.md) - System monitoring and observability
- [Incident Response](incident-runbooks.md) - Incident handling procedures
## Architecture
- [System Architecture](../reference/architecture/) - Understanding AITBC architecture
- [Components](../reference/architecture/) - Component documentation
- [Multi-tenancy](../reference/architecture/) - Multi-tenant infrastructure
## Scaling
- [Scaling Guide](scaling.md) - How to scale AITBC infrastructure
- [Performance Tuning](performance.md) - Performance optimization
- [Capacity Planning](capacity.md) - Resource planning
## Reference
- [Glossary](../reference/glossary.md) - Terms and definitions
- [Troubleshooting](../user-guide/troubleshooting.md) - Common issues and solutions
- [FAQ](../user-guide/faq.md) - Frequently asked questions
## Support
- [Getting Help](../user-guide/support.md) - How to get support
- [Contact](../user-guide/support.md) - Contact information

View File

@ -0,0 +1,449 @@
# AITBC Monitoring Playbook & On-Call Guide
## Overview
This document provides comprehensive monitoring procedures, on-call rotations, and incident response playbooks for the AITBC platform. It ensures reliable operation of all services and quick resolution of issues.
## Service Overview
### Core Services
- **Coordinator API**: Job management and marketplace coordination
- **Blockchain Nodes**: Consensus and transaction processing
- **Explorer UI**: Block explorer and transaction visualization
- **Marketplace UI**: User interface for marketplace operations
- **Wallet Daemon**: Cryptographic key management
- **Infrastructure**: PostgreSQL, Redis, Kubernetes cluster
### Critical Metrics
- **Availability**: 99.9% uptime SLA
- **Performance**: <200ms API response time (95th percentile)
- **Throughput**: 100+ TPS sustained
- **MTTR**: <2 minutes for critical incidents
## On-Call Rotation
### Rotation Schedule
- **Primary On-Call**: 1 week rotation, Monday 00:00 UTC to Monday 00:00 UTC
- **Secondary On-Call**: Shadow primary, handles escalations
- **Tertiary**: Backup for both primary and secondary
- **Rotation Handoff**: Every Monday at 08:00 UTC
### Team Structure
```
Week 1: Alice (Primary), Bob (Secondary), Carol (Tertiary)
Week 2: Bob (Primary), Carol (Secondary), Alice (Tertiary)
Week 3: Carol (Primary), Alice (Secondary), Bob (Tertiary)
```
### Handoff Procedures
1. **Pre-handoff Check** (Sunday 22:00 UTC):
- Review active incidents
- Check scheduled maintenance
- Verify monitoring systems health
2. **Handoff Meeting** (Monday 08:00 UTC):
- 15-minute video call
- Discuss current issues
- Transfer knowledge
- Confirm contact information
3. **Post-handoff** (Monday 09:00 UTC):
- Primary acknowledges receipt
- Update on-call calendar
- Test alerting systems
### Contact Information
- **Primary**: +1-555-ONCALL-1 (PagerDuty)
- **Secondary**: +1-555-ONCALL-2 (PagerDuty)
- **Tertiary**: +1-555-ONCALL-3 (PagerDuty)
- **Escalation Manager**: +1-555-ESCALATE
- **Emergency**: +1-555-EMERGENCY (Critical infrastructure only)
## Alerting & Escalation
### Alert Severity Levels
#### Critical (P0)
- Service completely down
- Data loss or corruption
- Security breach
- SLA violation in progress
- **Response Time**: 5 minutes
- **Escalation**: 15 minutes if no response
#### High (P1)
- Significant degradation
- Partial service outage
- High error rates (>10%)
- **Response Time**: 15 minutes
- **Escalation**: 1 hour if no response
#### Medium (P2)
- Minor degradation
- Elevated error rates (5-10%)
- Performance issues
- **Response Time**: 1 hour
- **Escalation**: 4 hours if no response
#### Low (P3)
- Informational alerts
- Non-critical issues
- **Response Time**: 4 hours
- **Escalation**: 24 hours if no response
### Escalation Policy
1. **Level 1**: Primary On-Call (5-60 minutes)
2. **Level 2**: Secondary On-Call (15 minutes - 4 hours)
3. **Level 3**: Tertiary On-Call (1 hour - 24 hours)
4. **Level 4**: Engineering Manager (4 hours)
5. **Level 5**: CTO (Critical incidents only)
### Alert Channels
- **PagerDuty**: Primary alerting system
- **Slack**: #on-call-aitbc channel
- **Email**: oncall@aitbc.io
- **SMS**: Critical alerts only
- **Phone**: Critical incidents only
## Incident Response
### Incident Classification
#### SEV-0 (Critical)
- Complete service outage
- Data loss or security breach
- Financial impact >$10,000/hour
- Customer impact >50%
#### SEV-1 (High)
- Significant service degradation
- Feature unavailable
- Financial impact $1,000-$10,000/hour
- Customer impact 10-50%
#### SEV-2 (Medium)
- Minor service degradation
- Performance issues
- Financial impact <$1,000/hour
- Customer impact <10%
#### SEV-3 (Low)
- Informational
- No customer impact
### Incident Response Process
#### 1. Detection & Triage (0-5 minutes)
```bash
# Check alert severity
# Verify impact
# Create incident channel
# Notify stakeholders
```
#### 2. Assessment (5-15 minutes)
- Determine scope
- Identify root cause area
- Estimate resolution time
- Declare severity level
#### 3. Communication (15-30 minutes)
- Update status page
- Notify customers (if needed)
- Internal stakeholder updates
- Set up war room
#### 4. Resolution (Varies)
- Implement fix
- Verify resolution
- Monitor for recurrence
- Document actions
#### 5. Recovery (30-60 minutes)
- Full service restoration
- Performance validation
- Customer communication
- Incident closure
## Service-Specific Runbooks
### Coordinator API
#### High Error Rate
**Symptoms**: 5xx errors >5%, response time >500ms
**Runbook**:
1. Check pod health: `kubectl get pods -l app=coordinator`
2. Review logs: `kubectl logs -f deployment/coordinator`
3. Check database connectivity
4. Verify Redis connection
5. Scale if needed: `kubectl scale deployment coordinator --replicas=5`
#### Service Unavailable
**Symptoms**: 503 errors, health check failures
**Runbook**:
1. Check deployment status
2. Review recent deployments
3. Rollback if necessary
4. Check resource limits
5. Verify ingress configuration
### Blockchain Nodes
#### Consensus Stalled
**Symptoms**: No new blocks, high finality latency
**Runbook**:
1. Check node sync status
2. Verify network connectivity
3. Review validator set
4. Check governance proposals
5. Restart if needed (with caution)
#### High Peer Drop Rate
**Symptoms**: Connected peers <50%, network partition
**Runbook**:
1. Check network policies
2. Verify DNS resolution
3. Review firewall rules
4. Check load balancer health
5. Restart networking components
### Database (PostgreSQL)
#### Connection Exhaustion
**Symptoms**: "Too many connections" errors
**Runbook**:
1. Check active connections
2. Identify long-running queries
3. Kill idle connections
4. Increase pool size if needed
5. Scale database
#### Replica Lag
**Symptoms**: Read replica lag >10 seconds
**Runbook**:
1. Check replica status
2. Review network latency
3. Verify disk space
4. Restart replication if needed
5. Failover if necessary
### Redis
#### Memory Pressure
**Symptoms**: OOM errors, high eviction rate
**Runbook**:
1. Check memory usage
2. Review key expiration
3. Clean up unused keys
4. Scale Redis cluster
5. Optimize data structures
#### Connection Issues
**Symptoms**: Connection timeouts, errors
**Runbook**:
1. Check max connections
2. Review connection pool
3. Verify network policies
4. Restart Redis if needed
5. Scale horizontally
## Monitoring Dashboards
### Primary Dashboards
#### 1. System Overview
- Service health status
- Error rates (4xx/5xx)
- Response times
- Throughput metrics
- Resource utilization
#### 2. Infrastructure
- Kubernetes cluster health
- Node resource usage
- Pod status and restarts
- Network traffic
- Storage capacity
#### 3. Application Metrics
- Job submission rates
- Transaction processing
- Marketplace activity
- Wallet operations
- Mining statistics
#### 4. Business KPIs
- Active users
- Transaction volume
- Revenue metrics
- Customer satisfaction
- SLA compliance
### Alert Rules
#### Critical Alerts
- Service down >1 minute
- Error rate >10%
- Response time >1 second
- Disk space >90%
- Memory usage >95%
#### Warning Alerts
- Error rate >5%
- Response time >500ms
- CPU usage >80%
- Queue depth >1000
- Replica lag >5s
## SLOs & SLIs
### Service Level Objectives
| Service | Metric | Target | Measurement |
|---------|--------|--------|-------------|
| Coordinator API | Availability | 99.9% | 30-day rolling |
| Coordinator API | Latency | <200ms | 95th percentile |
| Blockchain | Block Time | <2s | 24-hour average |
| Marketplace | Success Rate | 99.5% | Daily |
| Explorer | Response Time | <500ms | 95th percentile |
### Service Level Indicators
#### Availability
- HTTP status codes
- Health check responses
- Pod readiness status
#### Latency
- Request duration histogram
- Database query times
- External API calls
#### Throughput
- Requests per second
- Transactions per block
- Jobs completed per hour
#### Quality
- Error rates
- Success rates
- Customer satisfaction
## Post-Incident Process
### Immediate Actions (0-1 hour)
1. Verify full resolution
2. Monitor for recurrence
3. Update status page
4. Notify stakeholders
### Post-Mortem (1-24 hours)
1. Create incident document
2. Gather timeline and logs
3. Identify root cause
4. Document lessons learned
### Follow-up (1-7 days)
1. Schedule post-mortem meeting
2. Assign action items
3. Update runbooks
4. Improve monitoring
### Review (Weekly)
1. Review incident trends
2. Update SLOs if needed
3. Adjust alerting thresholds
4. Improve processes
## Maintenance Windows
### Scheduled Maintenance
- **Frequency**: Weekly maintenance window
- **Time**: Sunday 02:00-04:00 UTC
- **Duration**: Maximum 2 hours
- **Notification**: 72 hours advance
### Emergency Maintenance
- **Approval**: Engineering Manager required
- **Notification**: 4 hours advance (if possible)
- **Duration**: As needed
- **Rollback**: Always required
## Tools & Systems
### Monitoring Stack
- **Prometheus**: Metrics collection
- **Grafana**: Visualization and dashboards
- **Alertmanager**: Alert routing and management
- **PagerDuty**: On-call scheduling and escalation
### Observability
- **Jaeger**: Distributed tracing
- **Loki**: Log aggregation
- **Kiali**: Service mesh visualization
- **Kube-state-metrics**: Kubernetes metrics
### Communication
- **Slack**: Primary communication
- **Zoom**: War room meetings
- **Status Page**: Customer notifications
- **Email**: Formal communications
## Training & Onboarding
### New On-Call Engineer
1. Shadow primary for 1 week
2. Review all runbooks
3. Test alerting systems
4. Handle low-severity incidents
5. Solo on-call with mentor
### Ongoing Training
- Monthly incident drills
- Quarterly runbook updates
- Annual training refreshers
- Cross-team knowledge sharing
## Emergency Procedures
### Major Outage
1. Declare incident (SEV-0)
2. Activate war room
3. Customer communication
4. Executive updates
5. Recovery coordination
### Security Incident
1. Isolate affected systems
2. Preserve evidence
3. Notify security team
4. Customer notification
5. Regulatory compliance
### Data Loss
1. Stop affected services
2. Assess impact
3. Initiate recovery
4. Customer communication
5. Prevent recurrence
## Appendix
### A. Contact List
[Detailed contact information]
### B. Runbook Checklist
[Quick reference checklists]
### C. Alert Configuration
[Prometheus rules and thresholds]
### D. Dashboard Links
[Grafana dashboard URLs]
---
*Document Version: 1.0*
*Last Updated: 2024-12-22*
*Next Review: 2025-01-22*
*Owner: SRE Team*

340
docs/operator/security.md Normal file
View File

@ -0,0 +1,340 @@
# AITBC Security Documentation
This document outlines the security architecture, threat model, and implementation details for the AITBC platform.
## Overview
AITBC implements defense-in-depth security across multiple layers:
- Network security with TLS termination
- API authentication and authorization
- Secrets management and encryption
- Infrastructure security best practices
- Monitoring and incident response
## Threat Model
### Threat Actors
| Actor | Motivation | Capabilities | Impact |
|-------|-----------|--------------|--------|
| External attacker | Financial gain, disruption | Network access, exploits | High |
| Malicious insider | Data theft, sabotage | Internal access | Critical |
| Competitor | IP theft, market manipulation | Sophisticated attacks | High |
| Casual user | Accidental misuse | Limited knowledge | Low |
### Attack Vectors
1. **Network Attacks**
- Man-in-the-middle (MITM) attacks
- DDoS attacks
- Network reconnaissance
2. **API Attacks**
- Unauthorized access to marketplace
- API key leakage
- Rate limiting bypass
- Injection attacks
3. **Infrastructure Attacks**
- Container escape
- Pod-to-pod attacks
- Secrets exfiltration
- Supply chain attacks
4. **Blockchain-Specific Attacks**
- 51% attacks on consensus
- Transaction replay attacks
- Smart contract exploits
- Miner collusion
### Security Controls
| Control | Implementation | Mitigates |
|---------|----------------|-----------|
| TLS 1.3 | cert-manager + ingress | MITM, eavesdropping |
| API Keys | X-API-Key header | Unauthorized access |
| Rate Limiting | slowapi middleware | DDoS, abuse |
| Network Policies | Kubernetes NetworkPolicy | Pod-to-pod attacks |
| Secrets Mgmt | Kubernetes Secrets + SealedSecrets | Secrets exfiltration |
| RBAC | Kubernetes RBAC | Privilege escalation |
| Monitoring | Prometheus + AlertManager | Incident detection |
## Security Architecture
### Network Security
#### TLS Termination
```yaml
# Ingress configuration with TLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
spec:
tls:
- hosts:
- api.aitbc.io
secretName: api-tls
```
#### Certificate Management
- Uses cert-manager for automatic certificate provisioning
- Supports Let's Encrypt for production
- Internal CA for development environments
- Automatic renewal 30 days before expiry
### API Security
#### Authentication
- API key-based authentication for all services
- Keys stored in Kubernetes Secrets
- Per-service key rotation policies
- Audit logging for all authenticated requests
#### Authorization
- Role-based access control (RBAC)
- Resource-level permissions
- Rate limiting per API key
- IP whitelisting for sensitive operations
#### API Key Format
```
Header: X-API-Key: aitbc_prod_ak_1a2b3c4d5e6f7g8h9i0j
```
### Secrets Management
#### Kubernetes Secrets
- Base64 encoded secrets (not encrypted by default)
- Encrypted at rest with etcd encryption
- Access controlled via RBAC
#### SealedSecrets (Recommended for Production)
- Client-side encryption of secrets
- GitOps friendly
- Zero-knowledge encryption
#### Secret Rotation
- Automated rotation every 90 days
- Zero-downtime rotation for services
- Audit trail of all rotations
## Implementation Details
### 1. TLS Configuration
#### Coordinator API
```yaml
# Helm values for coordinator
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
tls:
- secretName: coordinator-tls
hosts:
- api.aitbc.io
```
#### Blockchain Node RPC
```yaml
# WebSocket with TLS
wss://api.aitbc.io:8080/ws
```
### 2. API Authentication Middleware
#### Coordinator API Implementation
```python
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=True)
async def verify_api_key(api_key: str = Security(api_key_header)):
if not verify_key(api_key):
raise HTTPException(status_code=403, detail="Invalid API key")
return api_key
@app.middleware("http")
async def auth_middleware(request: Request, call_next):
if request.url.path.startswith("/v1/"):
api_key = request.headers.get("X-API-Key")
if not verify_key(api_key):
raise HTTPException(status_code=403, detail="API key required")
response = await call_next(request)
return response
```
### 3. Secrets Management Setup
#### SealedSecrets Installation
```bash
# Install sealed-secrets controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system
# Create a sealed secret
kubeseal --format yaml < secret.yaml > sealed-secret.yaml
```
#### Example Secret Structure
```yaml
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: coordinator-api-keys
spec:
encryptedData:
api-key-prod: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
api-key-dev: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
```
### 4. Network Policies
#### Default Deny Policy
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
```
#### Service-Specific Policies
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: coordinator-api-netpol
spec:
podSelector:
matchLabels:
app: coordinator-api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress-nginx
ports:
- protocol: TCP
port: 8011
```
## Security Best Practices
### Development Environment
- Use 127.0.0.2 for local development (not 0.0.0.0)
- Separate API keys for dev/staging/prod
- Enable debug logging only in development
- Use self-signed certificates for local TLS
### Production Environment
- Enable all security headers
- Implement comprehensive logging
- Use external secret management
- Regular security audits
- Penetration testing quarterly
### Monitoring and Alerting
#### Security Metrics
- Failed authentication attempts
- Unusual API usage patterns
- Certificate expiry warnings
- Secret access audits
#### Alert Rules
```yaml
- alert: HighAuthFailureRate
expr: rate(auth_failures_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate detected"
- alert: CertificateExpiringSoon
expr: cert_certificate_expiry_time < time() + 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate expires in less than 7 days"
```
## Incident Response
### Security Incident Categories
1. **Critical**: Data breach, system compromise
2. **High**: Service disruption, privilege escalation
3. **Medium**: Suspicious activity, policy violation
4. **Low**: Misconfiguration, minor issue
### Response Procedures
1. **Detection**: Automated alerts, manual monitoring
2. **Assessment**: Impact analysis, containment
3. **Remediation**: Patch, rotate credentials, restore
4. **Post-mortem**: Document, improve controls
### Emergency Contacts
- Security Team: security@aitbc.io
- On-call Engineer: +1-555-SECURITY
- Incident Commander: incident@aitbc.io
## Compliance
### Data Protection
- GDPR compliance for EU users
- CCPA compliance for California users
- Data retention policies
- Right to deletion implementation
### Auditing
- Quarterly security audits
- Annual penetration testing
- Continuous vulnerability scanning
- Third-party security assessments
## Security Checklist
### Pre-deployment
- [ ] All API endpoints require authentication
- [ ] TLS certificates valid and properly configured
- [ ] Secrets encrypted and access-controlled
- [ ] Network policies implemented
- [ ] RBAC configured correctly
- [ ] Monitoring and alerting active
- [ ] Backup encryption enabled
- [ ] Security headers configured
### Post-deployment
- [ ] Security testing completed
- [ ] Documentation updated
- [ ] Team trained on procedures
- [ ] Incident response tested
- [ ] Compliance verified
## References
- [OWASP API Security Top 10](https://owasp.org/www-project-api-security/)
- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)
- [NIST Cybersecurity Framework](https://www.nist.gov/cyberframework)
- [CERT Coordination Center](https://www.cert.org/)
## Security Updates
This document is updated regularly. Last updated: 2024-12-22
For questions or concerns, contact the security team at security@aitbc.io