chore: standardize configuration, logging, and error handling across blockchain node and coordinator API
- Add infrastructure.md and workflow files to .gitignore to prevent sensitive info leaks - Change blockchain node mempool backend default from memory to database for persistence - Refactor blockchain node logger with StructuredLogFormatter and AuditLogger (consistent with coordinator) - Add structured logging fields: service, module, function, line number - Unify coordinator config with Database
This commit is contained in:
20
docs/7_deployment/0_index.md
Normal file
20
docs/7_deployment/0_index.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Deployment & Operations
|
||||
|
||||
Deploy, operate, and maintain AITBC infrastructure.
|
||||
|
||||
## Reading Order
|
||||
|
||||
| # | File | What you learn |
|
||||
|---|------|----------------|
|
||||
| 1 | [1_remote-deployment-guide.md](./1_remote-deployment-guide.md) | Deploy to remote servers |
|
||||
| 2 | [2_service-naming-convention.md](./2_service-naming-convention.md) | Systemd service names and standards |
|
||||
| 3 | [3_backup-restore.md](./3_backup-restore.md) | Backup PostgreSQL, Redis, ledger data |
|
||||
| 4 | [4_incident-runbooks.md](./4_incident-runbooks.md) | Handle outages and incidents |
|
||||
| 5 | [5_marketplace-deployment.md](./5_marketplace-deployment.md) | Deploy GPU marketplace endpoints |
|
||||
| 6 | [6_beta-release-plan.md](./6_beta-release-plan.md) | Beta release checklist and timeline |
|
||||
|
||||
## Related
|
||||
|
||||
- [Installation](../0_getting_started/2_installation.md) — Initial setup
|
||||
- [Security](../9_security/) — Security architecture and hardening
|
||||
- [Architecture](../6_architecture/) — System design docs
|
||||
138
docs/7_deployment/1_remote-deployment-guide.md
Normal file
138
docs/7_deployment/1_remote-deployment-guide.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# AITBC Remote Deployment Guide
|
||||
|
||||
## Overview
|
||||
This deployment strategy builds the blockchain node directly on the ns3 server to utilize its gigabit connection, avoiding slow uploads from localhost.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Deploy Everything
|
||||
```bash
|
||||
./scripts/deploy/deploy-all-remote.sh
|
||||
```
|
||||
|
||||
This will:
|
||||
- Copy deployment scripts to ns3
|
||||
- Copy blockchain source code from localhost
|
||||
- Build blockchain node directly on server
|
||||
- Deploy a lightweight HTML-based explorer
|
||||
- Configure port forwarding
|
||||
|
||||
### 2. Access Services
|
||||
|
||||
**Blockchain Node RPC:**
|
||||
- Internal: http://localhost:8082
|
||||
- External: http://aitbc.keisanki.net:8082
|
||||
|
||||
**Blockchain Explorer:**
|
||||
- Internal: http://localhost:3000
|
||||
- External: http://aitbc.keisanki.net:3000
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
ns3-root (95.216.198.140)
|
||||
├── Blockchain Node (port 8082)
|
||||
│ ├── Auto-syncs on startup
|
||||
│ └── Serves RPC API
|
||||
└── Explorer (port 3000)
|
||||
├── Static HTML/CSS/JS
|
||||
├── Served by nginx
|
||||
└── Connects to local node
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
### Blockchain Node
|
||||
- Built directly on server from source code
|
||||
- Source copied from localhost via scp
|
||||
- Auto-sync on startup
|
||||
- No large file uploads needed
|
||||
- Uses server's gigabit connection
|
||||
|
||||
### Explorer
|
||||
- Pure HTML/CSS/JS (no build step)
|
||||
- Served by nginx
|
||||
- Real-time block viewing
|
||||
- Transaction details
|
||||
- Auto-refresh every 30 seconds
|
||||
|
||||
## Manual Deployment
|
||||
|
||||
If you need to deploy components separately:
|
||||
|
||||
### Blockchain Node Only
|
||||
```bash
|
||||
ssh ns3-root
|
||||
cd /opt
|
||||
./deploy-blockchain-remote.sh
|
||||
```
|
||||
|
||||
### Explorer Only
|
||||
```bash
|
||||
ssh ns3-root
|
||||
cd /opt
|
||||
./deploy-explorer-remote.sh
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check Services
|
||||
```bash
|
||||
# On ns3 server
|
||||
systemctl status blockchain-node blockchain-rpc nginx
|
||||
|
||||
# Check logs
|
||||
journalctl -u blockchain-node -f
|
||||
journalctl -u blockchain-rpc -f
|
||||
journalctl -u nginx -f
|
||||
```
|
||||
|
||||
### Test RPC
|
||||
```bash
|
||||
# From ns3
|
||||
curl http://localhost:8082/rpc/head
|
||||
|
||||
# From external
|
||||
curl http://aitbc.keisanki.net:8082/rpc/head
|
||||
```
|
||||
|
||||
### Port Forwarding
|
||||
If port forwarding doesn't work:
|
||||
```bash
|
||||
# Check iptables rules
|
||||
iptables -t nat -L -n
|
||||
|
||||
# Re-add rules
|
||||
iptables -t nat -A PREROUTING -p tcp --dport 8082 -j DNAT --to-destination 192.168.100.10:8082
|
||||
iptables -t nat -A POSTROUTING -p tcp -d 192.168.100.10 --dport 8082 -j MASQUERADE
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Blockchain Node
|
||||
Location: `/opt/blockchain-node/.env`
|
||||
- Chain ID: ait-devnet
|
||||
- RPC Port: 8082
|
||||
- P2P Port: 7070
|
||||
- Auto-sync: enabled
|
||||
|
||||
### Explorer
|
||||
Location: `/opt/blockchain-explorer/index.html`
|
||||
- Served by nginx on port 3000
|
||||
- Connects to localhost:8082
|
||||
- No configuration needed
|
||||
|
||||
## Security Notes
|
||||
|
||||
- Services run as root (simplify for dev)
|
||||
- No authentication on RPC (dev only)
|
||||
- Port forwarding exposes services externally
|
||||
- Consider firewall rules for production
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Set up proper authentication
|
||||
2. Configure HTTPS with SSL certificates
|
||||
3. Add multiple peers for network resilience
|
||||
4. Implement proper backup procedures
|
||||
5. Set up monitoring and alerting
|
||||
85
docs/7_deployment/2_service-naming-convention.md
Normal file
85
docs/7_deployment/2_service-naming-convention.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# AITBC Service Naming Convention
|
||||
|
||||
## Updated Service Names (2026-02-13)
|
||||
|
||||
All AITBC systemd services now follow the `aitbc-` prefix convention for consistency and easier management.
|
||||
|
||||
### Site A (aitbc.bubuit.net) - Production Services
|
||||
|
||||
| Old Name | New Name | Port | Description |
|
||||
|----------|----------|------|-------------|
|
||||
| blockchain-node.service | aitbc-blockchain-node-1.service | 8081 | Blockchain Node 1 |
|
||||
| blockchain-node-2.service | aitbc-blockchain-node-2.service | 8082 | Blockchain Node 2 |
|
||||
| blockchain-rpc.service | aitbc-blockchain-rpc-1.service | - | RPC API for Node 1 |
|
||||
| blockchain-rpc-2.service | aitbc-blockchain-rpc-2.service | - | RPC API for Node 2 |
|
||||
| coordinator-api.service | aitbc-coordinator-api.service | 8000 | Coordinator API |
|
||||
| exchange-mock-api.service | aitbc-exchange-mock-api.service | - | Exchange Mock API |
|
||||
|
||||
### Site B (ns3 container) - Remote Node
|
||||
|
||||
| Old Name | New Name | Port | Description |
|
||||
|----------|----------|------|-------------|
|
||||
| blockchain-node.service | aitbc-blockchain-node-3.service | 8082 | Blockchain Node 3 |
|
||||
| blockchain-rpc.service | aitbc-blockchain-rpc-3.service | - | RPC API for Node 3 |
|
||||
|
||||
### Already Compliant Services
|
||||
These services already had the `aitbc-` prefix:
|
||||
- aitbc-exchange-api.service (port 3003)
|
||||
- aitbc-exchange.service (port 3002)
|
||||
- aitbc-miner-dashboard.service
|
||||
|
||||
### Removed Services
|
||||
- aitbc-blockchain.service (legacy, was on port 9080)
|
||||
|
||||
## Management Commands
|
||||
|
||||
### Check Service Status
|
||||
```bash
|
||||
# Site A (via SSH)
|
||||
ssh aitbc-cascade "systemctl status aitbc-blockchain-node-1.service"
|
||||
|
||||
# Site B (via SSH)
|
||||
ssh ns3-root "incus exec aitbc -- systemctl status aitbc-blockchain-node-3.service"
|
||||
```
|
||||
|
||||
### Restart Services
|
||||
```bash
|
||||
# Site A
|
||||
ssh aitbc-cascade "sudo systemctl restart aitbc-blockchain-node-1.service"
|
||||
|
||||
# Site B
|
||||
ssh ns3-root "incus exec aitbc -- sudo systemctl restart aitbc-blockchain-node-3.service"
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Site A
|
||||
ssh aitbc-cascade "journalctl -u aitbc-blockchain-node-1.service -f"
|
||||
|
||||
# Site B
|
||||
ssh ns3-root "incus exec aitbc -- journalctl -u aitbc-blockchain-node-3.service -f"
|
||||
```
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
### Blockchain Nodes
|
||||
- Node 1: `/opt/blockchain-node` → port 8081
|
||||
- Node 2: `/opt/blockchain-node-2` → port 8082
|
||||
- Node 3: `/opt/blockchain-node` → port 8082 (Site B)
|
||||
|
||||
### RPC Services
|
||||
- RPC services are companion services to the main nodes
|
||||
- They provide HTTP API endpoints for blockchain operations
|
||||
|
||||
### Coordinator API
|
||||
- Main API for job submission, miner management, and receipts
|
||||
- Runs on localhost:8000 inside container
|
||||
- Proxied via nginx at https://aitbc.bubuit.net/api/
|
||||
|
||||
## Benefits of Standardized Naming
|
||||
|
||||
1. **Clarity**: Easy to identify AITBC services among system services
|
||||
2. **Management**: Simpler to filter and manage with wildcards (`systemctl status aitbc-*`)
|
||||
3. **Documentation**: Consistent naming across all documentation
|
||||
4. **Automation**: Easier scripting and automation with predictable names
|
||||
5. **Debugging**: Faster identification of service-related issues
|
||||
316
docs/7_deployment/3_backup-restore.md
Normal file
316
docs/7_deployment/3_backup-restore.md
Normal file
@@ -0,0 +1,316 @@
|
||||
# AITBC Backup and Restore Procedures
|
||||
|
||||
This document outlines the backup and restore procedures for all AITBC system components including PostgreSQL, Redis, and blockchain ledger storage.
|
||||
|
||||
## Overview
|
||||
|
||||
The AITBC platform implements a comprehensive backup strategy with:
|
||||
- **Automated daily backups** via Kubernetes CronJobs
|
||||
- **Manual backup capabilities** for on-demand operations
|
||||
- **Incremental and full backup options** for ledger data
|
||||
- **Cloud storage integration** for off-site backups
|
||||
- **Retention policies** to manage storage efficiently
|
||||
|
||||
## Components
|
||||
|
||||
### 1. PostgreSQL Database
|
||||
- **Location**: Coordinator API persistent storage
|
||||
- **Data**: Jobs, marketplace offers/bids, user sessions, configuration
|
||||
- **Backup Format**: Custom PostgreSQL dump with compression
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
### 2. Redis Cache
|
||||
- **Location**: In-memory cache with persistence
|
||||
- **Data**: Session cache, temporary data, rate limiting
|
||||
- **Backup Format**: RDB snapshot + AOF (if enabled)
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
### 3. Ledger Storage
|
||||
- **Location**: Blockchain node persistent storage
|
||||
- **Data**: Blocks, transactions, receipts, wallet states
|
||||
- **Backup Format**: Compressed tar archives
|
||||
- **Retention**: 30 days (configurable)
|
||||
|
||||
## Automated Backups
|
||||
|
||||
### Kubernetes CronJob
|
||||
|
||||
The automated backup system runs daily at 2:00 AM UTC:
|
||||
|
||||
```bash
|
||||
# Deploy the backup CronJob
|
||||
kubectl apply -f infra/k8s/backup-cronjob.yaml
|
||||
|
||||
# Check CronJob status
|
||||
kubectl get cronjob aitbc-backup
|
||||
|
||||
# View backup jobs
|
||||
kubectl get jobs -l app=aitbc-backup
|
||||
|
||||
# View backup logs
|
||||
kubectl logs job/aitbc-backup-<timestamp>
|
||||
```
|
||||
|
||||
### Backup Schedule
|
||||
|
||||
| Time (UTC) | Component | Type | Retention |
|
||||
|------------|----------------|------------|-----------|
|
||||
| 02:00 | PostgreSQL | Full | 30 days |
|
||||
| 02:01 | Redis | Full | 30 days |
|
||||
| 02:02 | Ledger | Full | 30 days |
|
||||
|
||||
## Manual Backups
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
```bash
|
||||
# Create a manual backup
|
||||
./infra/scripts/backup_postgresql.sh default my-backup-$(date +%Y%m%d)
|
||||
|
||||
# View available backups
|
||||
ls -la /tmp/postgresql-backups/
|
||||
|
||||
# Upload to S3 manually
|
||||
aws s3 cp /tmp/postgresql-backups/my-backup.sql.gz s3://aitbc-backups-default/postgresql/
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
```bash
|
||||
# Create a manual backup
|
||||
./infra/scripts/backup_redis.sh default my-redis-backup-$(date +%Y%m%d)
|
||||
|
||||
# Force background save before backup
|
||||
kubectl exec -n default deployment/redis -- redis-cli BGSAVE
|
||||
```
|
||||
|
||||
### Ledger Storage
|
||||
|
||||
```bash
|
||||
# Create a full backup
|
||||
./infra/scripts/backup_ledger.sh default my-ledger-backup-$(date +%Y%m%d)
|
||||
|
||||
# Create incremental backup
|
||||
./infra/scripts/backup_ledger.sh default incremental-backup-$(date +%Y%m%d) true
|
||||
```
|
||||
|
||||
## Restore Procedures
|
||||
|
||||
### PostgreSQL Restore
|
||||
|
||||
```bash
|
||||
# List available backups
|
||||
aws s3 ls s3://aitbc-backups-default/postgresql/
|
||||
|
||||
# Download backup from S3
|
||||
aws s3 cp s3://aitbc-backups-default/postgresql/postgresql-backup-20231222_020000.sql.gz /tmp/
|
||||
|
||||
# Restore database
|
||||
./infra/scripts/restore_postgresql.sh default /tmp/postgresql-backup-20231222_020000.sql.gz
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/coordinator-api -- curl -s http://localhost:8011/v1/health
|
||||
```
|
||||
|
||||
### Redis Restore
|
||||
|
||||
```bash
|
||||
# Stop Redis service
|
||||
kubectl scale deployment redis --replicas=0 -n default
|
||||
|
||||
# Clear existing data
|
||||
kubectl exec -n default deployment/redis -- rm -f /data/dump.rdb /data/appendonly.aof
|
||||
|
||||
# Copy backup file
|
||||
kubectl cp /tmp/redis-backup.rdb default/redis-0:/data/dump.rdb
|
||||
|
||||
# Start Redis service
|
||||
kubectl scale deployment redis --replicas=1 -n default
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/redis -- redis-cli DBSIZE
|
||||
```
|
||||
|
||||
### Ledger Restore
|
||||
|
||||
```bash
|
||||
# Stop blockchain nodes
|
||||
kubectl scale deployment blockchain-node --replicas=0 -n default
|
||||
|
||||
# Extract backup
|
||||
tar -xzf /tmp/ledger-backup-20231222_020000.tar.gz -C /tmp/
|
||||
|
||||
# Copy ledger data
|
||||
kubectl cp /tmp/chain/ default/blockchain-node-0:/app/data/chain/
|
||||
kubectl cp /tmp/wallets/ default/blockchain-node-0:/app/data/wallets/
|
||||
kubectl cp /tmp/receipts/ default/blockchain-node-0:/app/data/receipts/
|
||||
|
||||
# Start blockchain nodes
|
||||
kubectl scale deployment blockchain-node --replicas=3 -n default
|
||||
|
||||
# Verify restore
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/blocks/head
|
||||
```
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Recovery Time Objective (RTO)
|
||||
|
||||
| Component | RTO Target | Notes |
|
||||
|----------------|------------|---------------------------------|
|
||||
| PostgreSQL | 1 hour | Database restore from backup |
|
||||
| Redis | 15 minutes | Cache rebuild from backup |
|
||||
| Ledger | 2 hours | Full chain synchronization |
|
||||
|
||||
### Recovery Point Objective (RPO)
|
||||
|
||||
| Component | RPO Target | Notes |
|
||||
|----------------|------------|---------------------------------|
|
||||
| PostgreSQL | 24 hours | Daily backups |
|
||||
| Redis | 24 hours | Daily backups |
|
||||
| Ledger | 24 hours | Daily full + incremental backups|
|
||||
|
||||
### Disaster Recovery Steps
|
||||
|
||||
1. **Assess Impact**
|
||||
```bash
|
||||
# Check component status
|
||||
kubectl get pods -n default
|
||||
kubectl get events --sort-by=.metadata.creationTimestamp
|
||||
```
|
||||
|
||||
2. **Restore Critical Services**
|
||||
```bash
|
||||
# Restore PostgreSQL first (critical for operations)
|
||||
./infra/scripts/restore_postgresql.sh default [latest-backup]
|
||||
|
||||
# Restore Redis cache
|
||||
./restore_redis.sh default [latest-backup]
|
||||
|
||||
# Restore ledger data
|
||||
./restore_ledger.sh default [latest-backup]
|
||||
```
|
||||
|
||||
3. **Verify System Health**
|
||||
```bash
|
||||
# Check all services
|
||||
kubectl get pods -n default
|
||||
|
||||
# Verify API endpoints
|
||||
curl -s http://coordinator-api:8011/v1/health
|
||||
curl -s http://blockchain-node:8080/v1/health
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Backup Monitoring
|
||||
|
||||
Prometheus metrics track backup success/failure:
|
||||
|
||||
```yaml
|
||||
# AlertManager rules for backups
|
||||
- alert: BackupFailed
|
||||
expr: backup_success == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Backup failed for {{ $labels.component }}"
|
||||
description: "Backup for {{ $labels.component }} has failed for 5 minutes"
|
||||
```
|
||||
|
||||
### Log Monitoring
|
||||
|
||||
```bash
|
||||
# View backup logs
|
||||
kubectl logs -l app=aitbc-backup -n default --tail=100
|
||||
|
||||
# Monitor backup CronJob
|
||||
kubectl get cronjob aitbc-backup -w
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Backup Security
|
||||
|
||||
1. **Encryption**: Backups uploaded to S3 use server-side encryption
|
||||
2. **Access Control**: IAM policies restrict backup access
|
||||
3. **Retention**: Automatic cleanup of old backups
|
||||
4. **Validation**: Regular restore testing
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
1. **Off-Peak Backups**: Scheduled during low traffic (2 AM UTC)
|
||||
2. **Parallel Processing**: Components backed up sequentially
|
||||
3. **Compression**: All backups compressed to save storage
|
||||
4. **Incremental Backups**: Ledger supports incremental to reduce size
|
||||
|
||||
### Testing
|
||||
|
||||
1. **Monthly Restore Tests**: Validate backup integrity
|
||||
2. **Disaster Recovery Drills**: Quarterly full scenario testing
|
||||
3. **Documentation Updates**: Keep procedures current
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Backup Fails with "Permission Denied"
|
||||
```bash
|
||||
# Check service account permissions
|
||||
kubectl describe serviceaccount backup-service-account
|
||||
kubectl describe role backup-role
|
||||
```
|
||||
|
||||
#### Restore Fails with "Database in Use"
|
||||
```bash
|
||||
# Scale down application before restore
|
||||
kubectl scale deployment coordinator-api --replicas=0
|
||||
# Perform restore
|
||||
# Scale up after restore
|
||||
kubectl scale deployment coordinator-api --replicas=3
|
||||
```
|
||||
|
||||
#### Ledger Restore Incomplete
|
||||
```bash
|
||||
# Verify backup integrity
|
||||
tar -tzf ledger-backup.tar.gz
|
||||
# Check metadata.json for block height
|
||||
cat metadata.json | jq '.latest_block_height'
|
||||
```
|
||||
|
||||
### Getting Help
|
||||
|
||||
1. Check logs: `kubectl logs -l app=aitbc-backup`
|
||||
2. Verify storage: `df -h` on backup nodes
|
||||
3. Check network: Test S3 connectivity
|
||||
4. Review events: `kubectl get events --sort-by=.metadata.creationTimestamp`
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|------------------------|------------------|---------------------------------|
|
||||
| BACKUP_RETENTION_DAYS | 30 | Days to keep backups |
|
||||
| BACKUP_SCHEDULE | 0 2 * * * | Cron schedule for backups |
|
||||
| S3_BUCKET_PREFIX | aitbc-backups | S3 bucket name prefix |
|
||||
| COMPRESSION_LEVEL | 6 | gzip compression level |
|
||||
|
||||
### Customizing Backup Schedule
|
||||
|
||||
Edit the CronJob schedule in `infra/k8s/backup-cronjob.yaml`:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
schedule: "0 3 * * *" # Change to 3 AM UTC
|
||||
```
|
||||
|
||||
### Adjusting Retention
|
||||
|
||||
Modify retention in each backup script:
|
||||
|
||||
```bash
|
||||
# In backup_*.sh scripts
|
||||
RETENTION_DAYS=60 # Keep for 60 days instead of 30
|
||||
```
|
||||
498
docs/7_deployment/4_incident-runbooks.md
Normal file
498
docs/7_deployment/4_incident-runbooks.md
Normal file
@@ -0,0 +1,498 @@
|
||||
# AITBC Incident Runbooks
|
||||
|
||||
This document contains specific runbooks for common incident scenarios, based on our chaos testing validation and integration test suite.
|
||||
|
||||
## Integration Test Status (Updated 2026-01-26)
|
||||
|
||||
### Current Test Coverage
|
||||
- ✅ 6 integration tests passing
|
||||
- ✅ Security tests using real ZK proof features
|
||||
- ✅ Marketplace tests connecting to live service
|
||||
- ⏸️ 1 test skipped (wallet payment flow)
|
||||
|
||||
### Test Environment
|
||||
- Tests run against both real and mock clients
|
||||
- CI/CD pipeline runs full test suite
|
||||
- Local development: `python -m pytest tests/integration/ -v`
|
||||
|
||||
## Runbook: Coordinator API Outage
|
||||
|
||||
### Based on Chaos Test: `chaos_test_coordinator.py`
|
||||
|
||||
### Symptoms
|
||||
- 503/504 errors on all endpoints
|
||||
- Health check failures
|
||||
- Job submission failures
|
||||
- Marketplace unresponsive
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check pod status
|
||||
kubectl get pods -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 2. Check recent events
|
||||
kubectl get events -n default --sort-by=.metadata.creationTimestamp | tail -20
|
||||
|
||||
# 3. Check if pods are crashlooping
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator
|
||||
|
||||
# 4. Quick restart if needed
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Investigation (2-10 minutes)
|
||||
1. **Review Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/coordinator --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=coordinator
|
||||
```
|
||||
|
||||
3. **Verify Database Connectivity**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- nc -z postgresql 5432
|
||||
```
|
||||
|
||||
4. **Check Redis Connection**
|
||||
```bash
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Up if Resource Starved**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
```
|
||||
|
||||
2. **Manual Pod Deletion if Stuck**
|
||||
```bash
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=coordinator --force --grace-period=0
|
||||
```
|
||||
|
||||
3. **Rollback Deployment**
|
||||
```bash
|
||||
kubectl rollout undo deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test health endpoint
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
|
||||
# Test API with sample request
|
||||
curl -X GET http://127.0.0.2:8011/v1/jobs -H "X-API-Key: test-key"
|
||||
```
|
||||
|
||||
## Runbook: Network Partition
|
||||
|
||||
### Based on Chaos Test: `chaos_test_network.py`
|
||||
|
||||
### Symptoms
|
||||
- Blockchain nodes not communicating
|
||||
- Consensus stalled
|
||||
- High finality latency
|
||||
- Transaction processing delays
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq
|
||||
|
||||
# 2. Check consensus status
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq
|
||||
|
||||
# 3. Check network policies
|
||||
kubectl get networkpolicies -n default
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Identify Partitioned Nodes**
|
||||
```bash
|
||||
# Check each node's peer count
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "Pod: $pod"
|
||||
kubectl exec -n default $pod -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
done
|
||||
```
|
||||
|
||||
2. **Check Network Policies**
|
||||
```bash
|
||||
kubectl describe networkpolicy default-deny-all-ingress -n default
|
||||
kubectl describe networkpolicy blockchain-node-netpol -n default
|
||||
```
|
||||
|
||||
3. **Verify DNS Resolution**
|
||||
```bash
|
||||
kubectl exec -n default deployment/blockchain-node -- nslookup blockchain-node
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Remove Problematic Network Rules**
|
||||
```bash
|
||||
# Flush iptables on affected nodes
|
||||
for pod in $(kubectl get pods -n default -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}'); do
|
||||
kubectl exec -n default $pod -- iptables -F
|
||||
done
|
||||
```
|
||||
|
||||
2. **Restart Network Components**
|
||||
```bash
|
||||
kubectl rollout restart deployment/blockchain-node -n default
|
||||
```
|
||||
|
||||
3. **Force Re-peering**
|
||||
```bash
|
||||
# Delete and recreate pods to force re-peering
|
||||
kubectl delete pods -n default -l app.kubernetes.io/name=blockchain-node
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Wait for consensus to resume
|
||||
watch -n 5 'kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/consensus | jq .height'
|
||||
|
||||
# Verify peer connectivity
|
||||
kubectl exec -n default deployment/blockchain-node -- curl -s http://localhost:8080/v1/peers | jq '. | length'
|
||||
```
|
||||
|
||||
## Runbook: Database Failure
|
||||
|
||||
### Based on Chaos Test: `chaos_test_database.py`
|
||||
|
||||
### Symptoms
|
||||
- Database connection errors
|
||||
- Service degradation
|
||||
- Failed transactions
|
||||
- High error rates
|
||||
|
||||
### MTTR Target: 3 minutes
|
||||
|
||||
### Immediate Actions (0-3 minutes)
|
||||
```bash
|
||||
# 1. Check PostgreSQL status
|
||||
kubectl exec -n default deployment/postgresql -- pg_isready
|
||||
|
||||
# 2. Check connection count
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# 3. Check replica lag
|
||||
kubectl exec -n default deployment/postgresql-replica -- psql -U aitbc -c "SELECT pg_last_xact_replay_timestamp();"
|
||||
```
|
||||
|
||||
### Investigation (3-10 minutes)
|
||||
1. **Review Database Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/postgresql --tail=100
|
||||
```
|
||||
|
||||
2. **Check Resource Usage**
|
||||
```bash
|
||||
kubectl top pods -n default -l app.kubernetes.io/name=postgresql
|
||||
df -h /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
3. **Identify Long-running Queries**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Kill Idle Connections**
|
||||
```bash
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';"
|
||||
```
|
||||
|
||||
2. **Restart PostgreSQL**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
3. **Failover to Replica**
|
||||
```bash
|
||||
# Promote replica if primary fails
|
||||
kubectl exec -n default deployment/postgresql-replica -- pg_ctl promote -D /var/lib/postgresql/data
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test database connectivity
|
||||
kubectl exec -n default deployment/coordinator -- python -c "import psycopg2; conn = psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('Connected')"
|
||||
|
||||
# Check application health
|
||||
curl -f http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Redis Failure
|
||||
|
||||
### Symptoms
|
||||
- Caching failures
|
||||
- Session loss
|
||||
- Increased database load
|
||||
- Slow response times
|
||||
|
||||
### MTTR Target: 2 minutes
|
||||
|
||||
### Immediate Actions (0-2 minutes)
|
||||
```bash
|
||||
# 1. Check Redis status
|
||||
kubectl exec -n default deployment/redis -- redis-cli ping
|
||||
|
||||
# 2. Check memory usage
|
||||
kubectl exec -n default deployment/redis -- redis-cli info memory | grep used_memory_human
|
||||
|
||||
# 3. Check connection count
|
||||
kubectl exec -n default deployment/redis -- redis-cli info clients | grep connected_clients
|
||||
```
|
||||
|
||||
### Investigation (2-5 minutes)
|
||||
1. **Review Redis Logs**
|
||||
```bash
|
||||
kubectl logs -n default deployment/redis --tail=100
|
||||
```
|
||||
|
||||
2. **Check for Eviction**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli info stats | grep evicted_keys
|
||||
```
|
||||
|
||||
3. **Identify Large Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --bigkeys
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Clear Expired Keys**
|
||||
```bash
|
||||
kubectl exec -n default deployment/redis -- redis-cli --scan --pattern "*:*" | xargs redis-cli del
|
||||
```
|
||||
|
||||
2. **Restart Redis**
|
||||
```bash
|
||||
kubectl rollout restart deployment/redis -n default
|
||||
```
|
||||
|
||||
3. **Scale Redis Cluster**
|
||||
```bash
|
||||
kubectl scale deployment/redis --replicas=3 -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Test Redis connectivity
|
||||
kubectl exec -n default deployment/coordinator -- redis-cli -h redis ping
|
||||
|
||||
# Check application performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: High CPU/Memory Usage
|
||||
|
||||
### Symptoms
|
||||
- Slow response times
|
||||
- Pod evictions
|
||||
- OOM errors
|
||||
- System degradation
|
||||
|
||||
### MTTR Target: 5 minutes
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
```bash
|
||||
# 1. Check resource usage
|
||||
kubectl top pods -n default
|
||||
kubectl top nodes
|
||||
|
||||
# 2. Identify resource-hungry pods
|
||||
kubectl exec -n default deployment/coordinator -- top
|
||||
|
||||
# 3. Check for OOM kills
|
||||
dmesg | grep -i "killed process"
|
||||
```
|
||||
|
||||
### Investigation (5-15 minutes)
|
||||
1. **Analyze Resource Usage**
|
||||
```bash
|
||||
# Detailed pod metrics
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%cpu | head -10
|
||||
kubectl exec -n default deployment/coordinator -- ps aux --sort=-%mem | head -10
|
||||
```
|
||||
|
||||
2. **Check Resource Limits**
|
||||
```bash
|
||||
kubectl describe pod -n default -l app.kubernetes.io/name=coordinator | grep -A 10 Limits
|
||||
```
|
||||
|
||||
3. **Review Application Metrics**
|
||||
```bash
|
||||
# Check Prometheus metrics
|
||||
curl http://127.0.0.2:8011/metrics | grep -E "(cpu|memory)"
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Scale Services**
|
||||
```bash
|
||||
kubectl scale deployment/coordinator --replicas=5 -n default
|
||||
kubectl scale deployment/blockchain-node --replicas=3 -n default
|
||||
```
|
||||
|
||||
2. **Increase Resource Limits**
|
||||
```bash
|
||||
kubectl patch deployment coordinator -p '{"spec":{"template":{"spec":{"containers":[{"name":"coordinator","resources":{"limits":{"cpu":"2000m","memory":"4Gi"}}}]}}}}'
|
||||
```
|
||||
|
||||
3. **Restart Affected Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/coordinator -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Monitor resource usage
|
||||
watch -n 5 'kubectl top pods -n default'
|
||||
|
||||
# Test service performance
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://127.0.0.2:8011/v1/health
|
||||
```
|
||||
|
||||
## Runbook: Storage Issues
|
||||
|
||||
### Symptoms
|
||||
- Disk space warnings
|
||||
- Write failures
|
||||
- Database errors
|
||||
- Pod crashes
|
||||
|
||||
### MTTR Target: 10 minutes
|
||||
|
||||
### Immediate Actions (0-10 minutes)
|
||||
```bash
|
||||
# 1. Check disk usage
|
||||
df -h
|
||||
kubectl exec -n default deployment/postgresql -- df -h
|
||||
|
||||
# 2. Identify large files
|
||||
find /var/log -name "*.log" -size +100M
|
||||
kubectl exec -n default deployment/postgresql -- find /var/lib/postgresql -type f -size +1G
|
||||
|
||||
# 3. Clean up logs
|
||||
kubectl logs -n default deployment/coordinator --tail=1000 > /tmp/coordinator.log && truncate -s 0 /var/log/containers/coordinator*.log
|
||||
```
|
||||
|
||||
### Investigation (10-20 minutes)
|
||||
1. **Analyze Storage Usage**
|
||||
```bash
|
||||
du -sh /var/log/*
|
||||
du -sh /var/lib/docker/*
|
||||
```
|
||||
|
||||
2. **Check PVC Usage**
|
||||
```bash
|
||||
kubectl get pvc -n default
|
||||
kubectl describe pvc postgresql-data -n default
|
||||
```
|
||||
|
||||
3. **Review Retention Policies**
|
||||
```bash
|
||||
kubectl get cronjobs -n default
|
||||
kubectl describe cronjob log-cleanup -n default
|
||||
```
|
||||
|
||||
### Recovery Actions
|
||||
1. **Expand Storage**
|
||||
```bash
|
||||
kubectl patch pvc postgresql-data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
|
||||
```
|
||||
|
||||
2. **Force Cleanup**
|
||||
```bash
|
||||
# Clean old logs
|
||||
find /var/log -name "*.log" -mtime +7 -delete
|
||||
|
||||
# Clean Docker images
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
3. **Restart Services**
|
||||
```bash
|
||||
kubectl rollout restart deployment/postgresql -n default
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# Verify database operations
|
||||
kubectl exec -n default deployment/postgresql -- psql -U aitbc -c "SELECT 1;"
|
||||
```
|
||||
|
||||
## Emergency Contact Procedures
|
||||
|
||||
### Escalation Matrix
|
||||
1. **Level 1**: On-call engineer (5 minutes)
|
||||
2. **Level 2**: On-call secondary (15 minutes)
|
||||
3. **Level 3**: Engineering manager (30 minutes)
|
||||
4. **Level 4**: CTO (1 hour, critical only)
|
||||
|
||||
### War Room Activation
|
||||
```bash
|
||||
# Create Slack channel
|
||||
/slack create-channel #incident-$(date +%Y%m%d-%H%M%S)
|
||||
|
||||
# Invite stakeholders
|
||||
/slack invite @sre-team @engineering-manager @cto
|
||||
|
||||
# Start Zoom meeting
|
||||
/zoom start "AITBC Incident War Room"
|
||||
```
|
||||
|
||||
### Customer Communication
|
||||
1. **Status Page Update** (5 minutes)
|
||||
2. **Email Notification** (15 minutes)
|
||||
3. **Twitter Update** (30 minutes, critical only)
|
||||
|
||||
## Post-Incident Checklist
|
||||
|
||||
### Immediate (0-1 hour)
|
||||
- [ ] Service fully restored
|
||||
- [ ] Monitoring normal
|
||||
- [ ] Status page updated
|
||||
- [ ] Stakeholders notified
|
||||
|
||||
### Short-term (1-24 hours)
|
||||
- [ ] Incident document created
|
||||
- [ ] Root cause identified
|
||||
- [ ] Runbooks updated
|
||||
- [ ] Post-mortem scheduled
|
||||
|
||||
### Long-term (1-7 days)
|
||||
- [ ] Post-mortem completed
|
||||
- [ ] Action items assigned
|
||||
- [ ] Monitoring improved
|
||||
- [ ] Process updated
|
||||
|
||||
## Runbook Maintenance
|
||||
|
||||
### Review Schedule
|
||||
- **Monthly**: Review and update runbooks
|
||||
- **Quarterly**: Full review and testing
|
||||
- **Annually**: Major revision
|
||||
|
||||
### Update Process
|
||||
1. Test runbook procedures
|
||||
2. Document lessons learned
|
||||
3. Update procedures
|
||||
4. Train team members
|
||||
5. Update documentation
|
||||
|
||||
---
|
||||
|
||||
*Version: 1.0*
|
||||
*Last Updated: 2024-12-22*
|
||||
*Owner: SRE Team*
|
||||
69
docs/7_deployment/5_marketplace-deployment.md
Normal file
69
docs/7_deployment/5_marketplace-deployment.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Marketplace GPU Endpoints Deployment Summary
|
||||
|
||||
## ✅ Successfully Deployed to Remote Server (aitbc-cascade)
|
||||
|
||||
### What was deployed:
|
||||
1. **New router file**: `/opt/coordinator-api/src/app/routers/marketplace_gpu.py`
|
||||
- 9 GPU-specific endpoints implemented
|
||||
- In-memory storage for quick testing
|
||||
- Mock data with 3 initial GPUs
|
||||
|
||||
2. **Updated router configuration**:
|
||||
- Added `marketplace_gpu` import to `__init__.py`
|
||||
- Added router to main app with `/v1` prefix
|
||||
- Service restarted successfully
|
||||
|
||||
### Available Endpoints:
|
||||
- `POST /v1/marketplace/gpu/register` - Register GPU
|
||||
- `GET /v1/marketplace/gpu/list` - List GPUs
|
||||
- `GET /v1/marketplace/gpu/{gpu_id}` - Get GPU details
|
||||
- `POST /v1/marketplace/gpu/{gpu_id}/book` - Book GPU
|
||||
- `POST /v1/marketplace/gpu/{gpu_id}/release` - Release GPU
|
||||
- `GET /v1/marketplace/gpu/{gpu_id}/reviews` - Get reviews
|
||||
- `POST /v1/marketplace/gpu/{gpu_id}/reviews` - Add review
|
||||
- `GET /v1/marketplace/orders` - List orders
|
||||
- `GET /v1/marketplace/pricing/{model}` - Get pricing
|
||||
|
||||
### Test Results:
|
||||
1. **GPU Registration**: ✅
|
||||
- Successfully registered RTX 4060 Ti (16GB)
|
||||
- GPU ID: gpu_001
|
||||
- Price: $0.30/hour
|
||||
|
||||
2. **GPU Booking**: ✅
|
||||
- Booked for 2 hours
|
||||
- Total cost: $1.0
|
||||
- Booking ID generated
|
||||
|
||||
3. **Review System**: ✅
|
||||
- Added 5-star review
|
||||
- Average rating updated to 5.0
|
||||
|
||||
4. **Order Management**: ✅
|
||||
- Orders tracked
|
||||
- Status: active
|
||||
|
||||
### Current GPU Inventory:
|
||||
1. RTX 4090 (24GB) - $0.50/hr - Available
|
||||
2. RTX 3080 (16GB) - $0.35/hr - Available
|
||||
3. A100 (40GB) - $1.20/hr - Booked
|
||||
4. **RTX 4060 Ti (16GB) - $0.30/hr - Available** (newly registered)
|
||||
|
||||
### Service Status:
|
||||
- Coordinator API: Running on port 8000
|
||||
- Service: active (running)
|
||||
- Last restart: Feb 12, 2026 at 16:14:11 UTC
|
||||
|
||||
### Next Steps:
|
||||
1. Update CLI to use remote server URL (http://aitbc-cascade:8000)
|
||||
2. Test full CLI workflow against remote server
|
||||
3. Consider persistent storage implementation
|
||||
4. Add authentication/authorization for production
|
||||
|
||||
### Notes:
|
||||
- Current implementation uses in-memory storage
|
||||
- Data resets on service restart
|
||||
- No authentication required (test API key works)
|
||||
- All endpoints return proper HTTP status codes (201 for creation)
|
||||
|
||||
The marketplace GPU functionality is now fully operational on the remote server! 🚀
|
||||
273
docs/7_deployment/6_beta-release-plan.md
Normal file
273
docs/7_deployment/6_beta-release-plan.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# AITBC Beta Release Plan
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document outlines the beta release plan for AITBC (AI Trusted Blockchain Computing), a blockchain platform designed for AI workloads. The release follows a phased approach: Alpha → Beta → Release Candidate (RC) → General Availability (GA).
|
||||
|
||||
## Release Phases
|
||||
|
||||
### Phase 1: Alpha Release (Completed)
|
||||
- **Duration**: 2 weeks
|
||||
- **Participants**: Internal team (10 members)
|
||||
- **Focus**: Core functionality validation
|
||||
- **Status**: ✅ Completed
|
||||
|
||||
### Phase 2: Beta Release (Current)
|
||||
- **Duration**: 6 weeks
|
||||
- **Participants**: 50-100 external testers
|
||||
- **Focus**: User acceptance testing, performance validation, security assessment
|
||||
- **Start Date**: 2025-01-15
|
||||
- **End Date**: 2025-02-26
|
||||
|
||||
### Phase 3: Release Candidate
|
||||
- **Duration**: 2 weeks
|
||||
- **Participants**: 20 selected beta testers
|
||||
- **Focus**: Final bug fixes, performance optimization
|
||||
- **Start Date**: 2025-03-04
|
||||
- **End Date**: 2025-03-18
|
||||
|
||||
### Phase 4: General Availability
|
||||
- **Date**: 2025-03-25
|
||||
- **Target**: Public launch
|
||||
|
||||
## Beta Release Timeline
|
||||
|
||||
### Week 1-2: Onboarding & Basic Flows
|
||||
- **Jan 15-19**: Tester onboarding and environment setup
|
||||
- **Jan 22-26**: Basic job submission and completion flows
|
||||
- **Milestone**: 80% of testers successfully submit and complete jobs
|
||||
|
||||
### Week 3-4: Marketplace & Explorer Testing
|
||||
- **Jan 29 - Feb 2**: Marketplace functionality testing
|
||||
- **Feb 5-9**: Explorer UI validation and transaction tracking
|
||||
- **Milestone**: 100 marketplace transactions completed
|
||||
|
||||
### Week 5-6: Stress Testing & Feedback
|
||||
- **Feb 12-16**: Performance stress testing (1000+ concurrent jobs)
|
||||
- **Feb 19-23**: Security testing and final feedback collection
|
||||
- **Milestone**: All critical bugs resolved
|
||||
|
||||
## User Acceptance Testing (UAT) Scenarios
|
||||
|
||||
### 1. Core Job Lifecycle
|
||||
- **Scenario**: Submit AI inference job → Miner picks up → Execution → Results delivery → Payment
|
||||
- **Test Cases**:
|
||||
- Job submission with various model types
|
||||
- Job monitoring and status tracking
|
||||
- Result retrieval and verification
|
||||
- Payment processing and wallet updates
|
||||
- **Success Criteria**: 95% success rate across 1000 test jobs
|
||||
|
||||
### 2. Marketplace Operations
|
||||
- **Scenario**: Create offer → Accept offer → Execute job → Complete transaction
|
||||
- **Test Cases**:
|
||||
- Offer creation and management
|
||||
- Bid acceptance and matching
|
||||
- Price discovery mechanisms
|
||||
- Dispute resolution
|
||||
- **Success Criteria**: 50 successful marketplace transactions
|
||||
|
||||
### 3. Explorer Functionality
|
||||
- **Scenario**: Transaction lookup → Job tracking → Address analysis
|
||||
- **Test Cases**:
|
||||
- Real-time transaction monitoring
|
||||
- Job history and status visualization
|
||||
- Wallet balance tracking
|
||||
- Block explorer features
|
||||
- **Success Criteria**: All transactions visible within 5 seconds
|
||||
|
||||
### 4. Wallet Management
|
||||
- **Scenario**: Wallet creation → Funding → Transactions → Backup/Restore
|
||||
- **Test Cases**:
|
||||
- Multi-signature wallet creation
|
||||
- Cross-chain transfers
|
||||
- Backup and recovery procedures
|
||||
- Staking and unstaking operations
|
||||
- **Success Criteria**: 100% wallet recovery success rate
|
||||
|
||||
### 5. Mining Operations
|
||||
- **Scenario**: Miner setup → Job acceptance → Mining rewards → Pool participation
|
||||
- **Test Cases**:
|
||||
- Miner registration and setup
|
||||
- Job bidding and execution
|
||||
- Reward distribution
|
||||
- Pool mining operations
|
||||
- **Success Criteria**: 90% of submitted jobs accepted by miners
|
||||
|
||||
### 6. Community Management
|
||||
|
||||
### Discord Community Structure
|
||||
- **#announcements**: Official updates and milestones
|
||||
- **#beta-testers**: Private channel for testers only
|
||||
- **#bug-reports**: Structured bug reporting format
|
||||
- **#feature-feedback**: Feature requests and discussions
|
||||
- **#technical-support**: 24/7 support from the team
|
||||
|
||||
### Regulatory Considerations
|
||||
- **KYC/AML**: Basic identity verification for testers
|
||||
- **Securities Law**: Beta tokens have no monetary value
|
||||
- **Tax Reporting**: Testnet transactions not taxable
|
||||
- **Export Controls**: Compliance with technology export laws
|
||||
|
||||
### Geographic Restrictions
|
||||
Beta testing is not available in:
|
||||
- North Korea, Iran, Cuba, Syria, Crimea
|
||||
- Countries under US sanctions
|
||||
- Jurdictions with unclear crypto regulations
|
||||
|
||||
### 7. Token Economics Validation
|
||||
- **Scenario**: Token issuance → Reward distribution → Staking yields → Fee mechanisms
|
||||
- **Test Cases**:
|
||||
- Mining reward calculations match whitepaper specs
|
||||
- Staking yields and unstaking penalties
|
||||
- Transaction fee burning and distribution
|
||||
- Marketplace fee structures
|
||||
- Token inflation/deflation mechanics
|
||||
- **Success Criteria**: All token operations within 1% of theoretical values
|
||||
|
||||
## Performance Benchmarks (Go/No-Go Criteria)
|
||||
|
||||
### Must-Have Metrics
|
||||
- **Transaction Throughput**: ≥ 100 TPS (Transactions Per Second)
|
||||
- **Job Completion Time**: ≤ 5 minutes for standard inference jobs
|
||||
- **API Response Time**: ≤ 200ms (95th percentile)
|
||||
- **System Uptime**: ≥ 99.9% during beta period
|
||||
- **MTTR (Mean Time To Recovery)**: ≤ 2 minutes (from chaos tests)
|
||||
|
||||
### Nice-to-Have Metrics
|
||||
- **Transaction Throughput**: ≥ 500 TPS
|
||||
- **Job Completion Time**: ≤ 2 minutes
|
||||
- **API Response Time**: ≤ 100ms (95th percentile)
|
||||
- **Concurrent Users**: ≥ 1000 simultaneous users
|
||||
|
||||
## Security Testing
|
||||
|
||||
### Automated Security Scans
|
||||
- **Smart Contract Audits**: Completed by [Security Firm]
|
||||
- **Penetration Testing**: OWASP Top 10 validation
|
||||
- **Dependency Scanning**: CVE scan of all dependencies
|
||||
- **Chaos Testing**: Network partition and coordinator outage scenarios
|
||||
|
||||
### Manual Security Reviews
|
||||
- **Authorization Testing**: API key validation and permissions
|
||||
- **Data Privacy**: GDPR compliance validation
|
||||
- **Cryptography**: Proof verification and signature validation
|
||||
- **Infrastructure Security**: Kubernetes and cloud security review
|
||||
|
||||
## Test Environment Setup
|
||||
|
||||
### Beta Environment
|
||||
- **Network**: Separate testnet with faucet for test tokens
|
||||
- **Infrastructure**: Production-like setup with monitoring
|
||||
- **Data**: Reset weekly to ensure clean testing
|
||||
- **Support**: 24/7 Discord support channel
|
||||
|
||||
### Access Credentials
|
||||
- **Testnet Faucet**: 1000 AITBC tokens per tester
|
||||
- **API Keys**: Unique keys per tester with rate limits
|
||||
- **Wallet Seeds**: Generated per tester with backup instructions
|
||||
- **Mining Accounts**: Pre-configured mining pools for testing
|
||||
|
||||
## Feedback Collection Mechanisms
|
||||
|
||||
### Automated Collection
|
||||
- **Error Reporting**: Automatic crash reports and error logs
|
||||
- **Performance Metrics**: Client-side performance data
|
||||
- **Usage Analytics**: Feature usage tracking (anonymized)
|
||||
- **Survey System**: In-app feedback prompts
|
||||
|
||||
### Manual Collection
|
||||
- **Weekly Surveys**: Structured feedback on specific features
|
||||
- **Discord Channels**: Real-time feedback and discussions
|
||||
- **Office Hours**: Weekly Q&A sessions with the team
|
||||
- **Bug Bounty**: Program for critical issue discovery
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Go/No-Go Decision Points
|
||||
|
||||
#### Week 2 Checkpoint (Jan 26)
|
||||
- **Go Criteria**: 80% of testers onboarded, basic flows working
|
||||
- **Blockers**: Critical bugs in job submission/completion
|
||||
|
||||
#### Week 4 Checkpoint (Feb 9)
|
||||
- **Go Criteria**: 50 marketplace transactions, explorer functional
|
||||
- **Blockers**: Security vulnerabilities, performance < 50 TPS
|
||||
|
||||
#### Week 6 Final Decision (Feb 23)
|
||||
- **Go Criteria**: All UAT scenarios passed, benchmarks met
|
||||
- **Blockers**: Any critical security issue, MTTR > 5 minutes
|
||||
|
||||
### Overall Success Metrics
|
||||
- **User Satisfaction**: ≥ 4.0/5.0 average rating
|
||||
- **Bug Resolution**: 90% of reported bugs fixed
|
||||
- **Performance**: All benchmarks met
|
||||
- **Security**: No critical vulnerabilities
|
||||
|
||||
## Risk Management
|
||||
|
||||
### Technical Risks
|
||||
- **Consensus Issues**: Rollback to previous version
|
||||
- **Performance Degradation**: Auto-scaling and optimization
|
||||
- **Security Breaches**: Immediate patch and notification
|
||||
|
||||
### Operational Risks
|
||||
- **Test Environment Downtime**: Backup environment ready
|
||||
- **Low Tester Participation**: Incentive program adjustments
|
||||
- **Feature Scope Creep**: Strict feature freeze after Week 4
|
||||
|
||||
### Mitigation Strategies
|
||||
- **Daily Health Checks**: Automated monitoring and alerts
|
||||
- **Rollback Plan**: Documented procedures for quick rollback
|
||||
- **Communication Plan**: Regular updates to all stakeholders
|
||||
|
||||
## Communication Plan
|
||||
|
||||
### Internal Updates
|
||||
- **Daily Standups**: Development team sync
|
||||
- **Weekly Reports**: Progress to leadership
|
||||
- **Bi-weekly Demos**: Feature demonstrations
|
||||
|
||||
### External Updates
|
||||
- **Beta Newsletter**: Weekly updates to testers
|
||||
- **Blog Posts**: Public progress updates
|
||||
- **Social Media**: Regular platform updates
|
||||
|
||||
## Post-Beta Activities
|
||||
|
||||
### RC Phase Preparation
|
||||
- **Bug Triage**: Prioritize and assign all reported issues
|
||||
- **Performance Tuning**: Optimize based on beta metrics
|
||||
- **Documentation Updates**: Incorporate beta feedback
|
||||
|
||||
### GA Preparation
|
||||
- **Final Security Review**: Complete audit and penetration test
|
||||
- **Infrastructure Scaling**: Prepare for production load
|
||||
- **Support Team Training**: Enable customer support team
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Test Case Matrix
|
||||
[Detailed test case spreadsheet link]
|
||||
|
||||
### B. Performance Benchmark Results
|
||||
[Benchmark data and graphs]
|
||||
|
||||
### C. Security Audit Reports
|
||||
[Audit firm reports and findings]
|
||||
|
||||
### D. Feedback Analysis
|
||||
[Summary of all user feedback and actions taken]
|
||||
|
||||
## Contact Information
|
||||
|
||||
- **Beta Program Manager**: beta@aitbc.io
|
||||
- **Technical Support**: support@aitbc.io
|
||||
- **Security Issues**: security@aitbc.io
|
||||
- **Discord Community**: https://discord.gg/aitbc
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-01-10*
|
||||
*Version: 1.0*
|
||||
*Next Review: 2025-01-17*
|
||||
Reference in New Issue
Block a user