ci: standardize pytest invocation and add security scanning
Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 8s
CLI Tests / test-cli (push) Successful in 10s
Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s
Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s
Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Cross-Node Transaction Testing / transaction-test (push) Successful in 5s
Deploy to Testnet / deploy-testnet (push) Successful in 1m14s
Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled
Documentation Validation / validate-docs (push) Failing after 10s
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled
Smart Contract Tests / test-foundry (push) Has been cancelled
Smart Contract Tests / lint-solidity (push) Has been cancelled
Smart Contract Tests / deploy-contracts (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Failing after 45s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s
P2P Network Verification / p2p-verification (push) Successful in 3s
Production Tests / Production Integration Tests (push) Failing after 7s
Python Tests / test-python (push) Failing after 46s
Staking Tests / test-staking-service (push) Failing after 2s
Staking Tests / test-staking-integration (push) Has been skipped
Staking Tests / test-staking-contract (push) Has been skipped
Staking Tests / run-staking-test-runner (push) Has been skipped
Systemd Sync / sync-systemd (push) Successful in 21s
API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s
Some checks failed
Blockchain Synchronization Verification / sync-verification (push) Failing after 8s
CLI Tests / test-cli (push) Successful in 10s
Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s
Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s
Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s
Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s
Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s
Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped
Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s
Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped
Cross-Node Transaction Testing / transaction-test (push) Successful in 5s
Deploy to Testnet / deploy-testnet (push) Successful in 1m14s
Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled
Documentation Validation / validate-docs (push) Failing after 10s
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled
Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled
Smart Contract Tests / test-foundry (push) Has been cancelled
Smart Contract Tests / lint-solidity (push) Has been cancelled
Smart Contract Tests / deploy-contracts (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Successful in 3s
Integration Tests / test-service-integration (push) Failing after 45s
Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s
Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s
P2P Network Verification / p2p-verification (push) Successful in 3s
Production Tests / Production Integration Tests (push) Failing after 7s
Python Tests / test-python (push) Failing after 46s
Staking Tests / test-staking-service (push) Failing after 2s
Staking Tests / test-staking-integration (push) Has been skipped
Staking Tests / test-staking-contract (push) Has been skipped
Staking Tests / run-staking-test-runner (push) Has been skipped
Systemd Sync / sync-systemd (push) Successful in 21s
API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s
- Changed pytest calls to use `venv/bin/python -m pytest` with explicit config - Added `--rootdir "$PWD"` and `--import-mode=importlib` for consistent imports - Fixed PYTHONPATH to use absolute paths with $PWD prefix - Added smart contract security scanning for Solidity files - Added Circom circuit security checks for ZK proof circuits - Added ZK proof implementation security validation - Added contracts/** to security scanning workflow
This commit is contained in:
507
docs/operations/disaster-recovery-drill-plan.md
Normal file
507
docs/operations/disaster-recovery-drill-plan.md
Normal file
@@ -0,0 +1,507 @@
|
||||
# AITBC Disaster Recovery Drill Plan
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2026-05-11
|
||||
**Status:** Active
|
||||
**Next Review:** 2026-08-11
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines the disaster recovery drill schedule, procedures, and reporting for the AITBC platform. Regular drills ensure the disaster recovery plan is effective, team members are trained, and recovery procedures are validated.
|
||||
|
||||
## Drill Schedule
|
||||
|
||||
### 2026 Drill Calendar
|
||||
|
||||
| Month | Drill Type | Duration | Target Date | Status |
|
||||
|-------|------------|----------|-------------|--------|
|
||||
| February | Tabletop Exercise | 2 hours | 2026-02-15 | Scheduled |
|
||||
| March | Service Failover | 1 hour | 2026-03-15 | Scheduled |
|
||||
| April | Database Restore | 1 hour | 2026-04-15 | Scheduled |
|
||||
| May | Full System Recovery | 4 hours | 2026-05-15 | Scheduled |
|
||||
| June | Tabletop Exercise | 2 hours | 2026-06-15 | Scheduled |
|
||||
| July | Service Failover | 1 hour | 2026-07-15 | Scheduled |
|
||||
| August | Database Restore | 1 hour | 2026-08-15 | Scheduled |
|
||||
| September | Full System Recovery | 4 hours | 2026-09-15 | Scheduled |
|
||||
| October | Tabletop Exercise | 2 hours | 2026-10-15 | Scheduled |
|
||||
| November | Service Failover | 1 hour | 2026-11-15 | Scheduled |
|
||||
| December | Data Center Failover | 8 hours | 2026-12-15 | Scheduled |
|
||||
|
||||
### Drill Types
|
||||
|
||||
#### 1. Tabletop Exercise
|
||||
- **Frequency:** Quarterly
|
||||
- **Duration:** 2 hours
|
||||
- **Participants:** Engineering, DevOps, Security, Product
|
||||
- **Format:** Discussion-based scenario walkthrough
|
||||
- **Objective:** Validate decision-making processes and communication
|
||||
|
||||
#### 2. Service Failover
|
||||
- **Frequency:** Monthly
|
||||
- **Duration:** 1 hour
|
||||
- **Participants:** DevOps, Engineering
|
||||
- **Format:** Actual service restart/failover
|
||||
- **Objective:** Validate automated failover mechanisms
|
||||
|
||||
#### 3. Database Restore
|
||||
- **Frequency:** Monthly
|
||||
- **Duration:** 1 hour
|
||||
- **Participants:** DBA, DevOps
|
||||
- **Format:** Actual database restore from backup
|
||||
- **Objective:** Validate backup integrity and restore procedures
|
||||
|
||||
#### 4. Full System Recovery
|
||||
- **Frequency:** Quarterly
|
||||
- **Duration:** 4 hours
|
||||
- **Participants:** All teams
|
||||
- **Format:** Complete system recovery simulation
|
||||
- **Objective:** Validate end-to-end recovery procedures
|
||||
|
||||
#### 5. Data Center Failover
|
||||
- **Frequency:** Annually
|
||||
- **Duration:** 8 hours
|
||||
- **Participants:** All teams
|
||||
- **Format:** Geographic failover simulation
|
||||
- **Objective:** Validate multi-region recovery capabilities
|
||||
|
||||
## Drill Procedures
|
||||
|
||||
### Pre-Drill Preparation (2 Weeks Before)
|
||||
|
||||
1. **Define Drill Scenario**
|
||||
- Select disaster scenario from DR plan
|
||||
- Define specific objectives and success criteria
|
||||
- Identify affected components and services
|
||||
- Determine scope and limitations
|
||||
|
||||
2. **Prepare Test Environment**
|
||||
- Set up isolated test environment (if needed)
|
||||
- Prepare test data and backups
|
||||
- Configure monitoring and logging
|
||||
- Verify tooling and access
|
||||
|
||||
3. **Notify Participants**
|
||||
- Send drill invitation with details
|
||||
- Confirm participant availability
|
||||
- Share drill scenario and objectives
|
||||
- Provide pre-reading materials
|
||||
|
||||
4. **Prepare Monitoring**
|
||||
- Set up additional monitoring for drill
|
||||
- Configure alerting for drill events
|
||||
- Prepare metrics collection
|
||||
- Set up logging capture
|
||||
|
||||
5. **Establish Success Criteria**
|
||||
- Define measurable objectives
|
||||
- Set RTO/RPO targets for drill
|
||||
- Define pass/fail criteria
|
||||
- Document expected outcomes
|
||||
|
||||
### During Drill Execution
|
||||
|
||||
#### 1. Drill Kickoff (15 minutes)
|
||||
- Call to order and attendance check
|
||||
- Review drill scenario and objectives
|
||||
- Review roles and responsibilities
|
||||
- Review communication channels
|
||||
- Start timer and begin drill
|
||||
|
||||
#### 2. Drill Execution (Variable)
|
||||
- Execute according to scenario
|
||||
- Document all actions and timestamps
|
||||
- Record issues and blockers
|
||||
- Monitor system behavior
|
||||
- Communicate progress per plan
|
||||
|
||||
#### 3. Drill Completion (15 minutes)
|
||||
- Stop timer and conclude drill
|
||||
- Collect initial observations
|
||||
- Verify system state
|
||||
- Begin preliminary debrief
|
||||
|
||||
### Post-Drill Activities
|
||||
|
||||
#### Immediate Post-Drill (1 Hour)
|
||||
1. **Collect Metrics**
|
||||
- RTO achieved
|
||||
- RPO achieved
|
||||
- Success criteria met
|
||||
- Issues encountered
|
||||
|
||||
2. **Initial Debrief**
|
||||
- Participant feedback
|
||||
- Observations and findings
|
||||
- Immediate issues identified
|
||||
- Preliminary recommendations
|
||||
|
||||
#### Post-Drill Review (1 Week)
|
||||
1. **Analyze Results**
|
||||
- Compare results to objectives
|
||||
- Identify gaps and weaknesses
|
||||
- Analyze root causes of issues
|
||||
- Document lessons learned
|
||||
|
||||
2. **Update Documentation**
|
||||
- Update DR procedures
|
||||
- Update runbooks
|
||||
- Update monitoring/alerting
|
||||
- Update contact information
|
||||
|
||||
3. **Create Action Items**
|
||||
- Assign owners and due dates
|
||||
- Prioritize improvements
|
||||
- Track completion
|
||||
- Schedule follow-up
|
||||
|
||||
## Drill Scenarios
|
||||
|
||||
### Scenario 1: Database Corruption
|
||||
- **Type:** Database Restore
|
||||
- **Severity:** P1
|
||||
- **Components:** PostgreSQL
|
||||
- **Steps:**
|
||||
1. Simulate database corruption
|
||||
2. Stop affected services
|
||||
3. Restore from latest backup
|
||||
4. Verify data integrity
|
||||
5. Restart services
|
||||
6. Verify system health
|
||||
|
||||
**Success Criteria:**
|
||||
- Database restored within RTO (1 hour)
|
||||
- Data integrity verified
|
||||
- Services operational within 30 minutes post-restore
|
||||
- Zero data loss
|
||||
|
||||
### Scenario 2: Service Failure
|
||||
- **Type:** Service Failover
|
||||
- **Severity:** P2
|
||||
- **Components:** Coordinator API, Marketplace, Exchange
|
||||
- **Steps:**
|
||||
1. Simulate service crash
|
||||
2. Monitor automatic failover
|
||||
3. Verify pod restart
|
||||
4. Test service health
|
||||
5. Verify data consistency
|
||||
|
||||
**Success Criteria:**
|
||||
- Automatic failover within 5 minutes
|
||||
- Service health restored
|
||||
- Zero data loss
|
||||
- Error rate returns to normal
|
||||
|
||||
### Scenario 3: Network Partition
|
||||
- **Type:** Tabletop Exercise
|
||||
- **Severity:** P2
|
||||
- **Components:** All services
|
||||
- **Steps:**
|
||||
1. Discuss network partition scenario
|
||||
2. Walk through response procedures
|
||||
3. Identify decision points
|
||||
4. Validate communication plan
|
||||
5. Document gaps
|
||||
|
||||
**Success Criteria:**
|
||||
- Response procedures validated
|
||||
- Communication plan confirmed
|
||||
- Decision points identified
|
||||
- Gaps documented
|
||||
|
||||
### Scenario 4: Data Center Outage
|
||||
- **Type:** Data Center Failover
|
||||
- **Severity:** P1
|
||||
- **Components:** All services
|
||||
- **Steps:**
|
||||
1. Simulate data center failure
|
||||
2. Activate alternate data center
|
||||
3. Restore from backup (if needed)
|
||||
4. Update DNS
|
||||
5. Verify service availability
|
||||
6. Monitor system performance
|
||||
|
||||
**Success Criteria:**
|
||||
- Alternate data center activated within 4 hours
|
||||
- Services operational
|
||||
- DNS propagation complete
|
||||
- Performance acceptable
|
||||
|
||||
### Scenario 5: Security Breach
|
||||
- **Type:** Tabletop Exercise
|
||||
- **Severity:** P1
|
||||
- **Components:** All services
|
||||
- **Steps:**
|
||||
1. Discuss breach scenario
|
||||
2. Walk through containment procedures
|
||||
3. Validate forensic preservation
|
||||
4. Review communication plan
|
||||
5. Document legal/compliance requirements
|
||||
|
||||
**Success Criteria:**
|
||||
- Containment procedures validated
|
||||
- Forensic procedures confirmed
|
||||
- Communication plan tested
|
||||
- Compliance requirements identified
|
||||
|
||||
## Drill Reporting
|
||||
|
||||
### Drill Report Template
|
||||
|
||||
```markdown
|
||||
# Disaster Recovery Drill Report
|
||||
|
||||
## Basic Information
|
||||
- **Drill ID:** DRILL-YYYYMMDD-001
|
||||
- **Date:** [Date]
|
||||
- **Type:** [Drill Type]
|
||||
- **Scenario:** [Description]
|
||||
- **Duration:** [Actual Duration]
|
||||
- **Participants:** [Names]
|
||||
|
||||
## Objectives
|
||||
- [Objective 1]
|
||||
- [Objective 2]
|
||||
- [Objective 3]
|
||||
|
||||
## Success Criteria
|
||||
| Criteria | Target | Actual | Status |
|
||||
|----------|--------|--------|--------|
|
||||
| [Criteria 1] | [Target] | [Actual] | [Met/Not Met] |
|
||||
| [Criteria 2] | [Target] | [Actual] | [Met/Not Met] |
|
||||
| [Criteria 3] | [Target] | [Actual] | [Met/Not Met] |
|
||||
|
||||
## Metrics
|
||||
- **RTO Target:** [Target]
|
||||
- **RTO Achieved:** [Actual]
|
||||
- **RPO Target:** [Target]
|
||||
- **RPO Achieved:** [Actual]
|
||||
- **Backup Restore Time:** [Time]
|
||||
- **Service Recovery Time:** [Time]
|
||||
|
||||
## Timeline
|
||||
| Time | Action | Owner | Status |
|
||||
|------|--------|-------|--------|
|
||||
| [Time] | [Action] | [Owner] | [Status] |
|
||||
| [Time] | [Action] | [Owner] | [Status] |
|
||||
|
||||
## Issues Encountered
|
||||
### Issue 1
|
||||
- **Description:** [Description]
|
||||
- **Impact:** [Impact]
|
||||
- **Resolution:** [Resolution]
|
||||
- **Prevention:** [Prevention]
|
||||
|
||||
### Issue 2
|
||||
- **Description:** [Description]
|
||||
- **Impact:** [Impact]
|
||||
- **Resolution:** [Resolution]
|
||||
- **Prevention:** [Prevention]
|
||||
|
||||
## Lessons Learned
|
||||
- [Lesson 1]
|
||||
- [Lesson 2]
|
||||
- [Lesson 3]
|
||||
|
||||
## Action Items
|
||||
| Action | Owner | Due Date | Status |
|
||||
|--------|-------|----------|--------|
|
||||
| [Action 1] | [Owner] | [Date] | [Status] |
|
||||
| [Action 2] | [Owner] | [Date] | [Status] |
|
||||
| [Action 3] | [Owner] | [Date] | [Status] |
|
||||
|
||||
## Recommendations
|
||||
- [Recommendation 1]
|
||||
- [Recommendation 2]
|
||||
- [Recommendation 3]
|
||||
|
||||
## Next Steps
|
||||
- [Next Step 1]
|
||||
- [Next Step 2]
|
||||
|
||||
## Sign-off
|
||||
- **Drill Lead:** [Name] - [Date]
|
||||
- **Observer:** [Name] - [Date]
|
||||
```
|
||||
|
||||
### Report Distribution
|
||||
|
||||
- **Primary:** CTO, Engineering Lead, DevOps Lead
|
||||
- **Secondary:** All participants
|
||||
- **Archive:** Confluence/wiki
|
||||
- **Retention:** 3 years
|
||||
|
||||
## Drill Metrics Tracking
|
||||
|
||||
### Quarterly Metrics Report
|
||||
|
||||
| Metric | Q1 Target | Q1 Actual | Q2 Target | Q2 Actual | Q3 Target | Q3 Actual | Q4 Target | Q4 Actual |
|
||||
|--------|-----------|----------|-----------|----------|-----------|----------|-----------|----------|
|
||||
| Drill Completion Rate | 100% | | 100% | | 100% | | 100% | |
|
||||
| Success Criteria Met | 90% | | 90% | | 90% | | 90% | |
|
||||
| RTO Achievement | 90% | | 90% | | 90% | | 90% | |
|
||||
| RPO Achievement | 95% | | 95% | | 95% | | 95% | |
|
||||
| Participant Satisfaction | 80% | | 80% | | 80% | | 80% | |
|
||||
|
||||
### Action Item Tracking
|
||||
|
||||
| Action Item | Drill ID | Owner | Due Date | Status | Closed Date |
|
||||
|-------------|----------|-------|----------|--------|-------------|
|
||||
| [Action] | [ID] | [Owner] | [Date] | [Status] | [Date] |
|
||||
|
||||
## Continuous Improvement
|
||||
|
||||
### Drill Feedback Process
|
||||
|
||||
1. **Immediate Feedback**
|
||||
- Collect participant feedback during drill
|
||||
- Note issues in real-time
|
||||
- Adjust drill if needed
|
||||
|
||||
2. **Post-Drill Survey**
|
||||
- Send survey within 24 hours
|
||||
- Ask about drill effectiveness
|
||||
- Collect suggestions for improvement
|
||||
- Rate drill difficulty and realism
|
||||
|
||||
3. **Quarterly Review**
|
||||
- Review drill metrics
|
||||
- Identify trends
|
||||
- Adjust drill schedule
|
||||
- Update drill scenarios
|
||||
|
||||
### Drill Improvement Cycle
|
||||
|
||||
```
|
||||
Plan → Execute → Review → Improve → Plan
|
||||
```
|
||||
|
||||
1. **Plan:** Design drill scenario and objectives
|
||||
2. **Execute:** Run drill according to procedures
|
||||
3. **Review:** Analyze results and collect feedback
|
||||
4. **Improve:** Update procedures and plan next drill
|
||||
|
||||
## Roles and Responsibilities
|
||||
|
||||
### Drill Coordinator
|
||||
- Plan and schedule drills
|
||||
- Coordinate participants
|
||||
- Lead drill execution
|
||||
- Document results
|
||||
- Track action items
|
||||
|
||||
### Drill Observer
|
||||
- Observe drill execution
|
||||
- Take detailed notes
|
||||
- Provide unbiased feedback
|
||||
- Identify improvement areas
|
||||
|
||||
### Drill Participants
|
||||
- Participate in drill execution
|
||||
- Follow drill procedures
|
||||
- Provide feedback
|
||||
- Complete action items
|
||||
|
||||
### Management
|
||||
- Approve drill schedule
|
||||
- Review drill results
|
||||
- Allocate resources
|
||||
- Support improvement initiatives
|
||||
|
||||
## Training
|
||||
|
||||
### New Hire Training
|
||||
- **Content:** DR plan overview, drill procedures
|
||||
- **Frequency:** Onboarding
|
||||
- **Duration:** 1 hour
|
||||
- **Format:** Presentation + walkthrough
|
||||
|
||||
### Annual Refresher Training
|
||||
- **Content:** Full DR plan, recent drill results
|
||||
- **Frequency:** Annually
|
||||
- **Duration:** 2 hours
|
||||
- **Format:** Workshop
|
||||
|
||||
### Role-Specific Training
|
||||
- **DBA:** Database restore procedures
|
||||
- **DevOps:** Service failover procedures
|
||||
- **Security:** Incident response procedures
|
||||
- **Engineering:** Service recovery procedures
|
||||
|
||||
## Compliance
|
||||
|
||||
### Regulatory Requirements
|
||||
- **SOC 2:** Annual DR testing
|
||||
- **ISO 27001:** Annual DR testing
|
||||
- **GDPR:** Data breach response testing
|
||||
- **PCI DSS:** Annual DR testing
|
||||
|
||||
### Audit Trail
|
||||
- Drill schedules
|
||||
- Drill reports
|
||||
- Action items
|
||||
- Training records
|
||||
- Metrics and trends
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Drill Checklist
|
||||
|
||||
#### Pre-Drill
|
||||
- [ ] Scenario defined
|
||||
- [ ] Objectives set
|
||||
- [ ] Participants notified
|
||||
- [ ] Environment prepared
|
||||
- [ ] Monitoring configured
|
||||
- [ ] Success criteria defined
|
||||
|
||||
#### During Drill
|
||||
- [ ] Kickoff completed
|
||||
- [ ] Timeline tracked
|
||||
- [ ] Actions documented
|
||||
- [ ] Issues recorded
|
||||
- [ ] Communication maintained
|
||||
- [ ] Metrics collected
|
||||
|
||||
#### Post-Drill
|
||||
- [ ] Metrics analyzed
|
||||
- [ ] Report completed
|
||||
- [ ] Action items assigned
|
||||
- [ ] Documentation updated
|
||||
- [ ] Feedback collected
|
||||
- [ ] Next drill scheduled
|
||||
|
||||
### B. Contact Information for Drills
|
||||
|
||||
| Role | Name | Email | Phone |
|
||||
|------|------|-------|-------|
|
||||
| Drill Coordinator | | | |
|
||||
| DevOps Lead | | | |
|
||||
| DBA | | | |
|
||||
| Security Lead | | | |
|
||||
|
||||
### C. Quick Reference
|
||||
|
||||
#### Emergency Drill Termination
|
||||
```bash
|
||||
# If drill causes actual incident, terminate immediately
|
||||
kubectl scale deployment --all --replicas=[original-counts]
|
||||
# Notify drill coordinator
|
||||
# Document termination reason
|
||||
# Schedule follow-up review
|
||||
```
|
||||
|
||||
#### Drill Status Check
|
||||
```bash
|
||||
# Check current drill status
|
||||
# View drill metrics
|
||||
# Monitor system health
|
||||
```
|
||||
|
||||
## Approval
|
||||
|
||||
| Role | Name | Date | Signature |
|
||||
|------|------|------|-----------|
|
||||
| CTO | | | |
|
||||
| Engineering Lead | | | |
|
||||
| DevOps Lead | | | |
|
||||
| Operations Manager | | | |
|
||||
687
docs/operations/disaster-recovery.md
Normal file
687
docs/operations/disaster-recovery.md
Normal file
@@ -0,0 +1,687 @@
|
||||
# AITBC Disaster Recovery Plan
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2026-05-11
|
||||
**Status:** Active
|
||||
**Last Updated:** 2026-05-11
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document outlines the comprehensive disaster recovery procedures for the AITBC platform. It defines disaster scenarios, recovery procedures, contact information, escalation paths, and communication protocols to ensure business continuity in the event of system failures or disasters.
|
||||
|
||||
## Disaster Scenarios
|
||||
|
||||
### 1. Database Corruption
|
||||
- **Description:** PostgreSQL database corruption due to hardware failure, software bug, or malicious attack
|
||||
- **Impact:** Loss of job data, marketplace offers/bids, user sessions, configuration
|
||||
- **RTO:** 1 hour
|
||||
- **RPO:** 24 hours
|
||||
- **Recovery Strategy:** Restore from latest PostgreSQL backup
|
||||
|
||||
### 2. Service Failure
|
||||
- **Description:** Critical service failure (coordinator-api, blockchain-node, marketplace, exchange)
|
||||
- **Impact:** Service unavailability, transaction processing halt
|
||||
- **RTO:** 30 minutes
|
||||
- **RPO:** 0 minutes (stateless services)
|
||||
- **Recovery Strategy:** Restart services, failover to standby instances
|
||||
|
||||
### 3. Network Partition
|
||||
- **Description:** Network connectivity loss between components or regions
|
||||
- **Impact:** Distributed system inconsistency, service degradation
|
||||
- **RTO:** 2 hours
|
||||
- **RPO:** 0 minutes
|
||||
- **Recovery Strategy:** Restore network connectivity, resynchronize state
|
||||
|
||||
### 4. Data Center Outage
|
||||
- **Description:** Complete data center failure (power, cooling, network)
|
||||
- **Impact:** Complete system unavailability
|
||||
- **RTO:** 4 hours
|
||||
- **RPO:** 24 hours
|
||||
- **Recovery Strategy:** Failover to alternate data center
|
||||
|
||||
### 5. Security Breach
|
||||
- **Description:** Unauthorized access, data breach, ransomware attack
|
||||
- **Impact:** Data compromise, service disruption, reputational damage
|
||||
- **RTO:** Variable (depends on breach severity)
|
||||
- **RPO:** 24 hours
|
||||
- **Recovery Strategy:** Contain breach, restore from pre-breach backup, patch vulnerabilities
|
||||
|
||||
### 6. Ransomware Attack
|
||||
- **Description:** Malicious encryption of data/systems
|
||||
- **Impact:** Data unavailability, service disruption
|
||||
- **RTO:** 8-24 hours
|
||||
- **RPO:** 24 hours
|
||||
- **Recovery Strategy:** Restore from clean backups, rebuild systems
|
||||
|
||||
## Contact Information
|
||||
|
||||
### Primary Contacts
|
||||
|
||||
| Role | Name | Email | Phone | Timezone |
|
||||
|------|------|-------|-------|----------|
|
||||
| CTO | | | | UTC |
|
||||
| Engineering Lead | | | | UTC |
|
||||
| DevOps Lead | | | | UTC |
|
||||
| Security Lead | | | | UTC |
|
||||
| Operations Manager | | | | UTC |
|
||||
|
||||
### Secondary Contacts
|
||||
|
||||
| Role | Name | Email | Phone | Timezone |
|
||||
|------|------|-------|-------|----------|
|
||||
| Database Administrator | | | | UTC |
|
||||
| Network Engineer | | | | UTC |
|
||||
| Security Analyst | | | | UTC |
|
||||
|
||||
### External Contacts
|
||||
|
||||
| Service | Contact | Email | Phone |
|
||||
|---------|---------|-------|-------|
|
||||
| Cloud Provider (AWS) | | | |
|
||||
| DNS Provider | | | |
|
||||
| Security Incident Response | | | |
|
||||
| Legal Counsel | | | |
|
||||
| Public Relations | | | |
|
||||
|
||||
## Escalation Procedures
|
||||
|
||||
### Severity Levels
|
||||
|
||||
#### P1 - Critical (System Down)
|
||||
- **Definition:** Complete system outage affecting all users
|
||||
- **Response Time:** 15 minutes
|
||||
- **Escalation Path:** On-call Engineer → Engineering Lead → CTO
|
||||
- **Communication:** Immediate stakeholder notification
|
||||
|
||||
#### P2 - Major (Service Degradation)
|
||||
- **Definition:** Critical functionality impaired, partial outage
|
||||
- **Response Time:** 30 minutes
|
||||
- **Escalation Path:** On-call Engineer → Engineering Lead
|
||||
- **Communication:** Stakeholder notification within 1 hour
|
||||
|
||||
#### P3 - Minor (Limited Impact)
|
||||
- **Definition:** Non-critical functionality impaired, limited users affected
|
||||
- **Response Time:** 1 hour
|
||||
- **Escalation Path:** On-call Engineer
|
||||
- **Communication:** Stakeholder notification within 4 hours
|
||||
|
||||
#### P4 - Low (Minimal Impact)
|
||||
- **Definition:** Cosmetic issues, documentation errors
|
||||
- **Response Time:** 4 hours
|
||||
- **Escalation Path:** Team Lead
|
||||
- **Communication:** Next business day
|
||||
|
||||
### Escalation Flowchart
|
||||
|
||||
```
|
||||
Incident Detected
|
||||
↓
|
||||
On-call Engineer (15 min)
|
||||
↓ (if unresolved)
|
||||
Engineering Lead (30 min)
|
||||
↓ (if unresolved)
|
||||
CTO (1 hour)
|
||||
↓ (if unresolved)
|
||||
Executive Team (2 hours)
|
||||
```
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Pre-Recovery Steps
|
||||
|
||||
1. **Assess Impact**
|
||||
- Determine scope and severity of incident
|
||||
- Identify affected components and users
|
||||
- Estimate recovery time
|
||||
- Classify incident severity (P1-P4)
|
||||
|
||||
2. **Declare Incident**
|
||||
- Notify on-call engineer
|
||||
- Create incident ticket
|
||||
- Initiate escalation based on severity
|
||||
- Activate incident response team
|
||||
|
||||
3. **Contain Incident**
|
||||
- Isolate affected systems
|
||||
- Prevent further damage
|
||||
- Preserve forensic evidence (if security incident)
|
||||
- Implement temporary workarounds
|
||||
|
||||
### Recovery by Scenario
|
||||
|
||||
#### Database Corruption Recovery
|
||||
|
||||
```bash
|
||||
# 1. Stop affected services
|
||||
kubectl scale deployment coordinator-api --replicas=0
|
||||
kubectl scale deployment marketplace --replicas=0
|
||||
kubectl scale deployment exchange --replicas=0
|
||||
|
||||
# 2. Identify latest clean backup
|
||||
aws s3 ls s3://aitbc-backups-default/postgresql/ | tail -1
|
||||
|
||||
# 3. Download backup
|
||||
aws s3 cp s3://aitbc-backups-default/postgresql/[latest-backup].sql.gz /tmp/
|
||||
|
||||
# 4. Restore database
|
||||
./infra/scripts/restore_postgresql.sh default /tmp/[latest-backup].sql.gz
|
||||
|
||||
# 5. Verify data integrity
|
||||
kubectl exec -n default deployment/postgres -- psql -U aitbc -d aitbc -c "SELECT COUNT(*) FROM jobs;"
|
||||
|
||||
# 6. Restart services
|
||||
kubectl scale deployment coordinator-api --replicas=3
|
||||
kubectl scale deployment marketplace --replicas=2
|
||||
kubectl scale deployment exchange --replicas=2
|
||||
|
||||
# 7. Verify system health
|
||||
curl -s http://coordinator-api:8011/v1/health
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Check database connectivity
|
||||
2. Verify job data integrity
|
||||
3. Test API endpoints
|
||||
4. Monitor error rates
|
||||
5. Validate user access
|
||||
|
||||
#### Service Failure Recovery
|
||||
|
||||
```bash
|
||||
# 1. Check service status
|
||||
kubectl get pods -n default
|
||||
kubectl describe deployment [service-name]
|
||||
|
||||
# 2. Check service logs
|
||||
kubectl logs -l app=[service-name] --tail=100
|
||||
|
||||
# 3. Restart affected service
|
||||
kubectl rollout restart deployment [service-name]
|
||||
|
||||
# 4. If restart fails, scale down and up
|
||||
kubectl scale deployment [service-name] --replicas=0
|
||||
kubectl scale deployment [service-name] --replicas=[original-count]
|
||||
|
||||
# 5. Verify service health
|
||||
kubectl exec -n default deployment/[service-name] -- curl -s http://localhost:[port]/v1/health
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Check pod status
|
||||
2. Verify service endpoints
|
||||
3. Test critical functionality
|
||||
4. Monitor service metrics
|
||||
|
||||
#### Network Partition Recovery
|
||||
|
||||
```bash
|
||||
# 1. Diagnose network issue
|
||||
kubectl get pods -n default -o wide
|
||||
kubectl exec -n default [pod-name] -- ping [target-host]
|
||||
kubectl exec -n default [pod-name] -- traceroute [target-host]
|
||||
|
||||
# 2. Check network policies
|
||||
kubectl get networkpolicies -n default
|
||||
|
||||
# 3. Check DNS resolution
|
||||
kubectl exec -n default [pod-name] -- nslookup [service-name]
|
||||
|
||||
# 4. Restart affected services if needed
|
||||
kubectl rollout restart deployment [service-name]
|
||||
|
||||
# 5. Verify connectivity
|
||||
kubectl exec -n default [pod-name] -- curl -s http://[service-name]:[port]/v1/health
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Verify network connectivity
|
||||
2. Test DNS resolution
|
||||
3. Check service communication
|
||||
4. Verify data synchronization
|
||||
|
||||
#### Data Center Outage Recovery
|
||||
|
||||
```bash
|
||||
# 1. Activate alternate data center
|
||||
kubectl config use-context [alt-cluster-context]
|
||||
|
||||
# 2. Verify alternate cluster health
|
||||
kubectl get nodes
|
||||
kubectl get pods -A
|
||||
|
||||
# 3. Restore from backup if needed
|
||||
aws s3 cp s3://aitbc-backups-alt/[latest-backup].sql.gz /tmp/
|
||||
./infra/scripts/restore_postgresql.sh alt /tmp/[latest-backup].sql.gz
|
||||
|
||||
# 4. Update DNS to point to alternate data center
|
||||
aws route53 change-resource-record-sets --hosted-zone-id [zone-id] --change-batch [change-batch]
|
||||
|
||||
# 5. Verify service availability
|
||||
curl -s https://api.aitbc.io/v1/health
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Verify alternate cluster health
|
||||
2. Test DNS propagation
|
||||
3. Verify service availability
|
||||
4. Monitor system performance
|
||||
|
||||
#### Security Breach Recovery
|
||||
|
||||
```bash
|
||||
# 1. Contain breach
|
||||
kubectl scale deployment [affected-service] --replicas=0
|
||||
iptables -A INPUT -s [attacker-ip] -j DROP
|
||||
|
||||
# 2. Preserve forensic evidence
|
||||
kubectl cp [pod-name]:/var/log /tmp/forensic-logs
|
||||
docker commit [container-id] forensic-image
|
||||
|
||||
# 3. Identify compromise scope
|
||||
grep -r "malicious" /var/log/
|
||||
check system logs for suspicious activity
|
||||
|
||||
# 4. Patch vulnerabilities
|
||||
./infra/scripts/apply-security-patches.sh
|
||||
|
||||
# 5. Restore from pre-breach backup
|
||||
aws s3 cp s3://aitbc-backups/[pre-breach-backup].sql.gz /tmp/
|
||||
./infra/scripts/restore_postgresql.sh default /tmp/[pre-breach-backup].sql.gz
|
||||
|
||||
# 6. Restart services
|
||||
kubectl scale deployment [affected-service] --replicas=[original-count]
|
||||
|
||||
# 7. Monitor for re-infection
|
||||
./scripts/monitoring/security-monitor.sh
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Verify breach containment
|
||||
2. Validate patch application
|
||||
3. Verify data integrity
|
||||
4. Monitor for suspicious activity
|
||||
5. Conduct security audit
|
||||
|
||||
#### Ransomware Attack Recovery
|
||||
|
||||
```bash
|
||||
# 1. Isolate infected systems
|
||||
kubectl scale deployment --all --replicas=0
|
||||
kubectl cordon [node-name]
|
||||
|
||||
# 2. Identify infection scope
|
||||
find /app/data -name "*.encrypted"
|
||||
grep -r "ransomware" /var/log/
|
||||
|
||||
# 3. Wipe and rebuild systems
|
||||
./infra/scripts/rebuild-systems.sh
|
||||
|
||||
# 4. Restore from clean backup
|
||||
aws s3 cp s3://aitbc-backups/[clean-backup].tar.gz /tmp/
|
||||
tar -xzf /tmp/[clean-backup].tar.gz -C /app/data/
|
||||
|
||||
# 5. Verify no ransomware remains
|
||||
./scripts/security/ransomware-scan.sh
|
||||
|
||||
# 6. Restart services
|
||||
kubectl scale deployment --all --replicas=[original-counts]
|
||||
|
||||
# 7. Implement additional security measures
|
||||
./infra/scripts/harden-security.sh
|
||||
```
|
||||
|
||||
**Verification Steps:**
|
||||
1. Verify system cleanliness
|
||||
2. Validate data integrity
|
||||
3. Test all services
|
||||
4. Monitor for re-infection
|
||||
5. Conduct security audit
|
||||
|
||||
### Post-Recovery Steps
|
||||
|
||||
1. **Verify System Health**
|
||||
- Check all services are running
|
||||
- Verify data integrity
|
||||
- Test critical functionality
|
||||
- Monitor error rates
|
||||
|
||||
2. **Document Incident**
|
||||
- Create incident report
|
||||
- Document root cause
|
||||
- Record recovery actions
|
||||
- Identify lessons learned
|
||||
|
||||
3. **Update Procedures**
|
||||
- Update disaster recovery plan
|
||||
- Improve monitoring/alerting
|
||||
- Add new prevention measures
|
||||
- Update runbooks
|
||||
|
||||
4. **Communicate Resolution**
|
||||
- Notify stakeholders
|
||||
- Update status page
|
||||
- Send post-mortem to team
|
||||
- Close incident ticket
|
||||
|
||||
## Communication Plan
|
||||
|
||||
### Internal Communication
|
||||
|
||||
#### During Incident
|
||||
- **Primary Channel:** Slack #incidents
|
||||
- **Backup Channel:** Phone call
|
||||
- **Frequency:** Every 15-30 minutes
|
||||
- **Content:** Status updates, ETA, blockers
|
||||
|
||||
#### After Incident
|
||||
- **Primary Channel:** Email + Slack
|
||||
- **Timing:** Within 24 hours
|
||||
- **Content:** Post-mortem, lessons learned, action items
|
||||
|
||||
### External Communication
|
||||
|
||||
#### Customers
|
||||
- **Channel:** Status page, email
|
||||
- **Timing:** P1/P2: Immediate; P3/P4: Within 4 hours
|
||||
- **Content:** Incident description, impact, ETA, resolution
|
||||
|
||||
#### Stakeholders
|
||||
- **Channel:** Email, phone
|
||||
- **Timing:** P1/P2: Within 1 hour; P3/P4: Within 4 hours
|
||||
- **Content:** Business impact, recovery status, financial impact
|
||||
|
||||
#### Public
|
||||
- **Channel:** Status page, social media (if major incident)
|
||||
- **Timing:** Only for major incidents (P1)
|
||||
- **Content:** High-level status, no technical details
|
||||
|
||||
### Communication Templates
|
||||
|
||||
#### Initial Incident Notification (Internal)
|
||||
```
|
||||
INCIDENT DECLARED - [Severity] - [Service]
|
||||
|
||||
Summary: [Brief description]
|
||||
Impact: [Affected users/services]
|
||||
Started: [Timestamp]
|
||||
Owner: [On-call engineer]
|
||||
Slack: #incidents-[ticket-number]
|
||||
```
|
||||
|
||||
#### Customer Notification
|
||||
```
|
||||
Service Incident - [Service Name]
|
||||
|
||||
We are currently experiencing an issue affecting [service].
|
||||
Our team is actively working to resolve this.
|
||||
We will provide updates every 30 minutes.
|
||||
|
||||
Status: [Current status]
|
||||
Started: [Timestamp]
|
||||
```
|
||||
|
||||
#### Resolution Notification
|
||||
```
|
||||
Incident Resolved - [Service Name]
|
||||
|
||||
The incident affecting [service] has been resolved.
|
||||
Normal service has been restored.
|
||||
|
||||
Started: [Timestamp]
|
||||
Resolved: [Timestamp]
|
||||
Duration: [Duration]
|
||||
Root Cause: [Brief description]
|
||||
Prevention: [What we're doing to prevent recurrence]
|
||||
```
|
||||
|
||||
## Failover Mechanisms
|
||||
|
||||
### Service Failover
|
||||
|
||||
#### Kubernetes Pod Failover
|
||||
- **Mechanism:** Kubernetes automatically restarts failed pods
|
||||
- **Configuration:** Pod replicas set to 3+ for critical services
|
||||
- **Health Checks:** Liveness and readiness probes configured
|
||||
- **Failover Time:** <5 minutes
|
||||
|
||||
#### Database Failover
|
||||
- **Mechanism:** PostgreSQL streaming replication
|
||||
- **Configuration:** Primary + 2 standby replicas
|
||||
- **Failover Trigger:** Automated via Patroni
|
||||
- **Failover Time:** <2 minutes
|
||||
|
||||
#### Redis Failover
|
||||
- **Mechanism:** Redis Sentinel
|
||||
- **Configuration:** Master + 2 slaves + 3 sentinels
|
||||
- **Failover Trigger:** Automatic via Sentinel
|
||||
- **Failover Time:** <30 seconds
|
||||
|
||||
### Geographic Failover
|
||||
|
||||
#### Data Center Failover
|
||||
- **Mechanism:** Multi-region deployment
|
||||
- **Configuration:** Active-active or active-passive
|
||||
- **Failover Trigger:** Manual or automated (based on health checks)
|
||||
- **Failover Time:** <4 hours
|
||||
|
||||
#### DNS Failover
|
||||
- **Mechanism:** Route53 health checks + DNS failover
|
||||
- **Configuration:** Multi-region DNS records
|
||||
- **Failover Trigger:** Automatic health checks
|
||||
- **Failover Time:** <5 minutes (DNS propagation)
|
||||
|
||||
### Data Failover
|
||||
|
||||
#### Blockchain State Synchronization
|
||||
- **Mechanism:** Peer-to-peer blockchain sync
|
||||
- **Configuration:** Multiple nodes in different regions
|
||||
- **Failover Trigger:** Automatic via consensus
|
||||
- **Failover Time:** Depends on chain height (typically <1 hour)
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Backup Schedule
|
||||
|
||||
| Component | Frequency | Time (UTC) | Type | Retention |
|
||||
|-----------|-----------|------------|------|-----------|
|
||||
| PostgreSQL | Daily | 02:00 | Full | 30 days |
|
||||
| PostgreSQL | Weekly | 02:00 Sunday | Full | 90 days |
|
||||
| Redis | Daily | 02:01 | Full | 30 days |
|
||||
| Ledger | Daily | 02:02 | Full + Incremental | 30 days |
|
||||
| Configuration | On change | - | Full | 90 days |
|
||||
|
||||
### Backup Verification
|
||||
|
||||
1. **Automated Verification**
|
||||
- Check backup completion via monitoring
|
||||
- Validate backup integrity via checksums
|
||||
- Test restore monthly (automated)
|
||||
|
||||
2. **Manual Verification**
|
||||
- Quarterly full restore test
|
||||
- Annual disaster recovery drill
|
||||
- Document verification results
|
||||
|
||||
### Backup Locations
|
||||
|
||||
- **Primary:** S3 (us-east-1)
|
||||
- **Secondary:** S3 (us-west-2)
|
||||
- **Tertiary:** On-premise backup server
|
||||
- **Encryption:** Server-side encryption (AES-256)
|
||||
- **Access:** IAM-restricted, audit-logged
|
||||
|
||||
## Disaster Recovery Drills
|
||||
|
||||
### Drill Schedule
|
||||
|
||||
| Drill Type | Frequency | Duration | Participants |
|
||||
|------------|-----------|----------|--------------|
|
||||
| Tabletop Exercise | Quarterly | 2 hours | Engineering, Ops, Security |
|
||||
| Service Failover | Monthly | 1 hour | DevOps |
|
||||
| Database Restore | Monthly | 1 hour | DBA, DevOps |
|
||||
| Full System Recovery | Quarterly | 4 hours | All teams |
|
||||
| Data Center Failover | Annually | 8 hours | All teams |
|
||||
|
||||
### Drill Procedures
|
||||
|
||||
#### Pre-Drill Preparation
|
||||
1. Define drill scenario and objectives
|
||||
2. Notify participants in advance
|
||||
3. Prepare test environment (if needed)
|
||||
4. Set up monitoring and logging
|
||||
5. Establish success criteria
|
||||
|
||||
#### During Drill
|
||||
1. Execute drill according to scenario
|
||||
2. Document actions and timing
|
||||
3. Record issues and blockers
|
||||
4. Monitor system behavior
|
||||
5. Communicate progress
|
||||
|
||||
#### Post-Drill Review
|
||||
1. Collect metrics and observations
|
||||
2. Identify gaps and improvements
|
||||
3. Update procedures and documentation
|
||||
4. Share lessons learned
|
||||
5. Schedule follow-up actions
|
||||
|
||||
### Drill Report Template
|
||||
|
||||
```
|
||||
Disaster Recovery Drill Report
|
||||
|
||||
Date: [Date]
|
||||
Type: [Drill Type]
|
||||
Scenario: [Description]
|
||||
Participants: [Names]
|
||||
|
||||
Objectives:
|
||||
- [Objective 1]
|
||||
- [Objective 2]
|
||||
|
||||
Results:
|
||||
- Success Criteria: [Met/Not Met]
|
||||
- RTO Achieved: [Time]
|
||||
- RPO Achieved: [Time]
|
||||
|
||||
Issues Encountered:
|
||||
- [Issue 1]
|
||||
- [Issue 2]
|
||||
|
||||
Lessons Learned:
|
||||
- [Lesson 1]
|
||||
- [Lesson 2]
|
||||
|
||||
Action Items:
|
||||
- [Action 1] - [Owner] - [Due Date]
|
||||
- [Action 2] - [Owner] - [Due Date]
|
||||
|
||||
Next Drill: [Date]
|
||||
```
|
||||
|
||||
## Metrics and Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
|--------|--------|-------------|
|
||||
| Backup Success Rate | >99% | Daily |
|
||||
| Restore Success Rate | 100% | Monthly test |
|
||||
| RTO Achievement | <Target | Per incident |
|
||||
| RPO Achievement | <Target | Per incident |
|
||||
| Drill Participation | 100% | Per drill |
|
||||
|
||||
### Monitoring
|
||||
|
||||
#### Backup Monitoring
|
||||
- Backup completion status
|
||||
- Backup size and duration
|
||||
- Backup integrity checks
|
||||
- Storage capacity
|
||||
|
||||
#### Recovery Monitoring
|
||||
- Recovery time tracking
|
||||
- Recovery success rate
|
||||
- System health post-recovery
|
||||
- Error rates post-recovery
|
||||
|
||||
#### Drill Monitoring
|
||||
- Drill completion rate
|
||||
- Drill success rate
|
||||
- Participant feedback
|
||||
- Action item completion
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Plan Review
|
||||
- **Frequency:** Quarterly
|
||||
- **Owner:** Operations Manager
|
||||
- **Participants:** Engineering, DevOps, Security
|
||||
- **Output:** Updated plan version
|
||||
|
||||
### Contact Updates
|
||||
- **Frequency:** Monthly
|
||||
- **Owner:** HR/Operations
|
||||
- **Process:** Verify all contacts are current
|
||||
|
||||
### Procedure Updates
|
||||
- **Frequency:** As needed
|
||||
- **Trigger:** System changes, incident lessons learned
|
||||
- **Process:** Update documentation, notify team
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Quick Reference Card
|
||||
|
||||
```
|
||||
EMERGENCY CONTACTS:
|
||||
CTO: [Phone]
|
||||
Engineering Lead: [Phone]
|
||||
On-call: [Phone]
|
||||
|
||||
CRITICAL COMMANDS:
|
||||
Check status: kubectl get pods -A
|
||||
Check logs: kubectl logs -l app=[service]
|
||||
Restart: kubectl rollout restart deployment [service]
|
||||
Scale: kubectl scale deployment [service] --replicas=N
|
||||
|
||||
BACKUP RESTORE:
|
||||
PostgreSQL: ./infra/scripts/restore_postgresql.sh default [backup]
|
||||
Redis: kubectl cp [backup] default/redis-0:/data/dump.rdb
|
||||
Ledger: tar -xzf [backup] -C /tmp/ && kubectl cp /tmp/chain/ default/node:/app/data/
|
||||
|
||||
DNS FAILOVER:
|
||||
aws route53 change-resource-record-sets --hosted-zone-id [id] --change-batch [batch]
|
||||
```
|
||||
|
||||
### B. Incident Response Checklist
|
||||
|
||||
- [ ] Assess impact and severity
|
||||
- [ ] Declare incident and notify on-call
|
||||
- [ ] Create incident ticket
|
||||
- [ ] Activate incident response team
|
||||
- [ ] Contain incident
|
||||
- [ ] Begin recovery procedures
|
||||
- [ ] Communicate with stakeholders
|
||||
- [ ] Monitor recovery progress
|
||||
- [ ] Verify system health
|
||||
- [ ] Document incident
|
||||
- [ ] Post-incident review
|
||||
- [ ] Update procedures
|
||||
- [ ] Close incident ticket
|
||||
|
||||
### C. Change History
|
||||
|
||||
| Version | Date | Changes | Author |
|
||||
|---------|------|---------|--------|
|
||||
| 1.0 | 2026-05-11 | Initial creation | |
|
||||
|
||||
## Approval
|
||||
|
||||
| Role | Name | Date | Signature |
|
||||
|------|------|------|-----------|
|
||||
| CTO | | | |
|
||||
| Engineering Lead | | | |
|
||||
| DevOps Lead | | | |
|
||||
| Security Lead | | | |
|
||||
| Operations Manager | | | |
|
||||
Reference in New Issue
Block a user