oib/aitbc

Fork 0

Files

aitbc e4f1a96172

Blockchain Synchronization Verification / sync-verification (push) Failing after 8s

Details

CLI Tests / test-cli (push) Successful in 10s

Details

Contract Performance Benchmarks / benchmark-gas-usage (push) Successful in 1m22s

Details

Contract Performance Benchmarks / benchmark-execution-time (push) Successful in 1m11s

Details

Contract Performance Benchmarks / benchmark-throughput (push) Successful in 1m13s

Details

Cross-Chain Functionality Tests / test-cross-chain-sync (push) Failing after 5s

Details

Cross-Chain Functionality Tests / test-cross-chain-transactions (push) Successful in 5s

Details

Cross-Chain Functionality Tests / test-cross-chain-bridge (push) Has been skipped

Details

Cross-Chain Functionality Tests / test-multi-chain-consensus (push) Failing after 3s

Details

Cross-Chain Functionality Tests / aggregate-results (push) Has been skipped

Details

Cross-Node Transaction Testing / transaction-test (push) Successful in 5s

Details

Deploy to Testnet / deploy-testnet (push) Successful in 1m14s

Details

Contract Performance Benchmarks / compare-benchmarks (push) Has been cancelled

Details

Documentation Validation / validate-docs (push) Failing after 10s

Details

Multi-Node Stress Testing / stress-test (push) Has been cancelled

Details

Node Failover Simulation / failover-test (push) Has been cancelled

Details

Security Scanning / security-scan (push) Has been cancelled

Details

Smart Contract Tests / test-solidity (map[name:aitbc-contracts path:contracts]) (push) Has been cancelled

Details

Smart Contract Tests / test-solidity (map[name:aitbc-token path:packages/solidity/aitbc-token]) (push) Has been cancelled

Details

Smart Contract Tests / test-foundry (push) Has been cancelled

Details

Smart Contract Tests / lint-solidity (push) Has been cancelled

Details

Smart Contract Tests / deploy-contracts (push) Has been cancelled

Details

Documentation Validation / validate-policies-strict (push) Successful in 3s

Details

Integration Tests / test-service-integration (push) Failing after 45s

Details

Multi-Chain Island Architecture Tests / test-multi-chain-island (push) Failing after 2s

Details

Multi-Node Blockchain Health Monitoring / health-check (push) Successful in 5s

Details

P2P Network Verification / p2p-verification (push) Successful in 3s

Details

Production Tests / Production Integration Tests (push) Failing after 7s

Details

Python Tests / test-python (push) Failing after 46s

Details

Staking Tests / test-staking-service (push) Failing after 2s

Details

Staking Tests / test-staking-integration (push) Has been skipped

Details

Staking Tests / test-staking-contract (push) Has been skipped

Details

Staking Tests / run-staking-test-runner (push) Has been skipped

Details

Systemd Sync / sync-systemd (push) Successful in 21s

Details

API Endpoint Tests / test-api-endpoints (push) Failing after 12m19s

Details

ci: standardize pytest invocation and add security scanning

- Changed pytest calls to use `venv/bin/python -m pytest` with explicit config
- Added `--rootdir "$PWD"` and `--import-mode=importlib` for consistent imports
- Fixed PYTHONPATH to use absolute paths with $PWD prefix
- Added smart contract security scanning for Solidity files
- Added Circom circuit security checks for ZK proof circuits
- Added ZK proof implementation security validation
- Added contracts/** to security scanning workflow

2026-05-11 13:46:42 +02:00

18 KiB

Raw Permalink Blame History

AITBC Disaster Recovery Plan

Version: 1.0 Date: 2026-05-11 Status: Active Last Updated: 2026-05-11

Executive Summary

This document outlines the comprehensive disaster recovery procedures for the AITBC platform. It defines disaster scenarios, recovery procedures, contact information, escalation paths, and communication protocols to ensure business continuity in the event of system failures or disasters.

Disaster Scenarios

1. Database Corruption

Description: PostgreSQL database corruption due to hardware failure, software bug, or malicious attack
Impact: Loss of job data, marketplace offers/bids, user sessions, configuration
RTO: 1 hour
RPO: 24 hours
Recovery Strategy: Restore from latest PostgreSQL backup

2. Service Failure

Description: Critical service failure (coordinator-api, blockchain-node, marketplace, exchange)
Impact: Service unavailability, transaction processing halt
RTO: 30 minutes
RPO: 0 minutes (stateless services)
Recovery Strategy: Restart services, failover to standby instances

3. Network Partition

Description: Network connectivity loss between components or regions
Impact: Distributed system inconsistency, service degradation
RTO: 2 hours
RPO: 0 minutes
Recovery Strategy: Restore network connectivity, resynchronize state

4. Data Center Outage

Description: Complete data center failure (power, cooling, network)
Impact: Complete system unavailability
RTO: 4 hours
RPO: 24 hours
Recovery Strategy: Failover to alternate data center

5. Security Breach

Description: Unauthorized access, data breach, ransomware attack
Impact: Data compromise, service disruption, reputational damage
RTO: Variable (depends on breach severity)
RPO: 24 hours
Recovery Strategy: Contain breach, restore from pre-breach backup, patch vulnerabilities

6. Ransomware Attack

Description: Malicious encryption of data/systems
Impact: Data unavailability, service disruption
RTO: 8-24 hours
RPO: 24 hours
Recovery Strategy: Restore from clean backups, rebuild systems

Contact Information

Primary Contacts

Role	Name	Email	Phone	Timezone
CTO				UTC
Engineering Lead				UTC
DevOps Lead				UTC
Security Lead				UTC
Operations Manager				UTC

Secondary Contacts

Role	Name	Email	Phone	Timezone
Database Administrator				UTC
Network Engineer				UTC
Security Analyst				UTC

External Contacts

Service	Contact	Email	Phone
Cloud Provider (AWS)
DNS Provider
Security Incident Response
Legal Counsel
Public Relations

Escalation Procedures

Severity Levels

P1 - Critical (System Down)

Definition: Complete system outage affecting all users
Response Time: 15 minutes
Escalation Path: On-call Engineer → Engineering Lead → CTO
Communication: Immediate stakeholder notification

P2 - Major (Service Degradation)

Definition: Critical functionality impaired, partial outage
Response Time: 30 minutes
Escalation Path: On-call Engineer → Engineering Lead
Communication: Stakeholder notification within 1 hour

P3 - Minor (Limited Impact)

Definition: Non-critical functionality impaired, limited users affected
Response Time: 1 hour
Escalation Path: On-call Engineer
Communication: Stakeholder notification within 4 hours

P4 - Low (Minimal Impact)

Definition: Cosmetic issues, documentation errors
Response Time: 4 hours
Escalation Path: Team Lead
Communication: Next business day

Escalation Flowchart

Incident Detected
    ↓
On-call Engineer (15 min)
    ↓ (if unresolved)
Engineering Lead (30 min)
    ↓ (if unresolved)
CTO (1 hour)
    ↓ (if unresolved)
Executive Team (2 hours)

Recovery Procedures

Pre-Recovery Steps

Assess Impact
- Determine scope and severity of incident
- Identify affected components and users
- Estimate recovery time
- Classify incident severity (P1-P4)
Declare Incident
- Notify on-call engineer
- Create incident ticket
- Initiate escalation based on severity
- Activate incident response team
Contain Incident
- Isolate affected systems
- Prevent further damage
- Preserve forensic evidence (if security incident)
- Implement temporary workarounds

Recovery by Scenario

Database Corruption Recovery

# 1. Stop affected services
kubectl scale deployment coordinator-api --replicas=0
kubectl scale deployment marketplace --replicas=0
kubectl scale deployment exchange --replicas=0

# 2. Identify latest clean backup
aws s3 ls s3://aitbc-backups-default/postgresql/ | tail -1

# 3. Download backup
aws s3 cp s3://aitbc-backups-default/postgresql/[latest-backup].sql.gz /tmp/

# 4. Restore database
./infra/scripts/restore_postgresql.sh default /tmp/[latest-backup].sql.gz

# 5. Verify data integrity
kubectl exec -n default deployment/postgres -- psql -U aitbc -d aitbc -c "SELECT COUNT(*) FROM jobs;"

# 6. Restart services
kubectl scale deployment coordinator-api --replicas=3
kubectl scale deployment marketplace --replicas=2
kubectl scale deployment exchange --replicas=2

# 7. Verify system health
curl -s http://coordinator-api:8011/v1/health

Verification Steps:

Check database connectivity
Verify job data integrity
Test API endpoints
Monitor error rates
Validate user access

Service Failure Recovery

# 1. Check service status
kubectl get pods -n default
kubectl describe deployment [service-name]

# 2. Check service logs
kubectl logs -l app=[service-name] --tail=100

# 3. Restart affected service
kubectl rollout restart deployment [service-name]

# 4. If restart fails, scale down and up
kubectl scale deployment [service-name] --replicas=0
kubectl scale deployment [service-name] --replicas=[original-count]

# 5. Verify service health
kubectl exec -n default deployment/[service-name] -- curl -s http://localhost:[port]/v1/health

Verification Steps:

Check pod status
Verify service endpoints
Test critical functionality
Monitor service metrics

Network Partition Recovery

# 1. Diagnose network issue
kubectl get pods -n default -o wide
kubectl exec -n default [pod-name] -- ping [target-host]
kubectl exec -n default [pod-name] -- traceroute [target-host]

# 2. Check network policies
kubectl get networkpolicies -n default

# 3. Check DNS resolution
kubectl exec -n default [pod-name] -- nslookup [service-name]

# 4. Restart affected services if needed
kubectl rollout restart deployment [service-name]

# 5. Verify connectivity
kubectl exec -n default [pod-name] -- curl -s http://[service-name]:[port]/v1/health

Verification Steps:

Verify network connectivity
Test DNS resolution
Check service communication
Verify data synchronization

Data Center Outage Recovery

# 1. Activate alternate data center
kubectl config use-context [alt-cluster-context]

# 2. Verify alternate cluster health
kubectl get nodes
kubectl get pods -A

# 3. Restore from backup if needed
aws s3 cp s3://aitbc-backups-alt/[latest-backup].sql.gz /tmp/
./infra/scripts/restore_postgresql.sh alt /tmp/[latest-backup].sql.gz

# 4. Update DNS to point to alternate data center
aws route53 change-resource-record-sets --hosted-zone-id [zone-id] --change-batch [change-batch]

# 5. Verify service availability
curl -s https://api.aitbc.io/v1/health

Verification Steps:

Verify alternate cluster health
Test DNS propagation
Verify service availability
Monitor system performance

Security Breach Recovery

# 1. Contain breach
kubectl scale deployment [affected-service] --replicas=0
iptables -A INPUT -s [attacker-ip] -j DROP

# 2. Preserve forensic evidence
kubectl cp [pod-name]:/var/log /tmp/forensic-logs
docker commit [container-id] forensic-image

# 3. Identify compromise scope
grep -r "malicious" /var/log/
check system logs for suspicious activity

# 4. Patch vulnerabilities
./infra/scripts/apply-security-patches.sh

# 5. Restore from pre-breach backup
aws s3 cp s3://aitbc-backups/[pre-breach-backup].sql.gz /tmp/
./infra/scripts/restore_postgresql.sh default /tmp/[pre-breach-backup].sql.gz

# 6. Restart services
kubectl scale deployment [affected-service] --replicas=[original-count]

# 7. Monitor for re-infection
./scripts/monitoring/security-monitor.sh

Verification Steps:

Verify breach containment
Validate patch application
Verify data integrity
Monitor for suspicious activity
Conduct security audit

Ransomware Attack Recovery

# 1. Isolate infected systems
kubectl scale deployment --all --replicas=0
kubectl cordon [node-name]

# 2. Identify infection scope
find /app/data -name "*.encrypted"
grep -r "ransomware" /var/log/

# 3. Wipe and rebuild systems
./infra/scripts/rebuild-systems.sh

# 4. Restore from clean backup
aws s3 cp s3://aitbc-backups/[clean-backup].tar.gz /tmp/
tar -xzf /tmp/[clean-backup].tar.gz -C /app/data/

# 5. Verify no ransomware remains
./scripts/security/ransomware-scan.sh

# 6. Restart services
kubectl scale deployment --all --replicas=[original-counts]

# 7. Implement additional security measures
./infra/scripts/harden-security.sh

Verification Steps:

Verify system cleanliness
Validate data integrity
Test all services
Monitor for re-infection
Conduct security audit

Post-Recovery Steps

Verify System Health
- Check all services are running
- Verify data integrity
- Test critical functionality
- Monitor error rates
Document Incident
- Create incident report
- Document root cause
- Record recovery actions
- Identify lessons learned
Update Procedures
- Update disaster recovery plan
- Improve monitoring/alerting
- Add new prevention measures
- Update runbooks
Communicate Resolution
- Notify stakeholders
- Update status page
- Send post-mortem to team
- Close incident ticket

Communication Plan

Internal Communication

During Incident

Primary Channel: Slack #incidents
Backup Channel: Phone call
Frequency: Every 15-30 minutes
Content: Status updates, ETA, blockers

After Incident

Primary Channel: Email + Slack
Timing: Within 24 hours
Content: Post-mortem, lessons learned, action items

External Communication

Customers

Channel: Status page, email
Timing: P1/P2: Immediate; P3/P4: Within 4 hours
Content: Incident description, impact, ETA, resolution

Stakeholders

Channel: Email, phone
Timing: P1/P2: Within 1 hour; P3/P4: Within 4 hours
Content: Business impact, recovery status, financial impact

Public

Channel: Status page, social media (if major incident)
Timing: Only for major incidents (P1)
Content: High-level status, no technical details

Communication Templates

Initial Incident Notification (Internal)

INCIDENT DECLARED - [Severity] - [Service]

Summary: [Brief description]
Impact: [Affected users/services]
Started: [Timestamp]
Owner: [On-call engineer]
Slack: #incidents-[ticket-number]

Customer Notification

Service Incident - [Service Name]

We are currently experiencing an issue affecting [service].
Our team is actively working to resolve this.
We will provide updates every 30 minutes.

Status: [Current status]
Started: [Timestamp]

Resolution Notification

Incident Resolved - [Service Name]

The incident affecting [service] has been resolved.
Normal service has been restored.

Started: [Timestamp]
Resolved: [Timestamp]
Duration: [Duration]
Root Cause: [Brief description]
Prevention: [What we're doing to prevent recurrence]

Failover Mechanisms

Service Failover

Kubernetes Pod Failover

Mechanism: Kubernetes automatically restarts failed pods
Configuration: Pod replicas set to 3+ for critical services
Health Checks: Liveness and readiness probes configured
Failover Time: <5 minutes

Database Failover

Mechanism: PostgreSQL streaming replication
Configuration: Primary + 2 standby replicas
Failover Trigger: Automated via Patroni
Failover Time: <2 minutes

Redis Failover

Mechanism: Redis Sentinel
Configuration: Master + 2 slaves + 3 sentinels
Failover Trigger: Automatic via Sentinel
Failover Time: <30 seconds

Geographic Failover

Data Center Failover

Mechanism: Multi-region deployment
Configuration: Active-active or active-passive
Failover Trigger: Manual or automated (based on health checks)
Failover Time: <4 hours

DNS Failover

Mechanism: Route53 health checks + DNS failover
Configuration: Multi-region DNS records
Failover Trigger: Automatic health checks
Failover Time: <5 minutes (DNS propagation)

Data Failover

Blockchain State Synchronization

Mechanism: Peer-to-peer blockchain sync
Configuration: Multiple nodes in different regions
Failover Trigger: Automatic via consensus
Failover Time: Depends on chain height (typically <1 hour)

Backup Procedures

Backup Schedule

Component	Frequency	Time (UTC)	Type	Retention
PostgreSQL	Daily	02:00	Full	30 days
PostgreSQL	Weekly	02:00 Sunday	Full	90 days
Redis	Daily	02:01	Full	30 days
Ledger	Daily	02:02	Full + Incremental	30 days
Configuration	On change	-	Full	90 days

Backup Verification

Automated Verification
- Check backup completion via monitoring
- Validate backup integrity via checksums
- Test restore monthly (automated)
Manual Verification
- Quarterly full restore test
- Annual disaster recovery drill
- Document verification results

Backup Locations

Primary: S3 (us-east-1)
Secondary: S3 (us-west-2)
Tertiary: On-premise backup server
Encryption: Server-side encryption (AES-256)
Access: IAM-restricted, audit-logged

Disaster Recovery Drills

Drill Schedule

Drill Type	Frequency	Duration	Participants
Tabletop Exercise	Quarterly	2 hours	Engineering, Ops, Security
Service Failover	Monthly	1 hour	DevOps
Database Restore	Monthly	1 hour	DBA, DevOps
Full System Recovery	Quarterly	4 hours	All teams
Data Center Failover	Annually	8 hours	All teams

Drill Procedures

Pre-Drill Preparation

Define drill scenario and objectives
Notify participants in advance
Prepare test environment (if needed)
Set up monitoring and logging
Establish success criteria

During Drill

Execute drill according to scenario
Document actions and timing
Record issues and blockers
Monitor system behavior
Communicate progress

Post-Drill Review

Collect metrics and observations
Identify gaps and improvements
Update procedures and documentation
Share lessons learned
Schedule follow-up actions

Drill Report Template

Disaster Recovery Drill Report

Date: [Date]
Type: [Drill Type]
Scenario: [Description]
Participants: [Names]

Objectives:
- [Objective 1]
- [Objective 2]

Results:
- Success Criteria: [Met/Not Met]
- RTO Achieved: [Time]
- RPO Achieved: [Time]

Issues Encountered:
- [Issue 1]
- [Issue 2]

Lessons Learned:
- [Lesson 1]
- [Lesson 2]

Action Items:
- [Action 1] - [Owner] - [Due Date]
- [Action 2] - [Owner] - [Due Date]

Next Drill: [Date]

Metrics and Monitoring

Key Metrics

Metric	Target	Measurement
Backup Success Rate	>99%	Daily
Restore Success Rate	100%	Monthly test
RTO Achievement	<Target	Per incident
RPO Achievement	<Target	Per incident
Drill Participation	100%	Per drill

Monitoring

Backup Monitoring

Backup completion status
Backup size and duration
Backup integrity checks
Storage capacity

Recovery Monitoring

Recovery time tracking
Recovery success rate
System health post-recovery
Error rates post-recovery

Drill Monitoring

Drill completion rate
Drill success rate
Participant feedback
Action item completion

Maintenance

Plan Review

Frequency: Quarterly
Owner: Operations Manager
Participants: Engineering, DevOps, Security
Output: Updated plan version

Contact Updates

Frequency: Monthly
Owner: HR/Operations
Process: Verify all contacts are current

Procedure Updates

Frequency: As needed
Trigger: System changes, incident lessons learned
Process: Update documentation, notify team

Appendix

A. Quick Reference Card

EMERGENCY CONTACTS:
CTO: [Phone]
Engineering Lead: [Phone]
On-call: [Phone]

CRITICAL COMMANDS:
Check status: kubectl get pods -A
Check logs: kubectl logs -l app=[service]
Restart: kubectl rollout restart deployment [service]
Scale: kubectl scale deployment [service] --replicas=N

BACKUP RESTORE:
PostgreSQL: ./infra/scripts/restore_postgresql.sh default [backup]
Redis: kubectl cp [backup] default/redis-0:/data/dump.rdb
Ledger: tar -xzf [backup] -C /tmp/ && kubectl cp /tmp/chain/ default/node:/app/data/

DNS FAILOVER:
aws route53 change-resource-record-sets --hosted-zone-id [id] --change-batch [batch]

B. Incident Response Checklist

Assess impact and severity
Declare incident and notify on-call
Create incident ticket
Activate incident response team
Contain incident
Begin recovery procedures
Communicate with stakeholders
Monitor recovery progress
Verify system health
Document incident
Post-incident review
Update procedures
Close incident ticket

C. Change History

Version	Date	Changes	Author
1.0	2026-05-11	Initial creation

Approval

Role	Name	Date	Signature
CTO
Engineering Lead
DevOps Lead
Security Lead
Operations Manager

18 KiB Raw Permalink Blame History

AITBC Disaster Recovery Plan

Executive Summary

Disaster Scenarios

1. Database Corruption

2. Service Failure

3. Network Partition

4. Data Center Outage

5. Security Breach

6. Ransomware Attack

Contact Information

Primary Contacts

Secondary Contacts

External Contacts

Escalation Procedures

Severity Levels

P1 - Critical (System Down)

P2 - Major (Service Degradation)

P3 - Minor (Limited Impact)

P4 - Low (Minimal Impact)

Escalation Flowchart

Recovery Procedures

Pre-Recovery Steps

Recovery by Scenario

Database Corruption Recovery

Service Failure Recovery

Network Partition Recovery

Data Center Outage Recovery

Security Breach Recovery

Ransomware Attack Recovery

Post-Recovery Steps

Communication Plan

Internal Communication

During Incident

After Incident

External Communication

Customers

Stakeholders

Public

Communication Templates

Initial Incident Notification (Internal)

Customer Notification

Resolution Notification

Failover Mechanisms

Service Failover

Kubernetes Pod Failover

Database Failover

Redis Failover

Geographic Failover

Data Center Failover

DNS Failover

Data Failover

Blockchain State Synchronization

Backup Procedures

Backup Schedule

Backup Verification

Backup Locations

Disaster Recovery Drills

Drill Schedule

Drill Procedures

Pre-Drill Preparation

During Drill

Post-Drill Review

Drill Report Template

Metrics and Monitoring

Key Metrics

Monitoring

Backup Monitoring

Recovery Monitoring

Drill Monitoring

Maintenance

Plan Review

Contact Updates

Procedure Updates

Appendix

A. Quick Reference Card

B. Incident Response Checklist

C. Change History

Approval

18 KiB

Raw Permalink Blame History