feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
This commit is contained in:
oib
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions

158
infra/README.md Normal file
View File

@ -0,0 +1,158 @@
# AITBC Infrastructure Templates
This directory contains Terraform and Helm templates for deploying AITBC services across dev, staging, and production environments.
## Directory Structure
```
infra/
├── terraform/ # Infrastructure as Code
│ ├── modules/ # Reusable Terraform modules
│ │ └── kubernetes/ # EKS cluster module
│ └── environments/ # Environment-specific configurations
│ ├── dev/
│ ├── staging/
│ └── prod/
└── helm/ # Helm Charts
├── charts/ # Application charts
│ ├── coordinator/ # Coordinator API chart
│ ├── blockchain-node/ # Blockchain node chart
│ └── monitoring/ # Monitoring stack (Prometheus, Grafana)
└── values/ # Environment-specific values
├── dev.yaml
├── staging.yaml
└── prod.yaml
```
## Quick Start
### Prerequisites
- Terraform >= 1.0
- Helm >= 3.0
- kubectl configured for your cluster
- AWS CLI configured (for EKS)
### Deploy Development Environment
1. **Provision Infrastructure with Terraform:**
```bash
cd infra/terraform/environments/dev
terraform init
terraform apply
```
2. **Configure kubectl:**
```bash
aws eks update-kubeconfig --name aitbc-dev --region us-west-2
```
3. **Deploy Applications with Helm:**
```bash
# Add required Helm repositories
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Deploy monitoring stack
helm install monitoring ../../helm/charts/monitoring -f ../../helm/values/dev.yaml
# Deploy coordinator API
helm install coordinator ../../helm/charts/coordinator -f ../../helm/values/dev.yaml
```
### Environment Configurations
#### Development
- 1 replica per service
- Minimal resource allocation
- Public EKS endpoint enabled
- 7-day metrics retention
#### Staging
- 2-3 replicas per service
- Moderate resource allocation
- Autoscaling enabled
- 30-day metrics retention
- TLS with staging certificates
#### Production
- 3+ replicas per service
- High resource allocation
- Full autoscaling configuration
- 90-day metrics retention
- TLS with production certificates
- Network policies enabled
- Backup configuration enabled
## Monitoring
The monitoring stack includes:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization dashboards
- **AlertManager**: Alert routing and notification
Access Grafana:
```bash
kubectl port-forward svc/monitoring-grafana 3000:3000
# Open http://localhost:3000
# Default credentials: admin/admin (check values files for environment-specific passwords)
```
## Scaling Guidelines
Based on benchmark results (`apps/blockchain-node/scripts/benchmark_throughput.py`):
- **Coordinator API**: Scale horizontally at ~500 TPS per node
- **Blockchain Node**: Scale horizontally at ~1000 TPS per node
- **Wallet Daemon**: Scale based on concurrent users
## Security Considerations
- Private subnets for all application workloads
- Network policies restrict traffic between services
- Secrets managed via Kubernetes Secrets
- TLS termination at ingress level
- Pod Security Policies enforced in production
## Backup and Recovery
- Automated daily backups of PostgreSQL databases
- EBS snapshots for persistent volumes
- Cross-region replication for production data
- Restore procedures documented in runbooks
## Cost Optimization
- Use Spot instances for non-critical workloads
- Implement cluster autoscaling
- Right-size resources based on metrics
- Schedule non-production environments to run only during business hours
## Troubleshooting
Common issues and solutions:
1. **Helm chart fails to install:**
- Check if all dependencies are added
- Verify kubectl context is correct
- Review values files for syntax errors
2. **Prometheus not scraping metrics:**
- Verify ServiceMonitor CRDs are installed
- Check service annotations
- Review network policies
3. **High memory usage:**
- Review resource limits in values files
- Check for memory leaks in applications
- Consider increasing node size
## Contributing
When adding new services:
1. Create a new Helm chart in `helm/charts/`
2. Add environment-specific values in `helm/values/`
3. Update monitoring configuration to include new service metrics
4. Document any special requirements in this README

View File

@ -0,0 +1,64 @@
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "aitbc-blockchain-node.fullname" . }}
labels:
{{- include "aitbc-blockchain-node.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "aitbc-blockchain-node.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
# Custom metrics for blockchain-specific scaling
- type: External
external:
metric:
name: blockchain_transaction_queue_depth
target:
type: AverageValue
averageValue: "100"
- type: External
external:
metric:
name: blockchain_pending_transactions
target:
type: AverageValue
averageValue: "500"
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # Longer stabilization for blockchain
policies:
- type: Percent
value: 5
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
{{- end }}

View File

@ -0,0 +1,11 @@
apiVersion: v2
name: aitbc-coordinator
description: AITBC Coordinator API Helm Chart
type: application
version: 0.1.0
appVersion: "0.1.0"
dependencies:
- name: postgresql
version: 12.x.x
repository: https://charts.bitnami.com/bitnami
condition: postgresql.enabled

View File

@ -0,0 +1,62 @@
{{/*
Expand the name of the chart.
*/}}
{{- define "aitbc-coordinator.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "aitbc-coordinator.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "aitbc-coordinator.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "aitbc-coordinator.labels" -}}
helm.sh/chart: {{ include "aitbc-coordinator.chart" . }}
{{ include "aitbc-coordinator.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels
*/}}
{{- define "aitbc-coordinator.selectorLabels" -}}
app.kubernetes.io/name: {{ include "aitbc-coordinator.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Create the name of the service account to use
*/}}
{{- define "aitbc-coordinator.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "aitbc-coordinator.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}

View File

@ -0,0 +1,90 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "aitbc-coordinator.fullname" . }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "aitbc-coordinator.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "aitbc-coordinator.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
- name: APP_ENV
value: {{ .Values.config.appEnv }}
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: {{ include "aitbc-coordinator.fullname" . }}
key: database-url
- name: ALLOW_ORIGINS
value: {{ .Values.config.allowOrigins | quote }}
{{- if .Values.config.receiptSigningKeyHex }}
- name: RECEIPT_SIGNING_KEY_HEX
valueFrom:
secretKeyRef:
name: {{ include "aitbc-coordinator.fullname" . }}
key: receipt-signing-key
{{- end }}
{{- if .Values.config.receiptAttestationKeyHex }}
- name: RECEIPT_ATTESTATION_KEY_HEX
valueFrom:
secretKeyRef:
name: {{ include "aitbc-coordinator.fullname" . }}
key: receipt-attestation-key
{{- end }}
volumeMounts:
- name: config
mountPath: /app/.env
subPath: .env
volumes:
- name: config
configMap:
name: {{ include "aitbc-coordinator.fullname" . }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}

View File

@ -0,0 +1,60 @@
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "aitbc-coordinator.fullname" . }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "aitbc-coordinator.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.customMetrics }}
{{- range .Values.autoscaling.customMetrics }}
- type: External
external:
metric:
name: {{ .name }}
target:
type: AverageValue
averageValue: {{ .targetValue }}
{{- end }}
{{- end }}
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
{{- end }}

View File

@ -0,0 +1,70 @@
{{- if .Values.ingress.enabled -}}
{{- $fullName := include "aitbc-coordinator.fullname" . -}}
{{- $svcPort := .Values.service.port -}}
{{- if and .Values.ingress.className (not (hasKey .Values.ingress.annotations "kubernetes.io/ingress.class")) }}
{{- $_ := set .Values.ingress.annotations "kubernetes.io/ingress.class" .Values.ingress.className}}
{{- end }}
{{- if semverCompare ">=1.19-0" .Capabilities.KubeVersion.GitVersion -}}
apiVersion: networking.k8s.io/v1
{{- else -}}
apiVersion: networking.k8s.io/v1beta1
{{- end }}
kind: Ingress
metadata:
name: {{ $fullName }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
annotations:
# Security annotations (always applied)
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
nginx.ingress.kubernetes.io/ssl-ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: DENY";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "X-XSS-Protection: 1; mode=block";
more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
more_set_headers "Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'";
more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains; preload";
cert-manager.io/cluster-issuer: {{ .Values.ingress.certManager.issuer | default "letsencrypt-prod" }}
# User-provided annotations
{{- with .Values.ingress.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if and .Values.ingress.className (semverCompare ">=1.18-0" .Capabilities.KubeVersion.GitVersion) }}
ingressClassName: {{ .Values.ingress.className }}
{{- end }}
{{- if .Values.ingress.tls }}
tls:
{{- range .Values.ingress.tls }}
- hosts:
{{- range .hosts }}
- {{ . | quote }}
{{- end }}
secretName: {{ .secretName }}
{{- end }}
{{- end }}
rules:
{{- range .Values.ingress.hosts }}
- host: {{ .host | quote }}
http:
paths:
{{- range .paths }}
- path: {{ .path }}
{{- if and .pathType (semverCompare ">=1.18-0" $.Capabilities.KubeVersion.GitVersion) }}
pathType: {{ .pathType }}
{{- end }}
backend:
{{- if semverCompare ">=1.19-0" $.Capabilities.KubeVersion.GitVersion }}
service:
name: {{ $fullName }}
port:
number: {{ $svcPort }}
{{- else }}
serviceName: {{ $fullName }}
servicePort: {{ $svcPort }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

View File

@ -0,0 +1,73 @@
{{- if .Values.networkPolicy.enabled }}
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: {{ include "aitbc-coordinator.fullname" . }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
spec:
podSelector:
matchLabels:
{{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
policyTypes:
- Ingress
- Egress
ingress:
# Allow traffic from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: http
# Allow traffic from monitoring
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
ports:
- protocol: TCP
port: http
# Allow traffic from wallet-daemon
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: wallet-daemon
ports:
- protocol: TCP
port: http
# Allow traffic from same namespace for internal communication
- from:
- podSelector: {}
ports:
- protocol: TCP
port: http
egress:
# Allow DNS resolution
- to: []
ports:
- protocol: UDP
port: 53
# Allow PostgreSQL access
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: postgresql
ports:
- protocol: TCP
port: 5432
# Allow external API calls (if needed)
- to: []
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 80
{{- end }}

View File

@ -0,0 +1,59 @@
{{- if .Values.podSecurityPolicy.enabled }}
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: {{ include "aitbc-coordinator.fullname" . }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
readOnlyRootFilesystem: false
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "aitbc-coordinator.fullname" }}-psp
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- {{ include "aitbc-coordinator.fullname" . }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "aitbc-coordinator.fullname" }}-psp
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
roleRef:
kind: Role
name: {{ include "aitbc-coordinator.fullname" }}-psp
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: {{ include "aitbc-coordinator.serviceAccountName" . }}
namespace: {{ .Release.Namespace }}
{{- end }}

View File

@ -0,0 +1,21 @@
apiVersion: v1
kind: Service
metadata:
name: {{ include "aitbc-coordinator.fullname" . }}
labels:
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
{{- if .Values.monitoring.enabled }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "{{ .Values.service.port }}"
prometheus.io/path: "{{ .Values.monitoring.serviceMonitor.path }}"
{{- end }}
spec:
type: {{ .Values.service.type }}
ports:
- port: {{ .Values.service.port }}
targetPort: {{ .Values.service.targetPort }}
protocol: TCP
name: http
selector:
{{- include "aitbc-coordinator.selectorLabels" . | nindent 4 }}

View File

@ -0,0 +1,162 @@
# Default values for aitbc-coordinator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
replicaCount: 1
image:
repository: aitbc/coordinator-api
pullPolicy: IfNotPresent
tag: "0.1.0"
nameOverride: ""
fullnameOverride: ""
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: {}
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
podAnnotations: {}
podSecurityContext:
fsGroup: 1000
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
service:
type: ClusterIP
port: 8011
targetPort: 8011
ingress:
enabled: false
className: nginx
annotations: {}
# cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: coordinator.local
paths:
- path: /
pathType: Prefix
tls: []
# - secretName: coordinator-tls
# hosts:
# - coordinator.local
# Pod Security Policy
podSecurityPolicy:
enabled: true
# Network policies
networkPolicy:
enabled: true
security:
auth:
enabled: true
requireApiKey: true
apiKeyHeader: "X-API-Key"
tls:
version: "TLSv1.3"
ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
headers:
frameOptions: "DENY"
contentTypeOptions: "nosniff"
xssProtection: "1; mode=block"
referrerPolicy: "strict-origin-when-cross-origin"
hsts:
enabled: true
maxAge: 31536000
includeSubDomains: true
preload: true
rateLimit:
enabled: true
requestsPerMinute: 60
burst: 10
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80
nodeSelector: {}
tolerations: []
affinity: {}
# Configuration
config:
appEnv: production
databaseUrl: "postgresql://aitbc:password@postgresql:5432/aitbc"
receiptSigningKeyHex: ""
receiptAttestationKeyHex: ""
allowOrigins: "*"
# PostgreSQL sub-chart configuration
postgresql:
enabled: true
auth:
postgresPassword: "password"
username: aitbc
database: aitbc
primary:
persistence:
enabled: true
size: 20Gi
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
# Monitoring
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
path: /metrics
port: http
# Health checks
livenessProbe:
httpGet:
path: /v1/health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /v1/health
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

View File

@ -0,0 +1,19 @@
apiVersion: v2
name: aitbc-monitoring
description: AITBC Monitoring Stack (Prometheus, Grafana, AlertManager)
type: application
version: 0.1.0
appVersion: "0.1.0"
dependencies:
- name: prometheus
version: 23.1.0
repository: https://prometheus-community.github.io/helm-charts
condition: prometheus.enabled
- name: grafana
version: 6.58.9
repository: https://grafana.github.io/helm-charts
condition: grafana.enabled
- name: alertmanager
version: 1.6.1
repository: https://prometheus-community.github.io/helm-charts
condition: alertmanager.enabled

View File

@ -0,0 +1,13 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "aitbc-monitoring.fullname" . }}-dashboards
labels:
{{- include "aitbc-monitoring.labels" . | nindent 4 }}
annotations:
grafana.io/dashboard: "1"
data:
blockchain-node-overview.json: |
{{ .Files.Get "dashboards/blockchain-node-overview.json" | indent 4 }}
coordinator-overview.json: |
{{ .Files.Get "dashboards/coordinator-overview.json" | indent 4 }}

View File

@ -0,0 +1,124 @@
# Default values for aitbc-monitoring.
# Prometheus configuration
prometheus:
enabled: true
server:
enabled: true
global:
scrape_interval: 15s
evaluation_interval: 15s
retention: 30d
persistentVolume:
enabled: true
size: 100Gi
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
service:
type: ClusterIP
port: 9090
serviceMonitors:
enabled: true
selector:
release: monitoring
alertmanager:
enabled: false
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
# Grafana configuration
grafana:
enabled: true
adminPassword: admin
persistence:
enabled: true
size: 20Gi
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
service:
type: ClusterIP
port: 3000
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server:9090
access: proxy
isDefault: true
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
# Service monitors for AITBC services
serviceMonitors:
coordinator:
enabled: true
interval: 30s
path: /metrics
port: http
blockchainNode:
enabled: true
interval: 30s
path: /metrics
port: http
walletDaemon:
enabled: true
interval: 30s
path: /metrics
port: http
# Alert rules
alertRules:
enabled: true
groups:
- name: aitbc.rules
rules:
- alert: HighErrorRate
expr: rate(marketplace_errors_total[5m]) / rate(marketplace_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for 5 minutes"
- alert: CoordinatorDown
expr: up{job="coordinator"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Coordinator is down"
description: "Coordinator API has been down for more than 1 minute"

View File

@ -0,0 +1,77 @@
# Development environment values
global:
environment: dev
coordinator:
replicaCount: 1
image:
tag: "dev-latest"
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
config:
appEnv: development
allowOrigins: "*"
postgresql:
auth:
postgresPassword: "dev-password"
primary:
persistence:
size: 10Gi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 512Mi
monitoring:
prometheus:
server:
retention: 7d
persistentVolume:
size: 20Gi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 512Mi
grafana:
adminPassword: "dev-admin"
persistence:
size: 5Gi
resources:
limits:
cpu: 250m
memory: 512Mi
requests:
cpu: 125m
memory: 256Mi
# Additional services
blockchainNode:
replicaCount: 1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
walletDaemon:
replicaCount: 1
resources:
limits:
cpu: 250m
memory: 256Mi
requests:
cpu: 125m
memory: 128Mi

140
infra/helm/values/prod.yaml Normal file
View File

@ -0,0 +1,140 @@
# Production environment values
global:
environment: production
coordinator:
replicaCount: 3
image:
tag: "v0.1.0"
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 75
targetMemoryUtilizationPercentage: 80
config:
appEnv: production
allowOrigins: "https://app.aitbc.io"
postgresql:
auth:
existingSecret: "coordinator-db-secret"
primary:
persistence:
size: 200Gi
storageClass: fast-ssd
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
readReplicas:
replicaCount: 2
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
monitoring:
prometheus:
server:
retention: 90d
persistentVolume:
size: 500Gi
storageClass: fast-ssd
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 2Gi
grafana:
adminPassword: "prod-admin-secure-2024"
persistence:
size: 50Gi
storageClass: fast-ssd
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
ingress:
enabled: true
hosts:
- grafana.aitbc.io
# Additional services
blockchainNode:
replicaCount: 5
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
targetCPUUtilizationPercentage: 70
walletDaemon:
replicaCount: 3
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 75
# Ingress configuration
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
hosts:
- host: api.aitbc.io
paths:
- path: /
pathType: Prefix
tls:
- secretName: prod-tls
hosts:
- api.aitbc.io
# Security
podSecurityPolicy:
enabled: true
networkPolicy:
enabled: true
# Backup configuration
backup:
enabled: true
schedule: "0 2 * * *"
retention: "30d"

View File

@ -0,0 +1,98 @@
# Staging environment values
global:
environment: staging
coordinator:
replicaCount: 2
image:
tag: "staging-latest"
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
config:
appEnv: staging
allowOrigins: "https://staging.aitbc.io"
postgresql:
auth:
postgresPassword: "staging-password"
primary:
persistence:
size: 50Gi
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
monitoring:
prometheus:
server:
retention: 30d
persistentVolume:
size: 100Gi
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
grafana:
adminPassword: "staging-admin-2024"
persistence:
size: 10Gi
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 512Mi
# Additional services
blockchainNode:
replicaCount: 2
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
walletDaemon:
replicaCount: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
# Ingress configuration
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: api.staging.aitbc.io
paths:
- path: /
pathType: Prefix
tls:
- secretName: staging-tls
hosts:
- api.staging.aitbc.io

View File

@ -0,0 +1,570 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-scripts
namespace: default
labels:
app: aitbc-backup
component: backup
data:
backup_postgresql.sh: |
#!/bin/bash
# PostgreSQL Backup Script for AITBC
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/postgresql-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v pg_dump &> /dev/null; then
error "pg_dump is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get PostgreSQL pod name
get_postgresql_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for PostgreSQL to be ready
wait_for_postgresql() {
local pod=$1
log "Waiting for PostgreSQL pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if PostgreSQL is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
log "PostgreSQL is ready"
return 0
fi
sleep 2
((retries--))
done
error "PostgreSQL did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
log "Starting PostgreSQL backup to $backup_file"
# Get database credentials from secret
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Perform the backup
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_dump -U "$db_user" -h localhost -d "$db_name" \
--verbose --clean --if-exists --create --format=custom \
--file="/tmp/${BACKUP_NAME}.dump"
# Copy backup from pod
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
# Clean up remote backup file
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
# Compress backup
gzip "$backup_file"
backup_file="${backup_file}.gz"
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="postgresql/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting PostgreSQL backup process"
check_dependencies
create_backup_dir
local pod=$(get_postgresql_pod)
wait_for_postgresql "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
upload_to_cloud "$backup_file"
log "PostgreSQL backup process completed successfully"
}
# Run main function
main "$@"
backup_redis.sh: |
#!/bin/bash
# Redis Backup Script for AITBC
# Usage: ./backup_redis.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/redis-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get Redis pod name
get_redis_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find Redis pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for Redis to be ready
wait_for_redis() {
local pod=$1
log "Waiting for Redis pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if Redis is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "Redis is ready"
return 0
fi
sleep 2
((retries--))
done
error "Redis did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
log "Starting Redis backup to $backup_file"
# Create Redis backup
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
# Wait for background save to complete
log "Waiting for background save to complete..."
local retries=60
while [[ $retries -gt 0 ]]; do
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
log "Background save completed"
break
fi
sleep 2
((retries--))
done
if [[ $retries -eq 0 ]]; then
error "Background save did not complete within timeout"
exit 1
fi
# Copy RDB file from pod
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
# Also create an append-only file backup if enabled
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
if [[ "$aof_enabled" == "yes" ]]; then
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
log "AOF backup created: $aof_backup"
fi
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="redis/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
# Upload AOF file if exists
local aof_file="${backup_file%.rdb}.aof"
if [[ -f "$aof_file" ]]; then
local aof_key="redis/$(basename "$aof_file")"
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
fi
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting Redis backup process"
check_dependencies
create_backup_dir
local pod=$(get_redis_pod)
wait_for_redis "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
upload_to_cloud "$backup_file"
log "Redis backup process completed successfully"
}
# Run main function
main "$@"
backup_ledger.sh: |
#!/bin/bash
# Ledger Storage Backup Script for AITBC
# Usage: ./backup_ledger.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/ledger-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get blockchain node pods
get_blockchain_pods() {
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pods" ]]; then
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pods" ]]; then
error "Could not find blockchain node pods in namespace $NAMESPACE"
exit 1
fi
echo $pods
}
# Wait for blockchain node to be ready
wait_for_blockchain_node() {
local pod=$1
log "Waiting for blockchain node pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if node is responding
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
log "Blockchain node is ready"
return 0
fi
sleep 2
((retries--))
done
error "Blockchain node did not become ready within timeout"
exit 1
}
# Backup ledger data
backup_ledger_data() {
local pod=$1
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
mkdir -p "$ledger_backup_dir"
log "Starting ledger backup from pod $pod"
# Get the latest block height before backup
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
log "Latest block height: $latest_block"
# Backup blockchain data directory
local blockchain_data_dir="/app/data/chain"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
log "Backing up blockchain data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
fi
# Backup wallet data
local wallet_data_dir="/app/data/wallets"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
log "Backing up wallet data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
fi
# Backup receipts
local receipts_data_dir="/app/data/receipts"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
log "Backing up receipts directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
fi
# Create metadata file
cat > "$ledger_backup_dir/metadata.json" << EOF
{
"backup_name": "$BACKUP_NAME",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"namespace": "$NAMESPACE",
"source_pod": "$pod",
"latest_block_height": $latest_block,
"backup_type": "full"
}
EOF
log "Ledger backup completed: $ledger_backup_dir"
# Verify backup
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
log "Total backup size: $total_size"
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_dir="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
# Upload entire backup directory
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting ledger backup process"
check_dependencies
create_backup_dir
local pods=($(get_blockchain_pods))
# Use the first ready pod for backup
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
wait_for_blockchain_node "$pod"
backup_ledger_data "$pod"
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
upload_to_cloud "$backup_dir"
break
fi
done
cleanup_old_backups
log "Ledger backup process completed successfully"
}
# Run main function
main "$@"

View File

@ -0,0 +1,156 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: aitbc-backup
namespace: default
labels:
app: aitbc-backup
component: backup
spec:
schedule: "0 2 * * *" # Run daily at 2 AM
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: postgresql-backup
image: postgres:15-alpine
command:
- /bin/bash
- -c
- |
echo "Starting PostgreSQL backup..."
/scripts/backup_postgresql.sh default postgresql-backup-$(date +%Y%m%d_%H%M%S)
echo "PostgreSQL backup completed"
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: coordinator-postgresql
key: password
volumeMounts:
- name: backup-scripts
mountPath: /scripts
readOnly: true
- name: backup-storage
mountPath: /backups
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
- name: redis-backup
image: redis:7-alpine
command:
- /bin/sh
- -c
- |
echo "Waiting for PostgreSQL backup to complete..."
sleep 60
echo "Starting Redis backup..."
/scripts/backup_redis.sh default redis-backup-$(date +%Y%m%d_%H%M%S)
echo "Redis backup completed"
volumeMounts:
- name: backup-scripts
mountPath: /scripts
readOnly: true
- name: backup-storage
mountPath: /backups
resources:
requests:
memory: "128Mi"
cpu: "50m"
limits:
memory: "256Mi"
cpu: "200m"
- name: ledger-backup
image: alpine:3.18
command:
- /bin/sh
- -c
- |
echo "Waiting for previous backups to complete..."
sleep 120
echo "Starting Ledger backup..."
/scripts/backup_ledger.sh default ledger-backup-$(date +%Y%m%d_%H%M%S)
echo "Ledger backup completed"
volumeMounts:
- name: backup-scripts
mountPath: /scripts
readOnly: true
- name: backup-storage
mountPath: /backups
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: backup-scripts
configMap:
name: backup-scripts
defaultMode: 0755
- name: backup-storage
persistentVolumeClaim:
claimName: backup-storage-pvc
# Add service account for cloud storage access
serviceAccountName: backup-service-account
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: backup-service-account
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: backup-role
namespace: default
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "secrets"]
verbs: ["get", "list"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: backup-role-binding
namespace: default
subjects:
- kind: ServiceAccount
name: backup-service-account
namespace: default
roleRef:
kind: Role
name: backup-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup-storage-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi

View File

@ -0,0 +1,99 @@
# Cert-Manager Installation
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cert-manager
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://charts.jetstack.io
chart: cert-manager
targetRevision: v1.14.0
helm:
releaseName: cert-manager
parameters:
- name: installCRDs
value: "true"
- name: namespace
value: cert-manager
destination:
server: https://kubernetes.default.svc
namespace: cert-manager
syncPolicy:
automated:
prune: true
selfHeal: true
---
# Let's Encrypt Production ClusterIssuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@aitbc.io
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
---
# Let's Encrypt Staging ClusterIssuer (for testing)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: admin@aitbc.io
privateKeySecretRef:
name: letsencrypt-staging
solvers:
- http01:
ingress:
class: nginx
---
# Self-Signed Issuer for Development
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: default
spec:
selfSigned: {}
---
# Development Certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: coordinator-dev-tls
namespace: default
spec:
secretName: coordinator-dev-tls
dnsNames:
- coordinator.local
- coordinator.127.0.0.2.nip.io
issuerRef:
name: selfsigned-issuer
kind: Issuer
---
# Production Certificate Template
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: coordinator-prod-tls
namespace: default
spec:
secretName: coordinator-prod-tls
dnsNames:
- api.aitbc.io
- www.api.aitbc.io
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer

View File

@ -0,0 +1,56 @@
# Default Deny All Network Policy
# This policy denies all ingress and egress traffic by default
# Individual services must have their own network policies to allow traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all-ingress
namespace: default
spec:
podSelector: {}
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all-egress
namespace: default
spec:
podSelector: {}
policyTypes:
- Egress
---
# Allow DNS resolution for all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: default
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
# Allow traffic to Kubernetes API
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-k8s-api
namespace: default
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to: []
ports:
- protocol: TCP
port: 443

View File

@ -0,0 +1,81 @@
# SealedSecrets Controller Installation
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: sealed-secrets
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://bitnami-labs.github.io/sealed-secrets
chart: sealed-secrets
targetRevision: 2.15.0
helm:
releaseName: sealed-secrets
parameters:
- name: namespace
value: kube-system
destination:
server: https://kubernetes.default.svc
namespace: kube-system
syncPolicy:
automated:
prune: true
selfHeal: true
---
# Example SealedSecret for Coordinator API Keys
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: coordinator-api-keys
namespace: default
annotations:
sealedsecrets.bitnami.com/cluster-wide: "true"
spec:
encryptedData:
# Production API key (encrypted)
api-key-prod: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
# Staging API key (encrypted)
api-key-staging: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
# Development API key (encrypted)
api-key-dev: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
template:
metadata:
name: coordinator-api-keys
namespace: default
type: Opaque
---
# Example SealedSecret for Database Credentials
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: coordinator-db-credentials
namespace: default
spec:
encryptedData:
username: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
database: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
template:
metadata:
name: coordinator-db-credentials
namespace: default
type: Opaque
---
# Example SealedSecret for JWT Signing Keys (if needed in future)
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: coordinator-jwt-keys
namespace: default
spec:
encryptedData:
private-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
public-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
template:
metadata:
name: coordinator-jwt-keys
namespace: default
type: Opaque

View File

@ -0,0 +1,330 @@
# AITBC Chaos Testing Framework
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
## Overview
The chaos testing framework simulates real-world failure scenarios to:
- Test system resilience under adverse conditions
- Measure Mean-Time-To-Recovery (MTTR) metrics
- Identify single points of failure
- Validate recovery procedures
- Ensure SLO compliance
## Components
### Test Scripts
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
- Deletes coordinator pods to simulate complete service outage
- Measures recovery time and service availability
- Tests load handling during and after recovery
2. **`chaos_test_network.py`** - Network partition simulation
- Creates network partitions between blockchain nodes
- Tests consensus resilience during partition
- Measures network recovery time
3. **`chaos_test_database.py`** - Database failure simulation
- Simulates PostgreSQL connection failures
- Tests high latency scenarios
- Validates application error handling
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
- Runs multiple chaos test scenarios
- Aggregates MTTR metrics across tests
- Generates comprehensive reports
- Supports continuous chaos testing
## Prerequisites
- Python 3.8+
- kubectl configured with cluster access
- Helm charts deployed in target namespace
- Administrative privileges for network manipulation
## Installation
```bash
# Clone the repository
git clone <repository-url>
cd aitbc/infra/scripts
# Install dependencies
pip install aiohttp
# Make scripts executable
chmod +x chaos_*.py
```
## Usage
### Running Individual Tests
#### Coordinator Outage Test
```bash
# Basic test
python3 chaos_test_coordinator.py --namespace default
# Custom outage duration
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
# Dry run (no actual chaos)
python3 chaos_test_coordinator.py --dry-run
```
#### Network Partition Test
```bash
# Partition 50% of nodes for 60 seconds
python3 chaos_test_network.py --namespace default
# Partition 30% of nodes for 90 seconds
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
```
#### Database Failure Test
```bash
# Simulate connection failure
python3 chaos_test_database.py --namespace default --failure-type connection
# Simulate high latency (5000ms)
python3 chaos_test_database.py --namespace default --failure-type latency
```
### Running All Tests
```bash
# Run all scenarios with default parameters
python3 chaos_orchestrator.py --namespace default
# Run specific scenarios
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
# Continuous chaos testing (24 hours, every 60 minutes)
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
```
## Test Scenarios
### 1. Coordinator API Outage
**Objective**: Test system resilience when the coordinator service becomes unavailable.
**Steps**:
1. Generate baseline load on coordinator API
2. Delete all coordinator pods
3. Wait for specified outage duration
4. Monitor service recovery
5. Generate post-recovery load
**Metrics Collected**:
- MTTR (Mean-Time-To-Recovery)
- Success/error request counts
- Recovery time distribution
### 2. Network Partition
**Objective**: Test blockchain consensus during network partitions.
**Steps**:
1. Identify blockchain node pods
2. Apply iptables rules to partition nodes
3. Monitor consensus during partition
4. Remove network partition
5. Verify network recovery
**Metrics Collected**:
- Network recovery time
- Consensus health during partition
- Node connectivity status
### 3. Database Failure
**Objective**: Test application behavior when database is unavailable.
**Steps**:
1. Simulate database connection failure or high latency
2. Monitor API behavior during failure
3. Restore database connectivity
4. Verify application recovery
**Metrics Collected**:
- Database recovery time
- API error rates during failure
- Application resilience metrics
## Results and Reporting
### Test Results Format
Each test generates a JSON results file with the following structure:
```json
{
"test_start": "2024-12-22T10:00:00.000Z",
"test_end": "2024-12-22T10:05:00.000Z",
"scenario": "coordinator_outage",
"mttr": 45.2,
"error_count": 156,
"success_count": 844,
"recovery_time": 45.2
}
```
### Orchestrator Report
The orchestrator generates a comprehensive report including:
- Summary metrics across all scenarios
- SLO compliance analysis
- Recommendations for improvements
- MTTR trends and statistics
Example report snippet:
```json
{
"summary": {
"total_scenarios": 3,
"successful_scenarios": 3,
"average_mttr": 67.8,
"max_mttr": 120.5,
"min_mttr": 45.2
},
"recommendations": [
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
"Coordinator recovery is slow. Consider reducing pod startup time."
]
}
```
## SLO Targets
| Metric | Target | Current |
|--------|--------|---------|
| MTTR (Average) | ≤ 120 seconds | TBD |
| MTTR (Maximum) | ≤ 300 seconds | TBD |
| Success Rate | ≥ 99.9% | TBD |
## Best Practices
### Before Running Tests
1. **Backup Critical Data**: Ensure recent backups are available
2. **Notify Team**: Inform stakeholders about chaos testing
3. **Check Cluster Health**: Verify all components are healthy
4. **Schedule Appropriately**: Run during low-traffic periods
### During Tests
1. **Monitor Logs**: Watch for unexpected errors
2. **Have Rollback Plan**: Be ready to manually intervene
3. **Document Observations**: Note any unusual behavior
4. **Stop if Critical**: Abort tests if production is impacted
### After Tests
1. **Review Results**: Analyze MTTR and error rates
2. **Update Documentation**: Record findings and improvements
3. **Address Issues**: Fix any discovered problems
4. **Schedule Follow-up**: Plan regular chaos testing
## Integration with CI/CD
### GitHub Actions Example
```yaml
name: Chaos Testing
on:
schedule:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
workflow_dispatch:
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install aiohttp
- name: Run chaos tests
run: |
cd infra/scripts
python3 chaos_orchestrator.py --namespace staging
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: chaos-results
path: "*.json"
```
## Troubleshooting
### Common Issues
1. **kubectl not found**
```bash
# Ensure kubectl is installed and configured
which kubectl
kubectl version
```
2. **Permission denied errors**
```bash
# Check RBAC permissions
kubectl auth can-i create pods --namespace default
kubectl auth can-i exec pods --namespace default
```
3. **Network rules not applying**
```bash
# Check if iptables is available in pods
kubectl exec -it <pod> -- iptables -L
```
4. **Tests hanging**
```bash
# Check pod status
kubectl get pods --namespace default
kubectl describe pod <pod-name> --namespace default
```
### Debug Mode
Enable debug logging:
```bash
export PYTHONPATH=.
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
```
## Contributing
To add new chaos test scenarios:
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
2. Implement the required methods: `run_test()`, `save_results()`
3. Add the scenario to `chaos_orchestrator.py`
4. Update documentation
## Security Considerations
- Chaos tests require elevated privileges
- Only run in authorized environments
- Ensure test isolation from production data
- Review network rules before deployment
- Monitor for security violations during tests
## Support
For issues or questions:
- Check the troubleshooting section
- Review test logs for error details
- Contact the DevOps team at devops@aitbc.io
## License
This chaos testing framework is part of the AITBC project and follows the same license terms.

233
infra/scripts/backup_ledger.sh Executable file
View File

@ -0,0 +1,233 @@
#!/bin/bash
# Ledger Storage Backup Script for AITBC
# Usage: ./backup_ledger.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/ledger-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get blockchain node pods
get_blockchain_pods() {
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pods" ]]; then
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pods" ]]; then
error "Could not find blockchain node pods in namespace $NAMESPACE"
exit 1
fi
echo $pods
}
# Wait for blockchain node to be ready
wait_for_blockchain_node() {
local pod=$1
log "Waiting for blockchain node pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if node is responding
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
log "Blockchain node is ready"
return 0
fi
sleep 2
((retries--))
done
error "Blockchain node did not become ready within timeout"
exit 1
}
# Backup ledger data
backup_ledger_data() {
local pod=$1
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
mkdir -p "$ledger_backup_dir"
log "Starting ledger backup from pod $pod"
# Get the latest block height before backup
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
log "Latest block height: $latest_block"
# Backup blockchain data directory
local blockchain_data_dir="/app/data/chain"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
log "Backing up blockchain data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
fi
# Backup wallet data
local wallet_data_dir="/app/data/wallets"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
log "Backing up wallet data directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
fi
# Backup receipts
local receipts_data_dir="/app/data/receipts"
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
log "Backing up receipts directory..."
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
fi
# Create metadata file
cat > "$ledger_backup_dir/metadata.json" << EOF
{
"backup_name": "$BACKUP_NAME",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"namespace": "$NAMESPACE",
"source_pod": "$pod",
"latest_block_height": $latest_block,
"backup_type": "full"
}
EOF
log "Ledger backup completed: $ledger_backup_dir"
# Verify backup
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
log "Total backup size: $total_size"
}
# Create incremental backup
create_incremental_backup() {
local pod=$1
local last_backup_file="$BACKUP_DIR/.last_backup_height"
# Get last backup height
local last_backup_height=0
if [[ -f "$last_backup_file" ]]; then
last_backup_height=$(cat "$last_backup_file")
fi
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
if [[ $current_height -le $last_backup_height ]]; then
log "No new blocks since last backup (height: $current_height)"
return 0
fi
log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
# Export blocks since last backup
local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
# Update last backup height
echo "$current_height" > "$last_backup_file"
log "Incremental backup created: $incremental_file"
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_dir="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
# Upload entire backup directory
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
local incremental=${3:-false}
log "Starting ledger backup process (incremental=$incremental)"
check_dependencies
create_backup_dir
local pods=($(get_blockchain_pods))
# Use the first ready pod for backup
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
wait_for_blockchain_node "$pod"
if [[ "$incremental" == "true" ]]; then
create_incremental_backup "$pod"
else
backup_ledger_data "$pod"
fi
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
upload_to_cloud "$backup_dir"
break
fi
done
cleanup_old_backups
log "Ledger backup process completed successfully"
}
# Run main function
main "$@"

View File

@ -0,0 +1,172 @@
#!/bin/bash
# PostgreSQL Backup Script for AITBC
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/postgresql-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v pg_dump &> /dev/null; then
error "pg_dump is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get PostgreSQL pod name
get_postgresql_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for PostgreSQL to be ready
wait_for_postgresql() {
local pod=$1
log "Waiting for PostgreSQL pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if PostgreSQL is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
log "PostgreSQL is ready"
return 0
fi
sleep 2
((retries--))
done
error "PostgreSQL did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
log "Starting PostgreSQL backup to $backup_file"
# Get database credentials from secret
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Perform the backup
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_dump -U "$db_user" -h localhost -d "$db_name" \
--verbose --clean --if-exists --create --format=custom \
--file="/tmp/${BACKUP_NAME}.dump"
# Copy backup from pod
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
# Clean up remote backup file
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
# Compress backup
gzip "$backup_file"
backup_file="${backup_file}.gz"
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="postgresql/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting PostgreSQL backup process"
check_dependencies
create_backup_dir
local pod=$(get_postgresql_pod)
wait_for_postgresql "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
upload_to_cloud "$backup_file"
log "PostgreSQL backup process completed successfully"
}
# Run main function
main "$@"

189
infra/scripts/backup_redis.sh Executable file
View File

@ -0,0 +1,189 @@
#!/bin/bash
# Redis Backup Script for AITBC
# Usage: ./backup_redis.sh [namespace] [backup_name]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
BACKUP_DIR="/tmp/redis-backups"
RETENTION_DAYS=30
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Create backup directory
create_backup_dir() {
mkdir -p "$BACKUP_DIR"
log "Created backup directory: $BACKUP_DIR"
}
# Get Redis pod name
get_redis_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find Redis pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for Redis to be ready
wait_for_redis() {
local pod=$1
log "Waiting for Redis pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if Redis is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "Redis is ready"
return 0
fi
sleep 2
((retries--))
done
error "Redis did not become ready within timeout"
exit 1
}
# Perform backup
perform_backup() {
local pod=$1
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
log "Starting Redis backup to $backup_file"
# Create Redis backup
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
# Wait for background save to complete
log "Waiting for background save to complete..."
local retries=60
while [[ $retries -gt 0 ]]; do
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
log "Background save completed"
break
fi
sleep 2
((retries--))
done
if [[ $retries -eq 0 ]]; then
error "Background save did not complete within timeout"
exit 1
fi
# Copy RDB file from pod
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
# Also create an append-only file backup if enabled
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
if [[ "$aof_enabled" == "yes" ]]; then
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
log "AOF backup created: $aof_backup"
fi
log "Backup completed: $backup_file"
# Verify backup
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
local size=$(du -h "$backup_file" | cut -f1)
log "Backup size: $size"
else
error "Backup file is empty or missing"
exit 1
fi
}
# Clean old backups
cleanup_old_backups() {
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
log "Cleanup completed"
}
# Upload to cloud storage (optional)
upload_to_cloud() {
local backup_file="$1"
# Check if AWS CLI is configured
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
log "Uploading backup to S3"
local s3_bucket="aitbc-backups-${NAMESPACE}"
local s3_key="redis/$(basename "$backup_file")"
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
log "Backup uploaded to s3://$s3_bucket/$s3_key"
# Upload AOF file if exists
local aof_file="${backup_file%.rdb}.aof"
if [[ -f "$aof_file" ]]; then
local aof_key="redis/$(basename "$aof_file")"
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
fi
else
warn "AWS CLI not configured, skipping cloud upload"
fi
}
# Main execution
main() {
log "Starting Redis backup process"
check_dependencies
create_backup_dir
local pod=$(get_redis_pod)
wait_for_redis "$pod"
perform_backup "$pod"
cleanup_old_backups
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
upload_to_cloud "$backup_file"
log "Redis backup process completed successfully"
}
# Run main function
main "$@"

View File

@ -0,0 +1,342 @@
#!/usr/bin/env python3
"""
Chaos Testing Orchestrator
Runs multiple chaos test scenarios and aggregates MTTR metrics
"""
import asyncio
import argparse
import json
import logging
import subprocess
import sys
import time
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosOrchestrator:
"""Orchestrates multiple chaos test scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.results = {
"orchestration_start": None,
"orchestration_end": None,
"scenarios": [],
"summary": {
"total_scenarios": 0,
"successful_scenarios": 0,
"failed_scenarios": 0,
"average_mttr": 0,
"max_mttr": 0,
"min_mttr": float('inf')
}
}
async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
"""Run a single chaos test scenario"""
scenario_name = Path(script).stem.replace("chaos_test_", "")
logger.info(f"Running scenario: {scenario_name}")
cmd = ["python3", script] + args
start_time = time.time()
try:
# Run the chaos test script
process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await process.communicate()
if process.returncode != 0:
logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
logger.error(f"Error: {stderr.decode()}")
return None
# Find the results file
result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
if not result_files:
logger.error(f"No results file found for scenario {scenario_name}")
return None
# Load the most recent result file
result_file = max(result_files, key=lambda p: p.stat().st_mtime)
with open(result_file, 'r') as f:
results = json.load(f)
# Add execution metadata
results["execution_time"] = time.time() - start_time
results["scenario_name"] = scenario_name
logger.info(f"Scenario {scenario_name} completed successfully")
return results
except Exception as e:
logger.error(f"Failed to run scenario {scenario_name}: {e}")
return None
def calculate_summary_metrics(self):
"""Calculate summary metrics across all scenarios"""
mttr_values = []
for scenario in self.results["scenarios"]:
if scenario.get("mttr"):
mttr_values.append(scenario["mttr"])
if mttr_values:
self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
self.results["summary"]["max_mttr"] = max(mttr_values)
self.results["summary"]["min_mttr"] = min(mttr_values)
self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
self.results["summary"]["successful_scenarios"] = sum(
1 for s in self.results["scenarios"] if s.get("mttr") is not None
)
self.results["summary"]["failed_scenarios"] = (
self.results["summary"]["total_scenarios"] -
self.results["summary"]["successful_scenarios"]
)
def generate_report(self, output_file: Optional[str] = None):
"""Generate a comprehensive chaos test report"""
report = {
"report_generated": datetime.utcnow().isoformat(),
"namespace": self.namespace,
"orchestration": self.results,
"recommendations": []
}
# Add recommendations based on results
if self.results["summary"]["average_mttr"] > 120:
report["recommendations"].append(
"Average MTTR exceeds 2 minutes. Consider improving recovery automation."
)
if self.results["summary"]["max_mttr"] > 300:
report["recommendations"].append(
"Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
)
if self.results["summary"]["failed_scenarios"] > 0:
report["recommendations"].append(
f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
)
# Check for specific scenario issues
for scenario in self.results["scenarios"]:
if scenario.get("scenario_name") == "coordinator_outage":
if scenario.get("mttr", 0) > 180:
report["recommendations"].append(
"Coordinator recovery is slow. Consider reducing pod startup time."
)
elif scenario.get("scenario_name") == "network_partition":
if scenario.get("error_count", 0) > scenario.get("success_count", 0):
report["recommendations"].append(
"High error rate during network partition. Improve error handling."
)
elif scenario.get("scenario_name") == "database_failure":
if scenario.get("failure_type") == "connection":
report["recommendations"].append(
"Consider implementing database connection pooling and retry logic."
)
# Save report
if output_file:
with open(output_file, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Chaos test report saved to: {output_file}")
# Print summary
self.print_summary()
return report
def print_summary(self):
"""Print a summary of all chaos test results"""
print("\n" + "="*60)
print("CHAOS TESTING SUMMARY REPORT")
print("="*60)
print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
print(f"Namespace: {self.namespace}")
print(f"\nScenario Results:")
print("-" * 40)
for scenario in self.results["scenarios"]:
name = scenario.get("scenario_name", "Unknown")
mttr = scenario.get("mttr", "N/A")
if mttr != "N/A":
mttr = f"{mttr:.2f}s"
print(f" {name:20} MTTR: {mttr}")
print(f"\nSummary Metrics:")
print("-" * 40)
print(f" Total Scenarios: {self.results['summary']['total_scenarios']}")
print(f" Successful: {self.results['summary']['successful_scenarios']}")
print(f" Failed: {self.results['summary']['failed_scenarios']}")
if self.results["summary"]["average_mttr"] > 0:
print(f" Average MTTR: {self.results['summary']['average_mttr']:.2f}s")
print(f" Maximum MTTR: {self.results['summary']['max_mttr']:.2f}s")
print(f" Minimum MTTR: {self.results['summary']['min_mttr']:.2f}s")
# SLO compliance
print(f"\nSLO Compliance:")
print("-" * 40)
slo_target = 120 # 2 minutes
if self.results["summary"]["average_mttr"] <= slo_target:
print(f" ✓ Average MTTR within SLO ({slo_target}s)")
else:
print(f" ✗ Average MTTR exceeds SLO ({slo_target}s)")
print("\n" + "="*60)
async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
"""Run all specified chaos test scenarios"""
logger.info("Starting chaos testing orchestration")
self.results["orchestration_start"] = datetime.utcnow().isoformat()
for scenario in scenarios:
args = scenario_args.get(scenario, [])
# Add namespace to all scenarios
args.extend(["--namespace", self.namespace])
result = await self.run_scenario(scenario, args)
if result:
self.results["scenarios"].append(result)
self.results["orchestration_end"] = datetime.utcnow().isoformat()
# Calculate summary metrics
self.calculate_summary_metrics()
# Generate report
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_file = f"chaos_test_report_{timestamp}.json"
self.generate_report(report_file)
logger.info("Chaos testing orchestration completed")
async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
"""Run chaos tests continuously over time"""
logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
end_time = datetime.now() + timedelta(hours=duration_hours)
interval_seconds = interval_minutes * 60
all_results = []
while datetime.now() < end_time:
cycle_start = datetime.now()
logger.info(f"Starting chaos test cycle at {cycle_start}")
# Run a random scenario
scenarios = [
"chaos_test_coordinator.py",
"chaos_test_network.py",
"chaos_test_database.py"
]
import random
selected_scenario = random.choice(scenarios)
# Run scenario with reduced duration for continuous testing
args = ["--namespace", self.namespace]
if "coordinator" in selected_scenario:
args.extend(["--outage-duration", "30", "--load-duration", "60"])
elif "network" in selected_scenario:
args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
elif "database" in selected_scenario:
args.extend(["--failure-duration", "30", "--failure-type", "connection"])
result = await self.run_scenario(selected_scenario, args)
if result:
result["cycle_time"] = cycle_start.isoformat()
all_results.append(result)
# Wait for next cycle
elapsed = (datetime.now() - cycle_start).total_seconds()
if elapsed < interval_seconds:
wait_time = interval_seconds - elapsed
logger.info(f"Waiting {wait_time:.0f}s for next cycle")
await asyncio.sleep(wait_time)
# Generate continuous testing report
continuous_report = {
"continuous_testing": True,
"duration_hours": duration_hours,
"interval_minutes": interval_minutes,
"total_cycles": len(all_results),
"cycles": all_results
}
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_file = f"continuous_chaos_report_{timestamp}.json"
with open(report_file, 'w') as f:
json.dump(continuous_report, f, indent=2)
logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
async def main():
parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--scenarios", nargs="+",
choices=["coordinator", "network", "database"],
default=["coordinator", "network", "database"],
help="Scenarios to run")
parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
orchestrator = ChaosOrchestrator(args.namespace)
if args.dry_run:
logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
return
if args.continuous:
await orchestrator.run_continuous_chaos(args.duration, args.interval)
else:
# Map scenario names to script files
scenario_map = {
"coordinator": "chaos_test_coordinator.py",
"network": "chaos_test_network.py",
"database": "chaos_test_database.py"
}
# Get script files
scripts = [scenario_map[s] for s in args.scenarios]
# Default arguments for each scenario
scenario_args = {
"chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
"chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
"chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
}
await orchestrator.run_all_scenarios(scripts, scenario_args)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,287 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Coordinator API Outage
Tests system resilience when coordinator API becomes unavailable
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestCoordinator:
"""Chaos testing for coordinator API outage scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"outage_start": None,
"outage_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "coordinator_outage"
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_coordinator_pods(self) -> List[str]:
"""Get list of coordinator pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get coordinator pods: {e}")
return []
def delete_coordinator_pods(self) -> bool:
"""Delete all coordinator pods to simulate outage"""
try:
cmd = [
"kubectl", "delete", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"--force", "--grace-period=0"
]
subprocess.run(cmd, check=True)
logger.info("Coordinator pods deleted successfully")
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to delete coordinator pods: {e}")
return False
async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
"""Wait for all coordinator pods to terminate"""
start_time = time.time()
while time.time() - start_time < timeout:
pods = self.get_coordinator_pods()
if not pods:
logger.info("All coordinator pods terminated")
return True
await asyncio.sleep(2)
logger.error("Timeout waiting for pods to terminate")
return False
async def wait_for_recovery(self, timeout: int = 300) -> bool:
"""Wait for coordinator service to recover"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
# Check if pods are running
pods = self.get_coordinator_pods()
if not pods:
await asyncio.sleep(5)
continue
# Check if at least one pod is ready
ready_cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
]
result = subprocess.run(ready_cmd, capture_output=True, text=True)
if result.stdout.strip():
# Test API health
if self.test_health_endpoint():
recovery_time = time.time() - start_time
self.metrics["recovery_time"] = recovery_time
logger.info(f"Service recovered in {recovery_time:.2f} seconds")
return True
except Exception as e:
logger.debug(f"Recovery check failed: {e}")
await asyncio.sleep(5)
logger.error("Service did not recover within timeout")
return False
def test_health_endpoint(self) -> bool:
"""Test if coordinator health endpoint is responding"""
try:
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
service_url = f"http://{result.stdout.strip()}/v1/health"
# Test health endpoint
response = subprocess.run(
["curl", "-s", "--max-time", "5", service_url],
capture_output=True, text=True
)
return response.returncode == 0 and "ok" in response.stdout
except Exception:
return False
async def generate_load(self, duration: int, concurrent: int = 10):
"""Generate synthetic load on coordinator API"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
"""Run the complete chaos test"""
logger.info("Starting coordinator outage chaos test")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Phase 1: Generate initial load
logger.info("Phase 1: Generating initial load")
await self.generate_load(30)
# Phase 2: Induce outage
logger.info("Phase 2: Inducing coordinator outage")
self.metrics["outage_start"] = datetime.utcnow().isoformat()
if not self.delete_coordinator_pods():
logger.error("Failed to induce outage")
return False
if not await self.wait_for_pods_termination():
logger.error("Pods did not terminate")
return False
# Wait for specified outage duration
logger.info(f"Waiting for {outage_duration} seconds outage duration")
await asyncio.sleep(outage_duration)
# Phase 3: Monitor recovery
logger.info("Phase 3: Monitoring service recovery")
self.metrics["outage_end"] = datetime.utcnow().isoformat()
if not await self.wait_for_recovery():
logger.error("Service did not recover")
return False
# Phase 4: Post-recovery load test
logger.info("Phase 4: Post-recovery load test")
await self.generate_load(load_duration)
# Calculate metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_coordinator_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info("DRY RUN: Would test coordinator outage without actual deletion")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestCoordinator(args.namespace) as test:
success = await test.run_test(args.outage_duration, args.load_duration)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,387 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Database Failure
Tests system resilience when PostgreSQL database becomes unavailable
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestDatabase:
"""Chaos testing for database failure scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"failure_start": None,
"failure_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "database_failure",
"failure_type": None
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_postgresql_pod(self) -> Optional[str]:
"""Get PostgreSQL pod name"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=postgresql",
"-o", "jsonpath={.items[0].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pod = result.stdout.strip()
return pod if pod else None
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get PostgreSQL pod: {e}")
return None
def simulate_database_connection_failure(self) -> bool:
"""Simulate database connection failure by blocking port 5432"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Block incoming connections to PostgreSQL
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
]
subprocess.run(cmd, check=True)
# Block outgoing connections from PostgreSQL
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
]
subprocess.run(cmd, check=True)
logger.info(f"Blocked PostgreSQL connections on pod {pod}")
self.metrics["failure_type"] = "connection_blocked"
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to block PostgreSQL connections: {e}")
return False
def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
"""Simulate high database latency using netem"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Add latency to PostgreSQL traffic
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
]
subprocess.run(cmd, check=True)
logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
self.metrics["failure_type"] = "high_latency"
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to add latency to PostgreSQL: {e}")
return False
def restore_database(self) -> bool:
"""Restore database connections"""
pod = self.get_postgresql_pod()
if not pod:
return False
try:
# Remove iptables rules
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "INPUT"
]
subprocess.run(cmd, check=False) # May fail if rules don't exist
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "OUTPUT"
]
subprocess.run(cmd, check=False)
# Remove netem qdisc
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"tc", "qdisc", "del", "dev", "eth0", "root"
]
subprocess.run(cmd, check=False)
logger.info(f"Restored PostgreSQL connections on pod {pod}")
return True
except subprocess.CalledProcessError as e:
logger.error(f"Failed to restore PostgreSQL: {e}")
return False
async def test_database_connectivity(self) -> bool:
"""Test if coordinator can connect to database"""
try:
# Get coordinator pod
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[0].metadata.name}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
coordinator_pod = result.stdout.strip()
if not coordinator_pod:
return False
# Test database connection from coordinator
cmd = [
"kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
"python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
]
result = subprocess.run(cmd, capture_output=True, text=True)
return result.returncode == 0 and "OK" in result.stdout
except Exception:
return False
async def test_api_health(self) -> bool:
"""Test if coordinator API is healthy"""
try:
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
service_url = f"http://{result.stdout.strip()}/v1/health"
# Test health endpoint
response = subprocess.run(
["curl", "-s", "--max-time", "5", service_url],
capture_output=True, text=True
)
return response.returncode == 0 and "ok" in response.stdout
except Exception:
return False
async def generate_load(self, duration: int, concurrent: int = 10):
"""Generate synthetic load on coordinator API"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "coordinator",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def wait_for_recovery(self, timeout: int = 300) -> bool:
"""Wait for database and API to recover"""
start_time = time.time()
while time.time() - start_time < timeout:
# Test database connectivity
db_connected = await self.test_database_connectivity()
# Test API health
api_healthy = await self.test_api_health()
if db_connected and api_healthy:
recovery_time = time.time() - start_time
self.metrics["recovery_time"] = recovery_time
logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
return True
await asyncio.sleep(5)
logger.error("Database and API did not recover within timeout")
return False
async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
"""Run the complete database chaos test"""
logger.info(f"Starting database chaos test - failure type: {failure_type}")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Phase 1: Baseline test
logger.info("Phase 1: Baseline connectivity test")
db_connected = await self.test_database_connectivity()
api_healthy = await self.test_api_health()
if not db_connected or not api_healthy:
logger.error("Baseline test failed - database or API not healthy")
return False
logger.info("Baseline: Database and API are healthy")
# Phase 2: Generate initial load
logger.info("Phase 2: Generating initial load")
await self.generate_load(30)
# Phase 3: Induce database failure
logger.info("Phase 3: Inducing database failure")
self.metrics["failure_start"] = datetime.utcnow().isoformat()
if failure_type == "connection":
if not self.simulate_database_connection_failure():
logger.error("Failed to induce database connection failure")
return False
elif failure_type == "latency":
if not self.simulate_database_high_latency():
logger.error("Failed to induce database latency")
return False
else:
logger.error(f"Unknown failure type: {failure_type}")
return False
# Verify failure is effective
await asyncio.sleep(5)
db_connected = await self.test_database_connectivity()
api_healthy = await self.test_api_health()
logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
# Phase 4: Monitor during failure
logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
# Generate load during failure
await self.generate_load(failure_duration)
# Phase 5: Restore database and monitor recovery
logger.info("Phase 5: Restoring database")
self.metrics["failure_end"] = datetime.utcnow().isoformat()
if not self.restore_database():
logger.error("Failed to restore database")
return False
# Wait for recovery
if not await self.wait_for_recovery():
logger.error("System did not recover after database restoration")
return False
# Phase 6: Post-recovery load test
logger.info("Phase 6: Post-recovery load test")
await self.generate_load(60)
# Final metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Database chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_database_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Failure Type: {self.metrics['failure_type']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for database failure")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestDatabase(args.namespace) as test:
success = await test.run_test(args.failure_type, args.failure_duration)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,372 @@
#!/usr/bin/env python3
"""
Chaos Testing Script - Network Partition
Tests system resilience when blockchain nodes experience network partitions
"""
import asyncio
import aiohttp
import argparse
import json
import time
import logging
import subprocess
import sys
from datetime import datetime
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ChaosTestNetwork:
"""Chaos testing for network partition scenarios"""
def __init__(self, namespace: str = "default"):
self.namespace = namespace
self.session = None
self.metrics = {
"test_start": None,
"test_end": None,
"partition_start": None,
"partition_end": None,
"recovery_time": None,
"mttr": None,
"error_count": 0,
"success_count": 0,
"scenario": "network_partition",
"affected_nodes": []
}
async def __aenter__(self):
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.session:
await self.session.close()
def get_blockchain_pods(self) -> List[str]:
"""Get list of blockchain node pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=blockchain-node",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get blockchain pods: {e}")
return []
def get_coordinator_pods(self) -> List[str]:
"""Get list of coordinator pods"""
cmd = [
"kubectl", "get", "pods",
"-n", self.namespace,
"-l", "app.kubernetes.io/name=coordinator",
"-o", "jsonpath={.items[*].metadata.name}"
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
pods = result.stdout.strip().split()
return pods
except subprocess.CalledProcessError as e:
logger.error(f"Failed to get coordinator pods: {e}")
return []
def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
"""Apply network partition using iptables"""
logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
for pod in pods:
if pod in target_pods:
continue
# Block traffic from this pod to target pods
for target_pod in target_pods:
try:
# Get target pod IP
cmd = [
"kubectl", "get", "pod", target_pod,
"-n", self.namespace,
"-o", "jsonpath={.status.podIP}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
target_ip = result.stdout.strip()
if not target_ip:
continue
# Apply iptables rule to block traffic
iptables_cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
]
subprocess.run(iptables_cmd, check=True)
logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
except subprocess.CalledProcessError as e:
logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
return False
self.metrics["affected_nodes"] = pods + target_pods
return True
def remove_network_partition(self, pods: List[str]) -> bool:
"""Remove network partition rules"""
logger.info("Removing network partition rules")
for pod in pods:
try:
# Flush OUTPUT chain (remove all rules)
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"iptables", "-F", "OUTPUT"
]
subprocess.run(cmd, check=True)
logger.info(f"Removed network rules from {pod}")
except subprocess.CalledProcessError as e:
logger.error(f"Failed to remove network rules from {pod}: {e}")
return False
return True
async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
"""Test connectivity between pods"""
results = {}
for pod in pods:
try:
# Test if pod can reach coordinator
cmd = [
"kubectl", "exec", "-n", self.namespace, pod, "--",
"curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
]
result = subprocess.run(cmd, capture_output=True, text=True)
results[pod] = result.returncode == 0 and "ok" in result.stdout
except Exception:
results[pod] = False
return results
async def monitor_consensus(self, duration: int = 60) -> bool:
"""Monitor blockchain consensus health"""
logger.info(f"Monitoring consensus for {duration} seconds")
start_time = time.time()
last_height = 0
while time.time() - start_time < duration:
try:
# Get block height from a random pod
pods = self.get_blockchain_pods()
if not pods:
await asyncio.sleep(5)
continue
# Use first pod to check height
cmd = [
"kubectl", "exec", "-n", self.namespace, pods[0], "--",
"curl", "-s", "http://localhost:8080/v1/blocks/head"
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
try:
data = json.loads(result.stdout)
current_height = data.get("height", 0)
# Check if blockchain is progressing
if current_height > last_height:
last_height = current_height
logger.info(f"Blockchain progressing, height: {current_height}")
elif time.time() - start_time > 30: # Allow 30s for initial sync
logger.warning(f"Blockchain stuck at height {current_height}")
except json.JSONDecodeError:
pass
except Exception as e:
logger.debug(f"Consensus check failed: {e}")
await asyncio.sleep(5)
return last_height > 0
async def generate_load(self, duration: int, concurrent: int = 5):
"""Generate synthetic load on blockchain nodes"""
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
# Get service URL
cmd = [
"kubectl", "get", "svc", "blockchain-node",
"-n", self.namespace,
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
base_url = f"http://{result.stdout.strip()}"
start_time = time.time()
tasks = []
async def make_request():
try:
async with self.session.get(f"{base_url}/v1/blocks/head") as response:
if response.status == 200:
self.metrics["success_count"] += 1
else:
self.metrics["error_count"] += 1
except Exception:
self.metrics["error_count"] += 1
while time.time() - start_time < duration:
# Create batch of requests
batch = [make_request() for _ in range(concurrent)]
tasks.extend(batch)
# Wait for batch to complete
await asyncio.gather(*batch, return_exceptions=True)
# Brief pause
await asyncio.sleep(1)
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
"""Run the complete network partition chaos test"""
logger.info("Starting network partition chaos test")
self.metrics["test_start"] = datetime.utcnow().isoformat()
# Get all blockchain pods
all_pods = self.get_blockchain_pods()
if not all_pods:
logger.error("No blockchain pods found")
return False
# Determine which pods to partition
num_partition = int(len(all_pods) * partition_ratio)
partition_pods = all_pods[:num_partition]
remaining_pods = all_pods[num_partition:]
logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
# Phase 1: Baseline test
logger.info("Phase 1: Baseline connectivity test")
baseline_connectivity = await self.test_connectivity(all_pods)
logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
# Phase 2: Generate initial load
logger.info("Phase 2: Generating initial load")
await self.generate_load(30)
# Phase 3: Apply network partition
logger.info("Phase 3: Applying network partition")
self.metrics["partition_start"] = datetime.utcnow().isoformat()
if not self.apply_network_partition(remaining_pods, partition_pods):
logger.error("Failed to apply network partition")
return False
# Verify partition is effective
await asyncio.sleep(5)
partitioned_connectivity = await self.test_connectivity(all_pods)
logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
# Phase 4: Monitor during partition
logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
consensus_healthy = await self.monitor_consensus(partition_duration)
# Phase 5: Remove partition and monitor recovery
logger.info("Phase 5: Removing network partition")
self.metrics["partition_end"] = datetime.utcnow().isoformat()
if not self.remove_network_partition(all_pods):
logger.error("Failed to remove network partition")
return False
# Wait for recovery
logger.info("Waiting for network recovery...")
await asyncio.sleep(10)
# Test connectivity after recovery
recovery_connectivity = await self.test_connectivity(all_pods)
recovery_time = time.time()
# Calculate recovery metrics
all_connected = all(recovery_connectivity.values())
if all_connected:
self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
# Phase 6: Post-recovery load test
logger.info("Phase 6: Post-recovery load test")
await self.generate_load(60)
# Final metrics
self.metrics["test_end"] = datetime.utcnow().isoformat()
self.metrics["mttr"] = self.metrics["recovery_time"]
# Save results
self.save_results()
logger.info("Network partition chaos test completed successfully")
return True
def save_results(self):
"""Save test results to file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"chaos_test_network_{timestamp}.json"
with open(filename, "w") as f:
json.dump(self.metrics, f, indent=2)
logger.info(f"Test results saved to: {filename}")
# Print summary
print("\n=== Chaos Test Summary ===")
print(f"Scenario: {self.metrics['scenario']}")
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
print(f"Success Requests: {self.metrics['success_count']}")
print(f"Error Requests: {self.metrics['error_count']}")
async def main():
parser = argparse.ArgumentParser(description="Chaos test for network partition")
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
args = parser.parse_args()
if args.dry_run:
logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
return
# Verify kubectl is available
try:
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
except (subprocess.CalledProcessError, FileNotFoundError):
logger.error("kubectl is not available or not configured")
sys.exit(1)
# Run test
async with ChaosTestNetwork(args.namespace) as test:
success = await test.run_test(args.partition_duration, args.partition_ratio)
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,279 @@
#!/bin/bash
# Ledger Storage Restore Script for AITBC
# Usage: ./restore_ledger.sh [namespace] [backup_directory]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_DIR=${2:-}
TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v jq &> /dev/null; then
error "jq is not installed or not in PATH"
exit 1
fi
}
# Validate backup directory
validate_backup_dir() {
if [[ -z "$BACKUP_DIR" ]]; then
error "Backup directory must be specified"
echo "Usage: $0 [namespace] [backup_directory]"
exit 1
fi
if [[ ! -d "$BACKUP_DIR" ]]; then
error "Backup directory not found: $BACKUP_DIR"
exit 1
fi
# Check for required files
if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
error "metadata.json not found in backup directory"
exit 1
fi
if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
error "chain.tar.gz not found in backup directory"
exit 1
fi
log "Using backup directory: $BACKUP_DIR"
}
# Get blockchain node pods
get_blockchain_pods() {
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pods" ]]; then
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pods" ]]; then
error "Could not find blockchain node pods in namespace $NAMESPACE"
exit 1
fi
echo $pods
}
# Create backup of current ledger before restore
create_pre_restore_backup() {
local pods=($1)
local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
warn "Creating backup of current ledger before restore..."
mkdir -p "$pre_restore_dir"
# Use the first ready pod
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
# Create metadata
cat > "$pre_restore_dir/metadata.json" << EOF
{
"backup_name": "$pre_restore_backup",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"namespace": "$NAMESPACE",
"source_pod": "$pod",
"latest_block_height": $current_height,
"backup_type": "pre-restore"
}
EOF
# Backup data directories
local data_dirs=("chain" "wallets" "receipts")
for dir in "${data_dirs[@]}"; do
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
fi
done
log "Pre-restore backup created: $pre_restore_dir"
break
fi
done
}
# Perform restore
perform_restore() {
local pods=($1)
warn "This will replace all current ledger data. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Scale down blockchain nodes
info "Scaling down blockchain node deployment..."
kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
# Wait for pods to terminate
kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
# Scale up blockchain nodes
info "Scaling up blockchain node deployment..."
kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
# Wait for pods to be ready
local ready_pods=()
local retries=30
while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
local all_pods=$(get_blockchain_pods)
for pod in $all_pods; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
ready_pods+=("$pod")
fi
done
if [[ ${#ready_pods[@]} -eq 0 ]]; then
sleep 5
((retries--))
fi
done
if [[ ${#ready_pods[@]} -eq 0 ]]; then
error "No blockchain nodes became ready after restore"
exit 1
fi
# Restore data to all ready pods
for pod in "${ready_pods[@]}"; do
info "Restoring ledger data to pod $pod..."
# Create temp directory on pod
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
# Extract and copy chain data
if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
fi
# Extract and copy wallet data
if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
fi
# Extract and copy receipt data
if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
fi
# Set correct permissions
kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
# Clean up temp directory
kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
log "Ledger data restored to pod $pod"
done
log "Ledger restore completed successfully"
}
# Verify restore
verify_restore() {
local pods=($1)
log "Verifying ledger restore..."
# Read backup metadata
local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
log "Backup contains blocks up to height: $backup_height"
# Verify on each pod
for pod in "${pods[@]}"; do
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
# Check if node is responding
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
# Get current block height
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
if [[ "$current_height" -eq "$backup_height" ]]; then
log "✓ Pod $pod: Block height matches backup ($current_height)"
else
warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
fi
# Check data directories
local dirs=("chain" "wallets" "receipts")
for dir in "${dirs[@]}"; do
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
log "✓ Pod $pod: $dir directory contains $file_count files"
else
warn "⚠ Pod $pod: $dir directory not found"
fi
done
else
error "✗ Pod $pod: Not responding to health checks"
fi
fi
done
}
# Main execution
main() {
log "Starting ledger restore process"
check_dependencies
validate_backup_dir
local pods=($(get_blockchain_pods))
create_pre_restore_backup "${pods[*]}"
perform_restore "${pods[*]}"
# Get updated pod list after restore
pods=($(get_blockchain_pods))
verify_restore "${pods[*]}"
log "Ledger restore process completed successfully"
warn "Please verify blockchain synchronization and application functionality"
}
# Run main function
main "$@"

View File

@ -0,0 +1,228 @@
#!/bin/bash
# PostgreSQL Restore Script for AITBC
# Usage: ./restore_postgresql.sh [namespace] [backup_file]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_FILE=${2:-}
BACKUP_DIR="/tmp/postgresql-backups"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
if ! command -v pg_restore &> /dev/null; then
error "pg_restore is not installed or not in PATH"
exit 1
fi
}
# Validate backup file
validate_backup_file() {
if [[ -z "$BACKUP_FILE" ]]; then
error "Backup file must be specified"
echo "Usage: $0 [namespace] [backup_file]"
exit 1
fi
# If file doesn't exist locally, try to find it in backup dir
if [[ ! -f "$BACKUP_FILE" ]]; then
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
if [[ -f "$potential_file" ]]; then
BACKUP_FILE="$potential_file"
else
error "Backup file not found: $BACKUP_FILE"
exit 1
fi
fi
# Check if file is gzipped and decompress if needed
if [[ "$BACKUP_FILE" == *.gz ]]; then
info "Decompressing backup file..."
gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
BACKUP_FILE="/tmp/restore_$(date +%s).dump"
fi
log "Using backup file: $BACKUP_FILE"
}
# Get PostgreSQL pod name
get_postgresql_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Wait for PostgreSQL to be ready
wait_for_postgresql() {
local pod=$1
log "Waiting for PostgreSQL pod $pod to be ready..."
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
# Check if PostgreSQL is accepting connections
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
log "PostgreSQL is ready"
return 0
fi
sleep 2
((retries--))
done
error "PostgreSQL did not become ready within timeout"
exit 1
}
# Create backup of current database before restore
create_pre_restore_backup() {
local pod=$1
local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
warn "Creating backup of current database before restore..."
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Create backup
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_dump -U "$db_user" -h localhost -d "$db_name" \
--format=custom --file="/tmp/${pre_restore_backup}.dump"
# Copy backup locally
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
}
# Perform restore
perform_restore() {
local pod=$1
warn "This will replace the current database. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Copy backup file to pod
local remote_backup="/tmp/restore_$(date +%s).dump"
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
# Drop existing database and recreate
log "Dropping existing database..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
log "Creating new database..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
# Restore database
log "Restoring database from backup..."
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
pg_restore -U "$db_user" -h localhost -d "$db_name" \
--verbose --clean --if-exists "$remote_backup"
# Clean up remote file
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
log "Database restore completed successfully"
}
# Verify restore
verify_restore() {
local pod=$1
log "Verifying database restore..."
# Get database credentials
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
# Check table count
local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
log "Database contains $table_count tables"
# Check if key tables exist
local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
for table in "${key_tables[@]}"; do
local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
if [[ "$exists" == "t" ]]; then
log "✓ Table $table exists"
else
warn "⚠ Table $table not found"
fi
done
}
# Main execution
main() {
log "Starting PostgreSQL restore process"
check_dependencies
validate_backup_file
local pod=$(get_postgresql_pod)
wait_for_postgresql "$pod"
create_pre_restore_backup "$pod"
perform_restore "$pod"
verify_restore "$pod"
log "PostgreSQL restore process completed successfully"
warn "Please verify application functionality after restore"
}
# Run main function
main "$@"

View File

@ -0,0 +1,223 @@
#!/bin/bash
# Redis Restore Script for AITBC
# Usage: ./restore_redis.sh [namespace] [backup_file]
set -euo pipefail
# Configuration
NAMESPACE=${1:-default}
BACKUP_FILE=${2:-}
BACKUP_DIR="/tmp/redis-backups"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
}
info() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
}
# Check dependencies
check_dependencies() {
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed or not in PATH"
exit 1
fi
}
# Validate backup file
validate_backup_file() {
if [[ -z "$BACKUP_FILE" ]]; then
error "Backup file must be specified"
echo "Usage: $0 [namespace] [backup_file]"
exit 1
fi
# If file doesn't exist locally, try to find it in backup dir
if [[ ! -f "$BACKUP_FILE" ]]; then
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
if [[ -f "$potential_file" ]]; then
BACKUP_FILE="$potential_file"
else
error "Backup file not found: $BACKUP_FILE"
exit 1
fi
fi
log "Using backup file: $BACKUP_FILE"
}
# Get Redis pod name
get_redis_pod() {
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
if [[ -z "$pod" ]]; then
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
fi
if [[ -z "$pod" ]]; then
error "Could not find Redis pod in namespace $NAMESPACE"
exit 1
fi
echo "$pod"
}
# Create backup of current Redis data before restore
create_pre_restore_backup() {
local pod=$1
local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
warn "Creating backup of current Redis data before restore..."
# Create background save
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
# Wait for save to complete
local retries=60
while [[ $retries -gt 0 ]]; do
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
break
fi
sleep 2
((retries--))
done
# Copy backup locally
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
# Also backup AOF if exists
if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
fi
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
}
# Perform restore
perform_restore() {
local pod=$1
warn "This will replace all current Redis data. Are you sure? (y/N)"
read -r response
if [[ ! "$response" =~ ^[Yy]$ ]]; then
log "Restore cancelled by user"
exit 0
fi
# Scale down Redis to ensure clean restore
info "Scaling down Redis deployment..."
kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
# Wait for pod to terminate
kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
# Scale up Redis
info "Scaling up Redis deployment..."
kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
# Wait for new pod to be ready
local new_pod=$(get_redis_pod)
kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
# Stop Redis server
info "Stopping Redis server..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
# Clear existing data
info "Clearing existing Redis data..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
# Copy backup file
info "Copying backup file..."
local remote_file="/data/restore.rdb"
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
# Set correct permissions
kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
# Start Redis server
info "Starting Redis server..."
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
# Wait for Redis to be ready
local retries=30
while [[ $retries -gt 0 ]]; do
if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "Redis is ready"
break
fi
sleep 2
((retries--))
done
if [[ $retries -eq 0 ]]; then
error "Redis did not start properly after restore"
exit 1
fi
log "Redis restore completed successfully"
}
# Verify restore
verify_restore() {
local pod=$1
log "Verifying Redis restore..."
# Check database size
local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
log "Database contains $db_size keys"
# Check memory usage
local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
log "Memory usage: $memory"
# Check if Redis is responding to commands
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
log "✓ Redis is responding normally"
else
error "✗ Redis is not responding"
exit 1
fi
}
# Main execution
main() {
log "Starting Redis restore process"
check_dependencies
validate_backup_file
local pod=$(get_redis_pod)
create_pre_restore_backup "$pod"
perform_restore "$pod"
# Get new pod name after restore
pod=$(get_redis_pod)
verify_restore "$pod"
log "Redis restore process completed successfully"
warn "Please verify application functionality after restore"
}
# Run main function
main "$@"

View File

@ -0,0 +1,25 @@
# Development environment configuration
terraform {
source = "../../modules/kubernetes"
}
include "root" {
path = find_in_parent_folders()
}
inputs = {
cluster_name = "aitbc-dev"
environment = "dev"
aws_region = "us-west-2"
vpc_cidr = "10.0.0.0/16"
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnet_cidrs = ["10.0.101.0/24", "10.0.102.0/24"]
availability_zones = ["us-west-2a", "us-west-2b"]
kubernetes_version = "1.28"
enable_public_endpoint = true
desired_node_count = 2
min_node_count = 1
max_node_count = 3
instance_types = ["t3.medium"]
}

View File

@ -0,0 +1,199 @@
# Kubernetes cluster module for AITBC infrastructure
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.20"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.10"
}
}
}
provider "aws" {
region = var.aws_region
}
# VPC for the cluster
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.cluster_name}-vpc"
Environment = var.environment
Project = "aitbc"
}
}
# Subnets
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.cluster_name}-private-${count.index}"
Environment = var.environment
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/internal-elb" = "1"
}
}
resource "aws_subnet" "public" {
count = length(var.public_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.cluster_name}-public-${count.index}"
Environment = var.environment
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"kubernetes.io/role/elb" = "1"
}
}
# EKS Cluster
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.cluster.arn
version = var.kubernetes_version
vpc_config {
subnet_ids = concat(
aws_subnet.private[*].id,
aws_subnet.public[*].id
)
endpoint_private_access = true
endpoint_public_access = var.enable_public_endpoint
}
depends_on = [
aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy
]
tags = {
Name = var.cluster_name
Environment = var.environment
Project = "aitbc"
}
}
# Node groups
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.cluster_name}-main"
node_role_arn = aws_iam_role.node.arn
subnet_ids = aws_subnet.private[*].id
scaling_config {
desired_size = var.desired_node_count
max_size = var.max_node_count
min_size = var.min_node_count
}
instance_types = var.instance_types
depends_on = [
aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly
]
tags = {
Name = "${var.cluster_name}-main"
Environment = var.environment
Project = "aitbc"
}
}
# IAM roles
resource "aws_iam_role" "cluster" {
name = "${var.cluster_name}-cluster"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "eks.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role" "node" {
name = "${var.cluster_name}-node"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
# IAM policy attachments
resource "aws_iam_role_policy_attachment" "cluster_AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.cluster.name
}
resource "aws_iam_role_policy_attachment" "node_AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.node.name
}
resource "aws_iam_role_policy_attachment" "node_AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.node.name
}
resource "aws_iam_role_policy_attachment" "node_AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.node.name
}
# Outputs
output "cluster_name" {
description = "The name of the EKS cluster"
value = aws_eks_cluster.main.name
}
output "cluster_endpoint" {
description = "The endpoint for the EKS cluster"
value = aws_eks_cluster.main.endpoint
}
output "cluster_certificate_authority_data" {
description = "The certificate authority data for the EKS cluster"
value = aws_eks_cluster.main.certificate_authority[0].data
}
output "cluster_security_group_id" {
description = "The security group ID of the EKS cluster"
value = aws_eks_cluster.main.vpc_config[0].cluster_security_group_id
}

View File

@ -0,0 +1,75 @@
variable "cluster_name" {
description = "Name of the EKS cluster"
type = string
}
variable "environment" {
description = "Environment name (dev, staging, prod)"
type = string
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-west-2"
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
default = "10.0.0.0/16"
}
variable "private_subnet_cidrs" {
description = "CIDR blocks for private subnets"
type = list(string)
default = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}
variable "public_subnet_cidrs" {
description = "CIDR blocks for public subnets"
type = list(string)
default = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}
variable "availability_zones" {
description = "Availability zones"
type = list(string)
default = ["us-west-2a", "us-west-2b", "us-west-2c"]
}
variable "kubernetes_version" {
description = "Kubernetes version"
type = string
default = "1.28"
}
variable "enable_public_endpoint" {
description = "Enable public EKS endpoint"
type = bool
default = false
}
variable "desired_node_count" {
description = "Desired number of worker nodes"
type = number
default = 3
}
variable "min_node_count" {
description = "Minimum number of worker nodes"
type = number
default = 1
}
variable "max_node_count" {
description = "Maximum number of worker nodes"
type = number
default = 10
}
variable "instance_types" {
description = "EC2 instance types for worker nodes"
type = list(string)
default = ["m5.large", "m5.xlarge"]
}