feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
This commit is contained in:
158
infra/README.md
Normal file
158
infra/README.md
Normal file
@ -0,0 +1,158 @@
|
||||
# AITBC Infrastructure Templates
|
||||
|
||||
This directory contains Terraform and Helm templates for deploying AITBC services across dev, staging, and production environments.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
infra/
|
||||
├── terraform/ # Infrastructure as Code
|
||||
│ ├── modules/ # Reusable Terraform modules
|
||||
│ │ └── kubernetes/ # EKS cluster module
|
||||
│ └── environments/ # Environment-specific configurations
|
||||
│ ├── dev/
|
||||
│ ├── staging/
|
||||
│ └── prod/
|
||||
└── helm/ # Helm Charts
|
||||
├── charts/ # Application charts
|
||||
│ ├── coordinator/ # Coordinator API chart
|
||||
│ ├── blockchain-node/ # Blockchain node chart
|
||||
│ └── monitoring/ # Monitoring stack (Prometheus, Grafana)
|
||||
└── values/ # Environment-specific values
|
||||
├── dev.yaml
|
||||
├── staging.yaml
|
||||
└── prod.yaml
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Terraform >= 1.0
|
||||
- Helm >= 3.0
|
||||
- kubectl configured for your cluster
|
||||
- AWS CLI configured (for EKS)
|
||||
|
||||
### Deploy Development Environment
|
||||
|
||||
1. **Provision Infrastructure with Terraform:**
|
||||
```bash
|
||||
cd infra/terraform/environments/dev
|
||||
terraform init
|
||||
terraform apply
|
||||
```
|
||||
|
||||
2. **Configure kubectl:**
|
||||
```bash
|
||||
aws eks update-kubeconfig --name aitbc-dev --region us-west-2
|
||||
```
|
||||
|
||||
3. **Deploy Applications with Helm:**
|
||||
```bash
|
||||
# Add required Helm repositories
|
||||
helm repo add bitnami https://charts.bitnami.com/bitnami
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo add grafana https://grafana.github.io/helm-charts
|
||||
helm repo update
|
||||
|
||||
# Deploy monitoring stack
|
||||
helm install monitoring ../../helm/charts/monitoring -f ../../helm/values/dev.yaml
|
||||
|
||||
# Deploy coordinator API
|
||||
helm install coordinator ../../helm/charts/coordinator -f ../../helm/values/dev.yaml
|
||||
```
|
||||
|
||||
### Environment Configurations
|
||||
|
||||
#### Development
|
||||
- 1 replica per service
|
||||
- Minimal resource allocation
|
||||
- Public EKS endpoint enabled
|
||||
- 7-day metrics retention
|
||||
|
||||
#### Staging
|
||||
- 2-3 replicas per service
|
||||
- Moderate resource allocation
|
||||
- Autoscaling enabled
|
||||
- 30-day metrics retention
|
||||
- TLS with staging certificates
|
||||
|
||||
#### Production
|
||||
- 3+ replicas per service
|
||||
- High resource allocation
|
||||
- Full autoscaling configuration
|
||||
- 90-day metrics retention
|
||||
- TLS with production certificates
|
||||
- Network policies enabled
|
||||
- Backup configuration enabled
|
||||
|
||||
## Monitoring
|
||||
|
||||
The monitoring stack includes:
|
||||
- **Prometheus**: Metrics collection and storage
|
||||
- **Grafana**: Visualization dashboards
|
||||
- **AlertManager**: Alert routing and notification
|
||||
|
||||
Access Grafana:
|
||||
```bash
|
||||
kubectl port-forward svc/monitoring-grafana 3000:3000
|
||||
# Open http://localhost:3000
|
||||
# Default credentials: admin/admin (check values files for environment-specific passwords)
|
||||
```
|
||||
|
||||
## Scaling Guidelines
|
||||
|
||||
Based on benchmark results (`apps/blockchain-node/scripts/benchmark_throughput.py`):
|
||||
|
||||
- **Coordinator API**: Scale horizontally at ~500 TPS per node
|
||||
- **Blockchain Node**: Scale horizontally at ~1000 TPS per node
|
||||
- **Wallet Daemon**: Scale based on concurrent users
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Private subnets for all application workloads
|
||||
- Network policies restrict traffic between services
|
||||
- Secrets managed via Kubernetes Secrets
|
||||
- TLS termination at ingress level
|
||||
- Pod Security Policies enforced in production
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
- Automated daily backups of PostgreSQL databases
|
||||
- EBS snapshots for persistent volumes
|
||||
- Cross-region replication for production data
|
||||
- Restore procedures documented in runbooks
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
- Use Spot instances for non-critical workloads
|
||||
- Implement cluster autoscaling
|
||||
- Right-size resources based on metrics
|
||||
- Schedule non-production environments to run only during business hours
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
1. **Helm chart fails to install:**
|
||||
- Check if all dependencies are added
|
||||
- Verify kubectl context is correct
|
||||
- Review values files for syntax errors
|
||||
|
||||
2. **Prometheus not scraping metrics:**
|
||||
- Verify ServiceMonitor CRDs are installed
|
||||
- Check service annotations
|
||||
- Review network policies
|
||||
|
||||
3. **High memory usage:**
|
||||
- Review resource limits in values files
|
||||
- Check for memory leaks in applications
|
||||
- Consider increasing node size
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding new services:
|
||||
1. Create a new Helm chart in `helm/charts/`
|
||||
2. Add environment-specific values in `helm/values/`
|
||||
3. Update monitoring configuration to include new service metrics
|
||||
4. Document any special requirements in this README
|
||||
64
infra/helm/charts/blockchain-node/hpa.yaml
Normal file
64
infra/helm/charts/blockchain-node/hpa.yaml
Normal file
@ -0,0 +1,64 @@
|
||||
{{- if .Values.autoscaling.enabled }}
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: {{ include "aitbc-blockchain-node.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-blockchain-node.labels" . | nindent 4 }}
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: {{ include "aitbc-blockchain-node.fullname" . }}
|
||||
minReplicas: {{ .Values.autoscaling.minReplicas }}
|
||||
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
|
||||
metrics:
|
||||
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
|
||||
{{- end }}
|
||||
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
|
||||
- type: Resource
|
||||
resource:
|
||||
name: memory
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
|
||||
{{- end }}
|
||||
# Custom metrics for blockchain-specific scaling
|
||||
- type: External
|
||||
external:
|
||||
metric:
|
||||
name: blockchain_transaction_queue_depth
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "100"
|
||||
- type: External
|
||||
external:
|
||||
metric:
|
||||
name: blockchain_pending_transactions
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "500"
|
||||
behavior:
|
||||
scaleDown:
|
||||
stabilizationWindowSeconds: 600 # Longer stabilization for blockchain
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 5
|
||||
periodSeconds: 60
|
||||
scaleUp:
|
||||
stabilizationWindowSeconds: 60
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 50
|
||||
periodSeconds: 60
|
||||
- type: Pods
|
||||
value: 2
|
||||
periodSeconds: 60
|
||||
selectPolicy: Max
|
||||
{{- end }}
|
||||
11
infra/helm/charts/coordinator/Chart.yaml
Normal file
11
infra/helm/charts/coordinator/Chart.yaml
Normal file
@ -0,0 +1,11 @@
|
||||
apiVersion: v2
|
||||
name: aitbc-coordinator
|
||||
description: AITBC Coordinator API Helm Chart
|
||||
type: application
|
||||
version: 0.1.0
|
||||
appVersion: "0.1.0"
|
||||
dependencies:
|
||||
- name: postgresql
|
||||
version: 12.x.x
|
||||
repository: https://charts.bitnami.com/bitnami
|
||||
condition: postgresql.enabled
|
||||
62
infra/helm/charts/coordinator/templates/_helpers.tpl
Normal file
62
infra/helm/charts/coordinator/templates/_helpers.tpl
Normal file
@ -0,0 +1,62 @@
|
||||
{{/*
|
||||
Expand the name of the chart.
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.name" -}}
|
||||
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
|
||||
{{- end }}
|
||||
|
||||
{{/*
|
||||
Create a default fully qualified app name.
|
||||
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
|
||||
If release name contains chart name it will be used as a full name.
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.fullname" -}}
|
||||
{{- if .Values.fullnameOverride }}
|
||||
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
|
||||
{{- else }}
|
||||
{{- $name := default .Chart.Name .Values.nameOverride }}
|
||||
{{- if contains $name .Release.Name }}
|
||||
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
|
||||
{{- else }}
|
||||
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
|
||||
{{/*
|
||||
Create chart name and version as used by the chart label.
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.chart" -}}
|
||||
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
|
||||
{{- end }}
|
||||
|
||||
{{/*
|
||||
Common labels
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.labels" -}}
|
||||
helm.sh/chart: {{ include "aitbc-coordinator.chart" . }}
|
||||
{{ include "aitbc-coordinator.selectorLabels" . }}
|
||||
{{- if .Chart.AppVersion }}
|
||||
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
|
||||
{{- end }}
|
||||
app.kubernetes.io/managed-by: {{ .Release.Service }}
|
||||
{{- end }}
|
||||
|
||||
{{/*
|
||||
Selector labels
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.selectorLabels" -}}
|
||||
app.kubernetes.io/name: {{ include "aitbc-coordinator.name" . }}
|
||||
app.kubernetes.io/instance: {{ .Release.Name }}
|
||||
{{- end }}
|
||||
|
||||
{{/*
|
||||
Create the name of the service account to use
|
||||
*/}}
|
||||
{{- define "aitbc-coordinator.serviceAccountName" -}}
|
||||
{{- if .Values.serviceAccount.create }}
|
||||
{{- default (include "aitbc-coordinator.fullname" .) .Values.serviceAccount.name }}
|
||||
{{- else }}
|
||||
{{- default "default" .Values.serviceAccount.name }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
90
infra/helm/charts/coordinator/templates/deployment.yaml
Normal file
90
infra/helm/charts/coordinator/templates/deployment.yaml
Normal file
@ -0,0 +1,90 @@
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
spec:
|
||||
{{- if not .Values.autoscaling.enabled }}
|
||||
replicas: {{ .Values.replicaCount }}
|
||||
{{- end }}
|
||||
selector:
|
||||
matchLabels:
|
||||
{{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
|
||||
{{- with .Values.podAnnotations }}
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.selectorLabels" . | nindent 8 }}
|
||||
spec:
|
||||
{{- with .Values.imagePullSecrets }}
|
||||
imagePullSecrets:
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
serviceAccountName: {{ include "aitbc-coordinator.serviceAccountName" . }}
|
||||
securityContext:
|
||||
{{- toYaml .Values.podSecurityContext | nindent 8 }}
|
||||
containers:
|
||||
- name: {{ .Chart.Name }}
|
||||
securityContext:
|
||||
{{- toYaml .Values.securityContext | nindent 12 }}
|
||||
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
|
||||
imagePullPolicy: {{ .Values.image.pullPolicy }}
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: {{ .Values.service.targetPort }}
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
{{- toYaml .Values.livenessProbe | nindent 12 }}
|
||||
readinessProbe:
|
||||
{{- toYaml .Values.readinessProbe | nindent 12 }}
|
||||
resources:
|
||||
{{- toYaml .Values.resources | nindent 12 }}
|
||||
env:
|
||||
- name: APP_ENV
|
||||
value: {{ .Values.config.appEnv }}
|
||||
- name: DATABASE_URL
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
key: database-url
|
||||
- name: ALLOW_ORIGINS
|
||||
value: {{ .Values.config.allowOrigins | quote }}
|
||||
{{- if .Values.config.receiptSigningKeyHex }}
|
||||
- name: RECEIPT_SIGNING_KEY_HEX
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
key: receipt-signing-key
|
||||
{{- end }}
|
||||
{{- if .Values.config.receiptAttestationKeyHex }}
|
||||
- name: RECEIPT_ATTESTATION_KEY_HEX
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
key: receipt-attestation-key
|
||||
{{- end }}
|
||||
volumeMounts:
|
||||
- name: config
|
||||
mountPath: /app/.env
|
||||
subPath: .env
|
||||
volumes:
|
||||
- name: config
|
||||
configMap:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
{{- with .Values.nodeSelector }}
|
||||
nodeSelector:
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
{{- with .Values.affinity }}
|
||||
affinity:
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
{{- with .Values.tolerations }}
|
||||
tolerations:
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
60
infra/helm/charts/coordinator/templates/hpa.yaml
Normal file
60
infra/helm/charts/coordinator/templates/hpa.yaml
Normal file
@ -0,0 +1,60 @@
|
||||
{{- if .Values.autoscaling.enabled }}
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
minReplicas: {{ .Values.autoscaling.minReplicas }}
|
||||
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
|
||||
metrics:
|
||||
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
|
||||
{{- end }}
|
||||
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
|
||||
- type: Resource
|
||||
resource:
|
||||
name: memory
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
|
||||
{{- end }}
|
||||
{{- if .Values.autoscaling.customMetrics }}
|
||||
{{- range .Values.autoscaling.customMetrics }}
|
||||
- type: External
|
||||
external:
|
||||
metric:
|
||||
name: {{ .name }}
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: {{ .targetValue }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
behavior:
|
||||
scaleDown:
|
||||
stabilizationWindowSeconds: 300
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 10
|
||||
periodSeconds: 60
|
||||
scaleUp:
|
||||
stabilizationWindowSeconds: 0
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 100
|
||||
periodSeconds: 15
|
||||
- type: Pods
|
||||
value: 4
|
||||
periodSeconds: 15
|
||||
selectPolicy: Max
|
||||
{{- end }}
|
||||
70
infra/helm/charts/coordinator/templates/ingress.yaml
Normal file
70
infra/helm/charts/coordinator/templates/ingress.yaml
Normal file
@ -0,0 +1,70 @@
|
||||
{{- if .Values.ingress.enabled -}}
|
||||
{{- $fullName := include "aitbc-coordinator.fullname" . -}}
|
||||
{{- $svcPort := .Values.service.port -}}
|
||||
{{- if and .Values.ingress.className (not (hasKey .Values.ingress.annotations "kubernetes.io/ingress.class")) }}
|
||||
{{- $_ := set .Values.ingress.annotations "kubernetes.io/ingress.class" .Values.ingress.className}}
|
||||
{{- end }}
|
||||
{{- if semverCompare ">=1.19-0" .Capabilities.KubeVersion.GitVersion -}}
|
||||
apiVersion: networking.k8s.io/v1
|
||||
{{- else -}}
|
||||
apiVersion: networking.k8s.io/v1beta1
|
||||
{{- end }}
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: {{ $fullName }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
annotations:
|
||||
# Security annotations (always applied)
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
|
||||
nginx.ingress.kubernetes.io/ssl-ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
|
||||
nginx.ingress.kubernetes.io/configuration-snippet: |
|
||||
more_set_headers "X-Frame-Options: DENY";
|
||||
more_set_headers "X-Content-Type-Options: nosniff";
|
||||
more_set_headers "X-XSS-Protection: 1; mode=block";
|
||||
more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
|
||||
more_set_headers "Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'";
|
||||
more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains; preload";
|
||||
cert-manager.io/cluster-issuer: {{ .Values.ingress.certManager.issuer | default "letsencrypt-prod" }}
|
||||
# User-provided annotations
|
||||
{{- with .Values.ingress.annotations }}
|
||||
{{- toYaml . | nindent 4 }}
|
||||
{{- end }}
|
||||
spec:
|
||||
{{- if and .Values.ingress.className (semverCompare ">=1.18-0" .Capabilities.KubeVersion.GitVersion) }}
|
||||
ingressClassName: {{ .Values.ingress.className }}
|
||||
{{- end }}
|
||||
{{- if .Values.ingress.tls }}
|
||||
tls:
|
||||
{{- range .Values.ingress.tls }}
|
||||
- hosts:
|
||||
{{- range .hosts }}
|
||||
- {{ . | quote }}
|
||||
{{- end }}
|
||||
secretName: {{ .secretName }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
rules:
|
||||
{{- range .Values.ingress.hosts }}
|
||||
- host: {{ .host | quote }}
|
||||
http:
|
||||
paths:
|
||||
{{- range .paths }}
|
||||
- path: {{ .path }}
|
||||
{{- if and .pathType (semverCompare ">=1.18-0" $.Capabilities.KubeVersion.GitVersion) }}
|
||||
pathType: {{ .pathType }}
|
||||
{{- end }}
|
||||
backend:
|
||||
{{- if semverCompare ">=1.19-0" $.Capabilities.KubeVersion.GitVersion }}
|
||||
service:
|
||||
name: {{ $fullName }}
|
||||
port:
|
||||
number: {{ $svcPort }}
|
||||
{{- else }}
|
||||
serviceName: {{ $fullName }}
|
||||
servicePort: {{ $svcPort }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
73
infra/helm/charts/coordinator/templates/networkpolicy.yaml
Normal file
73
infra/helm/charts/coordinator/templates/networkpolicy.yaml
Normal file
@ -0,0 +1,73 @@
|
||||
{{- if .Values.networkPolicy.enabled }}
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
{{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
- Egress
|
||||
ingress:
|
||||
# Allow traffic from ingress controller
|
||||
- from:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
name: ingress-nginx
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: ingress-nginx
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: http
|
||||
# Allow traffic from monitoring
|
||||
- from:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
name: monitoring
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: prometheus
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: http
|
||||
# Allow traffic from wallet-daemon
|
||||
- from:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: wallet-daemon
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: http
|
||||
# Allow traffic from same namespace for internal communication
|
||||
- from:
|
||||
- podSelector: {}
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: http
|
||||
egress:
|
||||
# Allow DNS resolution
|
||||
- to: []
|
||||
ports:
|
||||
- protocol: UDP
|
||||
port: 53
|
||||
# Allow PostgreSQL access
|
||||
- to:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: postgresql
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 5432
|
||||
# Allow external API calls (if needed)
|
||||
- to: []
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 443
|
||||
- protocol: TCP
|
||||
port: 80
|
||||
{{- end }}
|
||||
@ -0,0 +1,59 @@
|
||||
{{- if .Values.podSecurityPolicy.enabled }}
|
||||
apiVersion: policy/v1beta1
|
||||
kind: PodSecurityPolicy
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
spec:
|
||||
privileged: false
|
||||
allowPrivilegeEscalation: false
|
||||
requiredDropCapabilities:
|
||||
- ALL
|
||||
volumes:
|
||||
- 'configMap'
|
||||
- 'emptyDir'
|
||||
- 'projected'
|
||||
- 'secret'
|
||||
- 'downwardAPI'
|
||||
- 'persistentVolumeClaim'
|
||||
runAsUser:
|
||||
rule: 'MustRunAsNonRoot'
|
||||
seLinux:
|
||||
rule: 'RunAsAny'
|
||||
fsGroup:
|
||||
rule: 'RunAsAny'
|
||||
readOnlyRootFilesystem: false
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
fsGroup: 1000
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" }}-psp
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
rules:
|
||||
- apiGroups: ['policy']
|
||||
resources: ['podsecuritypolicies']
|
||||
verbs: ['use']
|
||||
resourceNames:
|
||||
- {{ include "aitbc-coordinator.fullname" . }}
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" }}-psp
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
roleRef:
|
||||
kind: Role
|
||||
name: {{ include "aitbc-coordinator.fullname" }}-psp
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: {{ include "aitbc-coordinator.serviceAccountName" . }}
|
||||
namespace: {{ .Release.Namespace }}
|
||||
{{- end }}
|
||||
21
infra/helm/charts/coordinator/templates/service.yaml
Normal file
21
infra/helm/charts/coordinator/templates/service.yaml
Normal file
@ -0,0 +1,21 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: {{ include "aitbc-coordinator.fullname" . }}
|
||||
labels:
|
||||
{{- include "aitbc-coordinator.labels" . | nindent 4 }}
|
||||
{{- if .Values.monitoring.enabled }}
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "{{ .Values.service.port }}"
|
||||
prometheus.io/path: "{{ .Values.monitoring.serviceMonitor.path }}"
|
||||
{{- end }}
|
||||
spec:
|
||||
type: {{ .Values.service.type }}
|
||||
ports:
|
||||
- port: {{ .Values.service.port }}
|
||||
targetPort: {{ .Values.service.targetPort }}
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
{{- include "aitbc-coordinator.selectorLabels" . | nindent 4 }}
|
||||
162
infra/helm/charts/coordinator/values.yaml
Normal file
162
infra/helm/charts/coordinator/values.yaml
Normal file
@ -0,0 +1,162 @@
|
||||
# Default values for aitbc-coordinator.
|
||||
# This is a YAML-formatted file.
|
||||
# Declare variables to be passed into your templates.
|
||||
|
||||
replicaCount: 1
|
||||
|
||||
image:
|
||||
repository: aitbc/coordinator-api
|
||||
pullPolicy: IfNotPresent
|
||||
tag: "0.1.0"
|
||||
|
||||
nameOverride: ""
|
||||
fullnameOverride: ""
|
||||
|
||||
serviceAccount:
|
||||
# Specifies whether a service account should be created
|
||||
create: true
|
||||
# Annotations to add to the service account
|
||||
annotations: {}
|
||||
# The name of the service account to use.
|
||||
# If not set and create is true, a name is generated using the fullname template
|
||||
name: ""
|
||||
|
||||
podAnnotations: {}
|
||||
|
||||
podSecurityContext:
|
||||
fsGroup: 1000
|
||||
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 8011
|
||||
targetPort: 8011
|
||||
|
||||
ingress:
|
||||
enabled: false
|
||||
className: nginx
|
||||
annotations: {}
|
||||
# cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
hosts:
|
||||
- host: coordinator.local
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
tls: []
|
||||
# - secretName: coordinator-tls
|
||||
# hosts:
|
||||
# - coordinator.local
|
||||
|
||||
# Pod Security Policy
|
||||
podSecurityPolicy:
|
||||
enabled: true
|
||||
|
||||
# Network policies
|
||||
networkPolicy:
|
||||
enabled: true
|
||||
|
||||
security:
|
||||
auth:
|
||||
enabled: true
|
||||
requireApiKey: true
|
||||
apiKeyHeader: "X-API-Key"
|
||||
tls:
|
||||
version: "TLSv1.3"
|
||||
ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
|
||||
headers:
|
||||
frameOptions: "DENY"
|
||||
contentTypeOptions: "nosniff"
|
||||
xssProtection: "1; mode=block"
|
||||
referrerPolicy: "strict-origin-when-cross-origin"
|
||||
hsts:
|
||||
enabled: true
|
||||
maxAge: 31536000
|
||||
includeSubDomains: true
|
||||
preload: true
|
||||
rateLimit:
|
||||
enabled: true
|
||||
requestsPerMinute: 60
|
||||
burst: 10
|
||||
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
autoscaling:
|
||||
enabled: false
|
||||
minReplicas: 1
|
||||
maxReplicas: 10
|
||||
targetCPUUtilizationPercentage: 80
|
||||
# targetMemoryUtilizationPercentage: 80
|
||||
|
||||
nodeSelector: {}
|
||||
|
||||
tolerations: []
|
||||
|
||||
affinity: {}
|
||||
|
||||
# Configuration
|
||||
config:
|
||||
appEnv: production
|
||||
databaseUrl: "postgresql://aitbc:password@postgresql:5432/aitbc"
|
||||
receiptSigningKeyHex: ""
|
||||
receiptAttestationKeyHex: ""
|
||||
allowOrigins: "*"
|
||||
|
||||
# PostgreSQL sub-chart configuration
|
||||
postgresql:
|
||||
enabled: true
|
||||
auth:
|
||||
postgresPassword: "password"
|
||||
username: aitbc
|
||||
database: aitbc
|
||||
primary:
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 20Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
|
||||
# Monitoring
|
||||
monitoring:
|
||||
enabled: true
|
||||
serviceMonitor:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
port: http
|
||||
|
||||
# Health checks
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /v1/health
|
||||
port: http
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /v1/health
|
||||
port: http
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
19
infra/helm/charts/monitoring/Chart.yaml
Normal file
19
infra/helm/charts/monitoring/Chart.yaml
Normal file
@ -0,0 +1,19 @@
|
||||
apiVersion: v2
|
||||
name: aitbc-monitoring
|
||||
description: AITBC Monitoring Stack (Prometheus, Grafana, AlertManager)
|
||||
type: application
|
||||
version: 0.1.0
|
||||
appVersion: "0.1.0"
|
||||
dependencies:
|
||||
- name: prometheus
|
||||
version: 23.1.0
|
||||
repository: https://prometheus-community.github.io/helm-charts
|
||||
condition: prometheus.enabled
|
||||
- name: grafana
|
||||
version: 6.58.9
|
||||
repository: https://grafana.github.io/helm-charts
|
||||
condition: grafana.enabled
|
||||
- name: alertmanager
|
||||
version: 1.6.1
|
||||
repository: https://prometheus-community.github.io/helm-charts
|
||||
condition: alertmanager.enabled
|
||||
13
infra/helm/charts/monitoring/templates/dashboards.yaml
Normal file
13
infra/helm/charts/monitoring/templates/dashboards.yaml
Normal file
@ -0,0 +1,13 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: {{ include "aitbc-monitoring.fullname" . }}-dashboards
|
||||
labels:
|
||||
{{- include "aitbc-monitoring.labels" . | nindent 4 }}
|
||||
annotations:
|
||||
grafana.io/dashboard: "1"
|
||||
data:
|
||||
blockchain-node-overview.json: |
|
||||
{{ .Files.Get "dashboards/blockchain-node-overview.json" | indent 4 }}
|
||||
coordinator-overview.json: |
|
||||
{{ .Files.Get "dashboards/coordinator-overview.json" | indent 4 }}
|
||||
124
infra/helm/charts/monitoring/values.yaml
Normal file
124
infra/helm/charts/monitoring/values.yaml
Normal file
@ -0,0 +1,124 @@
|
||||
# Default values for aitbc-monitoring.
|
||||
|
||||
# Prometheus configuration
|
||||
prometheus:
|
||||
enabled: true
|
||||
server:
|
||||
enabled: true
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
retention: 30d
|
||||
persistentVolume:
|
||||
enabled: true
|
||||
size: 100Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 4Gi
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 9090
|
||||
serviceMonitors:
|
||||
enabled: true
|
||||
selector:
|
||||
release: monitoring
|
||||
alertmanager:
|
||||
enabled: false
|
||||
config:
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
group_by: ['alertname']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 1h
|
||||
receiver: 'web.hook'
|
||||
receivers:
|
||||
- name: 'web.hook'
|
||||
webhook_configs:
|
||||
- url: 'http://127.0.0.1:5001/'
|
||||
|
||||
# Grafana configuration
|
||||
grafana:
|
||||
enabled: true
|
||||
adminPassword: admin
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 20Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
service:
|
||||
type: ClusterIP
|
||||
port: 3000
|
||||
datasources:
|
||||
datasources.yaml:
|
||||
apiVersion: 1
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
url: http://prometheus-server:9090
|
||||
access: proxy
|
||||
isDefault: true
|
||||
dashboardProviders:
|
||||
dashboardproviders.yaml:
|
||||
apiVersion: 1
|
||||
providers:
|
||||
- name: 'default'
|
||||
orgId: 1
|
||||
folder: ''
|
||||
type: file
|
||||
disableDeletion: false
|
||||
editable: true
|
||||
options:
|
||||
path: /var/lib/grafana/dashboards/default
|
||||
|
||||
# Service monitors for AITBC services
|
||||
serviceMonitors:
|
||||
coordinator:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
port: http
|
||||
blockchainNode:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
port: http
|
||||
walletDaemon:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
port: http
|
||||
|
||||
# Alert rules
|
||||
alertRules:
|
||||
enabled: true
|
||||
groups:
|
||||
- name: aitbc.rules
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: rate(marketplace_errors_total[5m]) / rate(marketplace_requests_total[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
description: "Error rate is above 10% for 5 minutes"
|
||||
|
||||
- alert: CoordinatorDown
|
||||
expr: up{job="coordinator"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Coordinator is down"
|
||||
description: "Coordinator API has been down for more than 1 minute"
|
||||
77
infra/helm/values/dev.yaml
Normal file
77
infra/helm/values/dev.yaml
Normal file
@ -0,0 +1,77 @@
|
||||
# Development environment values
|
||||
global:
|
||||
environment: dev
|
||||
|
||||
coordinator:
|
||||
replicaCount: 1
|
||||
image:
|
||||
tag: "dev-latest"
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 256Mi
|
||||
config:
|
||||
appEnv: development
|
||||
allowOrigins: "*"
|
||||
postgresql:
|
||||
auth:
|
||||
postgresPassword: "dev-password"
|
||||
primary:
|
||||
persistence:
|
||||
size: 10Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 512Mi
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
server:
|
||||
retention: 7d
|
||||
persistentVolume:
|
||||
size: 20Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 512Mi
|
||||
grafana:
|
||||
adminPassword: "dev-admin"
|
||||
persistence:
|
||||
size: 5Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 250m
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: 125m
|
||||
memory: 256Mi
|
||||
|
||||
# Additional services
|
||||
blockchainNode:
|
||||
replicaCount: 1
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 256Mi
|
||||
|
||||
walletDaemon:
|
||||
replicaCount: 1
|
||||
resources:
|
||||
limits:
|
||||
cpu: 250m
|
||||
memory: 256Mi
|
||||
requests:
|
||||
cpu: 125m
|
||||
memory: 128Mi
|
||||
140
infra/helm/values/prod.yaml
Normal file
140
infra/helm/values/prod.yaml
Normal file
@ -0,0 +1,140 @@
|
||||
# Production environment values
|
||||
global:
|
||||
environment: production
|
||||
|
||||
coordinator:
|
||||
replicaCount: 3
|
||||
image:
|
||||
tag: "v0.1.0"
|
||||
resources:
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 3
|
||||
maxReplicas: 20
|
||||
targetCPUUtilizationPercentage: 75
|
||||
targetMemoryUtilizationPercentage: 80
|
||||
config:
|
||||
appEnv: production
|
||||
allowOrigins: "https://app.aitbc.io"
|
||||
postgresql:
|
||||
auth:
|
||||
existingSecret: "coordinator-db-secret"
|
||||
primary:
|
||||
persistence:
|
||||
size: 200Gi
|
||||
storageClass: fast-ssd
|
||||
resources:
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 4Gi
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
readReplicas:
|
||||
replicaCount: 2
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
server:
|
||||
retention: 90d
|
||||
persistentVolume:
|
||||
size: 500Gi
|
||||
storageClass: fast-ssd
|
||||
resources:
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 4Gi
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
grafana:
|
||||
adminPassword: "prod-admin-secure-2024"
|
||||
persistence:
|
||||
size: 50Gi
|
||||
storageClass: fast-ssd
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
ingress:
|
||||
enabled: true
|
||||
hosts:
|
||||
- grafana.aitbc.io
|
||||
|
||||
# Additional services
|
||||
blockchainNode:
|
||||
replicaCount: 5
|
||||
resources:
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 5
|
||||
maxReplicas: 50
|
||||
targetCPUUtilizationPercentage: 70
|
||||
|
||||
walletDaemon:
|
||||
replicaCount: 3
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 3
|
||||
maxReplicas: 10
|
||||
targetCPUUtilizationPercentage: 75
|
||||
|
||||
# Ingress configuration
|
||||
ingress:
|
||||
enabled: true
|
||||
className: nginx
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
nginx.ingress.kubernetes.io/rate-limit: "100"
|
||||
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
|
||||
hosts:
|
||||
- host: api.aitbc.io
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
tls:
|
||||
- secretName: prod-tls
|
||||
hosts:
|
||||
- api.aitbc.io
|
||||
|
||||
# Security
|
||||
podSecurityPolicy:
|
||||
enabled: true
|
||||
|
||||
networkPolicy:
|
||||
enabled: true
|
||||
|
||||
# Backup configuration
|
||||
backup:
|
||||
enabled: true
|
||||
schedule: "0 2 * * *"
|
||||
retention: "30d"
|
||||
98
infra/helm/values/staging.yaml
Normal file
98
infra/helm/values/staging.yaml
Normal file
@ -0,0 +1,98 @@
|
||||
# Staging environment values
|
||||
global:
|
||||
environment: staging
|
||||
|
||||
coordinator:
|
||||
replicaCount: 2
|
||||
image:
|
||||
tag: "staging-latest"
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 2
|
||||
maxReplicas: 5
|
||||
targetCPUUtilizationPercentage: 70
|
||||
config:
|
||||
appEnv: staging
|
||||
allowOrigins: "https://staging.aitbc.io"
|
||||
postgresql:
|
||||
auth:
|
||||
postgresPassword: "staging-password"
|
||||
primary:
|
||||
persistence:
|
||||
size: 50Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
server:
|
||||
retention: 30d
|
||||
persistentVolume:
|
||||
size: 100Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 2Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
grafana:
|
||||
adminPassword: "staging-admin-2024"
|
||||
persistence:
|
||||
size: 10Gi
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 512Mi
|
||||
|
||||
# Additional services
|
||||
blockchainNode:
|
||||
replicaCount: 2
|
||||
resources:
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
walletDaemon:
|
||||
replicaCount: 2
|
||||
resources:
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 256Mi
|
||||
|
||||
# Ingress configuration
|
||||
ingress:
|
||||
enabled: true
|
||||
className: nginx
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
hosts:
|
||||
- host: api.staging.aitbc.io
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
tls:
|
||||
- secretName: staging-tls
|
||||
hosts:
|
||||
- api.staging.aitbc.io
|
||||
570
infra/k8s/backup-configmap.yaml
Normal file
570
infra/k8s/backup-configmap.yaml
Normal file
@ -0,0 +1,570 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: backup-scripts
|
||||
namespace: default
|
||||
labels:
|
||||
app: aitbc-backup
|
||||
component: backup
|
||||
data:
|
||||
backup_postgresql.sh: |
|
||||
#!/bin/bash
|
||||
# PostgreSQL Backup Script for AITBC
|
||||
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/postgresql-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v pg_dump &> /dev/null; then
|
||||
error "pg_dump is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get PostgreSQL pod name
|
||||
get_postgresql_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
wait_for_postgresql() {
|
||||
local pod=$1
|
||||
log "Waiting for PostgreSQL pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if PostgreSQL is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
|
||||
log "PostgreSQL is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "PostgreSQL did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
|
||||
|
||||
log "Starting PostgreSQL backup to $backup_file"
|
||||
|
||||
# Get database credentials from secret
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Perform the backup
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_dump -U "$db_user" -h localhost -d "$db_name" \
|
||||
--verbose --clean --if-exists --create --format=custom \
|
||||
--file="/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Copy backup from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
|
||||
|
||||
# Clean up remote backup file
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Compress backup
|
||||
gzip "$backup_file"
|
||||
backup_file="${backup_file}.gz"
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="postgresql/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting PostgreSQL backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_postgresql_pod)
|
||||
wait_for_postgresql "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "PostgreSQL backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
||||
backup_redis.sh: |
|
||||
#!/bin/bash
|
||||
# Redis Backup Script for AITBC
|
||||
# Usage: ./backup_redis.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/redis-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get Redis pod name
|
||||
get_redis_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find Redis pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for Redis to be ready
|
||||
wait_for_redis() {
|
||||
local pod=$1
|
||||
log "Waiting for Redis pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if Redis is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "Redis is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Redis did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
|
||||
log "Starting Redis backup to $backup_file"
|
||||
|
||||
# Create Redis backup
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
|
||||
|
||||
# Wait for background save to complete
|
||||
log "Waiting for background save to complete..."
|
||||
local retries=60
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
|
||||
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
|
||||
log "Background save completed"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
if [[ $retries -eq 0 ]]; then
|
||||
error "Background save did not complete within timeout"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Copy RDB file from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
|
||||
|
||||
# Also create an append-only file backup if enabled
|
||||
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
|
||||
if [[ "$aof_enabled" == "yes" ]]; then
|
||||
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
|
||||
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
|
||||
log "AOF backup created: $aof_backup"
|
||||
fi
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
|
||||
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="redis/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
|
||||
# Upload AOF file if exists
|
||||
local aof_file="${backup_file%.rdb}.aof"
|
||||
if [[ -f "$aof_file" ]]; then
|
||||
local aof_key="redis/$(basename "$aof_file")"
|
||||
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
|
||||
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
|
||||
fi
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting Redis backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_redis_pod)
|
||||
wait_for_redis "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "Redis backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
|
||||
backup_ledger.sh: |
|
||||
#!/bin/bash
|
||||
# Ledger Storage Backup Script for AITBC
|
||||
# Usage: ./backup_ledger.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/ledger-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get blockchain node pods
|
||||
get_blockchain_pods() {
|
||||
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pods" ]]; then
|
||||
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pods" ]]; then
|
||||
error "Could not find blockchain node pods in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo $pods
|
||||
}
|
||||
|
||||
# Wait for blockchain node to be ready
|
||||
wait_for_blockchain_node() {
|
||||
local pod=$1
|
||||
log "Waiting for blockchain node pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if node is responding
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
|
||||
log "Blockchain node is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Blockchain node did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Backup ledger data
|
||||
backup_ledger_data() {
|
||||
local pod=$1
|
||||
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
mkdir -p "$ledger_backup_dir"
|
||||
|
||||
log "Starting ledger backup from pod $pod"
|
||||
|
||||
# Get the latest block height before backup
|
||||
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
log "Latest block height: $latest_block"
|
||||
|
||||
# Backup blockchain data directory
|
||||
local blockchain_data_dir="/app/data/chain"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
|
||||
log "Backing up blockchain data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup wallet data
|
||||
local wallet_data_dir="/app/data/wallets"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
|
||||
log "Backing up wallet data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup receipts
|
||||
local receipts_data_dir="/app/data/receipts"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
|
||||
log "Backing up receipts directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
|
||||
fi
|
||||
|
||||
# Create metadata file
|
||||
cat > "$ledger_backup_dir/metadata.json" << EOF
|
||||
{
|
||||
"backup_name": "$BACKUP_NAME",
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"namespace": "$NAMESPACE",
|
||||
"source_pod": "$pod",
|
||||
"latest_block_height": $latest_block,
|
||||
"backup_type": "full"
|
||||
}
|
||||
EOF
|
||||
|
||||
log "Ledger backup completed: $ledger_backup_dir"
|
||||
|
||||
# Verify backup
|
||||
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
|
||||
log "Total backup size: $total_size"
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
|
||||
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_dir="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
|
||||
# Upload entire backup directory
|
||||
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
|
||||
|
||||
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting ledger backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pods=($(get_blockchain_pods))
|
||||
|
||||
# Use the first ready pod for backup
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
wait_for_blockchain_node "$pod"
|
||||
backup_ledger_data "$pod"
|
||||
|
||||
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
upload_to_cloud "$backup_dir"
|
||||
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
cleanup_old_backups
|
||||
|
||||
log "Ledger backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
156
infra/k8s/backup-cronjob.yaml
Normal file
156
infra/k8s/backup-cronjob.yaml
Normal file
@ -0,0 +1,156 @@
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: aitbc-backup
|
||||
namespace: default
|
||||
labels:
|
||||
app: aitbc-backup
|
||||
component: backup
|
||||
spec:
|
||||
schedule: "0 2 * * *" # Run daily at 2 AM
|
||||
concurrencyPolicy: Forbid
|
||||
successfulJobsHistoryLimit: 7
|
||||
failedJobsHistoryLimit: 3
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
containers:
|
||||
- name: postgresql-backup
|
||||
image: postgres:15-alpine
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- |
|
||||
echo "Starting PostgreSQL backup..."
|
||||
/scripts/backup_postgresql.sh default postgresql-backup-$(date +%Y%m%d_%H%M%S)
|
||||
echo "PostgreSQL backup completed"
|
||||
env:
|
||||
- name: PGPASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: coordinator-postgresql
|
||||
key: password
|
||||
volumeMounts:
|
||||
- name: backup-scripts
|
||||
mountPath: /scripts
|
||||
readOnly: true
|
||||
- name: backup-storage
|
||||
mountPath: /backups
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "100m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
|
||||
- name: redis-backup
|
||||
image: redis:7-alpine
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
echo "Waiting for PostgreSQL backup to complete..."
|
||||
sleep 60
|
||||
echo "Starting Redis backup..."
|
||||
/scripts/backup_redis.sh default redis-backup-$(date +%Y%m%d_%H%M%S)
|
||||
echo "Redis backup completed"
|
||||
volumeMounts:
|
||||
- name: backup-scripts
|
||||
mountPath: /scripts
|
||||
readOnly: true
|
||||
- name: backup-storage
|
||||
mountPath: /backups
|
||||
resources:
|
||||
requests:
|
||||
memory: "128Mi"
|
||||
cpu: "50m"
|
||||
limits:
|
||||
memory: "256Mi"
|
||||
cpu: "200m"
|
||||
|
||||
- name: ledger-backup
|
||||
image: alpine:3.18
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
echo "Waiting for previous backups to complete..."
|
||||
sleep 120
|
||||
echo "Starting Ledger backup..."
|
||||
/scripts/backup_ledger.sh default ledger-backup-$(date +%Y%m%d_%H%M%S)
|
||||
echo "Ledger backup completed"
|
||||
volumeMounts:
|
||||
- name: backup-scripts
|
||||
mountPath: /scripts
|
||||
readOnly: true
|
||||
- name: backup-storage
|
||||
mountPath: /backups
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "100m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
|
||||
volumes:
|
||||
- name: backup-scripts
|
||||
configMap:
|
||||
name: backup-scripts
|
||||
defaultMode: 0755
|
||||
|
||||
- name: backup-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: backup-storage-pvc
|
||||
|
||||
# Add service account for cloud storage access
|
||||
serviceAccountName: backup-service-account
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: backup-service-account
|
||||
namespace: default
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: backup-role
|
||||
namespace: default
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources: ["pods", "pods/exec", "secrets"]
|
||||
verbs: ["get", "list"]
|
||||
- apiGroups: ["batch"]
|
||||
resources: ["jobs", "cronjobs"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
metadata:
|
||||
name: backup-role-binding
|
||||
namespace: default
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: backup-service-account
|
||||
namespace: default
|
||||
roleRef:
|
||||
kind: Role
|
||||
name: backup-role
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: backup-storage-pvc
|
||||
namespace: default
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: fast-ssd
|
||||
resources:
|
||||
requests:
|
||||
storage: 500Gi
|
||||
99
infra/k8s/cert-manager.yaml
Normal file
99
infra/k8s/cert-manager.yaml
Normal file
@ -0,0 +1,99 @@
|
||||
# Cert-Manager Installation
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: Application
|
||||
metadata:
|
||||
name: cert-manager
|
||||
namespace: argocd
|
||||
finalizers:
|
||||
- resources-finalizer.argocd.argoproj.io
|
||||
spec:
|
||||
project: default
|
||||
source:
|
||||
repoURL: https://charts.jetstack.io
|
||||
chart: cert-manager
|
||||
targetRevision: v1.14.0
|
||||
helm:
|
||||
releaseName: cert-manager
|
||||
parameters:
|
||||
- name: installCRDs
|
||||
value: "true"
|
||||
- name: namespace
|
||||
value: cert-manager
|
||||
destination:
|
||||
server: https://kubernetes.default.svc
|
||||
namespace: cert-manager
|
||||
syncPolicy:
|
||||
automated:
|
||||
prune: true
|
||||
selfHeal: true
|
||||
---
|
||||
# Let's Encrypt Production ClusterIssuer
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: letsencrypt-prod
|
||||
spec:
|
||||
acme:
|
||||
server: https://acme-v02.api.letsencrypt.org/directory
|
||||
email: admin@aitbc.io
|
||||
privateKeySecretRef:
|
||||
name: letsencrypt-prod
|
||||
solvers:
|
||||
- http01:
|
||||
ingress:
|
||||
class: nginx
|
||||
---
|
||||
# Let's Encrypt Staging ClusterIssuer (for testing)
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: letsencrypt-staging
|
||||
spec:
|
||||
acme:
|
||||
server: https://acme-staging-v02.api.letsencrypt.org/directory
|
||||
email: admin@aitbc.io
|
||||
privateKeySecretRef:
|
||||
name: letsencrypt-staging
|
||||
solvers:
|
||||
- http01:
|
||||
ingress:
|
||||
class: nginx
|
||||
---
|
||||
# Self-Signed Issuer for Development
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Issuer
|
||||
metadata:
|
||||
name: selfsigned-issuer
|
||||
namespace: default
|
||||
spec:
|
||||
selfSigned: {}
|
||||
---
|
||||
# Development Certificate
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: coordinator-dev-tls
|
||||
namespace: default
|
||||
spec:
|
||||
secretName: coordinator-dev-tls
|
||||
dnsNames:
|
||||
- coordinator.local
|
||||
- coordinator.127.0.0.2.nip.io
|
||||
issuerRef:
|
||||
name: selfsigned-issuer
|
||||
kind: Issuer
|
||||
---
|
||||
# Production Certificate Template
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: coordinator-prod-tls
|
||||
namespace: default
|
||||
spec:
|
||||
secretName: coordinator-prod-tls
|
||||
dnsNames:
|
||||
- api.aitbc.io
|
||||
- www.api.aitbc.io
|
||||
issuerRef:
|
||||
name: letsencrypt-prod
|
||||
kind: ClusterIssuer
|
||||
56
infra/k8s/default-deny-netpol.yaml
Normal file
56
infra/k8s/default-deny-netpol.yaml
Normal file
@ -0,0 +1,56 @@
|
||||
# Default Deny All Network Policy
|
||||
# This policy denies all ingress and egress traffic by default
|
||||
# Individual services must have their own network policies to allow traffic
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: default-deny-all-ingress
|
||||
namespace: default
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: default-deny-all-egress
|
||||
namespace: default
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Egress
|
||||
---
|
||||
# Allow DNS resolution for all pods
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: allow-dns
|
||||
namespace: default
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Egress
|
||||
egress:
|
||||
- to: []
|
||||
ports:
|
||||
- protocol: UDP
|
||||
port: 53
|
||||
- protocol: TCP
|
||||
port: 53
|
||||
---
|
||||
# Allow traffic to Kubernetes API
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: allow-k8s-api
|
||||
namespace: default
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Egress
|
||||
egress:
|
||||
- to: []
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 443
|
||||
81
infra/k8s/sealed-secrets.yaml
Normal file
81
infra/k8s/sealed-secrets.yaml
Normal file
@ -0,0 +1,81 @@
|
||||
# SealedSecrets Controller Installation
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: Application
|
||||
metadata:
|
||||
name: sealed-secrets
|
||||
namespace: argocd
|
||||
finalizers:
|
||||
- resources-finalizer.argocd.argoproj.io
|
||||
spec:
|
||||
project: default
|
||||
source:
|
||||
repoURL: https://bitnami-labs.github.io/sealed-secrets
|
||||
chart: sealed-secrets
|
||||
targetRevision: 2.15.0
|
||||
helm:
|
||||
releaseName: sealed-secrets
|
||||
parameters:
|
||||
- name: namespace
|
||||
value: kube-system
|
||||
destination:
|
||||
server: https://kubernetes.default.svc
|
||||
namespace: kube-system
|
||||
syncPolicy:
|
||||
automated:
|
||||
prune: true
|
||||
selfHeal: true
|
||||
---
|
||||
# Example SealedSecret for Coordinator API Keys
|
||||
apiVersion: bitnami.com/v1alpha1
|
||||
kind: SealedSecret
|
||||
metadata:
|
||||
name: coordinator-api-keys
|
||||
namespace: default
|
||||
annotations:
|
||||
sealedsecrets.bitnami.com/cluster-wide: "true"
|
||||
spec:
|
||||
encryptedData:
|
||||
# Production API key (encrypted)
|
||||
api-key-prod: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
# Staging API key (encrypted)
|
||||
api-key-staging: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
# Development API key (encrypted)
|
||||
api-key-dev: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
template:
|
||||
metadata:
|
||||
name: coordinator-api-keys
|
||||
namespace: default
|
||||
type: Opaque
|
||||
---
|
||||
# Example SealedSecret for Database Credentials
|
||||
apiVersion: bitnami.com/v1alpha1
|
||||
kind: SealedSecret
|
||||
metadata:
|
||||
name: coordinator-db-credentials
|
||||
namespace: default
|
||||
spec:
|
||||
encryptedData:
|
||||
username: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
database: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
template:
|
||||
metadata:
|
||||
name: coordinator-db-credentials
|
||||
namespace: default
|
||||
type: Opaque
|
||||
---
|
||||
# Example SealedSecret for JWT Signing Keys (if needed in future)
|
||||
apiVersion: bitnami.com/v1alpha1
|
||||
kind: SealedSecret
|
||||
metadata:
|
||||
name: coordinator-jwt-keys
|
||||
namespace: default
|
||||
spec:
|
||||
encryptedData:
|
||||
private-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
public-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
|
||||
template:
|
||||
metadata:
|
||||
name: coordinator-jwt-keys
|
||||
namespace: default
|
||||
type: Opaque
|
||||
330
infra/scripts/README_chaos.md
Normal file
330
infra/scripts/README_chaos.md
Normal file
@ -0,0 +1,330 @@
|
||||
# AITBC Chaos Testing Framework
|
||||
|
||||
This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
|
||||
|
||||
## Overview
|
||||
|
||||
The chaos testing framework simulates real-world failure scenarios to:
|
||||
- Test system resilience under adverse conditions
|
||||
- Measure Mean-Time-To-Recovery (MTTR) metrics
|
||||
- Identify single points of failure
|
||||
- Validate recovery procedures
|
||||
- Ensure SLO compliance
|
||||
|
||||
## Components
|
||||
|
||||
### Test Scripts
|
||||
|
||||
1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
|
||||
- Deletes coordinator pods to simulate complete service outage
|
||||
- Measures recovery time and service availability
|
||||
- Tests load handling during and after recovery
|
||||
|
||||
2. **`chaos_test_network.py`** - Network partition simulation
|
||||
- Creates network partitions between blockchain nodes
|
||||
- Tests consensus resilience during partition
|
||||
- Measures network recovery time
|
||||
|
||||
3. **`chaos_test_database.py`** - Database failure simulation
|
||||
- Simulates PostgreSQL connection failures
|
||||
- Tests high latency scenarios
|
||||
- Validates application error handling
|
||||
|
||||
4. **`chaos_orchestrator.py`** - Test orchestration and reporting
|
||||
- Runs multiple chaos test scenarios
|
||||
- Aggregates MTTR metrics across tests
|
||||
- Generates comprehensive reports
|
||||
- Supports continuous chaos testing
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- kubectl configured with cluster access
|
||||
- Helm charts deployed in target namespace
|
||||
- Administrative privileges for network manipulation
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone <repository-url>
|
||||
cd aitbc/infra/scripts
|
||||
|
||||
# Install dependencies
|
||||
pip install aiohttp
|
||||
|
||||
# Make scripts executable
|
||||
chmod +x chaos_*.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Running Individual Tests
|
||||
|
||||
#### Coordinator Outage Test
|
||||
```bash
|
||||
# Basic test
|
||||
python3 chaos_test_coordinator.py --namespace default
|
||||
|
||||
# Custom outage duration
|
||||
python3 chaos_test_coordinator.py --namespace default --outage-duration 120
|
||||
|
||||
# Dry run (no actual chaos)
|
||||
python3 chaos_test_coordinator.py --dry-run
|
||||
```
|
||||
|
||||
#### Network Partition Test
|
||||
```bash
|
||||
# Partition 50% of nodes for 60 seconds
|
||||
python3 chaos_test_network.py --namespace default
|
||||
|
||||
# Partition 30% of nodes for 90 seconds
|
||||
python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
|
||||
```
|
||||
|
||||
#### Database Failure Test
|
||||
```bash
|
||||
# Simulate connection failure
|
||||
python3 chaos_test_database.py --namespace default --failure-type connection
|
||||
|
||||
# Simulate high latency (5000ms)
|
||||
python3 chaos_test_database.py --namespace default --failure-type latency
|
||||
```
|
||||
|
||||
### Running All Tests
|
||||
|
||||
```bash
|
||||
# Run all scenarios with default parameters
|
||||
python3 chaos_orchestrator.py --namespace default
|
||||
|
||||
# Run specific scenarios
|
||||
python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
|
||||
|
||||
# Continuous chaos testing (24 hours, every 60 minutes)
|
||||
python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
|
||||
```
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Coordinator API Outage
|
||||
|
||||
**Objective**: Test system resilience when the coordinator service becomes unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Generate baseline load on coordinator API
|
||||
2. Delete all coordinator pods
|
||||
3. Wait for specified outage duration
|
||||
4. Monitor service recovery
|
||||
5. Generate post-recovery load
|
||||
|
||||
**Metrics Collected**:
|
||||
- MTTR (Mean-Time-To-Recovery)
|
||||
- Success/error request counts
|
||||
- Recovery time distribution
|
||||
|
||||
### 2. Network Partition
|
||||
|
||||
**Objective**: Test blockchain consensus during network partitions.
|
||||
|
||||
**Steps**:
|
||||
1. Identify blockchain node pods
|
||||
2. Apply iptables rules to partition nodes
|
||||
3. Monitor consensus during partition
|
||||
4. Remove network partition
|
||||
5. Verify network recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Network recovery time
|
||||
- Consensus health during partition
|
||||
- Node connectivity status
|
||||
|
||||
### 3. Database Failure
|
||||
|
||||
**Objective**: Test application behavior when database is unavailable.
|
||||
|
||||
**Steps**:
|
||||
1. Simulate database connection failure or high latency
|
||||
2. Monitor API behavior during failure
|
||||
3. Restore database connectivity
|
||||
4. Verify application recovery
|
||||
|
||||
**Metrics Collected**:
|
||||
- Database recovery time
|
||||
- API error rates during failure
|
||||
- Application resilience metrics
|
||||
|
||||
## Results and Reporting
|
||||
|
||||
### Test Results Format
|
||||
|
||||
Each test generates a JSON results file with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"test_start": "2024-12-22T10:00:00.000Z",
|
||||
"test_end": "2024-12-22T10:05:00.000Z",
|
||||
"scenario": "coordinator_outage",
|
||||
"mttr": 45.2,
|
||||
"error_count": 156,
|
||||
"success_count": 844,
|
||||
"recovery_time": 45.2
|
||||
}
|
||||
```
|
||||
|
||||
### Orchestrator Report
|
||||
|
||||
The orchestrator generates a comprehensive report including:
|
||||
|
||||
- Summary metrics across all scenarios
|
||||
- SLO compliance analysis
|
||||
- Recommendations for improvements
|
||||
- MTTR trends and statistics
|
||||
|
||||
Example report snippet:
|
||||
```json
|
||||
{
|
||||
"summary": {
|
||||
"total_scenarios": 3,
|
||||
"successful_scenarios": 3,
|
||||
"average_mttr": 67.8,
|
||||
"max_mttr": 120.5,
|
||||
"min_mttr": 45.2
|
||||
},
|
||||
"recommendations": [
|
||||
"Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
|
||||
"Coordinator recovery is slow. Consider reducing pod startup time."
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## SLO Targets
|
||||
|
||||
| Metric | Target | Current |
|
||||
|--------|--------|---------|
|
||||
| MTTR (Average) | ≤ 120 seconds | TBD |
|
||||
| MTTR (Maximum) | ≤ 300 seconds | TBD |
|
||||
| Success Rate | ≥ 99.9% | TBD |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Before Running Tests
|
||||
|
||||
1. **Backup Critical Data**: Ensure recent backups are available
|
||||
2. **Notify Team**: Inform stakeholders about chaos testing
|
||||
3. **Check Cluster Health**: Verify all components are healthy
|
||||
4. **Schedule Appropriately**: Run during low-traffic periods
|
||||
|
||||
### During Tests
|
||||
|
||||
1. **Monitor Logs**: Watch for unexpected errors
|
||||
2. **Have Rollback Plan**: Be ready to manually intervene
|
||||
3. **Document Observations**: Note any unusual behavior
|
||||
4. **Stop if Critical**: Abort tests if production is impacted
|
||||
|
||||
### After Tests
|
||||
|
||||
1. **Review Results**: Analyze MTTR and error rates
|
||||
2. **Update Documentation**: Record findings and improvements
|
||||
3. **Address Issues**: Fix any discovered problems
|
||||
4. **Schedule Follow-up**: Plan regular chaos testing
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
### GitHub Actions Example
|
||||
|
||||
```yaml
|
||||
name: Chaos Testing
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
chaos-test:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v2
|
||||
with:
|
||||
python-version: '3.9'
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install aiohttp
|
||||
- name: Run chaos tests
|
||||
run: |
|
||||
cd infra/scripts
|
||||
python3 chaos_orchestrator.py --namespace staging
|
||||
- name: Upload results
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: chaos-results
|
||||
path: "*.json"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **kubectl not found**
|
||||
```bash
|
||||
# Ensure kubectl is installed and configured
|
||||
which kubectl
|
||||
kubectl version
|
||||
```
|
||||
|
||||
2. **Permission denied errors**
|
||||
```bash
|
||||
# Check RBAC permissions
|
||||
kubectl auth can-i create pods --namespace default
|
||||
kubectl auth can-i exec pods --namespace default
|
||||
```
|
||||
|
||||
3. **Network rules not applying**
|
||||
```bash
|
||||
# Check if iptables is available in pods
|
||||
kubectl exec -it <pod> -- iptables -L
|
||||
```
|
||||
|
||||
4. **Tests hanging**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods --namespace default
|
||||
kubectl describe pod <pod-name> --namespace default
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable debug logging:
|
||||
```bash
|
||||
export PYTHONPATH=.
|
||||
python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
To add new chaos test scenarios:
|
||||
|
||||
1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
|
||||
2. Implement the required methods: `run_test()`, `save_results()`
|
||||
3. Add the scenario to `chaos_orchestrator.py`
|
||||
4. Update documentation
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Chaos tests require elevated privileges
|
||||
- Only run in authorized environments
|
||||
- Ensure test isolation from production data
|
||||
- Review network rules before deployment
|
||||
- Monitor for security violations during tests
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check the troubleshooting section
|
||||
- Review test logs for error details
|
||||
- Contact the DevOps team at devops@aitbc.io
|
||||
|
||||
## License
|
||||
|
||||
This chaos testing framework is part of the AITBC project and follows the same license terms.
|
||||
233
infra/scripts/backup_ledger.sh
Executable file
233
infra/scripts/backup_ledger.sh
Executable file
@ -0,0 +1,233 @@
|
||||
#!/bin/bash
|
||||
# Ledger Storage Backup Script for AITBC
|
||||
# Usage: ./backup_ledger.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/ledger-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get blockchain node pods
|
||||
get_blockchain_pods() {
|
||||
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pods" ]]; then
|
||||
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pods" ]]; then
|
||||
error "Could not find blockchain node pods in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo $pods
|
||||
}
|
||||
|
||||
# Wait for blockchain node to be ready
|
||||
wait_for_blockchain_node() {
|
||||
local pod=$1
|
||||
log "Waiting for blockchain node pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if node is responding
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
|
||||
log "Blockchain node is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Blockchain node did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Backup ledger data
|
||||
backup_ledger_data() {
|
||||
local pod=$1
|
||||
local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
mkdir -p "$ledger_backup_dir"
|
||||
|
||||
log "Starting ledger backup from pod $pod"
|
||||
|
||||
# Get the latest block height before backup
|
||||
local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
log "Latest block height: $latest_block"
|
||||
|
||||
# Backup blockchain data directory
|
||||
local blockchain_data_dir="/app/data/chain"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
|
||||
log "Backing up blockchain data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup wallet data
|
||||
local wallet_data_dir="/app/data/wallets"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
|
||||
log "Backing up wallet data directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
|
||||
fi
|
||||
|
||||
# Backup receipts
|
||||
local receipts_data_dir="/app/data/receipts"
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
|
||||
log "Backing up receipts directory..."
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
|
||||
fi
|
||||
|
||||
# Create metadata file
|
||||
cat > "$ledger_backup_dir/metadata.json" << EOF
|
||||
{
|
||||
"backup_name": "$BACKUP_NAME",
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"namespace": "$NAMESPACE",
|
||||
"source_pod": "$pod",
|
||||
"latest_block_height": $latest_block,
|
||||
"backup_type": "full"
|
||||
}
|
||||
EOF
|
||||
|
||||
log "Ledger backup completed: $ledger_backup_dir"
|
||||
|
||||
# Verify backup
|
||||
local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
|
||||
log "Total backup size: $total_size"
|
||||
}
|
||||
|
||||
# Create incremental backup
|
||||
create_incremental_backup() {
|
||||
local pod=$1
|
||||
local last_backup_file="$BACKUP_DIR/.last_backup_height"
|
||||
|
||||
# Get last backup height
|
||||
local last_backup_height=0
|
||||
if [[ -f "$last_backup_file" ]]; then
|
||||
last_backup_height=$(cat "$last_backup_file")
|
||||
fi
|
||||
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
if [[ $current_height -le $last_backup_height ]]; then
|
||||
log "No new blocks since last backup (height: $current_height)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
|
||||
|
||||
# Export blocks since last backup
|
||||
local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
|
||||
|
||||
# Update last backup height
|
||||
echo "$current_height" > "$last_backup_file"
|
||||
|
||||
log "Incremental backup created: $incremental_file"
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
|
||||
find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_dir="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
|
||||
# Upload entire backup directory
|
||||
aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
|
||||
|
||||
log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
local incremental=${3:-false}
|
||||
|
||||
log "Starting ledger backup process (incremental=$incremental)"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pods=($(get_blockchain_pods))
|
||||
|
||||
# Use the first ready pod for backup
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
wait_for_blockchain_node "$pod"
|
||||
|
||||
if [[ "$incremental" == "true" ]]; then
|
||||
create_incremental_backup "$pod"
|
||||
else
|
||||
backup_ledger_data "$pod"
|
||||
fi
|
||||
|
||||
local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
|
||||
upload_to_cloud "$backup_dir"
|
||||
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
cleanup_old_backups
|
||||
|
||||
log "Ledger backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
172
infra/scripts/backup_postgresql.sh
Executable file
172
infra/scripts/backup_postgresql.sh
Executable file
@ -0,0 +1,172 @@
|
||||
#!/bin/bash
|
||||
# PostgreSQL Backup Script for AITBC
|
||||
# Usage: ./backup_postgresql.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/postgresql-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v pg_dump &> /dev/null; then
|
||||
error "pg_dump is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get PostgreSQL pod name
|
||||
get_postgresql_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
wait_for_postgresql() {
|
||||
local pod=$1
|
||||
log "Waiting for PostgreSQL pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if PostgreSQL is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
|
||||
log "PostgreSQL is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "PostgreSQL did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
|
||||
|
||||
log "Starting PostgreSQL backup to $backup_file"
|
||||
|
||||
# Get database credentials from secret
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Perform the backup
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_dump -U "$db_user" -h localhost -d "$db_name" \
|
||||
--verbose --clean --if-exists --create --format=custom \
|
||||
--file="/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Copy backup from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
|
||||
|
||||
# Clean up remote backup file
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
|
||||
|
||||
# Compress backup
|
||||
gzip "$backup_file"
|
||||
backup_file="${backup_file}.gz"
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="postgresql/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting PostgreSQL backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_postgresql_pod)
|
||||
wait_for_postgresql "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "PostgreSQL backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
189
infra/scripts/backup_redis.sh
Executable file
189
infra/scripts/backup_redis.sh
Executable file
@ -0,0 +1,189 @@
|
||||
#!/bin/bash
|
||||
# Redis Backup Script for AITBC
|
||||
# Usage: ./backup_redis.sh [namespace] [backup_name]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
|
||||
BACKUP_DIR="/tmp/redis-backups"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Create backup directory
|
||||
create_backup_dir() {
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
log "Created backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get Redis pod name
|
||||
get_redis_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find Redis pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for Redis to be ready
|
||||
wait_for_redis() {
|
||||
local pod=$1
|
||||
log "Waiting for Redis pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if Redis is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "Redis is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "Redis did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Perform backup
|
||||
perform_backup() {
|
||||
local pod=$1
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
|
||||
log "Starting Redis backup to $backup_file"
|
||||
|
||||
# Create Redis backup
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
|
||||
|
||||
# Wait for background save to complete
|
||||
log "Waiting for background save to complete..."
|
||||
local retries=60
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
|
||||
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
|
||||
log "Background save completed"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
if [[ $retries -eq 0 ]]; then
|
||||
error "Background save did not complete within timeout"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Copy RDB file from pod
|
||||
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
|
||||
|
||||
# Also create an append-only file backup if enabled
|
||||
local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
|
||||
if [[ "$aof_enabled" == "yes" ]]; then
|
||||
local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
|
||||
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
|
||||
log "AOF backup created: $aof_backup"
|
||||
fi
|
||||
|
||||
log "Backup completed: $backup_file"
|
||||
|
||||
# Verify backup
|
||||
if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
|
||||
local size=$(du -h "$backup_file" | cut -f1)
|
||||
log "Backup size: $size"
|
||||
else
|
||||
error "Backup file is empty or missing"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Clean old backups
|
||||
cleanup_old_backups() {
|
||||
log "Cleaning up backups older than $RETENTION_DAYS days"
|
||||
find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
|
||||
find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
|
||||
log "Cleanup completed"
|
||||
}
|
||||
|
||||
# Upload to cloud storage (optional)
|
||||
upload_to_cloud() {
|
||||
local backup_file="$1"
|
||||
|
||||
# Check if AWS CLI is configured
|
||||
if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
|
||||
log "Uploading backup to S3"
|
||||
local s3_bucket="aitbc-backups-${NAMESPACE}"
|
||||
local s3_key="redis/$(basename "$backup_file")"
|
||||
|
||||
aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
|
||||
log "Backup uploaded to s3://$s3_bucket/$s3_key"
|
||||
|
||||
# Upload AOF file if exists
|
||||
local aof_file="${backup_file%.rdb}.aof"
|
||||
if [[ -f "$aof_file" ]]; then
|
||||
local aof_key="redis/$(basename "$aof_file")"
|
||||
aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
|
||||
log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
|
||||
fi
|
||||
else
|
||||
warn "AWS CLI not configured, skipping cloud upload"
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting Redis backup process"
|
||||
|
||||
check_dependencies
|
||||
create_backup_dir
|
||||
|
||||
local pod=$(get_redis_pod)
|
||||
wait_for_redis "$pod"
|
||||
|
||||
perform_backup "$pod"
|
||||
cleanup_old_backups
|
||||
|
||||
local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
|
||||
upload_to_cloud "$backup_file"
|
||||
|
||||
log "Redis backup process completed successfully"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
342
infra/scripts/chaos_orchestrator.py
Executable file
342
infra/scripts/chaos_orchestrator.py
Executable file
@ -0,0 +1,342 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Orchestrator
|
||||
Runs multiple chaos test scenarios and aggregates MTTR metrics
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosOrchestrator:
|
||||
"""Orchestrates multiple chaos test scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.results = {
|
||||
"orchestration_start": None,
|
||||
"orchestration_end": None,
|
||||
"scenarios": [],
|
||||
"summary": {
|
||||
"total_scenarios": 0,
|
||||
"successful_scenarios": 0,
|
||||
"failed_scenarios": 0,
|
||||
"average_mttr": 0,
|
||||
"max_mttr": 0,
|
||||
"min_mttr": float('inf')
|
||||
}
|
||||
}
|
||||
|
||||
async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
|
||||
"""Run a single chaos test scenario"""
|
||||
scenario_name = Path(script).stem.replace("chaos_test_", "")
|
||||
logger.info(f"Running scenario: {scenario_name}")
|
||||
|
||||
cmd = ["python3", script] + args
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
# Run the chaos test script
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE
|
||||
)
|
||||
|
||||
stdout, stderr = await process.communicate()
|
||||
|
||||
if process.returncode != 0:
|
||||
logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
|
||||
logger.error(f"Error: {stderr.decode()}")
|
||||
return None
|
||||
|
||||
# Find the results file
|
||||
result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
|
||||
if not result_files:
|
||||
logger.error(f"No results file found for scenario {scenario_name}")
|
||||
return None
|
||||
|
||||
# Load the most recent result file
|
||||
result_file = max(result_files, key=lambda p: p.stat().st_mtime)
|
||||
with open(result_file, 'r') as f:
|
||||
results = json.load(f)
|
||||
|
||||
# Add execution metadata
|
||||
results["execution_time"] = time.time() - start_time
|
||||
results["scenario_name"] = scenario_name
|
||||
|
||||
logger.info(f"Scenario {scenario_name} completed successfully")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to run scenario {scenario_name}: {e}")
|
||||
return None
|
||||
|
||||
def calculate_summary_metrics(self):
|
||||
"""Calculate summary metrics across all scenarios"""
|
||||
mttr_values = []
|
||||
|
||||
for scenario in self.results["scenarios"]:
|
||||
if scenario.get("mttr"):
|
||||
mttr_values.append(scenario["mttr"])
|
||||
|
||||
if mttr_values:
|
||||
self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
|
||||
self.results["summary"]["max_mttr"] = max(mttr_values)
|
||||
self.results["summary"]["min_mttr"] = min(mttr_values)
|
||||
|
||||
self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
|
||||
self.results["summary"]["successful_scenarios"] = sum(
|
||||
1 for s in self.results["scenarios"] if s.get("mttr") is not None
|
||||
)
|
||||
self.results["summary"]["failed_scenarios"] = (
|
||||
self.results["summary"]["total_scenarios"] -
|
||||
self.results["summary"]["successful_scenarios"]
|
||||
)
|
||||
|
||||
def generate_report(self, output_file: Optional[str] = None):
|
||||
"""Generate a comprehensive chaos test report"""
|
||||
report = {
|
||||
"report_generated": datetime.utcnow().isoformat(),
|
||||
"namespace": self.namespace,
|
||||
"orchestration": self.results,
|
||||
"recommendations": []
|
||||
}
|
||||
|
||||
# Add recommendations based on results
|
||||
if self.results["summary"]["average_mttr"] > 120:
|
||||
report["recommendations"].append(
|
||||
"Average MTTR exceeds 2 minutes. Consider improving recovery automation."
|
||||
)
|
||||
|
||||
if self.results["summary"]["max_mttr"] > 300:
|
||||
report["recommendations"].append(
|
||||
"Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
|
||||
)
|
||||
|
||||
if self.results["summary"]["failed_scenarios"] > 0:
|
||||
report["recommendations"].append(
|
||||
f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
|
||||
)
|
||||
|
||||
# Check for specific scenario issues
|
||||
for scenario in self.results["scenarios"]:
|
||||
if scenario.get("scenario_name") == "coordinator_outage":
|
||||
if scenario.get("mttr", 0) > 180:
|
||||
report["recommendations"].append(
|
||||
"Coordinator recovery is slow. Consider reducing pod startup time."
|
||||
)
|
||||
|
||||
elif scenario.get("scenario_name") == "network_partition":
|
||||
if scenario.get("error_count", 0) > scenario.get("success_count", 0):
|
||||
report["recommendations"].append(
|
||||
"High error rate during network partition. Improve error handling."
|
||||
)
|
||||
|
||||
elif scenario.get("scenario_name") == "database_failure":
|
||||
if scenario.get("failure_type") == "connection":
|
||||
report["recommendations"].append(
|
||||
"Consider implementing database connection pooling and retry logic."
|
||||
)
|
||||
|
||||
# Save report
|
||||
if output_file:
|
||||
with open(output_file, 'w') as f:
|
||||
json.dump(report, f, indent=2)
|
||||
logger.info(f"Chaos test report saved to: {output_file}")
|
||||
|
||||
# Print summary
|
||||
self.print_summary()
|
||||
|
||||
return report
|
||||
|
||||
def print_summary(self):
|
||||
"""Print a summary of all chaos test results"""
|
||||
print("\n" + "="*60)
|
||||
print("CHAOS TESTING SUMMARY REPORT")
|
||||
print("="*60)
|
||||
|
||||
print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
|
||||
print(f"Namespace: {self.namespace}")
|
||||
|
||||
print(f"\nScenario Results:")
|
||||
print("-" * 40)
|
||||
for scenario in self.results["scenarios"]:
|
||||
name = scenario.get("scenario_name", "Unknown")
|
||||
mttr = scenario.get("mttr", "N/A")
|
||||
if mttr != "N/A":
|
||||
mttr = f"{mttr:.2f}s"
|
||||
print(f" {name:20} MTTR: {mttr}")
|
||||
|
||||
print(f"\nSummary Metrics:")
|
||||
print("-" * 40)
|
||||
print(f" Total Scenarios: {self.results['summary']['total_scenarios']}")
|
||||
print(f" Successful: {self.results['summary']['successful_scenarios']}")
|
||||
print(f" Failed: {self.results['summary']['failed_scenarios']}")
|
||||
|
||||
if self.results["summary"]["average_mttr"] > 0:
|
||||
print(f" Average MTTR: {self.results['summary']['average_mttr']:.2f}s")
|
||||
print(f" Maximum MTTR: {self.results['summary']['max_mttr']:.2f}s")
|
||||
print(f" Minimum MTTR: {self.results['summary']['min_mttr']:.2f}s")
|
||||
|
||||
# SLO compliance
|
||||
print(f"\nSLO Compliance:")
|
||||
print("-" * 40)
|
||||
slo_target = 120 # 2 minutes
|
||||
if self.results["summary"]["average_mttr"] <= slo_target:
|
||||
print(f" ✓ Average MTTR within SLO ({slo_target}s)")
|
||||
else:
|
||||
print(f" ✗ Average MTTR exceeds SLO ({slo_target}s)")
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
|
||||
"""Run all specified chaos test scenarios"""
|
||||
logger.info("Starting chaos testing orchestration")
|
||||
self.results["orchestration_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
for scenario in scenarios:
|
||||
args = scenario_args.get(scenario, [])
|
||||
# Add namespace to all scenarios
|
||||
args.extend(["--namespace", self.namespace])
|
||||
|
||||
result = await self.run_scenario(scenario, args)
|
||||
if result:
|
||||
self.results["scenarios"].append(result)
|
||||
|
||||
self.results["orchestration_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Calculate summary metrics
|
||||
self.calculate_summary_metrics()
|
||||
|
||||
# Generate report
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
report_file = f"chaos_test_report_{timestamp}.json"
|
||||
self.generate_report(report_file)
|
||||
|
||||
logger.info("Chaos testing orchestration completed")
|
||||
|
||||
async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
|
||||
"""Run chaos tests continuously over time"""
|
||||
logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
|
||||
|
||||
end_time = datetime.now() + timedelta(hours=duration_hours)
|
||||
interval_seconds = interval_minutes * 60
|
||||
|
||||
all_results = []
|
||||
|
||||
while datetime.now() < end_time:
|
||||
cycle_start = datetime.now()
|
||||
logger.info(f"Starting chaos test cycle at {cycle_start}")
|
||||
|
||||
# Run a random scenario
|
||||
scenarios = [
|
||||
"chaos_test_coordinator.py",
|
||||
"chaos_test_network.py",
|
||||
"chaos_test_database.py"
|
||||
]
|
||||
|
||||
import random
|
||||
selected_scenario = random.choice(scenarios)
|
||||
|
||||
# Run scenario with reduced duration for continuous testing
|
||||
args = ["--namespace", self.namespace]
|
||||
if "coordinator" in selected_scenario:
|
||||
args.extend(["--outage-duration", "30", "--load-duration", "60"])
|
||||
elif "network" in selected_scenario:
|
||||
args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
|
||||
elif "database" in selected_scenario:
|
||||
args.extend(["--failure-duration", "30", "--failure-type", "connection"])
|
||||
|
||||
result = await self.run_scenario(selected_scenario, args)
|
||||
if result:
|
||||
result["cycle_time"] = cycle_start.isoformat()
|
||||
all_results.append(result)
|
||||
|
||||
# Wait for next cycle
|
||||
elapsed = (datetime.now() - cycle_start).total_seconds()
|
||||
if elapsed < interval_seconds:
|
||||
wait_time = interval_seconds - elapsed
|
||||
logger.info(f"Waiting {wait_time:.0f}s for next cycle")
|
||||
await asyncio.sleep(wait_time)
|
||||
|
||||
# Generate continuous testing report
|
||||
continuous_report = {
|
||||
"continuous_testing": True,
|
||||
"duration_hours": duration_hours,
|
||||
"interval_minutes": interval_minutes,
|
||||
"total_cycles": len(all_results),
|
||||
"cycles": all_results
|
||||
}
|
||||
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
report_file = f"continuous_chaos_report_{timestamp}.json"
|
||||
with open(report_file, 'w') as f:
|
||||
json.dump(continuous_report, f, indent=2)
|
||||
|
||||
logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--scenarios", nargs="+",
|
||||
choices=["coordinator", "network", "database"],
|
||||
default=["coordinator", "network", "database"],
|
||||
help="Scenarios to run")
|
||||
parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
|
||||
parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
|
||||
parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
orchestrator = ChaosOrchestrator(args.namespace)
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
|
||||
return
|
||||
|
||||
if args.continuous:
|
||||
await orchestrator.run_continuous_chaos(args.duration, args.interval)
|
||||
else:
|
||||
# Map scenario names to script files
|
||||
scenario_map = {
|
||||
"coordinator": "chaos_test_coordinator.py",
|
||||
"network": "chaos_test_network.py",
|
||||
"database": "chaos_test_database.py"
|
||||
}
|
||||
|
||||
# Get script files
|
||||
scripts = [scenario_map[s] for s in args.scenarios]
|
||||
|
||||
# Default arguments for each scenario
|
||||
scenario_args = {
|
||||
"chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
|
||||
"chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
|
||||
"chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
|
||||
}
|
||||
|
||||
await orchestrator.run_all_scenarios(scripts, scenario_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
287
infra/scripts/chaos_test_coordinator.py
Executable file
287
infra/scripts/chaos_test_coordinator.py
Executable file
@ -0,0 +1,287 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Coordinator API Outage
|
||||
Tests system resilience when coordinator API becomes unavailable
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestCoordinator:
|
||||
"""Chaos testing for coordinator API outage scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"outage_start": None,
|
||||
"outage_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "coordinator_outage"
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_coordinator_pods(self) -> List[str]:
|
||||
"""Get list of coordinator pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get coordinator pods: {e}")
|
||||
return []
|
||||
|
||||
def delete_coordinator_pods(self) -> bool:
|
||||
"""Delete all coordinator pods to simulate outage"""
|
||||
try:
|
||||
cmd = [
|
||||
"kubectl", "delete", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"--force", "--grace-period=0"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
logger.info("Coordinator pods deleted successfully")
|
||||
return True
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to delete coordinator pods: {e}")
|
||||
return False
|
||||
|
||||
async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
|
||||
"""Wait for all coordinator pods to terminate"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
pods = self.get_coordinator_pods()
|
||||
if not pods:
|
||||
logger.info("All coordinator pods terminated")
|
||||
return True
|
||||
await asyncio.sleep(2)
|
||||
|
||||
logger.error("Timeout waiting for pods to terminate")
|
||||
return False
|
||||
|
||||
async def wait_for_recovery(self, timeout: int = 300) -> bool:
|
||||
"""Wait for coordinator service to recover"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
try:
|
||||
# Check if pods are running
|
||||
pods = self.get_coordinator_pods()
|
||||
if not pods:
|
||||
await asyncio.sleep(5)
|
||||
continue
|
||||
|
||||
# Check if at least one pod is ready
|
||||
ready_cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
|
||||
]
|
||||
result = subprocess.run(ready_cmd, capture_output=True, text=True)
|
||||
if result.stdout.strip():
|
||||
# Test API health
|
||||
if self.test_health_endpoint():
|
||||
recovery_time = time.time() - start_time
|
||||
self.metrics["recovery_time"] = recovery_time
|
||||
logger.info(f"Service recovered in {recovery_time:.2f} seconds")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Recovery check failed: {e}")
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
logger.error("Service did not recover within timeout")
|
||||
return False
|
||||
|
||||
def test_health_endpoint(self) -> bool:
|
||||
"""Test if coordinator health endpoint is responding"""
|
||||
try:
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
service_url = f"http://{result.stdout.strip()}/v1/health"
|
||||
|
||||
# Test health endpoint
|
||||
response = subprocess.run(
|
||||
["curl", "-s", "--max-time", "5", service_url],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
|
||||
return response.returncode == 0 and "ok" in response.stdout
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 10):
|
||||
"""Generate synthetic load on coordinator API"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
|
||||
"""Run the complete chaos test"""
|
||||
logger.info("Starting coordinator outage chaos test")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Phase 1: Generate initial load
|
||||
logger.info("Phase 1: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 2: Induce outage
|
||||
logger.info("Phase 2: Inducing coordinator outage")
|
||||
self.metrics["outage_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.delete_coordinator_pods():
|
||||
logger.error("Failed to induce outage")
|
||||
return False
|
||||
|
||||
if not await self.wait_for_pods_termination():
|
||||
logger.error("Pods did not terminate")
|
||||
return False
|
||||
|
||||
# Wait for specified outage duration
|
||||
logger.info(f"Waiting for {outage_duration} seconds outage duration")
|
||||
await asyncio.sleep(outage_duration)
|
||||
|
||||
# Phase 3: Monitor recovery
|
||||
logger.info("Phase 3: Monitoring service recovery")
|
||||
self.metrics["outage_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not await self.wait_for_recovery():
|
||||
logger.error("Service did not recover")
|
||||
return False
|
||||
|
||||
# Phase 4: Post-recovery load test
|
||||
logger.info("Phase 4: Post-recovery load test")
|
||||
await self.generate_load(load_duration)
|
||||
|
||||
# Calculate metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_coordinator_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
|
||||
parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("DRY RUN: Would test coordinator outage without actual deletion")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestCoordinator(args.namespace) as test:
|
||||
success = await test.run_test(args.outage_duration, args.load_duration)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
387
infra/scripts/chaos_test_database.py
Executable file
387
infra/scripts/chaos_test_database.py
Executable file
@ -0,0 +1,387 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Database Failure
|
||||
Tests system resilience when PostgreSQL database becomes unavailable
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestDatabase:
|
||||
"""Chaos testing for database failure scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"failure_start": None,
|
||||
"failure_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "database_failure",
|
||||
"failure_type": None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_postgresql_pod(self) -> Optional[str]:
|
||||
"""Get PostgreSQL pod name"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=postgresql",
|
||||
"-o", "jsonpath={.items[0].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pod = result.stdout.strip()
|
||||
return pod if pod else None
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get PostgreSQL pod: {e}")
|
||||
return None
|
||||
|
||||
def simulate_database_connection_failure(self) -> bool:
|
||||
"""Simulate database connection failure by blocking port 5432"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Block incoming connections to PostgreSQL
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
# Block outgoing connections from PostgreSQL
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
logger.info(f"Blocked PostgreSQL connections on pod {pod}")
|
||||
self.metrics["failure_type"] = "connection_blocked"
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to block PostgreSQL connections: {e}")
|
||||
return False
|
||||
|
||||
def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
|
||||
"""Simulate high database latency using netem"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Add latency to PostgreSQL traffic
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
|
||||
self.metrics["failure_type"] = "high_latency"
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to add latency to PostgreSQL: {e}")
|
||||
return False
|
||||
|
||||
def restore_database(self) -> bool:
|
||||
"""Restore database connections"""
|
||||
pod = self.get_postgresql_pod()
|
||||
if not pod:
|
||||
return False
|
||||
|
||||
try:
|
||||
# Remove iptables rules
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "INPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=False) # May fail if rules don't exist
|
||||
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "OUTPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=False)
|
||||
|
||||
# Remove netem qdisc
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"tc", "qdisc", "del", "dev", "eth0", "root"
|
||||
]
|
||||
subprocess.run(cmd, check=False)
|
||||
|
||||
logger.info(f"Restored PostgreSQL connections on pod {pod}")
|
||||
return True
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to restore PostgreSQL: {e}")
|
||||
return False
|
||||
|
||||
async def test_database_connectivity(self) -> bool:
|
||||
"""Test if coordinator can connect to database"""
|
||||
try:
|
||||
# Get coordinator pod
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[0].metadata.name}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
coordinator_pod = result.stdout.strip()
|
||||
|
||||
if not coordinator_pod:
|
||||
return False
|
||||
|
||||
# Test database connection from coordinator
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
|
||||
"python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
return result.returncode == 0 and "OK" in result.stdout
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def test_api_health(self) -> bool:
|
||||
"""Test if coordinator API is healthy"""
|
||||
try:
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
service_url = f"http://{result.stdout.strip()}/v1/health"
|
||||
|
||||
# Test health endpoint
|
||||
response = subprocess.run(
|
||||
["curl", "-s", "--max-time", "5", service_url],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
|
||||
return response.returncode == 0 and "ok" in response.stdout
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 10):
|
||||
"""Generate synthetic load on coordinator API"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "coordinator",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def wait_for_recovery(self, timeout: int = 300) -> bool:
|
||||
"""Wait for database and API to recover"""
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
# Test database connectivity
|
||||
db_connected = await self.test_database_connectivity()
|
||||
|
||||
# Test API health
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
if db_connected and api_healthy:
|
||||
recovery_time = time.time() - start_time
|
||||
self.metrics["recovery_time"] = recovery_time
|
||||
logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
|
||||
return True
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
logger.error("Database and API did not recover within timeout")
|
||||
return False
|
||||
|
||||
async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
|
||||
"""Run the complete database chaos test"""
|
||||
logger.info(f"Starting database chaos test - failure type: {failure_type}")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Phase 1: Baseline test
|
||||
logger.info("Phase 1: Baseline connectivity test")
|
||||
db_connected = await self.test_database_connectivity()
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
if not db_connected or not api_healthy:
|
||||
logger.error("Baseline test failed - database or API not healthy")
|
||||
return False
|
||||
|
||||
logger.info("Baseline: Database and API are healthy")
|
||||
|
||||
# Phase 2: Generate initial load
|
||||
logger.info("Phase 2: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 3: Induce database failure
|
||||
logger.info("Phase 3: Inducing database failure")
|
||||
self.metrics["failure_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if failure_type == "connection":
|
||||
if not self.simulate_database_connection_failure():
|
||||
logger.error("Failed to induce database connection failure")
|
||||
return False
|
||||
elif failure_type == "latency":
|
||||
if not self.simulate_database_high_latency():
|
||||
logger.error("Failed to induce database latency")
|
||||
return False
|
||||
else:
|
||||
logger.error(f"Unknown failure type: {failure_type}")
|
||||
return False
|
||||
|
||||
# Verify failure is effective
|
||||
await asyncio.sleep(5)
|
||||
db_connected = await self.test_database_connectivity()
|
||||
api_healthy = await self.test_api_health()
|
||||
|
||||
logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
|
||||
|
||||
# Phase 4: Monitor during failure
|
||||
logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
|
||||
|
||||
# Generate load during failure
|
||||
await self.generate_load(failure_duration)
|
||||
|
||||
# Phase 5: Restore database and monitor recovery
|
||||
logger.info("Phase 5: Restoring database")
|
||||
self.metrics["failure_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.restore_database():
|
||||
logger.error("Failed to restore database")
|
||||
return False
|
||||
|
||||
# Wait for recovery
|
||||
if not await self.wait_for_recovery():
|
||||
logger.error("System did not recover after database restoration")
|
||||
return False
|
||||
|
||||
# Phase 6: Post-recovery load test
|
||||
logger.info("Phase 6: Post-recovery load test")
|
||||
await self.generate_load(60)
|
||||
|
||||
# Final metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Database chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_database_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Failure Type: {self.metrics['failure_type']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for database failure")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
|
||||
parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestDatabase(args.namespace) as test:
|
||||
success = await test.run_test(args.failure_type, args.failure_duration)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
372
infra/scripts/chaos_test_network.py
Executable file
372
infra/scripts/chaos_test_network.py
Executable file
@ -0,0 +1,372 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Chaos Testing Script - Network Partition
|
||||
Tests system resilience when blockchain nodes experience network partitions
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import aiohttp
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ChaosTestNetwork:
|
||||
"""Chaos testing for network partition scenarios"""
|
||||
|
||||
def __init__(self, namespace: str = "default"):
|
||||
self.namespace = namespace
|
||||
self.session = None
|
||||
self.metrics = {
|
||||
"test_start": None,
|
||||
"test_end": None,
|
||||
"partition_start": None,
|
||||
"partition_end": None,
|
||||
"recovery_time": None,
|
||||
"mttr": None,
|
||||
"error_count": 0,
|
||||
"success_count": 0,
|
||||
"scenario": "network_partition",
|
||||
"affected_nodes": []
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
if self.session:
|
||||
await self.session.close()
|
||||
|
||||
def get_blockchain_pods(self) -> List[str]:
|
||||
"""Get list of blockchain node pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=blockchain-node",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get blockchain pods: {e}")
|
||||
return []
|
||||
|
||||
def get_coordinator_pods(self) -> List[str]:
|
||||
"""Get list of coordinator pods"""
|
||||
cmd = [
|
||||
"kubectl", "get", "pods",
|
||||
"-n", self.namespace,
|
||||
"-l", "app.kubernetes.io/name=coordinator",
|
||||
"-o", "jsonpath={.items[*].metadata.name}"
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
pods = result.stdout.strip().split()
|
||||
return pods
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to get coordinator pods: {e}")
|
||||
return []
|
||||
|
||||
def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
|
||||
"""Apply network partition using iptables"""
|
||||
logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
|
||||
|
||||
for pod in pods:
|
||||
if pod in target_pods:
|
||||
continue
|
||||
|
||||
# Block traffic from this pod to target pods
|
||||
for target_pod in target_pods:
|
||||
try:
|
||||
# Get target pod IP
|
||||
cmd = [
|
||||
"kubectl", "get", "pod", target_pod,
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.status.podIP}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
target_ip = result.stdout.strip()
|
||||
|
||||
if not target_ip:
|
||||
continue
|
||||
|
||||
# Apply iptables rule to block traffic
|
||||
iptables_cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
|
||||
]
|
||||
subprocess.run(iptables_cmd, check=True)
|
||||
|
||||
logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
|
||||
return False
|
||||
|
||||
self.metrics["affected_nodes"] = pods + target_pods
|
||||
return True
|
||||
|
||||
def remove_network_partition(self, pods: List[str]) -> bool:
|
||||
"""Remove network partition rules"""
|
||||
logger.info("Removing network partition rules")
|
||||
|
||||
for pod in pods:
|
||||
try:
|
||||
# Flush OUTPUT chain (remove all rules)
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"iptables", "-F", "OUTPUT"
|
||||
]
|
||||
subprocess.run(cmd, check=True)
|
||||
logger.info(f"Removed network rules from {pod}")
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Failed to remove network rules from {pod}: {e}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
|
||||
"""Test connectivity between pods"""
|
||||
results = {}
|
||||
|
||||
for pod in pods:
|
||||
try:
|
||||
# Test if pod can reach coordinator
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pod, "--",
|
||||
"curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
results[pod] = result.returncode == 0 and "ok" in result.stdout
|
||||
|
||||
except Exception:
|
||||
results[pod] = False
|
||||
|
||||
return results
|
||||
|
||||
async def monitor_consensus(self, duration: int = 60) -> bool:
|
||||
"""Monitor blockchain consensus health"""
|
||||
logger.info(f"Monitoring consensus for {duration} seconds")
|
||||
|
||||
start_time = time.time()
|
||||
last_height = 0
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
try:
|
||||
# Get block height from a random pod
|
||||
pods = self.get_blockchain_pods()
|
||||
if not pods:
|
||||
await asyncio.sleep(5)
|
||||
continue
|
||||
|
||||
# Use first pod to check height
|
||||
cmd = [
|
||||
"kubectl", "exec", "-n", self.namespace, pods[0], "--",
|
||||
"curl", "-s", "http://localhost:8080/v1/blocks/head"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
try:
|
||||
data = json.loads(result.stdout)
|
||||
current_height = data.get("height", 0)
|
||||
|
||||
# Check if blockchain is progressing
|
||||
if current_height > last_height:
|
||||
last_height = current_height
|
||||
logger.info(f"Blockchain progressing, height: {current_height}")
|
||||
elif time.time() - start_time > 30: # Allow 30s for initial sync
|
||||
logger.warning(f"Blockchain stuck at height {current_height}")
|
||||
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Consensus check failed: {e}")
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
return last_height > 0
|
||||
|
||||
async def generate_load(self, duration: int, concurrent: int = 5):
|
||||
"""Generate synthetic load on blockchain nodes"""
|
||||
logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
|
||||
|
||||
# Get service URL
|
||||
cmd = [
|
||||
"kubectl", "get", "svc", "blockchain-node",
|
||||
"-n", self.namespace,
|
||||
"-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
base_url = f"http://{result.stdout.strip()}"
|
||||
|
||||
start_time = time.time()
|
||||
tasks = []
|
||||
|
||||
async def make_request():
|
||||
try:
|
||||
async with self.session.get(f"{base_url}/v1/blocks/head") as response:
|
||||
if response.status == 200:
|
||||
self.metrics["success_count"] += 1
|
||||
else:
|
||||
self.metrics["error_count"] += 1
|
||||
except Exception:
|
||||
self.metrics["error_count"] += 1
|
||||
|
||||
while time.time() - start_time < duration:
|
||||
# Create batch of requests
|
||||
batch = [make_request() for _ in range(concurrent)]
|
||||
tasks.extend(batch)
|
||||
|
||||
# Wait for batch to complete
|
||||
await asyncio.gather(*batch, return_exceptions=True)
|
||||
|
||||
# Brief pause
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
|
||||
|
||||
async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
|
||||
"""Run the complete network partition chaos test"""
|
||||
logger.info("Starting network partition chaos test")
|
||||
self.metrics["test_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
# Get all blockchain pods
|
||||
all_pods = self.get_blockchain_pods()
|
||||
if not all_pods:
|
||||
logger.error("No blockchain pods found")
|
||||
return False
|
||||
|
||||
# Determine which pods to partition
|
||||
num_partition = int(len(all_pods) * partition_ratio)
|
||||
partition_pods = all_pods[:num_partition]
|
||||
remaining_pods = all_pods[num_partition:]
|
||||
|
||||
logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
|
||||
|
||||
# Phase 1: Baseline test
|
||||
logger.info("Phase 1: Baseline connectivity test")
|
||||
baseline_connectivity = await self.test_connectivity(all_pods)
|
||||
logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
|
||||
|
||||
# Phase 2: Generate initial load
|
||||
logger.info("Phase 2: Generating initial load")
|
||||
await self.generate_load(30)
|
||||
|
||||
# Phase 3: Apply network partition
|
||||
logger.info("Phase 3: Applying network partition")
|
||||
self.metrics["partition_start"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.apply_network_partition(remaining_pods, partition_pods):
|
||||
logger.error("Failed to apply network partition")
|
||||
return False
|
||||
|
||||
# Verify partition is effective
|
||||
await asyncio.sleep(5)
|
||||
partitioned_connectivity = await self.test_connectivity(all_pods)
|
||||
logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
|
||||
|
||||
# Phase 4: Monitor during partition
|
||||
logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
|
||||
consensus_healthy = await self.monitor_consensus(partition_duration)
|
||||
|
||||
# Phase 5: Remove partition and monitor recovery
|
||||
logger.info("Phase 5: Removing network partition")
|
||||
self.metrics["partition_end"] = datetime.utcnow().isoformat()
|
||||
|
||||
if not self.remove_network_partition(all_pods):
|
||||
logger.error("Failed to remove network partition")
|
||||
return False
|
||||
|
||||
# Wait for recovery
|
||||
logger.info("Waiting for network recovery...")
|
||||
await asyncio.sleep(10)
|
||||
|
||||
# Test connectivity after recovery
|
||||
recovery_connectivity = await self.test_connectivity(all_pods)
|
||||
recovery_time = time.time()
|
||||
|
||||
# Calculate recovery metrics
|
||||
all_connected = all(recovery_connectivity.values())
|
||||
if all_connected:
|
||||
self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
|
||||
logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
|
||||
|
||||
# Phase 6: Post-recovery load test
|
||||
logger.info("Phase 6: Post-recovery load test")
|
||||
await self.generate_load(60)
|
||||
|
||||
# Final metrics
|
||||
self.metrics["test_end"] = datetime.utcnow().isoformat()
|
||||
self.metrics["mttr"] = self.metrics["recovery_time"]
|
||||
|
||||
# Save results
|
||||
self.save_results()
|
||||
|
||||
logger.info("Network partition chaos test completed successfully")
|
||||
return True
|
||||
|
||||
def save_results(self):
|
||||
"""Save test results to file"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"chaos_test_network_{timestamp}.json"
|
||||
|
||||
with open(filename, "w") as f:
|
||||
json.dump(self.metrics, f, indent=2)
|
||||
|
||||
logger.info(f"Test results saved to: {filename}")
|
||||
|
||||
# Print summary
|
||||
print("\n=== Chaos Test Summary ===")
|
||||
print(f"Scenario: {self.metrics['scenario']}")
|
||||
print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
|
||||
print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
|
||||
print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
|
||||
print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
|
||||
print(f"Success Requests: {self.metrics['success_count']}")
|
||||
print(f"Error Requests: {self.metrics['error_count']}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Chaos test for network partition")
|
||||
parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
|
||||
parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
|
||||
parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
|
||||
return
|
||||
|
||||
# Verify kubectl is available
|
||||
try:
|
||||
subprocess.run(["kubectl", "version"], capture_output=True, check=True)
|
||||
except (subprocess.CalledProcessError, FileNotFoundError):
|
||||
logger.error("kubectl is not available or not configured")
|
||||
sys.exit(1)
|
||||
|
||||
# Run test
|
||||
async with ChaosTestNetwork(args.namespace) as test:
|
||||
success = await test.run_test(args.partition_duration, args.partition_ratio)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
279
infra/scripts/restore_ledger.sh
Normal file
279
infra/scripts/restore_ledger.sh
Normal file
@ -0,0 +1,279 @@
|
||||
#!/bin/bash
|
||||
# Ledger Storage Restore Script for AITBC
|
||||
# Usage: ./restore_ledger.sh [namespace] [backup_directory]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_DIR=${2:-}
|
||||
TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v jq &> /dev/null; then
|
||||
error "jq is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup directory
|
||||
validate_backup_dir() {
|
||||
if [[ -z "$BACKUP_DIR" ]]; then
|
||||
error "Backup directory must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_directory]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -d "$BACKUP_DIR" ]]; then
|
||||
error "Backup directory not found: $BACKUP_DIR"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for required files
|
||||
if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
|
||||
error "metadata.json not found in backup directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
|
||||
error "chain.tar.gz not found in backup directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "Using backup directory: $BACKUP_DIR"
|
||||
}
|
||||
|
||||
# Get blockchain node pods
|
||||
get_blockchain_pods() {
|
||||
local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pods" ]]; then
|
||||
pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pods" ]]; then
|
||||
error "Could not find blockchain node pods in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo $pods
|
||||
}
|
||||
|
||||
# Create backup of current ledger before restore
|
||||
create_pre_restore_backup() {
|
||||
local pods=($1)
|
||||
local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
|
||||
local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
|
||||
|
||||
warn "Creating backup of current ledger before restore..."
|
||||
mkdir -p "$pre_restore_dir"
|
||||
|
||||
# Use the first ready pod
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
# Create metadata
|
||||
cat > "$pre_restore_dir/metadata.json" << EOF
|
||||
{
|
||||
"backup_name": "$pre_restore_backup",
|
||||
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||
"namespace": "$NAMESPACE",
|
||||
"source_pod": "$pod",
|
||||
"latest_block_height": $current_height,
|
||||
"backup_type": "pre-restore"
|
||||
}
|
||||
EOF
|
||||
|
||||
# Backup data directories
|
||||
local data_dirs=("chain" "wallets" "receipts")
|
||||
for dir in "${data_dirs[@]}"; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
|
||||
fi
|
||||
done
|
||||
|
||||
log "Pre-restore backup created: $pre_restore_dir"
|
||||
break
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pods=($1)
|
||||
|
||||
warn "This will replace all current ledger data. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Scale down blockchain nodes
|
||||
info "Scaling down blockchain node deployment..."
|
||||
kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pods to terminate
|
||||
kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
|
||||
|
||||
# Scale up blockchain nodes
|
||||
info "Scaling up blockchain node deployment..."
|
||||
kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pods to be ready
|
||||
local ready_pods=()
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
|
||||
local all_pods=$(get_blockchain_pods)
|
||||
for pod in $all_pods; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
ready_pods+=("$pod")
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ ${#ready_pods[@]} -eq 0 ]]; then
|
||||
sleep 5
|
||||
((retries--))
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ ${#ready_pods[@]} -eq 0 ]]; then
|
||||
error "No blockchain nodes became ready after restore"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Restore data to all ready pods
|
||||
for pod in "${ready_pods[@]}"; do
|
||||
info "Restoring ledger data to pod $pod..."
|
||||
|
||||
# Create temp directory on pod
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
|
||||
|
||||
# Extract and copy chain data
|
||||
if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Extract and copy wallet data
|
||||
if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Extract and copy receipt data
|
||||
if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
|
||||
kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
|
||||
fi
|
||||
|
||||
# Set correct permissions
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
|
||||
|
||||
# Clean up temp directory
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
|
||||
|
||||
log "Ledger data restored to pod $pod"
|
||||
done
|
||||
|
||||
log "Ledger restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pods=($1)
|
||||
|
||||
log "Verifying ledger restore..."
|
||||
|
||||
# Read backup metadata
|
||||
local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
|
||||
log "Backup contains blocks up to height: $backup_height"
|
||||
|
||||
# Verify on each pod
|
||||
for pod in "${pods[@]}"; do
|
||||
if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
|
||||
# Check if node is responding
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
|
||||
# Get current block height
|
||||
local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
|
||||
|
||||
if [[ "$current_height" -eq "$backup_height" ]]; then
|
||||
log "✓ Pod $pod: Block height matches backup ($current_height)"
|
||||
else
|
||||
warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
|
||||
fi
|
||||
|
||||
# Check data directories
|
||||
local dirs=("chain" "wallets" "receipts")
|
||||
for dir in "${dirs[@]}"; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
|
||||
local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
|
||||
log "✓ Pod $pod: $dir directory contains $file_count files"
|
||||
else
|
||||
warn "⚠ Pod $pod: $dir directory not found"
|
||||
fi
|
||||
done
|
||||
else
|
||||
error "✗ Pod $pod: Not responding to health checks"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting ledger restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_dir
|
||||
|
||||
local pods=($(get_blockchain_pods))
|
||||
create_pre_restore_backup "${pods[*]}"
|
||||
perform_restore "${pods[*]}"
|
||||
|
||||
# Get updated pod list after restore
|
||||
pods=($(get_blockchain_pods))
|
||||
verify_restore "${pods[*]}"
|
||||
|
||||
log "Ledger restore process completed successfully"
|
||||
warn "Please verify blockchain synchronization and application functionality"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
228
infra/scripts/restore_postgresql.sh
Executable file
228
infra/scripts/restore_postgresql.sh
Executable file
@ -0,0 +1,228 @@
|
||||
#!/bin/bash
|
||||
# PostgreSQL Restore Script for AITBC
|
||||
# Usage: ./restore_postgresql.sh [namespace] [backup_file]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_FILE=${2:-}
|
||||
BACKUP_DIR="/tmp/postgresql-backups"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v pg_restore &> /dev/null; then
|
||||
error "pg_restore is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup file
|
||||
validate_backup_file() {
|
||||
if [[ -z "$BACKUP_FILE" ]]; then
|
||||
error "Backup file must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_file]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# If file doesn't exist locally, try to find it in backup dir
|
||||
if [[ ! -f "$BACKUP_FILE" ]]; then
|
||||
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
|
||||
if [[ -f "$potential_file" ]]; then
|
||||
BACKUP_FILE="$potential_file"
|
||||
else
|
||||
error "Backup file not found: $BACKUP_FILE"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check if file is gzipped and decompress if needed
|
||||
if [[ "$BACKUP_FILE" == *.gz ]]; then
|
||||
info "Decompressing backup file..."
|
||||
gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
|
||||
BACKUP_FILE="/tmp/restore_$(date +%s).dump"
|
||||
fi
|
||||
|
||||
log "Using backup file: $BACKUP_FILE"
|
||||
}
|
||||
|
||||
# Get PostgreSQL pod name
|
||||
get_postgresql_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find PostgreSQL pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Wait for PostgreSQL to be ready
|
||||
wait_for_postgresql() {
|
||||
local pod=$1
|
||||
log "Waiting for PostgreSQL pod $pod to be ready..."
|
||||
|
||||
kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Check if PostgreSQL is accepting connections
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
|
||||
log "PostgreSQL is ready"
|
||||
return 0
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
error "PostgreSQL did not become ready within timeout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Create backup of current database before restore
|
||||
create_pre_restore_backup() {
|
||||
local pod=$1
|
||||
local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
|
||||
|
||||
warn "Creating backup of current database before restore..."
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Create backup
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_dump -U "$db_user" -h localhost -d "$db_name" \
|
||||
--format=custom --file="/tmp/${pre_restore_backup}.dump"
|
||||
|
||||
# Copy backup locally
|
||||
kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
|
||||
|
||||
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pod=$1
|
||||
|
||||
warn "This will replace the current database. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Copy backup file to pod
|
||||
local remote_backup="/tmp/restore_$(date +%s).dump"
|
||||
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
|
||||
|
||||
# Drop existing database and recreate
|
||||
log "Dropping existing database..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
|
||||
|
||||
log "Creating new database..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
|
||||
|
||||
# Restore database
|
||||
log "Restoring database from backup..."
|
||||
PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
pg_restore -U "$db_user" -h localhost -d "$db_name" \
|
||||
--verbose --clean --if-exists "$remote_backup"
|
||||
|
||||
# Clean up remote file
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
|
||||
|
||||
log "Database restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pod=$1
|
||||
|
||||
log "Verifying database restore..."
|
||||
|
||||
# Get database credentials
|
||||
local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
|
||||
local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
|
||||
local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
|
||||
|
||||
# Check table count
|
||||
local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
|
||||
|
||||
log "Database contains $table_count tables"
|
||||
|
||||
# Check if key tables exist
|
||||
local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
|
||||
for table in "${key_tables[@]}"; do
|
||||
local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
|
||||
psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
|
||||
if [[ "$exists" == "t" ]]; then
|
||||
log "✓ Table $table exists"
|
||||
else
|
||||
warn "⚠ Table $table not found"
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting PostgreSQL restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_file
|
||||
|
||||
local pod=$(get_postgresql_pod)
|
||||
wait_for_postgresql "$pod"
|
||||
|
||||
create_pre_restore_backup "$pod"
|
||||
perform_restore "$pod"
|
||||
verify_restore "$pod"
|
||||
|
||||
log "PostgreSQL restore process completed successfully"
|
||||
warn "Please verify application functionality after restore"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
223
infra/scripts/restore_redis.sh
Normal file
223
infra/scripts/restore_redis.sh
Normal file
@ -0,0 +1,223 @@
|
||||
#!/bin/bash
|
||||
# Redis Restore Script for AITBC
|
||||
# Usage: ./restore_redis.sh [namespace] [backup_file]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Configuration
|
||||
NAMESPACE=${1:-default}
|
||||
BACKUP_FILE=${2:-}
|
||||
BACKUP_DIR="/tmp/redis-backups"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
|
||||
}
|
||||
|
||||
error() {
|
||||
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
|
||||
}
|
||||
|
||||
warn() {
|
||||
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
|
||||
}
|
||||
|
||||
info() {
|
||||
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
|
||||
}
|
||||
|
||||
# Check dependencies
|
||||
check_dependencies() {
|
||||
if ! command -v kubectl &> /dev/null; then
|
||||
error "kubectl is not installed or not in PATH"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Validate backup file
|
||||
validate_backup_file() {
|
||||
if [[ -z "$BACKUP_FILE" ]]; then
|
||||
error "Backup file must be specified"
|
||||
echo "Usage: $0 [namespace] [backup_file]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# If file doesn't exist locally, try to find it in backup dir
|
||||
if [[ ! -f "$BACKUP_FILE" ]]; then
|
||||
local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
|
||||
if [[ -f "$potential_file" ]]; then
|
||||
BACKUP_FILE="$potential_file"
|
||||
else
|
||||
error "Backup file not found: $BACKUP_FILE"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
log "Using backup file: $BACKUP_FILE"
|
||||
}
|
||||
|
||||
# Get Redis pod name
|
||||
get_redis_pod() {
|
||||
local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
if [[ -z "$pod" ]]; then
|
||||
pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
|
||||
fi
|
||||
|
||||
if [[ -z "$pod" ]]; then
|
||||
error "Could not find Redis pod in namespace $NAMESPACE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "$pod"
|
||||
}
|
||||
|
||||
# Create backup of current Redis data before restore
|
||||
create_pre_restore_backup() {
|
||||
local pod=$1
|
||||
local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
|
||||
|
||||
warn "Creating backup of current Redis data before restore..."
|
||||
|
||||
# Create background save
|
||||
kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
|
||||
|
||||
# Wait for save to complete
|
||||
local retries=60
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
|
||||
|
||||
if [[ "$lastsave" -gt "$lastbgsave" ]]; then
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
# Copy backup locally
|
||||
kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
|
||||
|
||||
# Also backup AOF if exists
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
|
||||
kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
|
||||
fi
|
||||
|
||||
log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
|
||||
}
|
||||
|
||||
# Perform restore
|
||||
perform_restore() {
|
||||
local pod=$1
|
||||
|
||||
warn "This will replace all current Redis data. Are you sure? (y/N)"
|
||||
read -r response
|
||||
if [[ ! "$response" =~ ^[Yy]$ ]]; then
|
||||
log "Restore cancelled by user"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Scale down Redis to ensure clean restore
|
||||
info "Scaling down Redis deployment..."
|
||||
kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
|
||||
|
||||
# Wait for pod to terminate
|
||||
kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
|
||||
|
||||
# Scale up Redis
|
||||
info "Scaling up Redis deployment..."
|
||||
kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
|
||||
|
||||
# Wait for new pod to be ready
|
||||
local new_pod=$(get_redis_pod)
|
||||
kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
|
||||
|
||||
# Stop Redis server
|
||||
info "Stopping Redis server..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
|
||||
|
||||
# Clear existing data
|
||||
info "Clearing existing Redis data..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
|
||||
|
||||
# Copy backup file
|
||||
info "Copying backup file..."
|
||||
local remote_file="/data/restore.rdb"
|
||||
kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
|
||||
|
||||
# Set correct permissions
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
|
||||
|
||||
# Start Redis server
|
||||
info "Starting Redis server..."
|
||||
kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
|
||||
|
||||
# Wait for Redis to be ready
|
||||
local retries=30
|
||||
while [[ $retries -gt 0 ]]; do
|
||||
if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "Redis is ready"
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
((retries--))
|
||||
done
|
||||
|
||||
if [[ $retries -eq 0 ]]; then
|
||||
error "Redis did not start properly after restore"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "Redis restore completed successfully"
|
||||
}
|
||||
|
||||
# Verify restore
|
||||
verify_restore() {
|
||||
local pod=$1
|
||||
|
||||
log "Verifying Redis restore..."
|
||||
|
||||
# Check database size
|
||||
local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
|
||||
log "Database contains $db_size keys"
|
||||
|
||||
# Check memory usage
|
||||
local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
|
||||
log "Memory usage: $memory"
|
||||
|
||||
# Check if Redis is responding to commands
|
||||
if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
log "✓ Redis is responding normally"
|
||||
else
|
||||
error "✗ Redis is not responding"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log "Starting Redis restore process"
|
||||
|
||||
check_dependencies
|
||||
validate_backup_file
|
||||
|
||||
local pod=$(get_redis_pod)
|
||||
create_pre_restore_backup "$pod"
|
||||
perform_restore "$pod"
|
||||
|
||||
# Get new pod name after restore
|
||||
pod=$(get_redis_pod)
|
||||
verify_restore "$pod"
|
||||
|
||||
log "Redis restore process completed successfully"
|
||||
warn "Please verify application functionality after restore"
|
||||
}
|
||||
|
||||
# Run main function
|
||||
main "$@"
|
||||
25
infra/terraform/environments/dev/main.tf
Normal file
25
infra/terraform/environments/dev/main.tf
Normal file
@ -0,0 +1,25 @@
|
||||
# Development environment configuration
|
||||
|
||||
terraform {
|
||||
source = "../../modules/kubernetes"
|
||||
}
|
||||
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
inputs = {
|
||||
cluster_name = "aitbc-dev"
|
||||
environment = "dev"
|
||||
aws_region = "us-west-2"
|
||||
vpc_cidr = "10.0.0.0/16"
|
||||
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24"]
|
||||
public_subnet_cidrs = ["10.0.101.0/24", "10.0.102.0/24"]
|
||||
availability_zones = ["us-west-2a", "us-west-2b"]
|
||||
kubernetes_version = "1.28"
|
||||
enable_public_endpoint = true
|
||||
desired_node_count = 2
|
||||
min_node_count = 1
|
||||
max_node_count = 3
|
||||
instance_types = ["t3.medium"]
|
||||
}
|
||||
199
infra/terraform/modules/kubernetes/main.tf
Normal file
199
infra/terraform/modules/kubernetes/main.tf
Normal file
@ -0,0 +1,199 @@
|
||||
# Kubernetes cluster module for AITBC infrastructure
|
||||
|
||||
terraform {
|
||||
required_version = ">= 1.0"
|
||||
required_providers {
|
||||
aws = {
|
||||
source = "hashicorp/aws"
|
||||
version = "~> 5.0"
|
||||
}
|
||||
kubernetes = {
|
||||
source = "hashicorp/kubernetes"
|
||||
version = "~> 2.20"
|
||||
}
|
||||
helm = {
|
||||
source = "hashicorp/helm"
|
||||
version = "~> 2.10"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
provider "aws" {
|
||||
region = var.aws_region
|
||||
}
|
||||
|
||||
# VPC for the cluster
|
||||
resource "aws_vpc" "main" {
|
||||
cidr_block = var.vpc_cidr
|
||||
enable_dns_hostnames = true
|
||||
enable_dns_support = true
|
||||
|
||||
tags = {
|
||||
Name = "${var.cluster_name}-vpc"
|
||||
Environment = var.environment
|
||||
Project = "aitbc"
|
||||
}
|
||||
}
|
||||
|
||||
# Subnets
|
||||
resource "aws_subnet" "private" {
|
||||
count = length(var.private_subnet_cidrs)
|
||||
|
||||
vpc_id = aws_vpc.main.id
|
||||
cidr_block = var.private_subnet_cidrs[count.index]
|
||||
availability_zone = var.availability_zones[count.index]
|
||||
|
||||
tags = {
|
||||
Name = "${var.cluster_name}-private-${count.index}"
|
||||
Environment = var.environment
|
||||
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
|
||||
"kubernetes.io/role/internal-elb" = "1"
|
||||
}
|
||||
}
|
||||
|
||||
resource "aws_subnet" "public" {
|
||||
count = length(var.public_subnet_cidrs)
|
||||
|
||||
vpc_id = aws_vpc.main.id
|
||||
cidr_block = var.public_subnet_cidrs[count.index]
|
||||
availability_zone = var.availability_zones[count.index]
|
||||
map_public_ip_on_launch = true
|
||||
|
||||
tags = {
|
||||
Name = "${var.cluster_name}-public-${count.index}"
|
||||
Environment = var.environment
|
||||
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
|
||||
"kubernetes.io/role/elb" = "1"
|
||||
}
|
||||
}
|
||||
|
||||
# EKS Cluster
|
||||
resource "aws_eks_cluster" "main" {
|
||||
name = var.cluster_name
|
||||
role_arn = aws_iam_role.cluster.arn
|
||||
version = var.kubernetes_version
|
||||
|
||||
vpc_config {
|
||||
subnet_ids = concat(
|
||||
aws_subnet.private[*].id,
|
||||
aws_subnet.public[*].id
|
||||
)
|
||||
endpoint_private_access = true
|
||||
endpoint_public_access = var.enable_public_endpoint
|
||||
}
|
||||
|
||||
depends_on = [
|
||||
aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy
|
||||
]
|
||||
|
||||
tags = {
|
||||
Name = var.cluster_name
|
||||
Environment = var.environment
|
||||
Project = "aitbc"
|
||||
}
|
||||
}
|
||||
|
||||
# Node groups
|
||||
resource "aws_eks_node_group" "main" {
|
||||
cluster_name = aws_eks_cluster.main.name
|
||||
node_group_name = "${var.cluster_name}-main"
|
||||
node_role_arn = aws_iam_role.node.arn
|
||||
subnet_ids = aws_subnet.private[*].id
|
||||
|
||||
scaling_config {
|
||||
desired_size = var.desired_node_count
|
||||
max_size = var.max_node_count
|
||||
min_size = var.min_node_count
|
||||
}
|
||||
|
||||
instance_types = var.instance_types
|
||||
|
||||
depends_on = [
|
||||
aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy,
|
||||
aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy,
|
||||
aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly
|
||||
]
|
||||
|
||||
tags = {
|
||||
Name = "${var.cluster_name}-main"
|
||||
Environment = var.environment
|
||||
Project = "aitbc"
|
||||
}
|
||||
}
|
||||
|
||||
# IAM roles
|
||||
resource "aws_iam_role" "cluster" {
|
||||
name = "${var.cluster_name}-cluster"
|
||||
|
||||
assume_role_policy = jsonencode({
|
||||
Version = "2012-10-17"
|
||||
Statement = [
|
||||
{
|
||||
Action = "sts:AssumeRole"
|
||||
Effect = "Allow"
|
||||
Principal = {
|
||||
Service = "eks.amazonaws.com"
|
||||
}
|
||||
}
|
||||
]
|
||||
})
|
||||
}
|
||||
|
||||
resource "aws_iam_role" "node" {
|
||||
name = "${var.cluster_name}-node"
|
||||
|
||||
assume_role_policy = jsonencode({
|
||||
Version = "2012-10-17"
|
||||
Statement = [
|
||||
{
|
||||
Action = "sts:AssumeRole"
|
||||
Effect = "Allow"
|
||||
Principal = {
|
||||
Service = "ec2.amazonaws.com"
|
||||
}
|
||||
}
|
||||
]
|
||||
})
|
||||
}
|
||||
|
||||
# IAM policy attachments
|
||||
resource "aws_iam_role_policy_attachment" "cluster_AmazonEKSClusterPolicy" {
|
||||
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
|
||||
role = aws_iam_role.cluster.name
|
||||
}
|
||||
|
||||
resource "aws_iam_role_policy_attachment" "node_AmazonEKSWorkerNodePolicy" {
|
||||
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
|
||||
role = aws_iam_role.node.name
|
||||
}
|
||||
|
||||
resource "aws_iam_role_policy_attachment" "node_AmazonEKS_CNI_Policy" {
|
||||
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
|
||||
role = aws_iam_role.node.name
|
||||
}
|
||||
|
||||
resource "aws_iam_role_policy_attachment" "node_AmazonEC2ContainerRegistryReadOnly" {
|
||||
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
|
||||
role = aws_iam_role.node.name
|
||||
}
|
||||
|
||||
# Outputs
|
||||
output "cluster_name" {
|
||||
description = "The name of the EKS cluster"
|
||||
value = aws_eks_cluster.main.name
|
||||
}
|
||||
|
||||
output "cluster_endpoint" {
|
||||
description = "The endpoint for the EKS cluster"
|
||||
value = aws_eks_cluster.main.endpoint
|
||||
}
|
||||
|
||||
output "cluster_certificate_authority_data" {
|
||||
description = "The certificate authority data for the EKS cluster"
|
||||
value = aws_eks_cluster.main.certificate_authority[0].data
|
||||
}
|
||||
|
||||
output "cluster_security_group_id" {
|
||||
description = "The security group ID of the EKS cluster"
|
||||
value = aws_eks_cluster.main.vpc_config[0].cluster_security_group_id
|
||||
}
|
||||
75
infra/terraform/modules/kubernetes/variables.tf
Normal file
75
infra/terraform/modules/kubernetes/variables.tf
Normal file
@ -0,0 +1,75 @@
|
||||
variable "cluster_name" {
|
||||
description = "Name of the EKS cluster"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "environment" {
|
||||
description = "Environment name (dev, staging, prod)"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "aws_region" {
|
||||
description = "AWS region"
|
||||
type = string
|
||||
default = "us-west-2"
|
||||
}
|
||||
|
||||
variable "vpc_cidr" {
|
||||
description = "CIDR block for VPC"
|
||||
type = string
|
||||
default = "10.0.0.0/16"
|
||||
}
|
||||
|
||||
variable "private_subnet_cidrs" {
|
||||
description = "CIDR blocks for private subnets"
|
||||
type = list(string)
|
||||
default = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
|
||||
}
|
||||
|
||||
variable "public_subnet_cidrs" {
|
||||
description = "CIDR blocks for public subnets"
|
||||
type = list(string)
|
||||
default = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
|
||||
}
|
||||
|
||||
variable "availability_zones" {
|
||||
description = "Availability zones"
|
||||
type = list(string)
|
||||
default = ["us-west-2a", "us-west-2b", "us-west-2c"]
|
||||
}
|
||||
|
||||
variable "kubernetes_version" {
|
||||
description = "Kubernetes version"
|
||||
type = string
|
||||
default = "1.28"
|
||||
}
|
||||
|
||||
variable "enable_public_endpoint" {
|
||||
description = "Enable public EKS endpoint"
|
||||
type = bool
|
||||
default = false
|
||||
}
|
||||
|
||||
variable "desired_node_count" {
|
||||
description = "Desired number of worker nodes"
|
||||
type = number
|
||||
default = 3
|
||||
}
|
||||
|
||||
variable "min_node_count" {
|
||||
description = "Minimum number of worker nodes"
|
||||
type = number
|
||||
default = 1
|
||||
}
|
||||
|
||||
variable "max_node_count" {
|
||||
description = "Maximum number of worker nodes"
|
||||
type = number
|
||||
default = 10
|
||||
}
|
||||
|
||||
variable "instance_types" {
|
||||
description = "EC2 instance types for worker nodes"
|
||||
type = list(string)
|
||||
default = ["m5.large", "m5.xlarge"]
|
||||
}
|
||||
Reference in New Issue
Block a user