feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions
--- a/infra/README.md
+++ b/infra/README.md
@ -0,0 +1,158 @@
+# AITBC Infrastructure Templates
+
+This directory contains Terraform and Helm templates for deploying AITBC services across dev, staging, and production environments.
+
+## Directory Structure
+
+```
+infra/
+├── terraform/                 # Infrastructure as Code
+│   ├── modules/              # Reusable Terraform modules
+│   │   └── kubernetes/       # EKS cluster module
+│   └── environments/         # Environment-specific configurations
+│       ├── dev/
+│       ├── staging/
+│       └── prod/
+└── helm/                     # Helm Charts
+    ├── charts/               # Application charts
+    │   ├── coordinator/      # Coordinator API chart
+    │   ├── blockchain-node/  # Blockchain node chart
+    │   └── monitoring/       # Monitoring stack (Prometheus, Grafana)
+    └── values/               # Environment-specific values
+        ├── dev.yaml
+        ├── staging.yaml
+        └── prod.yaml
+```
+
+## Quick Start
+
+### Prerequisites
+
+- Terraform >= 1.0
+- Helm >= 3.0
+- kubectl configured for your cluster
+- AWS CLI configured (for EKS)
+
+### Deploy Development Environment
+
+1. **Provision Infrastructure with Terraform:**
+   ```bash
+   cd infra/terraform/environments/dev
+   terraform init
+   terraform apply
+   ```
+
+2. **Configure kubectl:**
+   ```bash
+   aws eks update-kubeconfig --name aitbc-dev --region us-west-2
+   ```
+
+3. **Deploy Applications with Helm:**
+   ```bash
+   # Add required Helm repositories
+   helm repo add bitnami https://charts.bitnami.com/bitnami
+   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+   helm repo add grafana https://grafana.github.io/helm-charts
+   helm repo update
+   
+   # Deploy monitoring stack
+   helm install monitoring ../../helm/charts/monitoring -f ../../helm/values/dev.yaml
+   
+   # Deploy coordinator API
+   helm install coordinator ../../helm/charts/coordinator -f ../../helm/values/dev.yaml
+   ```
+
+### Environment Configurations
+
+#### Development
+- 1 replica per service
+- Minimal resource allocation
+- Public EKS endpoint enabled
+- 7-day metrics retention
+
+#### Staging
+- 2-3 replicas per service
+- Moderate resource allocation
+- Autoscaling enabled
+- 30-day metrics retention
+- TLS with staging certificates
+
+#### Production
+- 3+ replicas per service
+- High resource allocation
+- Full autoscaling configuration
+- 90-day metrics retention
+- TLS with production certificates
+- Network policies enabled
+- Backup configuration enabled
+
+## Monitoring
+
+The monitoring stack includes:
+- **Prometheus**: Metrics collection and storage
+- **Grafana**: Visualization dashboards
+- **AlertManager**: Alert routing and notification
+
+Access Grafana:
+```bash
+kubectl port-forward svc/monitoring-grafana 3000:3000
+# Open http://localhost:3000
+# Default credentials: admin/admin (check values files for environment-specific passwords)
+```
+
+## Scaling Guidelines
+
+Based on benchmark results (`apps/blockchain-node/scripts/benchmark_throughput.py`):
+
+- **Coordinator API**: Scale horizontally at ~500 TPS per node
+- **Blockchain Node**: Scale horizontally at ~1000 TPS per node
+- **Wallet Daemon**: Scale based on concurrent users
+
+## Security Considerations
+
+- Private subnets for all application workloads
+- Network policies restrict traffic between services
+- Secrets managed via Kubernetes Secrets
+- TLS termination at ingress level
+- Pod Security Policies enforced in production
+
+## Backup and Recovery
+
+- Automated daily backups of PostgreSQL databases
+- EBS snapshots for persistent volumes
+- Cross-region replication for production data
+- Restore procedures documented in runbooks
+
+## Cost Optimization
+
+- Use Spot instances for non-critical workloads
+- Implement cluster autoscaling
+- Right-size resources based on metrics
+- Schedule non-production environments to run only during business hours
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Helm chart fails to install:**
+   - Check if all dependencies are added
+   - Verify kubectl context is correct
+   - Review values files for syntax errors
+
+2. **Prometheus not scraping metrics:**
+   - Verify ServiceMonitor CRDs are installed
+   - Check service annotations
+   - Review network policies
+
+3. **High memory usage:**
+   - Review resource limits in values files
+   - Check for memory leaks in applications
+   - Consider increasing node size
+
+## Contributing
+
+When adding new services:
+1. Create a new Helm chart in `helm/charts/`
+2. Add environment-specific values in `helm/values/`
+3. Update monitoring configuration to include new service metrics
+4. Document any special requirements in this README
--- a/infra/helm/charts/blockchain-node/hpa.yaml
+++ b/infra/helm/charts/blockchain-node/hpa.yaml
@ -0,0 +1,64 @@
+{{- if .Values.autoscaling.enabled }}
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: {{ include "aitbc-blockchain-node.fullname" . }}
+  labels:
+    {{- include "aitbc-blockchain-node.labels" . | nindent 4 }}
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: {{ include "aitbc-blockchain-node.fullname" . }}
+  minReplicas: {{ .Values.autoscaling.minReplicas }}
+  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
+  metrics:
+    {{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
+    - type: Resource
+      resource:
+        name: cpu
+        target:
+          type: Utilization
+          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
+    {{- end }}
+    {{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
+    - type: Resource
+      resource:
+        name: memory
+        target:
+          type: Utilization
+          averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
+    {{- end }}
+    # Custom metrics for blockchain-specific scaling
+    - type: External
+      external:
+        metric:
+          name: blockchain_transaction_queue_depth
+        target:
+          type: AverageValue
+          averageValue: "100"
+    - type: External
+      external:
+        metric:
+          name: blockchain_pending_transactions
+        target:
+          type: AverageValue
+          averageValue: "500"
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 600  # Longer stabilization for blockchain
+      policies:
+      - type: Percent
+        value: 5
+        periodSeconds: 60
+    scaleUp:
+      stabilizationWindowSeconds: 60
+      policies:
+      - type: Percent
+        value: 50
+        periodSeconds: 60
+      - type: Pods
+        value: 2
+        periodSeconds: 60
+      selectPolicy: Max
+{{- end }}
--- a/infra/helm/charts/coordinator/Chart.yaml
+++ b/infra/helm/charts/coordinator/Chart.yaml
@ -0,0 +1,11 @@
+apiVersion: v2
+name: aitbc-coordinator
+description: AITBC Coordinator API Helm Chart
+type: application
+version: 0.1.0
+appVersion: "0.1.0"
+dependencies:
+  - name: postgresql
+    version: 12.x.x
+    repository: https://charts.bitnami.com/bitnami
+    condition: postgresql.enabled
--- a/infra/helm/charts/coordinator/templates/_helpers.tpl
+++ b/infra/helm/charts/coordinator/templates/_helpers.tpl
@ -0,0 +1,62 @@
+{{/*
+Expand the name of the chart.
+*/}}
+{{- define "aitbc-coordinator.name" -}}
+{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
+{{- end }}
+
+{{/*
+Create a default fully qualified app name.
+We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
+If release name contains chart name it will be used as a full name.
+*/}}
+{{- define "aitbc-coordinator.fullname" -}}
+{{- if .Values.fullnameOverride }}
+{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
+{{- else }}
+{{- $name := default .Chart.Name .Values.nameOverride }}
+{{- if contains $name .Release.Name }}
+{{- .Release.Name | trunc 63 | trimSuffix "-" }}
+{{- else }}
+{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
+{{- end }}
+{{- end }}
+{{- end }}
+
+{{/*
+Create chart name and version as used by the chart label.
+*/}}
+{{- define "aitbc-coordinator.chart" -}}
+{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
+{{- end }}
+
+{{/*
+Common labels
+*/}}
+{{- define "aitbc-coordinator.labels" -}}
+helm.sh/chart: {{ include "aitbc-coordinator.chart" . }}
+{{ include "aitbc-coordinator.selectorLabels" . }}
+{{- if .Chart.AppVersion }}
+app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
+{{- end }}
+app.kubernetes.io/managed-by: {{ .Release.Service }}
+{{- end }}
+
+{{/*
+Selector labels
+*/}}
+{{- define "aitbc-coordinator.selectorLabels" -}}
+app.kubernetes.io/name: {{ include "aitbc-coordinator.name" . }}
+app.kubernetes.io/instance: {{ .Release.Name }}
+{{- end }}
+
+{{/*
+Create the name of the service account to use
+*/}}
+{{- define "aitbc-coordinator.serviceAccountName" -}}
+{{- if .Values.serviceAccount.create }}
+{{- default (include "aitbc-coordinator.fullname" .) .Values.serviceAccount.name }}
+{{- else }}
+{{- default "default" .Values.serviceAccount.name }}
+{{- end }}
+{{- end }}
--- a/infra/helm/charts/coordinator/templates/deployment.yaml
+++ b/infra/helm/charts/coordinator/templates/deployment.yaml
@ -0,0 +1,90 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" . }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+spec:
+  {{- if not .Values.autoscaling.enabled }}
+  replicas: {{ .Values.replicaCount }}
+  {{- end }}
+  selector:
+    matchLabels:
+      {{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
+  template:
+    metadata:
+      annotations:
+        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
+        {{- with .Values.podAnnotations }}
+        {{- toYaml . | nindent 8 }}
+        {{- end }}
+      labels:
+        {{- include "aitbc-coordinator.selectorLabels" . | nindent 8 }}
+    spec:
+      {{- with .Values.imagePullSecrets }}
+      imagePullSecrets:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
+      serviceAccountName: {{ include "aitbc-coordinator.serviceAccountName" . }}
+      securityContext:
+        {{- toYaml .Values.podSecurityContext | nindent 8 }}
+      containers:
+        - name: {{ .Chart.Name }}
+          securityContext:
+            {{- toYaml .Values.securityContext | nindent 12 }}
+          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
+          imagePullPolicy: {{ .Values.image.pullPolicy }}
+          ports:
+            - name: http
+              containerPort: {{ .Values.service.targetPort }}
+              protocol: TCP
+          livenessProbe:
+            {{- toYaml .Values.livenessProbe | nindent 12 }}
+          readinessProbe:
+            {{- toYaml .Values.readinessProbe | nindent 12 }}
+          resources:
+            {{- toYaml .Values.resources | nindent 12 }}
+          env:
+            - name: APP_ENV
+              value: {{ .Values.config.appEnv }}
+            - name: DATABASE_URL
+              valueFrom:
+                secretKeyRef:
+                  name: {{ include "aitbc-coordinator.fullname" . }}
+                  key: database-url
+            - name: ALLOW_ORIGINS
+              value: {{ .Values.config.allowOrigins | quote }}
+            {{- if .Values.config.receiptSigningKeyHex }}
+            - name: RECEIPT_SIGNING_KEY_HEX
+              valueFrom:
+                secretKeyRef:
+                  name: {{ include "aitbc-coordinator.fullname" . }}
+                  key: receipt-signing-key
+            {{- end }}
+            {{- if .Values.config.receiptAttestationKeyHex }}
+            - name: RECEIPT_ATTESTATION_KEY_HEX
+              valueFrom:
+                secretKeyRef:
+                  name: {{ include "aitbc-coordinator.fullname" . }}
+                  key: receipt-attestation-key
+            {{- end }}
+          volumeMounts:
+            - name: config
+              mountPath: /app/.env
+              subPath: .env
+      volumes:
+        - name: config
+          configMap:
+            name: {{ include "aitbc-coordinator.fullname" . }}
+      {{- with .Values.nodeSelector }}
+      nodeSelector:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
+      {{- with .Values.affinity }}
+      affinity:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
+      {{- with .Values.tolerations }}
+      tolerations:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
--- a/infra/helm/charts/coordinator/templates/hpa.yaml
+++ b/infra/helm/charts/coordinator/templates/hpa.yaml
@ -0,0 +1,60 @@
+{{- if .Values.autoscaling.enabled }}
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" . }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: {{ include "aitbc-coordinator.fullname" . }}
+  minReplicas: {{ .Values.autoscaling.minReplicas }}
+  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
+  metrics:
+    {{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
+    - type: Resource
+      resource:
+        name: cpu
+        target:
+          type: Utilization
+          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
+    {{- end }}
+    {{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
+    - type: Resource
+      resource:
+        name: memory
+        target:
+          type: Utilization
+          averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
+    {{- end }}
+    {{- if .Values.autoscaling.customMetrics }}
+    {{- range .Values.autoscaling.customMetrics }}
+    - type: External
+      external:
+        metric:
+          name: {{ .name }}
+        target:
+          type: AverageValue
+          averageValue: {{ .targetValue }}
+    {{- end }}
+    {{- end }}
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 300
+      policies:
+      - type: Percent
+        value: 10
+        periodSeconds: 60
+    scaleUp:
+      stabilizationWindowSeconds: 0
+      policies:
+      - type: Percent
+        value: 100
+        periodSeconds: 15
+      - type: Pods
+        value: 4
+        periodSeconds: 15
+      selectPolicy: Max
+{{- end }}
--- a/infra/helm/charts/coordinator/templates/ingress.yaml
+++ b/infra/helm/charts/coordinator/templates/ingress.yaml
@ -0,0 +1,70 @@
+{{- if .Values.ingress.enabled -}}
+{{- $fullName := include "aitbc-coordinator.fullname" . -}}
+{{- $svcPort := .Values.service.port -}}
+{{- if and .Values.ingress.className (not (hasKey .Values.ingress.annotations "kubernetes.io/ingress.class")) }}
+  {{- $_ := set .Values.ingress.annotations "kubernetes.io/ingress.class" .Values.ingress.className}}
+{{- end }}
+{{- if semverCompare ">=1.19-0" .Capabilities.KubeVersion.GitVersion -}}
+apiVersion: networking.k8s.io/v1
+{{- else -}}
+apiVersion: networking.k8s.io/v1beta1
+{{- end }}
+kind: Ingress
+metadata:
+  name: {{ $fullName }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+  annotations:
+    # Security annotations (always applied)
+    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
+    nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.3"
+    nginx.ingress.kubernetes.io/ssl-ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
+    nginx.ingress.kubernetes.io/configuration-snippet: |
+      more_set_headers "X-Frame-Options: DENY";
+      more_set_headers "X-Content-Type-Options: nosniff";
+      more_set_headers "X-XSS-Protection: 1; mode=block";
+      more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
+      more_set_headers "Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'";
+      more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains; preload";
+    cert-manager.io/cluster-issuer: {{ .Values.ingress.certManager.issuer | default "letsencrypt-prod" }}
+    # User-provided annotations
+    {{- with .Values.ingress.annotations }}
+    {{- toYaml . | nindent 4 }}
+    {{- end }}
+spec:
+  {{- if and .Values.ingress.className (semverCompare ">=1.18-0" .Capabilities.KubeVersion.GitVersion) }}
+  ingressClassName: {{ .Values.ingress.className }}
+  {{- end }}
+  {{- if .Values.ingress.tls }}
+  tls:
+    {{- range .Values.ingress.tls }}
+    - hosts:
+        {{- range .hosts }}
+        - {{ . | quote }}
+        {{- end }}
+      secretName: {{ .secretName }}
+    {{- end }}
+  {{- end }}
+  rules:
+    {{- range .Values.ingress.hosts }}
+    - host: {{ .host | quote }}
+      http:
+        paths:
+          {{- range .paths }}
+          - path: {{ .path }}
+            {{- if and .pathType (semverCompare ">=1.18-0" $.Capabilities.KubeVersion.GitVersion) }}
+            pathType: {{ .pathType }}
+            {{- end }}
+            backend:
+              {{- if semverCompare ">=1.19-0" $.Capabilities.KubeVersion.GitVersion }}
+              service:
+                name: {{ $fullName }}
+                port:
+                  number: {{ $svcPort }}
+              {{- else }}
+              serviceName: {{ $fullName }}
+              servicePort: {{ $svcPort }}
+              {{- end }}
+          {{- end }}
+    {{- end }}
+{{- end }}
--- a/infra/helm/charts/coordinator/templates/networkpolicy.yaml
+++ b/infra/helm/charts/coordinator/templates/networkpolicy.yaml
@ -0,0 +1,73 @@
+{{- if .Values.networkPolicy.enabled }}
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" . }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+spec:
+  podSelector:
+    matchLabels:
+      {{- include "aitbc-coordinator.selectorLabels" . | nindent 6 }}
+  policyTypes:
+  - Ingress
+  - Egress
+  ingress:
+  # Allow traffic from ingress controller
+  - from:
+    - namespaceSelector:
+        matchLabels:
+          name: ingress-nginx
+    - podSelector:
+        matchLabels:
+          app.kubernetes.io/name: ingress-nginx
+    ports:
+    - protocol: TCP
+      port: http
+  # Allow traffic from monitoring
+  - from:
+    - namespaceSelector:
+        matchLabels:
+          name: monitoring
+    - podSelector:
+        matchLabels:
+          app.kubernetes.io/name: prometheus
+    ports:
+    - protocol: TCP
+      port: http
+  # Allow traffic from wallet-daemon
+  - from:
+    - podSelector:
+        matchLabels:
+          app.kubernetes.io/name: wallet-daemon
+    ports:
+    - protocol: TCP
+      port: http
+  # Allow traffic from same namespace for internal communication
+  - from:
+    - podSelector: {}
+    ports:
+    - protocol: TCP
+      port: http
+  egress:
+  # Allow DNS resolution
+  - to: []
+    ports:
+    - protocol: UDP
+      port: 53
+  # Allow PostgreSQL access
+  - to:
+    - podSelector:
+        matchLabels:
+          app.kubernetes.io/name: postgresql
+    ports:
+    - protocol: TCP
+      port: 5432
+  # Allow external API calls (if needed)
+  - to: []
+    ports:
+    - protocol: TCP
+      port: 443
+    - protocol: TCP
+      port: 80
+{{- end }}
--- a/infra/helm/charts/coordinator/templates/podsecuritypolicy.yaml
+++ b/infra/helm/charts/coordinator/templates/podsecuritypolicy.yaml
@ -0,0 +1,59 @@
+{{- if .Values.podSecurityPolicy.enabled }}
+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" . }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+spec:
+  privileged: false
+  allowPrivilegeEscalation: false
+  requiredDropCapabilities:
+    - ALL
+  volumes:
+    - 'configMap'
+    - 'emptyDir'
+    - 'projected'
+    - 'secret'
+    - 'downwardAPI'
+    - 'persistentVolumeClaim'
+  runAsUser:
+    rule: 'MustRunAsNonRoot'
+  seLinux:
+    rule: 'RunAsAny'
+  fsGroup:
+    rule: 'RunAsAny'
+  readOnlyRootFilesystem: false
+  securityContext:
+    runAsNonRoot: true
+    runAsUser: 1000
+    fsGroup: 1000
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" }}-psp
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+rules:
+- apiGroups: ['policy']
+  resources: ['podsecuritypolicies']
+  verbs: ['use']
+  resourceNames:
+  - {{ include "aitbc-coordinator.fullname" . }}
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" }}-psp
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+roleRef:
+  kind: Role
+  name: {{ include "aitbc-coordinator.fullname" }}-psp
+  apiGroup: rbac.authorization.k8s.io
+subjects:
+- kind: ServiceAccount
+  name: {{ include "aitbc-coordinator.serviceAccountName" . }}
+  namespace: {{ .Release.Namespace }}
+{{- end }}
--- a/infra/helm/charts/coordinator/templates/service.yaml
+++ b/infra/helm/charts/coordinator/templates/service.yaml
@ -0,0 +1,21 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: {{ include "aitbc-coordinator.fullname" . }}
+  labels:
+    {{- include "aitbc-coordinator.labels" . | nindent 4 }}
+  {{- if .Values.monitoring.enabled }}
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "{{ .Values.service.port }}"
+    prometheus.io/path: "{{ .Values.monitoring.serviceMonitor.path }}"
+  {{- end }}
+spec:
+  type: {{ .Values.service.type }}
+  ports:
+    - port: {{ .Values.service.port }}
+      targetPort: {{ .Values.service.targetPort }}
+      protocol: TCP
+      name: http
+  selector:
+    {{- include "aitbc-coordinator.selectorLabels" . | nindent 4 }}
--- a/infra/helm/charts/coordinator/values.yaml
+++ b/infra/helm/charts/coordinator/values.yaml
@ -0,0 +1,162 @@
+# Default values for aitbc-coordinator.
+# This is a YAML-formatted file.
+# Declare variables to be passed into your templates.
+
+replicaCount: 1
+
+image:
+  repository: aitbc/coordinator-api
+  pullPolicy: IfNotPresent
+  tag: "0.1.0"
+
+nameOverride: ""
+fullnameOverride: ""
+
+serviceAccount:
+  # Specifies whether a service account should be created
+  create: true
+  # Annotations to add to the service account
+  annotations: {}
+  # The name of the service account to use.
+  # If not set and create is true, a name is generated using the fullname template
+  name: ""
+
+podAnnotations: {}
+
+podSecurityContext:
+  fsGroup: 1000
+
+securityContext:
+  allowPrivilegeEscalation: false
+  runAsNonRoot: true
+  runAsUser: 1000
+  capabilities:
+    drop:
+    - ALL
+
+service:
+  type: ClusterIP
+  port: 8011
+  targetPort: 8011
+
+ingress:
+  enabled: false
+  className: nginx
+  annotations: {}
+    # cert-manager.io/cluster-issuer: letsencrypt-prod
+  hosts:
+    - host: coordinator.local
+      paths:
+        - path: /
+          pathType: Prefix
+  tls: []
+    # - secretName: coordinator-tls
+    #   hosts:
+    #     - coordinator.local
+
+# Pod Security Policy
+podSecurityPolicy:
+  enabled: true
+
+# Network policies
+networkPolicy:
+  enabled: true
+
+security:
+  auth:
+    enabled: true
+    requireApiKey: true
+    apiKeyHeader: "X-API-Key"
+  tls:
+    version: "TLSv1.3"
+    ciphers: "TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256"
+  headers:
+    frameOptions: "DENY"
+    contentTypeOptions: "nosniff"
+    xssProtection: "1; mode=block"
+    referrerPolicy: "strict-origin-when-cross-origin"
+    hsts:
+      enabled: true
+      maxAge: 31536000
+      includeSubDomains: true
+      preload: true
+  rateLimit:
+    enabled: true
+    requestsPerMinute: 60
+    burst: 10
+
+resources:
+  limits:
+    cpu: 1000m
+    memory: 1Gi
+  requests:
+    cpu: 500m
+    memory: 512Mi
+
+autoscaling:
+  enabled: false
+  minReplicas: 1
+  maxReplicas: 10
+  targetCPUUtilizationPercentage: 80
+  # targetMemoryUtilizationPercentage: 80
+
+nodeSelector: {}
+
+tolerations: []
+
+affinity: {}
+
+# Configuration
+config:
+  appEnv: production
+  databaseUrl: "postgresql://aitbc:password@postgresql:5432/aitbc"
+  receiptSigningKeyHex: ""
+  receiptAttestationKeyHex: ""
+  allowOrigins: "*"
+
+# PostgreSQL sub-chart configuration
+postgresql:
+  enabled: true
+  auth:
+    postgresPassword: "password"
+    username: aitbc
+    database: aitbc
+  primary:
+    persistence:
+      enabled: true
+      size: 20Gi
+    resources:
+      limits:
+        cpu: 1000m
+        memory: 2Gi
+      requests:
+        cpu: 500m
+        memory: 1Gi
+
+# Monitoring
+monitoring:
+  enabled: true
+  serviceMonitor:
+    enabled: true
+    interval: 30s
+    path: /metrics
+    port: http
+
+# Health checks
+livenessProbe:
+  httpGet:
+    path: /v1/health
+    port: http
+  initialDelaySeconds: 30
+  periodSeconds: 10
+  timeoutSeconds: 5
+  failureThreshold: 3
+
+readinessProbe:
+  httpGet:
+    path: /v1/health
+    port: http
+  initialDelaySeconds: 5
+  periodSeconds: 5
+  timeoutSeconds: 3
+  failureThreshold: 3
--- a/infra/helm/charts/monitoring/Chart.yaml
+++ b/infra/helm/charts/monitoring/Chart.yaml
@ -0,0 +1,19 @@
+apiVersion: v2
+name: aitbc-monitoring
+description: AITBC Monitoring Stack (Prometheus, Grafana, AlertManager)
+type: application
+version: 0.1.0
+appVersion: "0.1.0"
+dependencies:
+  - name: prometheus
+    version: 23.1.0
+    repository: https://prometheus-community.github.io/helm-charts
+    condition: prometheus.enabled
+  - name: grafana
+    version: 6.58.9
+    repository: https://grafana.github.io/helm-charts
+    condition: grafana.enabled
+  - name: alertmanager
+    version: 1.6.1
+    repository: https://prometheus-community.github.io/helm-charts
+    condition: alertmanager.enabled
--- a/infra/helm/charts/monitoring/templates/dashboards.yaml
+++ b/infra/helm/charts/monitoring/templates/dashboards.yaml
@ -0,0 +1,13 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: {{ include "aitbc-monitoring.fullname" . }}-dashboards
+  labels:
+    {{- include "aitbc-monitoring.labels" . | nindent 4 }}
+  annotations:
+    grafana.io/dashboard: "1"
+data:
+  blockchain-node-overview.json: |
+{{ .Files.Get "dashboards/blockchain-node-overview.json" | indent 4 }}
+  coordinator-overview.json: |
+{{ .Files.Get "dashboards/coordinator-overview.json" | indent 4 }}
--- a/infra/helm/charts/monitoring/values.yaml
+++ b/infra/helm/charts/monitoring/values.yaml
@ -0,0 +1,124 @@
+# Default values for aitbc-monitoring.
+
+# Prometheus configuration
+prometheus:
+  enabled: true
+  server:
+    enabled: true
+    global:
+      scrape_interval: 15s
+      evaluation_interval: 15s
+    retention: 30d
+    persistentVolume:
+      enabled: true
+      size: 100Gi
+    resources:
+      limits:
+        cpu: 2000m
+        memory: 4Gi
+      requests:
+        cpu: 1000m
+        memory: 2Gi
+    service:
+      type: ClusterIP
+      port: 9090
+    serviceMonitors:
+      enabled: true
+      selector:
+        release: monitoring
+  alertmanager:
+    enabled: false
+    config:
+      global:
+        resolve_timeout: 5m
+      route:
+        group_by: ['alertname']
+        group_wait: 10s
+        group_interval: 10s
+        repeat_interval: 1h
+        receiver: 'web.hook'
+      receivers:
+      - name: 'web.hook'
+        webhook_configs:
+        - url: 'http://127.0.0.1:5001/'
+
+# Grafana configuration
+grafana:
+  enabled: true
+  adminPassword: admin
+  persistence:
+    enabled: true
+    size: 20Gi
+  resources:
+    limits:
+      cpu: 1000m
+      memory: 2Gi
+    requests:
+      cpu: 500m
+      memory: 1Gi
+  service:
+    type: ClusterIP
+    port: 3000
+  datasources:
+    datasources.yaml:
+      apiVersion: 1
+      datasources:
+        - name: Prometheus
+          type: prometheus
+          url: http://prometheus-server:9090
+          access: proxy
+          isDefault: true
+  dashboardProviders:
+    dashboardproviders.yaml:
+      apiVersion: 1
+      providers:
+        - name: 'default'
+          orgId: 1
+          folder: ''
+          type: file
+          disableDeletion: false
+          editable: true
+          options:
+            path: /var/lib/grafana/dashboards/default
+
+# Service monitors for AITBC services
+serviceMonitors:
+  coordinator:
+    enabled: true
+    interval: 30s
+    path: /metrics
+    port: http
+  blockchainNode:
+    enabled: true
+    interval: 30s
+    path: /metrics
+    port: http
+  walletDaemon:
+    enabled: true
+    interval: 30s
+    path: /metrics
+    port: http
+
+# Alert rules
+alertRules:
+  enabled: true
+  groups:
+    - name: aitbc.rules
+      rules:
+        - alert: HighErrorRate
+          expr: rate(marketplace_errors_total[5m]) / rate(marketplace_requests_total[5m]) > 0.1
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "High error rate detected"
+            description: "Error rate is above 10% for 5 minutes"
+        
+        - alert: CoordinatorDown
+          expr: up{job="coordinator"} == 0
+          for: 1m
+          labels:
+            severity: critical
+          annotations:
+            summary: "Coordinator is down"
+            description: "Coordinator API has been down for more than 1 minute"
--- a/infra/helm/values/dev.yaml
+++ b/infra/helm/values/dev.yaml
@ -0,0 +1,77 @@
+# Development environment values
+global:
+  environment: dev
+
+coordinator:
+  replicaCount: 1
+  image:
+    tag: "dev-latest"
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 250m
+      memory: 256Mi
+  config:
+    appEnv: development
+    allowOrigins: "*"
+  postgresql:
+    auth:
+      postgresPassword: "dev-password"
+    primary:
+      persistence:
+        size: 10Gi
+      resources:
+        limits:
+          cpu: 500m
+          memory: 1Gi
+        requests:
+          cpu: 250m
+          memory: 512Mi
+
+monitoring:
+  prometheus:
+    server:
+      retention: 7d
+      persistentVolume:
+        size: 20Gi
+      resources:
+        limits:
+          cpu: 500m
+          memory: 1Gi
+        requests:
+          cpu: 250m
+          memory: 512Mi
+  grafana:
+    adminPassword: "dev-admin"
+    persistence:
+      size: 5Gi
+    resources:
+      limits:
+        cpu: 250m
+          memory: 512Mi
+        requests:
+          cpu: 125m
+          memory: 256Mi
+
+# Additional services
+blockchainNode:
+  replicaCount: 1
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 250m
+      memory: 256Mi
+
+walletDaemon:
+  replicaCount: 1
+  resources:
+    limits:
+      cpu: 250m
+      memory: 256Mi
+    requests:
+      cpu: 125m
+      memory: 128Mi
--- a/infra/helm/values/prod.yaml
+++ b/infra/helm/values/prod.yaml
@ -0,0 +1,140 @@
+# Production environment values
+global:
+  environment: production
+
+coordinator:
+  replicaCount: 3
+  image:
+    tag: "v0.1.0"
+  resources:
+    limits:
+      cpu: 2000m
+      memory: 2Gi
+    requests:
+      cpu: 1000m
+      memory: 1Gi
+  autoscaling:
+    enabled: true
+    minReplicas: 3
+    maxReplicas: 20
+    targetCPUUtilizationPercentage: 75
+    targetMemoryUtilizationPercentage: 80
+  config:
+    appEnv: production
+    allowOrigins: "https://app.aitbc.io"
+  postgresql:
+    auth:
+      existingSecret: "coordinator-db-secret"
+    primary:
+      persistence:
+        size: 200Gi
+        storageClass: fast-ssd
+      resources:
+        limits:
+          cpu: 2000m
+          memory: 4Gi
+        requests:
+          cpu: 1000m
+          memory: 2Gi
+    readReplicas:
+      replicaCount: 2
+      resources:
+        limits:
+          cpu: 1000m
+          memory: 2Gi
+        requests:
+          cpu: 500m
+          memory: 1Gi
+
+monitoring:
+  prometheus:
+    server:
+      retention: 90d
+      persistentVolume:
+        size: 500Gi
+        storageClass: fast-ssd
+      resources:
+        limits:
+          cpu: 2000m
+          memory: 4Gi
+        requests:
+          cpu: 1000m
+          memory: 2Gi
+  grafana:
+    adminPassword: "prod-admin-secure-2024"
+    persistence:
+      size: 50Gi
+      storageClass: fast-ssd
+    resources:
+      limits:
+        cpu: 1000m
+          memory: 2Gi
+        requests:
+          cpu: 500m
+          memory: 1Gi
+    ingress:
+      enabled: true
+      hosts:
+        - grafana.aitbc.io
+
+# Additional services
+blockchainNode:
+  replicaCount: 5
+  resources:
+    limits:
+      cpu: 2000m
+      memory: 2Gi
+    requests:
+      cpu: 1000m
+      memory: 1Gi
+  autoscaling:
+    enabled: true
+    minReplicas: 5
+    maxReplicas: 50
+    targetCPUUtilizationPercentage: 70
+
+walletDaemon:
+  replicaCount: 3
+  resources:
+    limits:
+      cpu: 1000m
+      memory: 1Gi
+    requests:
+      cpu: 500m
+      memory: 512Mi
+  autoscaling:
+    enabled: true
+    minReplicas: 3
+    maxReplicas: 10
+    targetCPUUtilizationPercentage: 75
+
+# Ingress configuration
+ingress:
+  enabled: true
+  className: nginx
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+    nginx.ingress.kubernetes.io/rate-limit: "100"
+    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
+  hosts:
+    - host: api.aitbc.io
+      paths:
+        - path: /
+          pathType: Prefix
+  tls:
+    - secretName: prod-tls
+      hosts:
+        - api.aitbc.io
+
+# Security
+podSecurityPolicy:
+  enabled: true
+
+networkPolicy:
+  enabled: true
+
+# Backup configuration
+backup:
+  enabled: true
+  schedule: "0 2 * * *"
+  retention: "30d"
--- a/infra/helm/values/staging.yaml
+++ b/infra/helm/values/staging.yaml
@ -0,0 +1,98 @@
+# Staging environment values
+global:
+  environment: staging
+
+coordinator:
+  replicaCount: 2
+  image:
+    tag: "staging-latest"
+  resources:
+    limits:
+      cpu: 1000m
+      memory: 1Gi
+    requests:
+      cpu: 500m
+      memory: 512Mi
+  autoscaling:
+    enabled: true
+    minReplicas: 2
+    maxReplicas: 5
+    targetCPUUtilizationPercentage: 70
+  config:
+    appEnv: staging
+    allowOrigins: "https://staging.aitbc.io"
+  postgresql:
+    auth:
+      postgresPassword: "staging-password"
+    primary:
+      persistence:
+        size: 50Gi
+      resources:
+        limits:
+          cpu: 1000m
+          memory: 2Gi
+        requests:
+          cpu: 500m
+          memory: 1Gi
+
+monitoring:
+  prometheus:
+    server:
+      retention: 30d
+      persistentVolume:
+        size: 100Gi
+      resources:
+        limits:
+          cpu: 1000m
+          memory: 2Gi
+        requests:
+          cpu: 500m
+          memory: 1Gi
+  grafana:
+    adminPassword: "staging-admin-2024"
+    persistence:
+      size: 10Gi
+    resources:
+      limits:
+        cpu: 500m
+        memory: 1Gi
+      requests:
+        cpu: 250m
+        memory: 512Mi
+
+# Additional services
+blockchainNode:
+  replicaCount: 2
+  resources:
+    limits:
+      cpu: 1000m
+      memory: 1Gi
+    requests:
+      cpu: 500m
+      memory: 512Mi
+
+walletDaemon:
+  replicaCount: 2
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 250m
+      memory: 256Mi
+
+# Ingress configuration
+ingress:
+  enabled: true
+  className: nginx
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt-prod
+  hosts:
+    - host: api.staging.aitbc.io
+      paths:
+        - path: /
+          pathType: Prefix
+  tls:
+    - secretName: staging-tls
+      hosts:
+        - api.staging.aitbc.io
--- a/infra/k8s/backup-configmap.yaml
+++ b/infra/k8s/backup-configmap.yaml
@ -0,0 +1,570 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: backup-scripts
+  namespace: default
+  labels:
+    app: aitbc-backup
+    component: backup
+data:
+  backup_postgresql.sh: |
+    #!/bin/bash
+    # PostgreSQL Backup Script for AITBC
+    # Usage: ./backup_postgresql.sh [namespace] [backup_name]
+
+    set -euo pipefail
+
+    # Configuration
+    NAMESPACE=${1:-default}
+    BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
+    BACKUP_DIR="/tmp/postgresql-backups"
+    RETENTION_DAYS=30
+
+    # Colors for output
+    RED='\033[0;31m'
+    GREEN='\033[0;32m'
+    YELLOW='\033[1;33m'
+    NC='\033[0m' # No Color
+
+    # Logging function
+    log() {
+        echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+    }
+
+    error() {
+        echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+    }
+
+    warn() {
+        echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+    }
+
+    # Check dependencies
+    check_dependencies() {
+        if ! command -v kubectl &> /dev/null; then
+            error "kubectl is not installed or not in PATH"
+            exit 1
+        fi
+        
+        if ! command -v pg_dump &> /dev/null; then
+            error "pg_dump is not installed or not in PATH"
+            exit 1
+        fi
+    }
+
+    # Create backup directory
+    create_backup_dir() {
+        mkdir -p "$BACKUP_DIR"
+        log "Created backup directory: $BACKUP_DIR"
+    }
+
+    # Get PostgreSQL pod name
+    get_postgresql_pod() {
+        local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+        if [[ -z "$pod" ]]; then
+            pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+        fi
+        
+        if [[ -z "$pod" ]]; then
+            error "Could not find PostgreSQL pod in namespace $NAMESPACE"
+            exit 1
+        fi
+        
+        echo "$pod"
+    }
+
+    # Wait for PostgreSQL to be ready
+    wait_for_postgresql() {
+        local pod=$1
+        log "Waiting for PostgreSQL pod $pod to be ready..."
+        
+        kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+        
+        # Check if PostgreSQL is accepting connections
+        local retries=30
+        while [[ $retries -gt 0 ]]; do
+            if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
+                log "PostgreSQL is ready"
+                return 0
+            fi
+            sleep 2
+            ((retries--))
+        done
+        
+        error "PostgreSQL did not become ready within timeout"
+        exit 1
+    }
+
+    # Perform backup
+    perform_backup() {
+        local pod=$1
+        local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
+        
+        log "Starting PostgreSQL backup to $backup_file"
+        
+        # Get database credentials from secret
+        local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+        local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+        local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+        
+        # Perform the backup
+        PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+            pg_dump -U "$db_user" -h localhost -d "$db_name" \
+            --verbose --clean --if-exists --create --format=custom \
+            --file="/tmp/${BACKUP_NAME}.dump"
+        
+        # Copy backup from pod
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
+        
+        # Clean up remote backup file
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
+        
+        # Compress backup
+        gzip "$backup_file"
+        backup_file="${backup_file}.gz"
+        
+        log "Backup completed: $backup_file"
+        
+        # Verify backup
+        if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+            local size=$(du -h "$backup_file" | cut -f1)
+            log "Backup size: $size"
+        else
+            error "Backup file is empty or missing"
+            exit 1
+        fi
+    }
+
+    # Clean old backups
+    cleanup_old_backups() {
+        log "Cleaning up backups older than $RETENTION_DAYS days"
+        find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
+        log "Cleanup completed"
+    }
+
+    # Upload to cloud storage (optional)
+    upload_to_cloud() {
+        local backup_file="$1"
+        
+        # Check if AWS CLI is configured
+        if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+            log "Uploading backup to S3"
+            local s3_bucket="aitbc-backups-${NAMESPACE}"
+            local s3_key="postgresql/$(basename "$backup_file")"
+            
+            aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+            log "Backup uploaded to s3://$s3_bucket/$s3_key"
+        else
+            warn "AWS CLI not configured, skipping cloud upload"
+        fi
+    }
+
+    # Main execution
+    main() {
+        log "Starting PostgreSQL backup process"
+        
+        check_dependencies
+        create_backup_dir
+        
+        local pod=$(get_postgresql_pod)
+        wait_for_postgresql "$pod"
+        
+        perform_backup "$pod"
+        cleanup_old_backups
+        
+        local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
+        upload_to_cloud "$backup_file"
+        
+        log "PostgreSQL backup process completed successfully"
+    }
+
+    # Run main function
+    main "$@"
+
+  backup_redis.sh: |
+    #!/bin/bash
+    # Redis Backup Script for AITBC
+    # Usage: ./backup_redis.sh [namespace] [backup_name]
+
+    set -euo pipefail
+
+    # Configuration
+    NAMESPACE=${1:-default}
+    BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
+    BACKUP_DIR="/tmp/redis-backups"
+    RETENTION_DAYS=30
+
+    # Colors for output
+    RED='\033[0;31m'
+    GREEN='\033[0;32m'
+    YELLOW='\033[1;33m'
+    NC='\033[0m' # No Color
+
+    # Logging function
+    log() {
+        echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+    }
+
+    error() {
+        echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+    }
+
+    warn() {
+        echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+    }
+
+    # Check dependencies
+    check_dependencies() {
+        if ! command -v kubectl &> /dev/null; then
+            error "kubectl is not installed or not in PATH"
+            exit 1
+        fi
+    }
+
+    # Create backup directory
+    create_backup_dir() {
+        mkdir -p "$BACKUP_DIR"
+        log "Created backup directory: $BACKUP_DIR"
+    }
+
+    # Get Redis pod name
+    get_redis_pod() {
+        local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+        if [[ -z "$pod" ]]; then
+            pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+        fi
+        
+        if [[ -z "$pod" ]]; then
+            error "Could not find Redis pod in namespace $NAMESPACE"
+            exit 1
+        fi
+        
+        echo "$pod"
+    }
+
+    # Wait for Redis to be ready
+    wait_for_redis() {
+        local pod=$1
+        log "Waiting for Redis pod $pod to be ready..."
+        
+        kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+        
+        # Check if Redis is accepting connections
+        local retries=30
+        while [[ $retries -gt 0 ]]; do
+            if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+                log "Redis is ready"
+                return 0
+            fi
+            sleep 2
+            ((retries--))
+        done
+        
+        error "Redis did not become ready within timeout"
+        exit 1
+    }
+
+    # Perform backup
+    perform_backup() {
+        local pod=$1
+        local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+        
+        log "Starting Redis backup to $backup_file"
+        
+        # Create Redis backup
+        kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
+        
+        # Wait for background save to complete
+        log "Waiting for background save to complete..."
+        local retries=60
+        while [[ $retries -gt 0 ]]; do
+            local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+            local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+            
+            if [[ "$lastsave" -gt "$lastbgsave" ]]; then
+                log "Background save completed"
+                break
+            fi
+            sleep 2
+            ((retries--))
+        done
+        
+        if [[ $retries -eq 0 ]]; then
+            error "Background save did not complete within timeout"
+            exit 1
+        fi
+        
+        # Copy RDB file from pod
+        kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
+        
+        # Also create an append-only file backup if enabled
+        local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
+        if [[ "$aof_enabled" == "yes" ]]; then
+            local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
+            kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
+            log "AOF backup created: $aof_backup"
+        fi
+        
+        log "Backup completed: $backup_file"
+        
+        # Verify backup
+        if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+            local size=$(du -h "$backup_file" | cut -f1)
+            log "Backup size: $size"
+        else
+            error "Backup file is empty or missing"
+            exit 1
+        fi
+    }
+
+    # Clean old backups
+    cleanup_old_backups() {
+        log "Cleaning up backups older than $RETENTION_DAYS days"
+        find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
+        find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
+        log "Cleanup completed"
+    }
+
+    # Upload to cloud storage (optional)
+    upload_to_cloud() {
+        local backup_file="$1"
+        
+        # Check if AWS CLI is configured
+        if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+            log "Uploading backup to S3"
+            local s3_bucket="aitbc-backups-${NAMESPACE}"
+            local s3_key="redis/$(basename "$backup_file")"
+            
+            aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+            log "Backup uploaded to s3://$s3_bucket/$s3_key"
+            
+            # Upload AOF file if exists
+            local aof_file="${backup_file%.rdb}.aof"
+            if [[ -f "$aof_file" ]]; then
+                local aof_key="redis/$(basename "$aof_file")"
+                aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
+                log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
+            fi
+        else
+            warn "AWS CLI not configured, skipping cloud upload"
+        fi
+    }
+
+    # Main execution
+    main() {
+        log "Starting Redis backup process"
+        
+        check_dependencies
+        create_backup_dir
+        
+        local pod=$(get_redis_pod)
+        wait_for_redis "$pod"
+        
+        perform_backup "$pod"
+        cleanup_old_backups
+        
+        local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+        upload_to_cloud "$backup_file"
+        
+        log "Redis backup process completed successfully"
+    }
+
+    # Run main function
+    main "$@"
+
+  backup_ledger.sh: |
+    #!/bin/bash
+    # Ledger Storage Backup Script for AITBC
+    # Usage: ./backup_ledger.sh [namespace] [backup_name]
+
+    set -euo pipefail
+
+    # Configuration
+    NAMESPACE=${1:-default}
+    BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
+    BACKUP_DIR="/tmp/ledger-backups"
+    RETENTION_DAYS=30
+
+    # Colors for output
+    RED='\033[0;31m'
+    GREEN='\033[0;32m'
+    YELLOW='\033[1;33m'
+    NC='\033[0m' # No Color
+
+    # Logging function
+    log() {
+        echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+    }
+
+    error() {
+        echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+    }
+
+    warn() {
+        echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+    }
+
+    # Check dependencies
+    check_dependencies() {
+        if ! command -v kubectl &> /dev/null; then
+            error "kubectl is not installed or not in PATH"
+            exit 1
+        fi
+    }
+
+    # Create backup directory
+    create_backup_dir() {
+        mkdir -p "$BACKUP_DIR"
+        log "Created backup directory: $BACKUP_DIR"
+    }
+
+    # Get blockchain node pods
+    get_blockchain_pods() {
+        local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+        if [[ -z "$pods" ]]; then
+            pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+        fi
+        
+        if [[ -z "$pods" ]]; then
+            error "Could not find blockchain node pods in namespace $NAMESPACE"
+            exit 1
+        fi
+        
+        echo $pods
+    }
+
+    # Wait for blockchain node to be ready
+    wait_for_blockchain_node() {
+        local pod=$1
+        log "Waiting for blockchain node pod $pod to be ready..."
+        
+        kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+        
+        # Check if node is responding
+        local retries=30
+        while [[ $retries -gt 0 ]]; do
+            if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
+                log "Blockchain node is ready"
+                return 0
+            fi
+            sleep 2
+            ((retries--))
+        done
+        
+        error "Blockchain node did not become ready within timeout"
+        exit 1
+    }
+
+    # Backup ledger data
+    backup_ledger_data() {
+        local pod=$1
+        local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+        mkdir -p "$ledger_backup_dir"
+        
+        log "Starting ledger backup from pod $pod"
+        
+        # Get the latest block height before backup
+        local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+        log "Latest block height: $latest_block"
+        
+        # Backup blockchain data directory
+        local blockchain_data_dir="/app/data/chain"
+        if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
+            log "Backing up blockchain data directory..."
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
+            kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
+        fi
+        
+        # Backup wallet data
+        local wallet_data_dir="/app/data/wallets"
+        if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
+            log "Backing up wallet data directory..."
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
+            kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
+        fi
+        
+        # Backup receipts
+        local receipts_data_dir="/app/data/receipts"
+        if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
+            log "Backing up receipts directory..."
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
+            kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
+        fi
+        
+        # Create metadata file
+        cat > "$ledger_backup_dir/metadata.json" << EOF
+    {
+      "backup_name": "$BACKUP_NAME",
+      "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+      "namespace": "$NAMESPACE",
+      "source_pod": "$pod",
+      "latest_block_height": $latest_block,
+      "backup_type": "full"
+    }
+    EOF
+        
+        log "Ledger backup completed: $ledger_backup_dir"
+        
+        # Verify backup
+        local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
+        log "Total backup size: $total_size"
+    }
+
+    # Clean old backups
+    cleanup_old_backups() {
+        log "Cleaning up backups older than $RETENTION_DAYS days"
+        find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
+        find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
+        log "Cleanup completed"
+    }
+
+    # Upload to cloud storage (optional)
+    upload_to_cloud() {
+        local backup_dir="$1"
+        
+        # Check if AWS CLI is configured
+        if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+            log "Uploading backup to S3"
+            local s3_bucket="aitbc-backups-${NAMESPACE}"
+            
+            # Upload entire backup directory
+            aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
+            
+            log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
+        else
+            warn "AWS CLI not configured, skipping cloud upload"
+        fi
+    }
+
+    # Main execution
+    main() {
+        log "Starting ledger backup process"
+        
+        check_dependencies
+        create_backup_dir
+        
+        local pods=($(get_blockchain_pods))
+        
+        # Use the first ready pod for backup
+        for pod in "${pods[@]}"; do
+            if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+                wait_for_blockchain_node "$pod"
+                backup_ledger_data "$pod"
+                
+                local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+                upload_to_cloud "$backup_dir"
+                
+                break
+            fi
+        done
+        
+        cleanup_old_backups
+        
+        log "Ledger backup process completed successfully"
+    }
+
+    # Run main function
+    main "$@"
--- a/infra/k8s/backup-cronjob.yaml
+++ b/infra/k8s/backup-cronjob.yaml
@ -0,0 +1,156 @@
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: aitbc-backup
+  namespace: default
+  labels:
+    app: aitbc-backup
+    component: backup
+spec:
+  schedule: "0 2 * * *"  # Run daily at 2 AM
+  concurrencyPolicy: Forbid
+  successfulJobsHistoryLimit: 7
+  failedJobsHistoryLimit: 3
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: OnFailure
+          containers:
+          - name: postgresql-backup
+            image: postgres:15-alpine
+            command:
+            - /bin/bash
+            - -c
+            - |
+              echo "Starting PostgreSQL backup..."
+              /scripts/backup_postgresql.sh default postgresql-backup-$(date +%Y%m%d_%H%M%S)
+              echo "PostgreSQL backup completed"
+            env:
+            - name: PGPASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: coordinator-postgresql
+                  key: password
+            volumeMounts:
+            - name: backup-scripts
+              mountPath: /scripts
+              readOnly: true
+            - name: backup-storage
+              mountPath: /backups
+            resources:
+              requests:
+                memory: "256Mi"
+                cpu: "100m"
+              limits:
+                memory: "512Mi"
+                cpu: "500m"
+          
+          - name: redis-backup
+            image: redis:7-alpine
+            command:
+            - /bin/sh
+            - -c
+            - |
+              echo "Waiting for PostgreSQL backup to complete..."
+              sleep 60
+              echo "Starting Redis backup..."
+              /scripts/backup_redis.sh default redis-backup-$(date +%Y%m%d_%H%M%S)
+              echo "Redis backup completed"
+            volumeMounts:
+            - name: backup-scripts
+              mountPath: /scripts
+              readOnly: true
+            - name: backup-storage
+              mountPath: /backups
+            resources:
+              requests:
+                memory: "128Mi"
+                cpu: "50m"
+              limits:
+                memory: "256Mi"
+                cpu: "200m"
+          
+          - name: ledger-backup
+            image: alpine:3.18
+            command:
+            - /bin/sh
+            - -c
+            - |
+              echo "Waiting for previous backups to complete..."
+              sleep 120
+              echo "Starting Ledger backup..."
+              /scripts/backup_ledger.sh default ledger-backup-$(date +%Y%m%d_%H%M%S)
+              echo "Ledger backup completed"
+            volumeMounts:
+            - name: backup-scripts
+              mountPath: /scripts
+              readOnly: true
+            - name: backup-storage
+              mountPath: /backups
+            resources:
+              requests:
+                memory: "256Mi"
+                cpu: "100m"
+              limits:
+                memory: "512Mi"
+                cpu: "500m"
+          
+          volumes:
+          - name: backup-scripts
+            configMap:
+              name: backup-scripts
+              defaultMode: 0755
+          
+          - name: backup-storage
+            persistentVolumeClaim:
+              claimName: backup-storage-pvc
+          
+          # Add service account for cloud storage access
+          serviceAccountName: backup-service-account
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: backup-service-account
+  namespace: default
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: backup-role
+  namespace: default
+rules:
+- apiGroups: [""]
+  resources: ["pods", "pods/exec", "secrets"]
+  verbs: ["get", "list"]
+- apiGroups: ["batch"]
+  resources: ["jobs", "cronjobs"]
+  verbs: ["get", "list", "watch"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: backup-role-binding
+  namespace: default
+subjects:
+- kind: ServiceAccount
+  name: backup-service-account
+  namespace: default
+roleRef:
+  kind: Role
+  name: backup-role
+  apiGroup: rbac.authorization.k8s.io
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: backup-storage-pvc
+  namespace: default
+spec:
+  accessModes:
+    - ReadWriteOnce
+  storageClassName: fast-ssd
+  resources:
+    requests:
+      storage: 500Gi
--- a/infra/k8s/cert-manager.yaml
+++ b/infra/k8s/cert-manager.yaml
@ -0,0 +1,99 @@
+# Cert-Manager Installation
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: cert-manager
+  namespace: argocd
+  finalizers:
+  - resources-finalizer.argocd.argoproj.io
+spec:
+  project: default
+  source:
+    repoURL: https://charts.jetstack.io
+    chart: cert-manager
+    targetRevision: v1.14.0
+    helm:
+      releaseName: cert-manager
+      parameters:
+      - name: installCRDs
+        value: "true"
+      - name: namespace
+        value: cert-manager
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: cert-manager
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+---
+# Let's Encrypt Production ClusterIssuer
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: letsencrypt-prod
+spec:
+  acme:
+    server: https://acme-v02.api.letsencrypt.org/directory
+    email: admin@aitbc.io
+    privateKeySecretRef:
+      name: letsencrypt-prod
+    solvers:
+    - http01:
+        ingress:
+          class: nginx
+---
+# Let's Encrypt Staging ClusterIssuer (for testing)
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: letsencrypt-staging
+spec:
+  acme:
+    server: https://acme-staging-v02.api.letsencrypt.org/directory
+    email: admin@aitbc.io
+    privateKeySecretRef:
+      name: letsencrypt-staging
+    solvers:
+    - http01:
+        ingress:
+          class: nginx
+---
+# Self-Signed Issuer for Development
+apiVersion: cert-manager.io/v1
+kind: Issuer
+metadata:
+  name: selfsigned-issuer
+  namespace: default
+spec:
+  selfSigned: {}
+---
+# Development Certificate
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: coordinator-dev-tls
+  namespace: default
+spec:
+  secretName: coordinator-dev-tls
+  dnsNames:
+  - coordinator.local
+  - coordinator.127.0.0.2.nip.io
+  issuerRef:
+    name: selfsigned-issuer
+    kind: Issuer
+---
+# Production Certificate Template
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: coordinator-prod-tls
+  namespace: default
+spec:
+  secretName: coordinator-prod-tls
+  dnsNames:
+  - api.aitbc.io
+  - www.api.aitbc.io
+  issuerRef:
+    name: letsencrypt-prod
+    kind: ClusterIssuer
--- a/infra/k8s/default-deny-netpol.yaml
+++ b/infra/k8s/default-deny-netpol.yaml
@ -0,0 +1,56 @@
+# Default Deny All Network Policy
+# This policy denies all ingress and egress traffic by default
+# Individual services must have their own network policies to allow traffic
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: default-deny-all-ingress
+  namespace: default
+spec:
+  podSelector: {}
+  policyTypes:
+  - Ingress
+---
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: default-deny-all-egress
+  namespace: default
+spec:
+  podSelector: {}
+  policyTypes:
+  - Egress
+---
+# Allow DNS resolution for all pods
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-dns
+  namespace: default
+spec:
+  podSelector: {}
+  policyTypes:
+  - Egress
+  egress:
+  - to: []
+    ports:
+    - protocol: UDP
+      port: 53
+    - protocol: TCP
+      port: 53
+---
+# Allow traffic to Kubernetes API
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-k8s-api
+  namespace: default
+spec:
+  podSelector: {}
+  policyTypes:
+  - Egress
+  egress:
+  - to: []
+    ports:
+    - protocol: TCP
+      port: 443
--- a/infra/k8s/sealed-secrets.yaml
+++ b/infra/k8s/sealed-secrets.yaml
@ -0,0 +1,81 @@
+# SealedSecrets Controller Installation
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: sealed-secrets
+  namespace: argocd
+  finalizers:
+  - resources-finalizer.argocd.argoproj.io
+spec:
+  project: default
+  source:
+    repoURL: https://bitnami-labs.github.io/sealed-secrets
+    chart: sealed-secrets
+    targetRevision: 2.15.0
+    helm:
+      releaseName: sealed-secrets
+      parameters:
+      - name: namespace
+        value: kube-system
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: kube-system
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+---
+# Example SealedSecret for Coordinator API Keys
+apiVersion: bitnami.com/v1alpha1
+kind: SealedSecret
+metadata:
+  name: coordinator-api-keys
+  namespace: default
+  annotations:
+    sealedsecrets.bitnami.com/cluster-wide: "true"
+spec:
+  encryptedData:
+    # Production API key (encrypted)
+    api-key-prod: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+    # Staging API key (encrypted)
+    api-key-staging: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+    # Development API key (encrypted)
+    api-key-dev: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+  template:
+    metadata:
+      name: coordinator-api-keys
+      namespace: default
+    type: Opaque
+---
+# Example SealedSecret for Database Credentials
+apiVersion: bitnami.com/v1alpha1
+kind: SealedSecret
+metadata:
+  name: coordinator-db-credentials
+  namespace: default
+spec:
+  encryptedData:
+    username: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+    password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+    database: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+  template:
+    metadata:
+      name: coordinator-db-credentials
+      namespace: default
+    type: Opaque
+---
+# Example SealedSecret for JWT Signing Keys (if needed in future)
+apiVersion: bitnami.com/v1alpha1
+kind: SealedSecret
+metadata:
+  name: coordinator-jwt-keys
+  namespace: default
+spec:
+  encryptedData:
+    private-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+    public-key: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEQAx...
+  template:
+    metadata:
+      name: coordinator-jwt-keys
+      namespace: default
+    type: Opaque
--- a/infra/scripts/README_chaos.md
+++ b/infra/scripts/README_chaos.md
@ -0,0 +1,330 @@
+# AITBC Chaos Testing Framework
+
+This framework implements chaos engineering tests to validate the resilience and recovery capabilities of the AITBC platform.
+
+## Overview
+
+The chaos testing framework simulates real-world failure scenarios to:
+- Test system resilience under adverse conditions
+- Measure Mean-Time-To-Recovery (MTTR) metrics
+- Identify single points of failure
+- Validate recovery procedures
+- Ensure SLO compliance
+
+## Components
+
+### Test Scripts
+
+1. **`chaos_test_coordinator.py`** - Coordinator API outage simulation
+   - Deletes coordinator pods to simulate complete service outage
+   - Measures recovery time and service availability
+   - Tests load handling during and after recovery
+
+2. **`chaos_test_network.py`** - Network partition simulation
+   - Creates network partitions between blockchain nodes
+   - Tests consensus resilience during partition
+   - Measures network recovery time
+
+3. **`chaos_test_database.py`** - Database failure simulation
+   - Simulates PostgreSQL connection failures
+   - Tests high latency scenarios
+   - Validates application error handling
+
+4. **`chaos_orchestrator.py`** - Test orchestration and reporting
+   - Runs multiple chaos test scenarios
+   - Aggregates MTTR metrics across tests
+   - Generates comprehensive reports
+   - Supports continuous chaos testing
+
+## Prerequisites
+
+- Python 3.8+
+- kubectl configured with cluster access
+- Helm charts deployed in target namespace
+- Administrative privileges for network manipulation
+
+## Installation
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd aitbc/infra/scripts
+
+# Install dependencies
+pip install aiohttp
+
+# Make scripts executable
+chmod +x chaos_*.py
+```
+
+## Usage
+
+### Running Individual Tests
+
+#### Coordinator Outage Test
+```bash
+# Basic test
+python3 chaos_test_coordinator.py --namespace default
+
+# Custom outage duration
+python3 chaos_test_coordinator.py --namespace default --outage-duration 120
+
+# Dry run (no actual chaos)
+python3 chaos_test_coordinator.py --dry-run
+```
+
+#### Network Partition Test
+```bash
+# Partition 50% of nodes for 60 seconds
+python3 chaos_test_network.py --namespace default
+
+# Partition 30% of nodes for 90 seconds
+python3 chaos_test_network.py --namespace default --partition-duration 90 --partition-ratio 0.3
+```
+
+#### Database Failure Test
+```bash
+# Simulate connection failure
+python3 chaos_test_database.py --namespace default --failure-type connection
+
+# Simulate high latency (5000ms)
+python3 chaos_test_database.py --namespace default --failure-type latency
+```
+
+### Running All Tests
+
+```bash
+# Run all scenarios with default parameters
+python3 chaos_orchestrator.py --namespace default
+
+# Run specific scenarios
+python3 chaos_orchestrator.py --namespace default --scenarios coordinator network
+
+# Continuous chaos testing (24 hours, every 60 minutes)
+python3 chaos_orchestrator.py --namespace default --continuous --duration 24 --interval 60
+```
+
+## Test Scenarios
+
+### 1. Coordinator API Outage
+
+**Objective**: Test system resilience when the coordinator service becomes unavailable.
+
+**Steps**:
+1. Generate baseline load on coordinator API
+2. Delete all coordinator pods
+3. Wait for specified outage duration
+4. Monitor service recovery
+5. Generate post-recovery load
+
+**Metrics Collected**:
+- MTTR (Mean-Time-To-Recovery)
+- Success/error request counts
+- Recovery time distribution
+
+### 2. Network Partition
+
+**Objective**: Test blockchain consensus during network partitions.
+
+**Steps**:
+1. Identify blockchain node pods
+2. Apply iptables rules to partition nodes
+3. Monitor consensus during partition
+4. Remove network partition
+5. Verify network recovery
+
+**Metrics Collected**:
+- Network recovery time
+- Consensus health during partition
+- Node connectivity status
+
+### 3. Database Failure
+
+**Objective**: Test application behavior when database is unavailable.
+
+**Steps**:
+1. Simulate database connection failure or high latency
+2. Monitor API behavior during failure
+3. Restore database connectivity
+4. Verify application recovery
+
+**Metrics Collected**:
+- Database recovery time
+- API error rates during failure
+- Application resilience metrics
+
+## Results and Reporting
+
+### Test Results Format
+
+Each test generates a JSON results file with the following structure:
+
+```json
+{
+  "test_start": "2024-12-22T10:00:00.000Z",
+  "test_end": "2024-12-22T10:05:00.000Z",
+  "scenario": "coordinator_outage",
+  "mttr": 45.2,
+  "error_count": 156,
+  "success_count": 844,
+  "recovery_time": 45.2
+}
+```
+
+### Orchestrator Report
+
+The orchestrator generates a comprehensive report including:
+
+- Summary metrics across all scenarios
+- SLO compliance analysis
+- Recommendations for improvements
+- MTTR trends and statistics
+
+Example report snippet:
+```json
+{
+  "summary": {
+    "total_scenarios": 3,
+    "successful_scenarios": 3,
+    "average_mttr": 67.8,
+    "max_mttr": 120.5,
+    "min_mttr": 45.2
+  },
+  "recommendations": [
+    "Average MTTR exceeds 2 minutes. Consider improving recovery automation.",
+    "Coordinator recovery is slow. Consider reducing pod startup time."
+  ]
+}
+```
+
+## SLO Targets
+
+| Metric | Target | Current |
+|--------|--------|---------|
+| MTTR (Average) | ≤ 120 seconds | TBD |
+| MTTR (Maximum) | ≤ 300 seconds | TBD |
+| Success Rate | ≥ 99.9% | TBD |
+
+## Best Practices
+
+### Before Running Tests
+
+1. **Backup Critical Data**: Ensure recent backups are available
+2. **Notify Team**: Inform stakeholders about chaos testing
+3. **Check Cluster Health**: Verify all components are healthy
+4. **Schedule Appropriately**: Run during low-traffic periods
+
+### During Tests
+
+1. **Monitor Logs**: Watch for unexpected errors
+2. **Have Rollback Plan**: Be ready to manually intervene
+3. **Document Observations**: Note any unusual behavior
+4. **Stop if Critical**: Abort tests if production is impacted
+
+### After Tests
+
+1. **Review Results**: Analyze MTTR and error rates
+2. **Update Documentation**: Record findings and improvements
+3. **Address Issues**: Fix any discovered problems
+4. **Schedule Follow-up**: Plan regular chaos testing
+
+## Integration with CI/CD
+
+### GitHub Actions Example
+
+```yaml
+name: Chaos Testing
+on:
+  schedule:
+    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
+  workflow_dispatch:
+
+jobs:
+  chaos-test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Setup Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: '3.9'
+      - name: Install dependencies
+        run: |
+          pip install aiohttp
+      - name: Run chaos tests
+        run: |
+          cd infra/scripts
+          python3 chaos_orchestrator.py --namespace staging
+      - name: Upload results
+        uses: actions/upload-artifact@v2
+        with:
+          name: chaos-results
+          path: "*.json"
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **kubectl not found**
+   ```bash
+   # Ensure kubectl is installed and configured
+   which kubectl
+   kubectl version
+   ```
+
+2. **Permission denied errors**
+   ```bash
+   # Check RBAC permissions
+   kubectl auth can-i create pods --namespace default
+   kubectl auth can-i exec pods --namespace default
+   ```
+
+3. **Network rules not applying**
+   ```bash
+   # Check if iptables is available in pods
+   kubectl exec -it <pod> -- iptables -L
+   ```
+
+4. **Tests hanging**
+   ```bash
+   # Check pod status
+   kubectl get pods --namespace default
+   kubectl describe pod <pod-name> --namespace default
+   ```
+
+### Debug Mode
+
+Enable debug logging:
+```bash
+export PYTHONPATH=.
+python3 -u chaos_test_coordinator.py --namespace default 2>&1 | tee debug.log
+```
+
+## Contributing
+
+To add new chaos test scenarios:
+
+1. Create a new script following the naming pattern `chaos_test_<scenario>.py`
+2. Implement the required methods: `run_test()`, `save_results()`
+3. Add the scenario to `chaos_orchestrator.py`
+4. Update documentation
+
+## Security Considerations
+
+- Chaos tests require elevated privileges
+- Only run in authorized environments
+- Ensure test isolation from production data
+- Review network rules before deployment
+- Monitor for security violations during tests
+
+## Support
+
+For issues or questions:
+- Check the troubleshooting section
+- Review test logs for error details
+- Contact the DevOps team at devops@aitbc.io
+
+## License
+
+This chaos testing framework is part of the AITBC project and follows the same license terms.
--- a/infra/scripts/backup_ledger.sh
+++ b/infra/scripts/backup_ledger.sh
@ -0,0 +1,233 @@
+#!/bin/bash
+# Ledger Storage Backup Script for AITBC
+# Usage: ./backup_ledger.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-ledger-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/ledger-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get blockchain node pods
+get_blockchain_pods() {
+    local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pods" ]]; then
+        pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pods" ]]; then
+        error "Could not find blockchain node pods in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo $pods
+}
+
+# Wait for blockchain node to be ready
+wait_for_blockchain_node() {
+    local pod=$1
+    log "Waiting for blockchain node pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if node is responding
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
+            log "Blockchain node is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "Blockchain node did not become ready within timeout"
+    exit 1
+}
+
+# Backup ledger data
+backup_ledger_data() {
+    local pod=$1
+    local ledger_backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+    mkdir -p "$ledger_backup_dir"
+    
+    log "Starting ledger backup from pod $pod"
+    
+    # Get the latest block height before backup
+    local latest_block=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+    log "Latest block height: $latest_block"
+    
+    # Backup blockchain data directory
+    local blockchain_data_dir="/app/data/chain"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$blockchain_data_dir"; then
+        log "Backing up blockchain data directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-chain.tar.gz" -C "$blockchain_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-chain.tar.gz" "$ledger_backup_dir/chain.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-chain.tar.gz"
+    fi
+    
+    # Backup wallet data
+    local wallet_data_dir="/app/data/wallets"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$wallet_data_dir"; then
+        log "Backing up wallet data directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-wallets.tar.gz" -C "$wallet_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-wallets.tar.gz" "$ledger_backup_dir/wallets.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-wallets.tar.gz"
+    fi
+    
+    # Backup receipts
+    local receipts_data_dir="/app/data/receipts"
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "$receipts_data_dir"; then
+        log "Backing up receipts directory..."
+        kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${BACKUP_NAME}-receipts.tar.gz" -C "$receipts_data_dir" .
+        kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}-receipts.tar.gz" "$ledger_backup_dir/receipts.tar.gz"
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}-receipts.tar.gz"
+    fi
+    
+    # Create metadata file
+    cat > "$ledger_backup_dir/metadata.json" << EOF
+{
+  "backup_name": "$BACKUP_NAME",
+  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+  "namespace": "$NAMESPACE",
+  "source_pod": "$pod",
+  "latest_block_height": $latest_block,
+  "backup_type": "full"
+}
+EOF
+    
+    log "Ledger backup completed: $ledger_backup_dir"
+    
+    # Verify backup
+    local total_size=$(du -sh "$ledger_backup_dir" | cut -f1)
+    log "Total backup size: $total_size"
+}
+
+# Create incremental backup
+create_incremental_backup() {
+    local pod=$1
+    local last_backup_file="$BACKUP_DIR/.last_backup_height"
+    
+    # Get last backup height
+    local last_backup_height=0
+    if [[ -f "$last_backup_file" ]]; then
+        last_backup_height=$(cat "$last_backup_file")
+    fi
+    
+    # Get current block height
+    local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+    
+    if [[ $current_height -le $last_backup_height ]]; then
+        log "No new blocks since last backup (height: $current_height)"
+        return 0
+    fi
+    
+    log "Creating incremental backup from block $((last_backup_height + 1)) to $current_height"
+    
+    # Export blocks since last backup
+    local incremental_file="$BACKUP_DIR/${BACKUP_NAME}-incremental.json"
+    kubectl exec -n "$NAMESPACE" "$pod" -- curl -s "http://localhost:8080/v1/blocks?from=$((last_backup_height + 1))&to=$current_height" > "$incremental_file"
+    
+    # Update last backup height
+    echo "$current_height" > "$last_backup_file"
+    
+    log "Incremental backup created: $incremental_file"
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -maxdepth 1 -type d -name "ledger-backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
+    find "$BACKUP_DIR" -name "*-incremental.json" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_dir="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        
+        # Upload entire backup directory
+        aws s3 cp "$backup_dir" "s3://$s3_bucket/ledger/$(basename "$backup_dir")/" --recursive --storage-class GLACIER_IR
+        
+        log "Backup uploaded to s3://$s3_bucket/ledger/$(basename "$backup_dir")/"
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    local incremental=${3:-false}
+    
+    log "Starting ledger backup process (incremental=$incremental)"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pods=($(get_blockchain_pods))
+    
+    # Use the first ready pod for backup
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            wait_for_blockchain_node "$pod"
+            
+            if [[ "$incremental" == "true" ]]; then
+                create_incremental_backup "$pod"
+            else
+                backup_ledger_data "$pod"
+            fi
+            
+            local backup_dir="$BACKUP_DIR/${BACKUP_NAME}"
+            upload_to_cloud "$backup_dir"
+            
+            break
+        fi
+    done
+    
+    cleanup_old_backups
+    
+    log "Ledger backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/backup_postgresql.sh
+++ b/infra/scripts/backup_postgresql.sh
@ -0,0 +1,172 @@
+#!/bin/bash
+# PostgreSQL Backup Script for AITBC
+# Usage: ./backup_postgresql.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-postgresql-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/postgresql-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v pg_dump &> /dev/null; then
+        error "pg_dump is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get PostgreSQL pod name
+get_postgresql_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find PostgreSQL pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for PostgreSQL to be ready
+wait_for_postgresql() {
+    local pod=$1
+    log "Waiting for PostgreSQL pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if PostgreSQL is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
+            log "PostgreSQL is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "PostgreSQL did not become ready within timeout"
+    exit 1
+}
+
+# Perform backup
+perform_backup() {
+    local pod=$1
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql"
+    
+    log "Starting PostgreSQL backup to $backup_file"
+    
+    # Get database credentials from secret
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Perform the backup
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_dump -U "$db_user" -h localhost -d "$db_name" \
+        --verbose --clean --if-exists --create --format=custom \
+        --file="/tmp/${BACKUP_NAME}.dump"
+    
+    # Copy backup from pod
+    kubectl cp "$NAMESPACE/$pod:/tmp/${BACKUP_NAME}.dump" "$backup_file"
+    
+    # Clean up remote backup file
+    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${BACKUP_NAME}.dump"
+    
+    # Compress backup
+    gzip "$backup_file"
+    backup_file="${backup_file}.gz"
+    
+    log "Backup completed: $backup_file"
+    
+    # Verify backup
+    if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+        local size=$(du -h "$backup_file" | cut -f1)
+        log "Backup size: $size"
+    else
+        error "Backup file is empty or missing"
+        exit 1
+    fi
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_file="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        local s3_key="postgresql/$(basename "$backup_file")"
+        
+        aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+        log "Backup uploaded to s3://$s3_bucket/$s3_key"
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting PostgreSQL backup process"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pod=$(get_postgresql_pod)
+    wait_for_postgresql "$pod"
+    
+    perform_backup "$pod"
+    cleanup_old_backups
+    
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.sql.gz"
+    upload_to_cloud "$backup_file"
+    
+    log "PostgreSQL backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/backup_redis.sh
+++ b/infra/scripts/backup_redis.sh
@ -0,0 +1,189 @@
+#!/bin/bash
+# Redis Backup Script for AITBC
+# Usage: ./backup_redis.sh [namespace] [backup_name]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_NAME=${2:-redis-backup-$(date +%Y%m%d_%H%M%S)}
+BACKUP_DIR="/tmp/redis-backups"
+RETENTION_DAYS=30
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Create backup directory
+create_backup_dir() {
+    mkdir -p "$BACKUP_DIR"
+    log "Created backup directory: $BACKUP_DIR"
+}
+
+# Get Redis pod name
+get_redis_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find Redis pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for Redis to be ready
+wait_for_redis() {
+    local pod=$1
+    log "Waiting for Redis pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if Redis is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+            log "Redis is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "Redis did not become ready within timeout"
+    exit 1
+}
+
+# Perform backup
+perform_backup() {
+    local pod=$1
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+    
+    log "Starting Redis backup to $backup_file"
+    
+    # Create Redis backup
+    kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
+    
+    # Wait for background save to complete
+    log "Waiting for background save to complete..."
+    local retries=60
+    while [[ $retries -gt 0 ]]; do
+        local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        
+        if [[ "$lastsave" -gt "$lastbgsave" ]]; then
+            log "Background save completed"
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    if [[ $retries -eq 0 ]]; then
+        error "Background save did not complete within timeout"
+        exit 1
+    fi
+    
+    # Copy RDB file from pod
+    kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$backup_file"
+    
+    # Also create an append-only file backup if enabled
+    local aof_enabled=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli CONFIG GET appendonly | tail -1)
+    if [[ "$aof_enabled" == "yes" ]]; then
+        local aof_backup="$BACKUP_DIR/${BACKUP_NAME}.aof"
+        kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$aof_backup"
+        log "AOF backup created: $aof_backup"
+    fi
+    
+    log "Backup completed: $backup_file"
+    
+    # Verify backup
+    if [[ -f "$backup_file" ]] && [[ -s "$backup_file" ]]; then
+        local size=$(du -h "$backup_file" | cut -f1)
+        log "Backup size: $size"
+    else
+        error "Backup file is empty or missing"
+        exit 1
+    fi
+}
+
+# Clean old backups
+cleanup_old_backups() {
+    log "Cleaning up backups older than $RETENTION_DAYS days"
+    find "$BACKUP_DIR" -name "*.rdb" -type f -mtime +$RETENTION_DAYS -delete
+    find "$BACKUP_DIR" -name "*.aof" -type f -mtime +$RETENTION_DAYS -delete
+    log "Cleanup completed"
+}
+
+# Upload to cloud storage (optional)
+upload_to_cloud() {
+    local backup_file="$1"
+    
+    # Check if AWS CLI is configured
+    if command -v aws &> /dev/null && aws sts get-caller-identity &>/dev/null; then
+        log "Uploading backup to S3"
+        local s3_bucket="aitbc-backups-${NAMESPACE}"
+        local s3_key="redis/$(basename "$backup_file")"
+        
+        aws s3 cp "$backup_file" "s3://$s3_bucket/$s3_key" --storage-class GLACIER_IR
+        log "Backup uploaded to s3://$s3_bucket/$s3_key"
+        
+        # Upload AOF file if exists
+        local aof_file="${backup_file%.rdb}.aof"
+        if [[ -f "$aof_file" ]]; then
+            local aof_key="redis/$(basename "$aof_file")"
+            aws s3 cp "$aof_file" "s3://$s3_bucket/$aof_key" --storage-class GLACIER_IR
+            log "AOF backup uploaded to s3://$s3_bucket/$aof_key"
+        fi
+    else
+        warn "AWS CLI not configured, skipping cloud upload"
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting Redis backup process"
+    
+    check_dependencies
+    create_backup_dir
+    
+    local pod=$(get_redis_pod)
+    wait_for_redis "$pod"
+    
+    perform_backup "$pod"
+    cleanup_old_backups
+    
+    local backup_file="$BACKUP_DIR/${BACKUP_NAME}.rdb"
+    upload_to_cloud "$backup_file"
+    
+    log "Redis backup process completed successfully"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/chaos_orchestrator.py
+++ b/infra/scripts/chaos_orchestrator.py
@ -0,0 +1,342 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Orchestrator
+Runs multiple chaos test scenarios and aggregates MTTR metrics
+"""
+
+import asyncio
+import argparse
+import json
+import logging
+import subprocess
+import sys
+import time
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosOrchestrator:
+    """Orchestrates multiple chaos test scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.results = {
+            "orchestration_start": None,
+            "orchestration_end": None,
+            "scenarios": [],
+            "summary": {
+                "total_scenarios": 0,
+                "successful_scenarios": 0,
+                "failed_scenarios": 0,
+                "average_mttr": 0,
+                "max_mttr": 0,
+                "min_mttr": float('inf')
+            }
+        }
+    
+    async def run_scenario(self, script: str, args: List[str]) -> Optional[Dict]:
+        """Run a single chaos test scenario"""
+        scenario_name = Path(script).stem.replace("chaos_test_", "")
+        logger.info(f"Running scenario: {scenario_name}")
+        
+        cmd = ["python3", script] + args
+        start_time = time.time()
+        
+        try:
+            # Run the chaos test script
+            process = await asyncio.create_subprocess_exec(
+                *cmd,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE
+            )
+            
+            stdout, stderr = await process.communicate()
+            
+            if process.returncode != 0:
+                logger.error(f"Scenario {scenario_name} failed with exit code {process.returncode}")
+                logger.error(f"Error: {stderr.decode()}")
+                return None
+            
+            # Find the results file
+            result_files = list(Path(".").glob(f"chaos_test_{scenario_name}_*.json"))
+            if not result_files:
+                logger.error(f"No results file found for scenario {scenario_name}")
+                return None
+            
+            # Load the most recent result file
+            result_file = max(result_files, key=lambda p: p.stat().st_mtime)
+            with open(result_file, 'r') as f:
+                results = json.load(f)
+            
+            # Add execution metadata
+            results["execution_time"] = time.time() - start_time
+            results["scenario_name"] = scenario_name
+            
+            logger.info(f"Scenario {scenario_name} completed successfully")
+            return results
+            
+        except Exception as e:
+            logger.error(f"Failed to run scenario {scenario_name}: {e}")
+            return None
+    
+    def calculate_summary_metrics(self):
+        """Calculate summary metrics across all scenarios"""
+        mttr_values = []
+        
+        for scenario in self.results["scenarios"]:
+            if scenario.get("mttr"):
+                mttr_values.append(scenario["mttr"])
+        
+        if mttr_values:
+            self.results["summary"]["average_mttr"] = sum(mttr_values) / len(mttr_values)
+            self.results["summary"]["max_mttr"] = max(mttr_values)
+            self.results["summary"]["min_mttr"] = min(mttr_values)
+        
+        self.results["summary"]["total_scenarios"] = len(self.results["scenarios"])
+        self.results["summary"]["successful_scenarios"] = sum(
+            1 for s in self.results["scenarios"] if s.get("mttr") is not None
+        )
+        self.results["summary"]["failed_scenarios"] = (
+            self.results["summary"]["total_scenarios"] - 
+            self.results["summary"]["successful_scenarios"]
+        )
+    
+    def generate_report(self, output_file: Optional[str] = None):
+        """Generate a comprehensive chaos test report"""
+        report = {
+            "report_generated": datetime.utcnow().isoformat(),
+            "namespace": self.namespace,
+            "orchestration": self.results,
+            "recommendations": []
+        }
+        
+        # Add recommendations based on results
+        if self.results["summary"]["average_mttr"] > 120:
+            report["recommendations"].append(
+                "Average MTTR exceeds 2 minutes. Consider improving recovery automation."
+            )
+        
+        if self.results["summary"]["max_mttr"] > 300:
+            report["recommendations"].append(
+                "Maximum MTTR exceeds 5 minutes. Review slowest recovery scenario."
+            )
+        
+        if self.results["summary"]["failed_scenarios"] > 0:
+            report["recommendations"].append(
+                f"{self.results['summary']['failed_scenarios']} scenario(s) failed. Review test configuration."
+            )
+        
+        # Check for specific scenario issues
+        for scenario in self.results["scenarios"]:
+            if scenario.get("scenario_name") == "coordinator_outage":
+                if scenario.get("mttr", 0) > 180:
+                    report["recommendations"].append(
+                        "Coordinator recovery is slow. Consider reducing pod startup time."
+                    )
+            
+            elif scenario.get("scenario_name") == "network_partition":
+                if scenario.get("error_count", 0) > scenario.get("success_count", 0):
+                    report["recommendations"].append(
+                        "High error rate during network partition. Improve error handling."
+                    )
+            
+            elif scenario.get("scenario_name") == "database_failure":
+                if scenario.get("failure_type") == "connection":
+                    report["recommendations"].append(
+                        "Consider implementing database connection pooling and retry logic."
+                    )
+        
+        # Save report
+        if output_file:
+            with open(output_file, 'w') as f:
+                json.dump(report, f, indent=2)
+            logger.info(f"Chaos test report saved to: {output_file}")
+        
+        # Print summary
+        self.print_summary()
+        
+        return report
+    
+    def print_summary(self):
+        """Print a summary of all chaos test results"""
+        print("\n" + "="*60)
+        print("CHAOS TESTING SUMMARY REPORT")
+        print("="*60)
+        
+        print(f"\nTest Execution: {self.results['orchestration_start']} to {self.results['orchestration_end']}")
+        print(f"Namespace: {self.namespace}")
+        
+        print(f"\nScenario Results:")
+        print("-" * 40)
+        for scenario in self.results["scenarios"]:
+            name = scenario.get("scenario_name", "Unknown")
+            mttr = scenario.get("mttr", "N/A")
+            if mttr != "N/A":
+                mttr = f"{mttr:.2f}s"
+            print(f"  {name:20} MTTR: {mttr}")
+        
+        print(f"\nSummary Metrics:")
+        print("-" * 40)
+        print(f"  Total Scenarios:     {self.results['summary']['total_scenarios']}")
+        print(f"  Successful:          {self.results['summary']['successful_scenarios']}")
+        print(f"  Failed:              {self.results['summary']['failed_scenarios']}")
+        
+        if self.results["summary"]["average_mttr"] > 0:
+            print(f"  Average MTTR:        {self.results['summary']['average_mttr']:.2f}s")
+            print(f"  Maximum MTTR:        {self.results['summary']['max_mttr']:.2f}s")
+            print(f"  Minimum MTTR:        {self.results['summary']['min_mttr']:.2f}s")
+        
+        # SLO compliance
+        print(f"\nSLO Compliance:")
+        print("-" * 40)
+        slo_target = 120  # 2 minutes
+        if self.results["summary"]["average_mttr"] <= slo_target:
+            print(f"  ✓ Average MTTR within SLO ({slo_target}s)")
+        else:
+            print(f"  ✗ Average MTTR exceeds SLO ({slo_target}s)")
+        
+        print("\n" + "="*60)
+    
+    async def run_all_scenarios(self, scenarios: List[str], scenario_args: Dict[str, List[str]]):
+        """Run all specified chaos test scenarios"""
+        logger.info("Starting chaos testing orchestration")
+        self.results["orchestration_start"] = datetime.utcnow().isoformat()
+        
+        for scenario in scenarios:
+            args = scenario_args.get(scenario, [])
+            # Add namespace to all scenarios
+            args.extend(["--namespace", self.namespace])
+            
+            result = await self.run_scenario(scenario, args)
+            if result:
+                self.results["scenarios"].append(result)
+        
+        self.results["orchestration_end"] = datetime.utcnow().isoformat()
+        
+        # Calculate summary metrics
+        self.calculate_summary_metrics()
+        
+        # Generate report
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        report_file = f"chaos_test_report_{timestamp}.json"
+        self.generate_report(report_file)
+        
+        logger.info("Chaos testing orchestration completed")
+    
+    async def run_continuous_chaos(self, duration_hours: int = 24, interval_minutes: int = 60):
+        """Run chaos tests continuously over time"""
+        logger.info(f"Starting continuous chaos testing for {duration_hours} hours")
+        
+        end_time = datetime.now() + timedelta(hours=duration_hours)
+        interval_seconds = interval_minutes * 60
+        
+        all_results = []
+        
+        while datetime.now() < end_time:
+            cycle_start = datetime.now()
+            logger.info(f"Starting chaos test cycle at {cycle_start}")
+            
+            # Run a random scenario
+            scenarios = [
+                "chaos_test_coordinator.py",
+                "chaos_test_network.py",
+                "chaos_test_database.py"
+            ]
+            
+            import random
+            selected_scenario = random.choice(scenarios)
+            
+            # Run scenario with reduced duration for continuous testing
+            args = ["--namespace", self.namespace]
+            if "coordinator" in selected_scenario:
+                args.extend(["--outage-duration", "30", "--load-duration", "60"])
+            elif "network" in selected_scenario:
+                args.extend(["--partition-duration", "30", "--partition-ratio", "0.3"])
+            elif "database" in selected_scenario:
+                args.extend(["--failure-duration", "30", "--failure-type", "connection"])
+            
+            result = await self.run_scenario(selected_scenario, args)
+            if result:
+                result["cycle_time"] = cycle_start.isoformat()
+                all_results.append(result)
+            
+            # Wait for next cycle
+            elapsed = (datetime.now() - cycle_start).total_seconds()
+            if elapsed < interval_seconds:
+                wait_time = interval_seconds - elapsed
+                logger.info(f"Waiting {wait_time:.0f}s for next cycle")
+                await asyncio.sleep(wait_time)
+        
+        # Generate continuous testing report
+        continuous_report = {
+            "continuous_testing": True,
+            "duration_hours": duration_hours,
+            "interval_minutes": interval_minutes,
+            "total_cycles": len(all_results),
+            "cycles": all_results
+        }
+        
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        report_file = f"continuous_chaos_report_{timestamp}.json"
+        with open(report_file, 'w') as f:
+            json.dump(continuous_report, f, indent=2)
+        
+        logger.info(f"Continuous chaos testing completed. Report saved to: {report_file}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos testing orchestrator")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--scenarios", nargs="+", 
+                       choices=["coordinator", "network", "database"],
+                       default=["coordinator", "network", "database"],
+                       help="Scenarios to run")
+    parser.add_argument("--continuous", action="store_true", help="Run continuous chaos testing")
+    parser.add_argument("--duration", type=int, default=24, help="Duration in hours for continuous testing")
+    parser.add_argument("--interval", type=int, default=60, help="Interval in minutes for continuous testing")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    orchestrator = ChaosOrchestrator(args.namespace)
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would run scenarios: {', '.join(args.scenarios)}")
+        return
+    
+    if args.continuous:
+        await orchestrator.run_continuous_chaos(args.duration, args.interval)
+    else:
+        # Map scenario names to script files
+        scenario_map = {
+            "coordinator": "chaos_test_coordinator.py",
+            "network": "chaos_test_network.py",
+            "database": "chaos_test_database.py"
+        }
+        
+        # Get script files
+        scripts = [scenario_map[s] for s in args.scenarios]
+        
+        # Default arguments for each scenario
+        scenario_args = {
+            "chaos_test_coordinator.py": ["--outage-duration", "60", "--load-duration", "120"],
+            "chaos_test_network.py": ["--partition-duration", "60", "--partition-ratio", "0.5"],
+            "chaos_test_database.py": ["--failure-duration", "60", "--failure-type", "connection"]
+        }
+        
+        await orchestrator.run_all_scenarios(scripts, scenario_args)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_coordinator.py
+++ b/infra/scripts/chaos_test_coordinator.py
@ -0,0 +1,287 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Coordinator API Outage
+Tests system resilience when coordinator API becomes unavailable
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestCoordinator:
+    """Chaos testing for coordinator API outage scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "outage_start": None,
+            "outage_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "coordinator_outage"
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_coordinator_pods(self) -> List[str]:
+        """Get list of coordinator pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=coordinator",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get coordinator pods: {e}")
+            return []
+    
+    def delete_coordinator_pods(self) -> bool:
+        """Delete all coordinator pods to simulate outage"""
+        try:
+            cmd = [
+                "kubectl", "delete", "pods",
+                "-n", self.namespace,
+                "-l", "app.kubernetes.io/name=coordinator",
+                "--force", "--grace-period=0"
+            ]
+            subprocess.run(cmd, check=True)
+            logger.info("Coordinator pods deleted successfully")
+            return True
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to delete coordinator pods: {e}")
+            return False
+    
+    async def wait_for_pods_termination(self, timeout: int = 60) -> bool:
+        """Wait for all coordinator pods to terminate"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            pods = self.get_coordinator_pods()
+            if not pods:
+                logger.info("All coordinator pods terminated")
+                return True
+            await asyncio.sleep(2)
+        
+        logger.error("Timeout waiting for pods to terminate")
+        return False
+    
+    async def wait_for_recovery(self, timeout: int = 300) -> bool:
+        """Wait for coordinator service to recover"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            try:
+                # Check if pods are running
+                pods = self.get_coordinator_pods()
+                if not pods:
+                    await asyncio.sleep(5)
+                    continue
+                
+                # Check if at least one pod is ready
+                ready_cmd = [
+                    "kubectl", "get", "pods",
+                    "-n", self.namespace,
+                    "-l", "app.kubernetes.io/name=coordinator",
+                    "-o", "jsonpath={.items[?(@.status.phase=='Running')].metadata.name}"
+                ]
+                result = subprocess.run(ready_cmd, capture_output=True, text=True)
+                if result.stdout.strip():
+                    # Test API health
+                    if self.test_health_endpoint():
+                        recovery_time = time.time() - start_time
+                        self.metrics["recovery_time"] = recovery_time
+                        logger.info(f"Service recovered in {recovery_time:.2f} seconds")
+                        return True
+                
+            except Exception as e:
+                logger.debug(f"Recovery check failed: {e}")
+            
+            await asyncio.sleep(5)
+        
+        logger.error("Service did not recover within timeout")
+        return False
+    
+    def test_health_endpoint(self) -> bool:
+        """Test if coordinator health endpoint is responding"""
+        try:
+            # Get service URL
+            cmd = [
+                "kubectl", "get", "svc", "coordinator",
+                "-n", self.namespace,
+                "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            service_url = f"http://{result.stdout.strip()}/v1/health"
+            
+            # Test health endpoint
+            response = subprocess.run(
+                ["curl", "-s", "--max-time", "5", service_url],
+                capture_output=True, text=True
+            )
+            
+            return response.returncode == 0 and "ok" in response.stdout
+        except Exception:
+            return False
+    
+    async def generate_load(self, duration: int, concurrent: int = 10):
+        """Generate synthetic load on coordinator API"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "coordinator",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/marketplace/stats") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def run_test(self, outage_duration: int = 60, load_duration: int = 120):
+        """Run the complete chaos test"""
+        logger.info("Starting coordinator outage chaos test")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Phase 1: Generate initial load
+        logger.info("Phase 1: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 2: Induce outage
+        logger.info("Phase 2: Inducing coordinator outage")
+        self.metrics["outage_start"] = datetime.utcnow().isoformat()
+        
+        if not self.delete_coordinator_pods():
+            logger.error("Failed to induce outage")
+            return False
+        
+        if not await self.wait_for_pods_termination():
+            logger.error("Pods did not terminate")
+            return False
+        
+        # Wait for specified outage duration
+        logger.info(f"Waiting for {outage_duration} seconds outage duration")
+        await asyncio.sleep(outage_duration)
+        
+        # Phase 3: Monitor recovery
+        logger.info("Phase 3: Monitoring service recovery")
+        self.metrics["outage_end"] = datetime.utcnow().isoformat()
+        
+        if not await self.wait_for_recovery():
+            logger.error("Service did not recover")
+            return False
+        
+        # Phase 4: Post-recovery load test
+        logger.info("Phase 4: Post-recovery load test")
+        await self.generate_load(load_duration)
+        
+        # Calculate metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_coordinator_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Outage Duration: {self.metrics['outage_start']} to {self.metrics['outage_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+        print(f"Error Rate: {(self.metrics['error_count'] / (self.metrics['success_count'] + self.metrics['error_count']) * 100):.2f}%")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for coordinator API outage")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--outage-duration", type=int, default=60, help="Outage duration in seconds")
+    parser.add_argument("--load-duration", type=int, default=120, help="Post-recovery load test duration")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info("DRY RUN: Would test coordinator outage without actual deletion")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestCoordinator(args.namespace) as test:
+        success = await test.run_test(args.outage_duration, args.load_duration)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_database.py
+++ b/infra/scripts/chaos_test_database.py
@ -0,0 +1,387 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Database Failure
+Tests system resilience when PostgreSQL database becomes unavailable
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestDatabase:
+    """Chaos testing for database failure scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "failure_start": None,
+            "failure_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "database_failure",
+            "failure_type": None
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_postgresql_pod(self) -> Optional[str]:
+        """Get PostgreSQL pod name"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=postgresql",
+            "-o", "jsonpath={.items[0].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pod = result.stdout.strip()
+            return pod if pod else None
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get PostgreSQL pod: {e}")
+            return None
+    
+    def simulate_database_connection_failure(self) -> bool:
+        """Simulate database connection failure by blocking port 5432"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Block incoming connections to PostgreSQL
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-A", "INPUT", "-p", "tcp", "--dport", "5432", "-j", "DROP"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            # Block outgoing connections from PostgreSQL
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-A", "OUTPUT", "-p", "tcp", "--sport", "5432", "-j", "DROP"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            logger.info(f"Blocked PostgreSQL connections on pod {pod}")
+            self.metrics["failure_type"] = "connection_blocked"
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to block PostgreSQL connections: {e}")
+            return False
+    
+    def simulate_database_high_latency(self, latency_ms: int = 5000) -> bool:
+        """Simulate high database latency using netem"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Add latency to PostgreSQL traffic
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "tc", "qdisc", "add", "dev", "eth0", "root", "netem", "delay", f"{latency_ms}ms"
+            ]
+            subprocess.run(cmd, check=True)
+            
+            logger.info(f"Added {latency_ms}ms latency to PostgreSQL on pod {pod}")
+            self.metrics["failure_type"] = "high_latency"
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to add latency to PostgreSQL: {e}")
+            return False
+    
+    def restore_database(self) -> bool:
+        """Restore database connections"""
+        pod = self.get_postgresql_pod()
+        if not pod:
+            return False
+        
+        try:
+            # Remove iptables rules
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-F", "INPUT"
+            ]
+            subprocess.run(cmd, check=False)  # May fail if rules don't exist
+            
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "iptables", "-F", "OUTPUT"
+            ]
+            subprocess.run(cmd, check=False)
+            
+            # Remove netem qdisc
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, pod, "--",
+                "tc", "qdisc", "del", "dev", "eth0", "root"
+            ]
+            subprocess.run(cmd, check=False)
+            
+            logger.info(f"Restored PostgreSQL connections on pod {pod}")
+            return True
+            
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to restore PostgreSQL: {e}")
+            return False
+    
+    async def test_database_connectivity(self) -> bool:
+        """Test if coordinator can connect to database"""
+        try:
+            # Get coordinator pod
+            cmd = [
+                "kubectl", "get", "pods",
+                "-n", self.namespace,
+                "-l", "app.kubernetes.io/name=coordinator",
+                "-o", "jsonpath={.items[0].metadata.name}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            coordinator_pod = result.stdout.strip()
+            
+            if not coordinator_pod:
+                return False
+            
+            # Test database connection from coordinator
+            cmd = [
+                "kubectl", "exec", "-n", self.namespace, coordinator_pod, "--",
+                "python", "-c", "import psycopg2; psycopg2.connect('postgresql://aitbc:password@postgresql:5432/aitbc'); print('OK')"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            
+            return result.returncode == 0 and "OK" in result.stdout
+            
+        except Exception:
+            return False
+    
+    async def test_api_health(self) -> bool:
+        """Test if coordinator API is healthy"""
+        try:
+            # Get service URL
+            cmd = [
+                "kubectl", "get", "svc", "coordinator",
+                "-n", self.namespace,
+                "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            service_url = f"http://{result.stdout.strip()}/v1/health"
+            
+            # Test health endpoint
+            response = subprocess.run(
+                ["curl", "-s", "--max-time", "5", service_url],
+                capture_output=True, text=True
+            )
+            
+            return response.returncode == 0 and "ok" in response.stdout
+            
+        except Exception:
+            return False
+    
+    async def generate_load(self, duration: int, concurrent: int = 10):
+        """Generate synthetic load on coordinator API"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "coordinator",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/marketplace/offers") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def wait_for_recovery(self, timeout: int = 300) -> bool:
+        """Wait for database and API to recover"""
+        start_time = time.time()
+        
+        while time.time() - start_time < timeout:
+            # Test database connectivity
+            db_connected = await self.test_database_connectivity()
+            
+            # Test API health
+            api_healthy = await self.test_api_health()
+            
+            if db_connected and api_healthy:
+                recovery_time = time.time() - start_time
+                self.metrics["recovery_time"] = recovery_time
+                logger.info(f"Database and API recovered in {recovery_time:.2f} seconds")
+                return True
+            
+            await asyncio.sleep(5)
+        
+        logger.error("Database and API did not recover within timeout")
+        return False
+    
+    async def run_test(self, failure_type: str = "connection", failure_duration: int = 60):
+        """Run the complete database chaos test"""
+        logger.info(f"Starting database chaos test - failure type: {failure_type}")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Phase 1: Baseline test
+        logger.info("Phase 1: Baseline connectivity test")
+        db_connected = await self.test_database_connectivity()
+        api_healthy = await self.test_api_health()
+        
+        if not db_connected or not api_healthy:
+            logger.error("Baseline test failed - database or API not healthy")
+            return False
+        
+        logger.info("Baseline: Database and API are healthy")
+        
+        # Phase 2: Generate initial load
+        logger.info("Phase 2: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 3: Induce database failure
+        logger.info("Phase 3: Inducing database failure")
+        self.metrics["failure_start"] = datetime.utcnow().isoformat()
+        
+        if failure_type == "connection":
+            if not self.simulate_database_connection_failure():
+                logger.error("Failed to induce database connection failure")
+                return False
+        elif failure_type == "latency":
+            if not self.simulate_database_high_latency():
+                logger.error("Failed to induce database latency")
+                return False
+        else:
+            logger.error(f"Unknown failure type: {failure_type}")
+            return False
+        
+        # Verify failure is effective
+        await asyncio.sleep(5)
+        db_connected = await self.test_database_connectivity()
+        api_healthy = await self.test_api_health()
+        
+        logger.info(f"During failure - DB connected: {db_connected}, API healthy: {api_healthy}")
+        
+        # Phase 4: Monitor during failure
+        logger.info(f"Phase 4: Monitoring system during {failure_duration}s failure")
+        
+        # Generate load during failure
+        await self.generate_load(failure_duration)
+        
+        # Phase 5: Restore database and monitor recovery
+        logger.info("Phase 5: Restoring database")
+        self.metrics["failure_end"] = datetime.utcnow().isoformat()
+        
+        if not self.restore_database():
+            logger.error("Failed to restore database")
+            return False
+        
+        # Wait for recovery
+        if not await self.wait_for_recovery():
+            logger.error("System did not recover after database restoration")
+            return False
+        
+        # Phase 6: Post-recovery load test
+        logger.info("Phase 6: Post-recovery load test")
+        await self.generate_load(60)
+        
+        # Final metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Database chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_database_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Failure Type: {self.metrics['failure_type']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Failure Duration: {self.metrics['failure_start']} to {self.metrics['failure_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for database failure")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--failure-type", choices=["connection", "latency"], default="connection", help="Type of failure to simulate")
+    parser.add_argument("--failure-duration", type=int, default=60, help="Failure duration in seconds")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would simulate {args.failure_type} database failure for {args.failure_duration} seconds")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestDatabase(args.namespace) as test:
+        success = await test.run_test(args.failure_type, args.failure_duration)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/chaos_test_network.py
+++ b/infra/scripts/chaos_test_network.py
@ -0,0 +1,372 @@
+#!/usr/bin/env python3
+"""
+Chaos Testing Script - Network Partition
+Tests system resilience when blockchain nodes experience network partitions
+"""
+
+import asyncio
+import aiohttp
+import argparse
+import json
+import time
+import logging
+import subprocess
+import sys
+from datetime import datetime
+from typing import Dict, List, Optional
+
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+
+class ChaosTestNetwork:
+    """Chaos testing for network partition scenarios"""
+    
+    def __init__(self, namespace: str = "default"):
+        self.namespace = namespace
+        self.session = None
+        self.metrics = {
+            "test_start": None,
+            "test_end": None,
+            "partition_start": None,
+            "partition_end": None,
+            "recovery_time": None,
+            "mttr": None,
+            "error_count": 0,
+            "success_count": 0,
+            "scenario": "network_partition",
+            "affected_nodes": []
+        }
+    
+    async def __aenter__(self):
+        self.session = aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10))
+        return self
+    
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        if self.session:
+            await self.session.close()
+    
+    def get_blockchain_pods(self) -> List[str]:
+        """Get list of blockchain node pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=blockchain-node",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get blockchain pods: {e}")
+            return []
+    
+    def get_coordinator_pods(self) -> List[str]:
+        """Get list of coordinator pods"""
+        cmd = [
+            "kubectl", "get", "pods",
+            "-n", self.namespace,
+            "-l", "app.kubernetes.io/name=coordinator",
+            "-o", "jsonpath={.items[*].metadata.name}"
+        ]
+        
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            pods = result.stdout.strip().split()
+            return pods
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to get coordinator pods: {e}")
+            return []
+    
+    def apply_network_partition(self, pods: List[str], target_pods: List[str]) -> bool:
+        """Apply network partition using iptables"""
+        logger.info(f"Applying network partition: blocking traffic between {len(pods)} and {len(target_pods)} pods")
+        
+        for pod in pods:
+            if pod in target_pods:
+                continue
+                
+            # Block traffic from this pod to target pods
+            for target_pod in target_pods:
+                try:
+                    # Get target pod IP
+                    cmd = [
+                        "kubectl", "get", "pod", target_pod,
+                        "-n", self.namespace,
+                        "-o", "jsonpath={.status.podIP}"
+                    ]
+                    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+                    target_ip = result.stdout.strip()
+                    
+                    if not target_ip:
+                        continue
+                    
+                    # Apply iptables rule to block traffic
+                    iptables_cmd = [
+                        "kubectl", "exec", "-n", self.namespace, pod, "--",
+                        "iptables", "-A", "OUTPUT", "-d", target_ip, "-j", "DROP"
+                    ]
+                    subprocess.run(iptables_cmd, check=True)
+                    
+                    logger.info(f"Blocked traffic from {pod} to {target_pod} ({target_ip})")
+                    
+                except subprocess.CalledProcessError as e:
+                    logger.error(f"Failed to block traffic from {pod} to {target_pod}: {e}")
+                    return False
+        
+        self.metrics["affected_nodes"] = pods + target_pods
+        return True
+    
+    def remove_network_partition(self, pods: List[str]) -> bool:
+        """Remove network partition rules"""
+        logger.info("Removing network partition rules")
+        
+        for pod in pods:
+            try:
+                # Flush OUTPUT chain (remove all rules)
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pod, "--",
+                    "iptables", "-F", "OUTPUT"
+                ]
+                subprocess.run(cmd, check=True)
+                logger.info(f"Removed network rules from {pod}")
+                
+            except subprocess.CalledProcessError as e:
+                logger.error(f"Failed to remove network rules from {pod}: {e}")
+                return False
+        
+        return True
+    
+    async def test_connectivity(self, pods: List[str]) -> Dict[str, bool]:
+        """Test connectivity between pods"""
+        results = {}
+        
+        for pod in pods:
+            try:
+                # Test if pod can reach coordinator
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pod, "--",
+                    "curl", "-s", "--max-time", "5", "http://coordinator:8011/v1/health"
+                ]
+                result = subprocess.run(cmd, capture_output=True, text=True)
+                results[pod] = result.returncode == 0 and "ok" in result.stdout
+                
+            except Exception:
+                results[pod] = False
+        
+        return results
+    
+    async def monitor_consensus(self, duration: int = 60) -> bool:
+        """Monitor blockchain consensus health"""
+        logger.info(f"Monitoring consensus for {duration} seconds")
+        
+        start_time = time.time()
+        last_height = 0
+        
+        while time.time() - start_time < duration:
+            try:
+                # Get block height from a random pod
+                pods = self.get_blockchain_pods()
+                if not pods:
+                    await asyncio.sleep(5)
+                    continue
+                
+                # Use first pod to check height
+                cmd = [
+                    "kubectl", "exec", "-n", self.namespace, pods[0], "--",
+                    "curl", "-s", "http://localhost:8080/v1/blocks/head"
+                ]
+                result = subprocess.run(cmd, capture_output=True, text=True)
+                
+                if result.returncode == 0:
+                    try:
+                        data = json.loads(result.stdout)
+                        current_height = data.get("height", 0)
+                        
+                        # Check if blockchain is progressing
+                        if current_height > last_height:
+                            last_height = current_height
+                            logger.info(f"Blockchain progressing, height: {current_height}")
+                        elif time.time() - start_time > 30:  # Allow 30s for initial sync
+                            logger.warning(f"Blockchain stuck at height {current_height}")
+                    
+                    except json.JSONDecodeError:
+                        pass
+                
+            except Exception as e:
+                logger.debug(f"Consensus check failed: {e}")
+            
+            await asyncio.sleep(5)
+        
+        return last_height > 0
+    
+    async def generate_load(self, duration: int, concurrent: int = 5):
+        """Generate synthetic load on blockchain nodes"""
+        logger.info(f"Generating load for {duration} seconds with {concurrent} concurrent requests")
+        
+        # Get service URL
+        cmd = [
+            "kubectl", "get", "svc", "blockchain-node",
+            "-n", self.namespace,
+            "-o", "jsonpath={.spec.clusterIP}:{.spec.ports[0].port}"
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        base_url = f"http://{result.stdout.strip()}"
+        
+        start_time = time.time()
+        tasks = []
+        
+        async def make_request():
+            try:
+                async with self.session.get(f"{base_url}/v1/blocks/head") as response:
+                    if response.status == 200:
+                        self.metrics["success_count"] += 1
+                    else:
+                        self.metrics["error_count"] += 1
+            except Exception:
+                self.metrics["error_count"] += 1
+        
+        while time.time() - start_time < duration:
+            # Create batch of requests
+            batch = [make_request() for _ in range(concurrent)]
+            tasks.extend(batch)
+            
+            # Wait for batch to complete
+            await asyncio.gather(*batch, return_exceptions=True)
+            
+            # Brief pause
+            await asyncio.sleep(1)
+        
+        logger.info(f"Load generation completed. Success: {self.metrics['success_count']}, Errors: {self.metrics['error_count']}")
+    
+    async def run_test(self, partition_duration: int = 60, partition_ratio: float = 0.5):
+        """Run the complete network partition chaos test"""
+        logger.info("Starting network partition chaos test")
+        self.metrics["test_start"] = datetime.utcnow().isoformat()
+        
+        # Get all blockchain pods
+        all_pods = self.get_blockchain_pods()
+        if not all_pods:
+            logger.error("No blockchain pods found")
+            return False
+        
+        # Determine which pods to partition
+        num_partition = int(len(all_pods) * partition_ratio)
+        partition_pods = all_pods[:num_partition]
+        remaining_pods = all_pods[num_partition:]
+        
+        logger.info(f"Partitioning {len(partition_pods)} pods out of {len(all_pods)} total")
+        
+        # Phase 1: Baseline test
+        logger.info("Phase 1: Baseline connectivity test")
+        baseline_connectivity = await self.test_connectivity(all_pods)
+        logger.info(f"Baseline connectivity: {sum(baseline_connectivity.values())}/{len(all_pods)} pods connected")
+        
+        # Phase 2: Generate initial load
+        logger.info("Phase 2: Generating initial load")
+        await self.generate_load(30)
+        
+        # Phase 3: Apply network partition
+        logger.info("Phase 3: Applying network partition")
+        self.metrics["partition_start"] = datetime.utcnow().isoformat()
+        
+        if not self.apply_network_partition(remaining_pods, partition_pods):
+            logger.error("Failed to apply network partition")
+            return False
+        
+        # Verify partition is effective
+        await asyncio.sleep(5)
+        partitioned_connectivity = await self.test_connectivity(all_pods)
+        logger.info(f"Partitioned connectivity: {sum(partitioned_connectivity.values())}/{len(all_pods)} pods connected")
+        
+        # Phase 4: Monitor during partition
+        logger.info(f"Phase 4: Monitoring system during {partition_duration}s partition")
+        consensus_healthy = await self.monitor_consensus(partition_duration)
+        
+        # Phase 5: Remove partition and monitor recovery
+        logger.info("Phase 5: Removing network partition")
+        self.metrics["partition_end"] = datetime.utcnow().isoformat()
+        
+        if not self.remove_network_partition(all_pods):
+            logger.error("Failed to remove network partition")
+            return False
+        
+        # Wait for recovery
+        logger.info("Waiting for network recovery...")
+        await asyncio.sleep(10)
+        
+        # Test connectivity after recovery
+        recovery_connectivity = await self.test_connectivity(all_pods)
+        recovery_time = time.time()
+        
+        # Calculate recovery metrics
+        all_connected = all(recovery_connectivity.values())
+        if all_connected:
+            self.metrics["recovery_time"] = recovery_time - (datetime.fromisoformat(self.metrics["partition_end"]).timestamp())
+            logger.info(f"Network recovered in {self.metrics['recovery_time']:.2f} seconds")
+        
+        # Phase 6: Post-recovery load test
+        logger.info("Phase 6: Post-recovery load test")
+        await self.generate_load(60)
+        
+        # Final metrics
+        self.metrics["test_end"] = datetime.utcnow().isoformat()
+        self.metrics["mttr"] = self.metrics["recovery_time"]
+        
+        # Save results
+        self.save_results()
+        
+        logger.info("Network partition chaos test completed successfully")
+        return True
+    
+    def save_results(self):
+        """Save test results to file"""
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename = f"chaos_test_network_{timestamp}.json"
+        
+        with open(filename, "w") as f:
+            json.dump(self.metrics, f, indent=2)
+        
+        logger.info(f"Test results saved to: {filename}")
+        
+        # Print summary
+        print("\n=== Chaos Test Summary ===")
+        print(f"Scenario: {self.metrics['scenario']}")
+        print(f"Test Duration: {self.metrics['test_start']} to {self.metrics['test_end']}")
+        print(f"Partition Duration: {self.metrics['partition_start']} to {self.metrics['partition_end']}")
+        print(f"MTTR: {self.metrics['mttr']:.2f} seconds" if self.metrics['mttr'] else "MTTR: N/A")
+        print(f"Affected Nodes: {len(self.metrics['affected_nodes'])}")
+        print(f"Success Requests: {self.metrics['success_count']}")
+        print(f"Error Requests: {self.metrics['error_count']}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Chaos test for network partition")
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace")
+    parser.add_argument("--partition-duration", type=int, default=60, help="Partition duration in seconds")
+    parser.add_argument("--partition-ratio", type=float, default=0.5, help="Fraction of nodes to partition (0.0-1.0)")
+    parser.add_argument("--dry-run", action="store_true", help="Dry run without actual chaos")
+    
+    args = parser.parse_args()
+    
+    if args.dry_run:
+        logger.info(f"DRY RUN: Would partition {args.partition_ratio * 100}% of nodes for {args.partition_duration} seconds")
+        return
+    
+    # Verify kubectl is available
+    try:
+        subprocess.run(["kubectl", "version"], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        logger.error("kubectl is not available or not configured")
+        sys.exit(1)
+    
+    # Run test
+    async with ChaosTestNetwork(args.namespace) as test:
+        success = await test.run_test(args.partition_duration, args.partition_ratio)
+        sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/infra/scripts/restore_ledger.sh
+++ b/infra/scripts/restore_ledger.sh
@ -0,0 +1,279 @@
+#!/bin/bash
+# Ledger Storage Restore Script for AITBC
+# Usage: ./restore_ledger.sh [namespace] [backup_directory]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_DIR=${2:-}
+TEMP_DIR="/tmp/ledger-restore-$(date +%s)"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v jq &> /dev/null; then
+        error "jq is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup directory
+validate_backup_dir() {
+    if [[ -z "$BACKUP_DIR" ]]; then
+        error "Backup directory must be specified"
+        echo "Usage: $0 [namespace] [backup_directory]"
+        exit 1
+    fi
+    
+    if [[ ! -d "$BACKUP_DIR" ]]; then
+        error "Backup directory not found: $BACKUP_DIR"
+        exit 1
+    fi
+    
+    # Check for required files
+    if [[ ! -f "$BACKUP_DIR/metadata.json" ]]; then
+        error "metadata.json not found in backup directory"
+        exit 1
+    fi
+    
+    if [[ ! -f "$BACKUP_DIR/chain.tar.gz" ]]; then
+        error "chain.tar.gz not found in backup directory"
+        exit 1
+    fi
+    
+    log "Using backup directory: $BACKUP_DIR"
+}
+
+# Get blockchain node pods
+get_blockchain_pods() {
+    local pods=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pods" ]]; then
+        pods=$(kubectl get pods -n "$NAMESPACE" -l app=blockchain-node -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pods" ]]; then
+        error "Could not find blockchain node pods in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo $pods
+}
+
+# Create backup of current ledger before restore
+create_pre_restore_backup() {
+    local pods=($1)
+    local pre_restore_backup="pre-restore-ledger-$(date +%Y%m%d_%H%M%S)"
+    local pre_restore_dir="/tmp/ledger-backups/$pre_restore_backup"
+    
+    warn "Creating backup of current ledger before restore..."
+    mkdir -p "$pre_restore_dir"
+    
+    # Use the first ready pod
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            # Get current block height
+            local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+            
+            # Create metadata
+            cat > "$pre_restore_dir/metadata.json" << EOF
+{
+  "backup_name": "$pre_restore_backup",
+  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
+  "namespace": "$NAMESPACE",
+  "source_pod": "$pod",
+  "latest_block_height": $current_height,
+  "backup_type": "pre-restore"
+}
+EOF
+            
+            # Backup data directories
+            local data_dirs=("chain" "wallets" "receipts")
+            for dir in "${data_dirs[@]}"; do
+                if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
+                    kubectl exec -n "$NAMESPACE" "$pod" -- tar -czf "/tmp/${pre_restore_backup}-${dir}.tar.gz" -C "/app/data" "$dir"
+                    kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}-${dir}.tar.gz" "$pre_restore_dir/${dir}.tar.gz"
+                    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "/tmp/${pre_restore_backup}-${dir}.tar.gz"
+                fi
+            done
+            
+            log "Pre-restore backup created: $pre_restore_dir"
+            break
+        fi
+    done
+}
+
+# Perform restore
+perform_restore() {
+    local pods=($1)
+    
+    warn "This will replace all current ledger data. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Scale down blockchain nodes
+    info "Scaling down blockchain node deployment..."
+    kubectl scale deployment blockchain-node --replicas=0 -n "$NAMESPACE"
+    
+    # Wait for pods to terminate
+    kubectl wait --for=delete pod -l app=blockchain-node -n "$NAMESPACE" --timeout=120s
+    
+    # Scale up blockchain nodes
+    info "Scaling up blockchain node deployment..."
+    kubectl scale deployment blockchain-node --replicas=3 -n "$NAMESPACE"
+    
+    # Wait for pods to be ready
+    local ready_pods=()
+    local retries=30
+    while [[ $retries -gt 0 && ${#ready_pods[@]} -eq 0 ]]; do
+        local all_pods=$(get_blockchain_pods)
+        for pod in $all_pods; do
+            if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+                ready_pods+=("$pod")
+            fi
+        done
+        
+        if [[ ${#ready_pods[@]} -eq 0 ]]; then
+            sleep 5
+            ((retries--))
+        fi
+    done
+    
+    if [[ ${#ready_pods[@]} -eq 0 ]]; then
+        error "No blockchain nodes became ready after restore"
+        exit 1
+    fi
+    
+    # Restore data to all ready pods
+    for pod in "${ready_pods[@]}"; do
+        info "Restoring ledger data to pod $pod..."
+        
+        # Create temp directory on pod
+        kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p "$TEMP_DIR"
+        
+        # Extract and copy chain data
+        if [[ -f "$BACKUP_DIR/chain.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/chain.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/chain.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/chain
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/chain.tar.gz" -C /app/data/
+        fi
+        
+        # Extract and copy wallet data
+        if [[ -f "$BACKUP_DIR/wallets.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/wallets.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/wallets.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/wallets
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/wallets.tar.gz" -C /app/data/
+        fi
+        
+        # Extract and copy receipt data
+        if [[ -f "$BACKUP_DIR/receipts.tar.gz" ]]; then
+            kubectl cp "$BACKUP_DIR/receipts.tar.gz" "$NAMESPACE/$pod:$TEMP_DIR/receipts.tar.gz"
+            kubectl exec -n "$NAMESPACE" "$pod" -- mkdir -p /app/data/receipts
+            kubectl exec -n "$NAMESPACE" "$pod" -- tar -xzf "$TEMP_DIR/receipts.tar.gz" -C /app/data/
+        fi
+        
+        # Set correct permissions
+        kubectl exec -n "$NAMESPACE" "$pod" -- chown -R app:app /app/data/
+        
+        # Clean up temp directory
+        kubectl exec -n "$NAMESPACE" "$pod" -- rm -rf "$TEMP_DIR"
+        
+        log "Ledger data restored to pod $pod"
+    done
+    
+    log "Ledger restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pods=($1)
+    
+    log "Verifying ledger restore..."
+    
+    # Read backup metadata
+    local backup_height=$(jq -r '.latest_block_height' "$BACKUP_DIR/metadata.json")
+    log "Backup contains blocks up to height: $backup_height"
+    
+    # Verify on each pod
+    for pod in "${pods[@]}"; do
+        if kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=10s >/dev/null 2>&1; then
+            # Check if node is responding
+            if kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/health >/dev/null 2>&1; then
+                # Get current block height
+                local current_height=$(kubectl exec -n "$NAMESPACE" "$pod" -- curl -s http://localhost:8080/v1/blocks/head | jq -r '.height // 0')
+                
+                if [[ "$current_height" -eq "$backup_height" ]]; then
+                    log "✓ Pod $pod: Block height matches backup ($current_height)"
+                else
+                    warn "⚠ Pod $pod: Block height mismatch (expected: $backup_height, actual: $current_height)"
+                fi
+                
+                # Check data directories
+                local dirs=("chain" "wallets" "receipts")
+                for dir in "${dirs[@]}"; do
+                    if kubectl exec -n "$NAMESPACE" "$pod" -- test -d "/app/data/$dir"; then
+                        local file_count=$(kubectl exec -n "$NAMESPACE" "$pod" -- find "/app/data/$dir" -type f | wc -l)
+                        log "✓ Pod $pod: $dir directory contains $file_count files"
+                    else
+                        warn "⚠ Pod $pod: $dir directory not found"
+                    fi
+                done
+            else
+                error "✗ Pod $pod: Not responding to health checks"
+            fi
+        fi
+    done
+}
+
+# Main execution
+main() {
+    log "Starting ledger restore process"
+    
+    check_dependencies
+    validate_backup_dir
+    
+    local pods=($(get_blockchain_pods))
+    create_pre_restore_backup "${pods[*]}"
+    perform_restore "${pods[*]}"
+    
+    # Get updated pod list after restore
+    pods=($(get_blockchain_pods))
+    verify_restore "${pods[*]}"
+    
+    log "Ledger restore process completed successfully"
+    warn "Please verify blockchain synchronization and application functionality"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/restore_postgresql.sh
+++ b/infra/scripts/restore_postgresql.sh
@ -0,0 +1,228 @@
+#!/bin/bash
+# PostgreSQL Restore Script for AITBC
+# Usage: ./restore_postgresql.sh [namespace] [backup_file]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_FILE=${2:-}
+BACKUP_DIR="/tmp/postgresql-backups"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+    
+    if ! command -v pg_restore &> /dev/null; then
+        error "pg_restore is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup file
+validate_backup_file() {
+    if [[ -z "$BACKUP_FILE" ]]; then
+        error "Backup file must be specified"
+        echo "Usage: $0 [namespace] [backup_file]"
+        exit 1
+    fi
+    
+    # If file doesn't exist locally, try to find it in backup dir
+    if [[ ! -f "$BACKUP_FILE" ]]; then
+        local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
+        if [[ -f "$potential_file" ]]; then
+            BACKUP_FILE="$potential_file"
+        else
+            error "Backup file not found: $BACKUP_FILE"
+            exit 1
+        fi
+    fi
+    
+    # Check if file is gzipped and decompress if needed
+    if [[ "$BACKUP_FILE" == *.gz ]]; then
+        info "Decompressing backup file..."
+        gunzip -c "$BACKUP_FILE" > "/tmp/restore_$(date +%s).dump"
+        BACKUP_FILE="/tmp/restore_$(date +%s).dump"
+    fi
+    
+    log "Using backup file: $BACKUP_FILE"
+}
+
+# Get PostgreSQL pod name
+get_postgresql_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=postgresql -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find PostgreSQL pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Wait for PostgreSQL to be ready
+wait_for_postgresql() {
+    local pod=$1
+    log "Waiting for PostgreSQL pod $pod to be ready..."
+    
+    kubectl wait --for=condition=ready pod "$pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Check if PostgreSQL is accepting connections
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$pod" -- pg_isready -U postgres >/dev/null 2>&1; then
+            log "PostgreSQL is ready"
+            return 0
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    error "PostgreSQL did not become ready within timeout"
+    exit 1
+}
+
+# Create backup of current database before restore
+create_pre_restore_backup() {
+    local pod=$1
+    local pre_restore_backup="pre-restore-$(date +%Y%m%d_%H%M%S)"
+    
+    warn "Creating backup of current database before restore..."
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Create backup
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_dump -U "$db_user" -h localhost -d "$db_name" \
+        --format=custom --file="/tmp/${pre_restore_backup}.dump"
+    
+    # Copy backup locally
+    kubectl cp "$NAMESPACE/$pod:/tmp/${pre_restore_backup}.dump" "$BACKUP_DIR/${pre_restore_backup}.dump"
+    
+    log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.dump"
+}
+
+# Perform restore
+perform_restore() {
+    local pod=$1
+    
+    warn "This will replace the current database. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Copy backup file to pod
+    local remote_backup="/tmp/restore_$(date +%s).dump"
+    kubectl cp "$BACKUP_FILE" "$NAMESPACE/$pod:$remote_backup"
+    
+    # Drop existing database and recreate
+    log "Dropping existing database..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d postgres -c "DROP DATABASE IF EXISTS $db_name;"
+    
+    log "Creating new database..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d postgres -c "CREATE DATABASE $db_name;"
+    
+    # Restore database
+    log "Restoring database from backup..."
+    PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        pg_restore -U "$db_user" -h localhost -d "$db_name" \
+        --verbose --clean --if-exists "$remote_backup"
+    
+    # Clean up remote file
+    kubectl exec -n "$NAMESPACE" "$pod" -- rm -f "$remote_backup"
+    
+    log "Database restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pod=$1
+    
+    log "Verifying database restore..."
+    
+    # Get database credentials
+    local db_user=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.username}' 2>/dev/null | base64 -d || echo "postgres")
+    local db_password=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.password}' 2>/dev/null | base64 -d || echo "")
+    local db_name=$(kubectl get secret -n "$NAMESPACE" coordinator-postgresql -o jsonpath='{.data.database}' 2>/dev/null | base64 -d || echo "aitbc")
+    
+    # Check table count
+    local table_count=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+        psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" | tr -d ' ')
+    
+    log "Database contains $table_count tables"
+    
+    # Check if key tables exist
+    local key_tables=("jobs" "marketplace_offers" "marketplace_bids" "blocks" "transactions")
+    for table in "${key_tables[@]}"; do
+        local exists=$(PGPASSWORD="$db_password" kubectl exec -n "$NAMESPACE" "$pod" -- \
+            psql -U "$db_user" -h localhost -d "$db_name" -t -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_name = '$table');" | tr -d ' ')
+        if [[ "$exists" == "t" ]]; then
+            log "✓ Table $table exists"
+        else
+            warn "⚠ Table $table not found"
+        fi
+    done
+}
+
+# Main execution
+main() {
+    log "Starting PostgreSQL restore process"
+    
+    check_dependencies
+    validate_backup_file
+    
+    local pod=$(get_postgresql_pod)
+    wait_for_postgresql "$pod"
+    
+    create_pre_restore_backup "$pod"
+    perform_restore "$pod"
+    verify_restore "$pod"
+    
+    log "PostgreSQL restore process completed successfully"
+    warn "Please verify application functionality after restore"
+}
+
+# Run main function
+main "$@"
--- a/infra/scripts/restore_redis.sh
+++ b/infra/scripts/restore_redis.sh
@ -0,0 +1,223 @@
+#!/bin/bash
+# Redis Restore Script for AITBC
+# Usage: ./restore_redis.sh [namespace] [backup_file]
+
+set -euo pipefail
+
+# Configuration
+NAMESPACE=${1:-default}
+BACKUP_FILE=${2:-}
+BACKUP_DIR="/tmp/redis-backups"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
+}
+
+error() {
+    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR:${NC} $1" >&2
+}
+
+warn() {
+    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING:${NC} $1"
+}
+
+info() {
+    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')] INFO:${NC} $1"
+}
+
+# Check dependencies
+check_dependencies() {
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed or not in PATH"
+        exit 1
+    fi
+}
+
+# Validate backup file
+validate_backup_file() {
+    if [[ -z "$BACKUP_FILE" ]]; then
+        error "Backup file must be specified"
+        echo "Usage: $0 [namespace] [backup_file]"
+        exit 1
+    fi
+    
+    # If file doesn't exist locally, try to find it in backup dir
+    if [[ ! -f "$BACKUP_FILE" ]]; then
+        local potential_file="$BACKUP_DIR/$(basename "$BACKUP_FILE")"
+        if [[ -f "$potential_file" ]]; then
+            BACKUP_FILE="$potential_file"
+        else
+            error "Backup file not found: $BACKUP_FILE"
+            exit 1
+        fi
+    fi
+    
+    log "Using backup file: $BACKUP_FILE"
+}
+
+# Get Redis pod name
+get_redis_pod() {
+    local pod=$(kubectl get pods -n "$NAMESPACE" -l app.kubernetes.io/name=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    if [[ -z "$pod" ]]; then
+        pod=$(kubectl get pods -n "$NAMESPACE" -l app=redis -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+    fi
+    
+    if [[ -z "$pod" ]]; then
+        error "Could not find Redis pod in namespace $NAMESPACE"
+        exit 1
+    fi
+    
+    echo "$pod"
+}
+
+# Create backup of current Redis data before restore
+create_pre_restore_backup() {
+    local pod=$1
+    local pre_restore_backup="pre-restore-redis-$(date +%Y%m%d_%H%M%S)"
+    
+    warn "Creating backup of current Redis data before restore..."
+    
+    # Create background save
+    kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli BGSAVE
+    
+    # Wait for save to complete
+    local retries=60
+    while [[ $retries -gt 0 ]]; do
+        local lastsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        local lastbgsave=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli LASTSAVE)
+        
+        if [[ "$lastsave" -gt "$lastbgsave" ]]; then
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    # Copy backup locally
+    kubectl cp "$NAMESPACE/$pod:/data/dump.rdb" "$BACKUP_DIR/${pre_restore_backup}.rdb"
+    
+    # Also backup AOF if exists
+    if kubectl exec -n "$NAMESPACE" "$pod" -- test -f /data/appendonly.aof; then
+        kubectl cp "$NAMESPACE/$pod:/data/appendonly.aof" "$BACKUP_DIR/${pre_restore_backup}.aof"
+    fi
+    
+    log "Pre-restore backup created: $BACKUP_DIR/${pre_restore_backup}.rdb"
+}
+
+# Perform restore
+perform_restore() {
+    local pod=$1
+    
+    warn "This will replace all current Redis data. Are you sure? (y/N)"
+    read -r response
+    if [[ ! "$response" =~ ^[Yy]$ ]]; then
+        log "Restore cancelled by user"
+        exit 0
+    fi
+    
+    # Scale down Redis to ensure clean restore
+    info "Scaling down Redis deployment..."
+    kubectl scale deployment redis --replicas=0 -n "$NAMESPACE"
+    
+    # Wait for pod to terminate
+    kubectl wait --for=delete pod -l app=redis -n "$NAMESPACE" --timeout=120s
+    
+    # Scale up Redis
+    info "Scaling up Redis deployment..."
+    kubectl scale deployment redis --replicas=1 -n "$NAMESPACE"
+    
+    # Wait for new pod to be ready
+    local new_pod=$(get_redis_pod)
+    kubectl wait --for=condition=ready pod "$new_pod" -n "$NAMESPACE" --timeout=300s
+    
+    # Stop Redis server
+    info "Stopping Redis server..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli SHUTDOWN NOSAVE
+    
+    # Clear existing data
+    info "Clearing existing Redis data..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- rm -f /data/dump.rdb /data/appendonly.aof
+    
+    # Copy backup file
+    info "Copying backup file..."
+    local remote_file="/data/restore.rdb"
+    kubectl cp "$BACKUP_FILE" "$NAMESPACE/$new_pod:$remote_file"
+    
+    # Set correct permissions
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- chown redis:redis "$remote_file"
+    
+    # Start Redis server
+    info "Starting Redis server..."
+    kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-server --daemonize yes
+    
+    # Wait for Redis to be ready
+    local retries=30
+    while [[ $retries -gt 0 ]]; do
+        if kubectl exec -n "$NAMESPACE" "$new_pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+            log "Redis is ready"
+            break
+        fi
+        sleep 2
+        ((retries--))
+    done
+    
+    if [[ $retries -eq 0 ]]; then
+        error "Redis did not start properly after restore"
+        exit 1
+    fi
+    
+    log "Redis restore completed successfully"
+}
+
+# Verify restore
+verify_restore() {
+    local pod=$1
+    
+    log "Verifying Redis restore..."
+    
+    # Check database size
+    local db_size=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli DBSIZE)
+    log "Database contains $db_size keys"
+    
+    # Check memory usage
+    local memory=$(kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli INFO memory | grep used_memory_human | cut -d: -f2 | tr -d '\r')
+    log "Memory usage: $memory"
+    
+    # Check if Redis is responding to commands
+    if kubectl exec -n "$NAMESPACE" "$pod" -- redis-cli ping 2>/dev/null | grep -q PONG; then
+        log "✓ Redis is responding normally"
+    else
+        error "✗ Redis is not responding"
+        exit 1
+    fi
+}
+
+# Main execution
+main() {
+    log "Starting Redis restore process"
+    
+    check_dependencies
+    validate_backup_file
+    
+    local pod=$(get_redis_pod)
+    create_pre_restore_backup "$pod"
+    perform_restore "$pod"
+    
+    # Get new pod name after restore
+    pod=$(get_redis_pod)
+    verify_restore "$pod"
+    
+    log "Redis restore process completed successfully"
+    warn "Please verify application functionality after restore"
+}
+
+# Run main function
+main "$@"
--- a/infra/terraform/environments/dev/main.tf
+++ b/infra/terraform/environments/dev/main.tf
@ -0,0 +1,25 @@
+# Development environment configuration
+
+terraform {
+  source = "../../modules/kubernetes"
+}
+
+include "root" {
+  path = find_in_parent_folders()
+}
+
+inputs = {
+  cluster_name               = "aitbc-dev"
+  environment               = "dev"
+  aws_region                = "us-west-2"
+  vpc_cidr                  = "10.0.0.0/16"
+  private_subnet_cidrs      = ["10.0.1.0/24", "10.0.2.0/24"]
+  public_subnet_cidrs       = ["10.0.101.0/24", "10.0.102.0/24"]
+  availability_zones        = ["us-west-2a", "us-west-2b"]
+  kubernetes_version        = "1.28"
+  enable_public_endpoint    = true
+  desired_node_count        = 2
+  min_node_count            = 1
+  max_node_count            = 3
+  instance_types            = ["t3.medium"]
+}
--- a/infra/terraform/modules/kubernetes/main.tf
+++ b/infra/terraform/modules/kubernetes/main.tf
@ -0,0 +1,199 @@
+# Kubernetes cluster module for AITBC infrastructure
+
+terraform {
+  required_version = ">= 1.0"
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~> 5.0"
+    }
+    kubernetes = {
+      source  = "hashicorp/kubernetes"
+      version = "~> 2.20"
+    }
+    helm = {
+      source  = "hashicorp/helm"
+      version = "~> 2.10"
+    }
+  }
+}
+
+provider "aws" {
+  region = var.aws_region
+}
+
+# VPC for the cluster
+resource "aws_vpc" "main" {
+  cidr_block           = var.vpc_cidr
+  enable_dns_hostnames = true
+  enable_dns_support   = true
+
+  tags = {
+    Name        = "${var.cluster_name}-vpc"
+    Environment = var.environment
+    Project     = "aitbc"
+  }
+}
+
+# Subnets
+resource "aws_subnet" "private" {
+  count = length(var.private_subnet_cidrs)
+
+  vpc_id            = aws_vpc.main.id
+  cidr_block        = var.private_subnet_cidrs[count.index]
+  availability_zone = var.availability_zones[count.index]
+
+  tags = {
+    Name                              = "${var.cluster_name}-private-${count.index}"
+    Environment                       = var.environment
+    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
+    "kubernetes.io/role/internal-elb" = "1"
+  }
+}
+
+resource "aws_subnet" "public" {
+  count = length(var.public_subnet_cidrs)
+
+  vpc_id                  = aws_vpc.main.id
+  cidr_block              = var.public_subnet_cidrs[count.index]
+  availability_zone       = var.availability_zones[count.index]
+  map_public_ip_on_launch = true
+
+  tags = {
+    Name                              = "${var.cluster_name}-public-${count.index}"
+    Environment                       = var.environment
+    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
+    "kubernetes.io/role/elb"          = "1"
+  }
+}
+
+# EKS Cluster
+resource "aws_eks_cluster" "main" {
+  name     = var.cluster_name
+  role_arn = aws_iam_role.cluster.arn
+  version  = var.kubernetes_version
+
+  vpc_config {
+    subnet_ids = concat(
+      aws_subnet.private[*].id,
+      aws_subnet.public[*].id
+    )
+    endpoint_private_access = true
+    endpoint_public_access  = var.enable_public_endpoint
+  }
+
+  depends_on = [
+    aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy
+  ]
+
+  tags = {
+    Name        = var.cluster_name
+    Environment = var.environment
+    Project     = "aitbc"
+  }
+}
+
+# Node groups
+resource "aws_eks_node_group" "main" {
+  cluster_name    = aws_eks_cluster.main.name
+  node_group_name = "${var.cluster_name}-main"
+  node_role_arn   = aws_iam_role.node.arn
+  subnet_ids      = aws_subnet.private[*].id
+
+  scaling_config {
+    desired_size = var.desired_node_count
+    max_size     = var.max_node_count
+    min_size     = var.min_node_count
+  }
+
+  instance_types = var.instance_types
+
+  depends_on = [
+    aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy,
+    aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy,
+    aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly
+  ]
+
+  tags = {
+    Name        = "${var.cluster_name}-main"
+    Environment = var.environment
+    Project     = "aitbc"
+  }
+}
+
+# IAM roles
+resource "aws_iam_role" "cluster" {
+  name = "${var.cluster_name}-cluster"
+
+  assume_role_policy = jsonencode({
+    Version = "2012-10-17"
+    Statement = [
+      {
+        Action = "sts:AssumeRole"
+        Effect = "Allow"
+        Principal = {
+          Service = "eks.amazonaws.com"
+        }
+      }
+    ]
+  })
+}
+
+resource "aws_iam_role" "node" {
+  name = "${var.cluster_name}-node"
+
+  assume_role_policy = jsonencode({
+    Version = "2012-10-17"
+    Statement = [
+      {
+        Action = "sts:AssumeRole"
+        Effect = "Allow"
+        Principal = {
+          Service = "ec2.amazonaws.com"
+        }
+      }
+    ]
+  })
+}
+
+# IAM policy attachments
+resource "aws_iam_role_policy_attachment" "cluster_AmazonEKSClusterPolicy" {
+  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
+  role       = aws_iam_role.cluster.name
+}
+
+resource "aws_iam_role_policy_attachment" "node_AmazonEKSWorkerNodePolicy" {
+  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
+  role       = aws_iam_role.node.name
+}
+
+resource "aws_iam_role_policy_attachment" "node_AmazonEKS_CNI_Policy" {
+  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
+  role       = aws_iam_role.node.name
+}
+
+resource "aws_iam_role_policy_attachment" "node_AmazonEC2ContainerRegistryReadOnly" {
+  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
+  role       = aws_iam_role.node.name
+}
+
+# Outputs
+output "cluster_name" {
+  description = "The name of the EKS cluster"
+  value       = aws_eks_cluster.main.name
+}
+
+output "cluster_endpoint" {
+  description = "The endpoint for the EKS cluster"
+  value       = aws_eks_cluster.main.endpoint
+}
+
+output "cluster_certificate_authority_data" {
+  description = "The certificate authority data for the EKS cluster"
+  value       = aws_eks_cluster.main.certificate_authority[0].data
+}
+
+output "cluster_security_group_id" {
+  description = "The security group ID of the EKS cluster"
+  value       = aws_eks_cluster.main.vpc_config[0].cluster_security_group_id
+}
--- a/infra/terraform/modules/kubernetes/variables.tf
+++ b/infra/terraform/modules/kubernetes/variables.tf
@ -0,0 +1,75 @@
+variable "cluster_name" {
+  description = "Name of the EKS cluster"
+  type        = string
+}
+
+variable "environment" {
+  description = "Environment name (dev, staging, prod)"
+  type        = string
+}
+
+variable "aws_region" {
+  description = "AWS region"
+  type        = string
+  default     = "us-west-2"
+}
+
+variable "vpc_cidr" {
+  description = "CIDR block for VPC"
+  type        = string
+  default     = "10.0.0.0/16"
+}
+
+variable "private_subnet_cidrs" {
+  description = "CIDR blocks for private subnets"
+  type        = list(string)
+  default     = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
+}
+
+variable "public_subnet_cidrs" {
+  description = "CIDR blocks for public subnets"
+  type        = list(string)
+  default     = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
+}
+
+variable "availability_zones" {
+  description = "Availability zones"
+  type        = list(string)
+  default     = ["us-west-2a", "us-west-2b", "us-west-2c"]
+}
+
+variable "kubernetes_version" {
+  description = "Kubernetes version"
+  type        = string
+  default     = "1.28"
+}
+
+variable "enable_public_endpoint" {
+  description = "Enable public EKS endpoint"
+  type        = bool
+  default     = false
+}
+
+variable "desired_node_count" {
+  description = "Desired number of worker nodes"
+  type        = number
+  default     = 3
+}
+
+variable "min_node_count" {
+  description = "Minimum number of worker nodes"
+  type        = number
+  default     = 1
+}
+
+variable "max_node_count" {
+  description = "Maximum number of worker nodes"
+  type        = number
+  default     = 10
+}
+
+variable "instance_types" {
+  description = "EC2 instance types for worker nodes"
+  type        = list(string)
+  default     = ["m5.large", "m5.xlarge"]
+}