Files
aitbc/docs/reference/bootstrap/miner_node.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

12 KiB
Raw Blame History

miner-node/ — Worker Node Daemon for GPU/CPU Tasks

Goal: Implement a Dockerfree worker daemon that connects to the Coordinator API, advertises capabilities (CPU/GPU), fetches jobs, executes them in a sandboxed workspace, and streams results/metrics back.


1) Scope & MVP

MVP Features

  • Node registration with Coordinator (auth token + capability descriptor).
  • Heartbeat & liveness (interval ± jitter, backoff on failure).
  • Job fetch → ack → execute → upload result → finalize.
  • Two runner types:
    • CLI runner: executes a provided command with arguments (whitelistbased).
    • Python runner: executes a trusted task module with parameters.
  • CPU/GPU capability detection (CUDA, VRAM, driver info) without Docker.
  • Sandboxed working dir per job under /var/lib/aitbc/miner/jobs/<job-id>.
  • Resource controls (nice/ionice/ulimit; optional cgroup v2 if present).
  • Structured JSON logging and minimal metrics.

PostMVP

  • Chunked artifact upload; resumable transfers.
  • Prometheus /metrics endpoint (pull).
  • GPU multicard scheduling & fractional allocation policy.
  • Onnode model cache management (size, eviction, pinning).
  • Signed task manifests & attestation of execution.
  • Secure TMPFS for secrets; hardware key support (YubiKey).

2) HighLevel Architecture

client → coordinator-api → miner-node(s) → results store → coordinator-api → client

Miner components:

  • Agent (control loop): registration, heartbeat, fetch/dispatch, result reporting.
  • Capability Probe: CPU/GPU inventory (CUDA, VRAM), free RAM/disk, load.
  • Schedulers: simple FIFO for MVP; one job per GPU or CPU slot.
  • Runners: CLI runner & Python runner.
  • Sandbox: working dirs, resource limits, network I/O gating (optional), file allowlist.
  • Telemetry: JSON logs, minimal metrics; perjob timeline.

3) Directory Layout (on node)

/var/lib/aitbc/miner/
  ├─ jobs/
  │   ├─ <job-id>/
  │   │   ├─ input/
  │   │   ├─ work/
  │   │   ├─ output/
  │   │   └─ logs/
  ├─ cache/            # model/assets cache (optional)
  └─ tmp/
/etc/aitbc/miner/
  ├─ config.yaml
  └─ allowlist.d/      # allowed CLI programs & argument schema snippets
/var/log/aitbc/miner/
/usr/local/lib/aitbc/miner/  # python package venv install target

4) Config (YAML)

node_id: "node-<shortid>"
coordinator:
  base_url: "https://coordinator.local/api/v1"
  auth_token: "env:MINER_AUTH"        # read from env at runtime
  tls_verify: true
  timeout_s: 20

heartbeat:
  interval_s: 15
  jitter_pct: 10
  backoff:
    min_s: 5
    max_s: 120

runners:
  cli:
    enable: true
    allowlist_files:
      - "/etc/aitbc/miner/allowlist.d/ffmpeg.yaml"
      - "/etc/aitbc/miner/allowlist.d/whisper.yaml"
  python:
    enable: true
    task_paths:
      - "/usr/local/lib/aitbc/miner/tasks"
    venv: "/usr/local/lib/aitbc/miner/.venv"

resources:
  max_concurrent_cpu: 2
  max_concurrent_gpu: 1
  cpu_nice: 10
  io_class: "best-effort"
  io_level: 6
  mem_soft_mb: 16384

workspace:
  root: "/var/lib/aitbc/miner/jobs"
  keep_success: 24h
  keep_failed: 7d

logging:
  level: "info"
  json: true
  path: "/var/log/aitbc/miner/miner.jsonl"

5) Environment & Dependencies

  • OS: Debian 12/13 (systemd).
  • Python: 3.11+ in venv under /usr/local/lib/aitbc/miner/.venv.
  • Libraries: httpx, pydantic, uvloop (optional), pyyaml, psutil.
  • GPU (optional): NVIDIA driver installed; nvidia-smi available; CUDA 12.x runtime on path for GPU tasks.

Install skeleton

python3 -m venv /usr/local/lib/aitbc/miner/.venv
/usr/local/lib/aitbc/miner/.venv/bin/pip install --upgrade pip
/usr/local/lib/aitbc/miner/.venv/bin/pip install httpx pydantic pyyaml psutil uvloop
install -d /etc/aitbc/miner /var/lib/aitbc/miner/{jobs,cache,tmp} /var/log/aitbc/miner

6) Systemd Service

/etc/systemd/system/aitbc-miner.service

[Unit]
Description=AITBC Miner Node
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
Environment=MINER_AUTH=***REDACTED***
ExecStart=/usr/local/lib/aitbc/miner/.venv/bin/python -m aitbc_miner --config /etc/aitbc/miner/config.yaml
User=games
Group=games
# Lower CPU/IO priority by default
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=6
Restart=always
RestartSec=5
# Hardening
NoNewPrivileges=true
ProtectSystem=full
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/aitbc/miner /var/log/aitbc/miner

[Install]
WantedBy=multi-user.target

7) Capability Probe (sent to Coordinator)

Example payload:

{
  "node_id": "node-abc123",
  "version": "0.1.0",
  "cpu": {"cores": 16, "arch": "x86_64"},
  "memory_mb": 64000,
  "disk_free_mb": 250000,
  "gpu": [
    {
      "vendor": "nvidia", "name": "RTX 4060 Ti 16GB",
      "vram_mb": 16384,
      "cuda": {"version": "12.3", "driver": "545.23.06"}
    }
  ],
  "runners": ["cli", "python"],
  "tags": ["debian", "cuda", "cpu"]
}

8) Coordinator API Contract (MVP)

Endpoints (HTTPS, JSON):

  • POST /nodes/register → returns signed node_token (or 401)
  • POST /nodes/heartbeat{node_id, load, free_mb, gpu_free} → 200
  • POST /jobs/pull{node_id, filters}{job|none}
  • POST /jobs/ack{job_id, node_id} → 200
  • POST /jobs/progress{job_id, pct, note} → 200
  • POST /jobs/result → multipart (metadata.json + artifacts/*) → 200
  • POST /jobs/fail{job_id, error_code, error_msg, logs_ref} → 200

Auth

  • Bearer token in header (Node → Coordinator): Authorization: Bearer <node_token>
  • Coordinator signs job.manifest with HMAC(sha256) or Ed25519 (postMVP).

Job manifest (subset)

{
  "job_id": "j-20250926-001",
  "runner": "cli",
  "requirements": {"gpu": true, "vram_mb": 12000, "cpu_threads": 4},
  "timeout_s": 3600,
  "input": {"urls": ["https://.../input1"], "inline": {"text": "..."}},
  "command": "ffmpeg",
  "args": ["-y", "-i", "input1.mp4", "-c:v", "libx264", "output.mp4"],
  "artifacts": [{"path": "output.mp4", "type": "video/mp4", "max_mb": 5000}]
}

9) Runner Design

CLI Runner

  • Validate command against allowlist (/etc/aitbc/miner/allowlist.d/*.yaml).
  • Validate args against pertool schema (regex & size caps).
  • Materialize inputs in job workspace; set PATH, CUDA_VISIBLE_DEVICES.
  • Launch via subprocess.Popen with preexec_fn applying nice, ionice, setrlimit.
  • Livetail stdout/stderr to logs/exec.log; throttle progress pings.

Python Runner

  • Import trusted module tasks.<name>:run(**params) from configured paths.
  • Run in same venv; optional venv per task later.
  • Enforce timeouts; capture logs; write artifacts to output/.

10) Resource Controls (No Docker)

  • CPU: nice(10); optional cgroups v2 CPU.max if available.
  • IO: ionice -c 2 -n 6 (besteffort) for heavy disk ops.
  • Memory: setrlimit(RLIMIT_AS) soft cap; kill on OOM.
  • GPU: select by policy (least used VRAM). No hard memory partitioning in MVP.
  • Network: allowlist outbound hosts; deny by default (optional phase2).

11) Job Lifecycle (State Machine)

IDLE → PULLING → ACKED → PREP → RUNNING → UPLOADING → DONE | FAILED | RETRY_WAIT

  • Retries: exponential backoff, max N; idempotent uploads.
  • On crash: onstart recovery scans jobs/*/state.json and reconciles with Coordinator.

12) Logging & Metrics

  • JSON lines in /var/log/aitbc/miner/miner.jsonl with fields: ts, level, node_id, job_id, event, attrs{}.
  • Optional /healthz (HTTP) returning 200 + brief status.
  • Future: Prometheus /metrics with gauges (queue, running, VRAM free, CPU load).

13) Security Model

  • TLS required; pin CA or enable cert validation per env.
  • Node bootstrap token (MINER_AUTH) exchanged for node_token at registration.
  • Strict allowlist for CLI tools + args; size/time caps.
  • Secrets never written to disk unencrypted; pass via env vars or inmemory.
  • Wipe workdirs on success (per policy); keep failed for triage.

14) Windsurf Implementation Plan

Milestone 1 — Skeleton

  1. aitbc_miner/ package: main.py, config.py, agent.py, probe.py, runners/{cli.py, python.py}, util/{limits.py, fs.py, log.py}.
  2. Load YAML config, bootstrap logs, print probe JSON.
  3. Implement /healthz (optional FastAPI or bare aiohttp) for local checks.

Milestone 2 — Control Loop

  1. Register → store node_token (in memory only).
  2. Heartbeat task (async), backoff on network errors.
  3. Pull/ack & singleslot executor; write state.json.

Milestone 3 — Runners

  1. CLI allowlist loader + validator; subprocess with limits.
  2. Python runner calling tasks.example:run.
  3. Upload artifacts via multipart; handle large files with chunking stub.

Milestone 4 — Hardening & Ops

  1. Crash recovery; cleanup policy; TTL sweeper.
  2. Metrics counters; structured logging fields.
  3. Systemd unit; install scripts; doc.

15) Minimal Allowlist Example (ffmpeg)

# /etc/aitbc/miner/allowlist.d/ffmpeg.yaml
command:
  path: "/usr/bin/ffmpeg"
args:
  - ["-y"]
  - ["-i", ".+\\.(mp4|wav|mkv)$"]
  - ["-c:v", "(libx264|copy)"]
  - ["-c:a", "(aac|copy)"]
  - ["-b:v", "[1-9][0-9]{2,5}k"]
  - ["output\\.(mp4|mkv)"]
max_total_args_len: 4096
max_runtime_s: 7200
max_output_mb: 5000

16) Mock Coordinator (for local testing)

Run a tiny dev server to hand out a single job and accept results.

# mock_coordinator.py (FastAPI)
from fastapi import FastAPI, UploadFile, File, Form
from pydantic import BaseModel

app = FastAPI()
JOB = {
  "job_id": "j-local-1",
  "runner": "cli",
  "requirements": {"gpu": False},
  "timeout_s": 120,
  "command": "echo",
  "args": ["hello", "world"],
  "artifacts": [{"path": "output.txt", "type": "text/plain", "max_mb": 1}]
}

class PullReq(BaseModel):
    node_id: str
    filters: dict | None = None

@app.post("/api/v1/jobs/pull")
def pull(req: PullReq):
    return {"job": JOB}

@app.post("/api/v1/jobs/ack")
def ack(job_id: str, node_id: str):
    return {"ok": True}

@app.post("/api/v1/jobs/result")
def result(job_id: str = Form(...), metadata: str = Form(...), artifact: UploadFile = File(...)):
    return {"ok": True}

17) Developer UX (Make Targets)

make venv        # create venv + install deps
make run         # run miner with local config
make fmt         # ruff/black (optional)
make test        # unit tests

18) Operational Runbook

  • Start/Stop: systemctl enable --now aitbc-miner
  • Logs: journalctl -u aitbc-miner -f and /var/log/aitbc/miner/miner.jsonl
  • Rotate: logrotate config (size 50M, keep 7)
  • Upgrade: drain → stop → replace venv → start → verify heartbeat
  • Health: /healthz 200 + JSON {running, queued, cpu_load, vram_free}

19) Failure Modes & Recovery

  • Network errors: exponential backoff; keep heartbeat local status.
  • Job invalid: fail fast with reason; do not retry.
  • Runner denied: allowlist miss → fail with E_DENY.
  • OOM: kill process group; mark E_OOM.
  • GPU unavailable: requeue with reason E_NOGPU.

20) Roadmap Notes

  • Binary task bundles with signed SBOM.
  • Remote cache warming via Coordinator hints.
  • MultiQueue scheduling (latency vs throughput).
  • MIG/computeinstance support if hardware allows.

21) Checklist for Windsurf

  1. Create aitbc_miner/ package skeleton with modules listed in §14.
  2. Implement config loader + capability probe output.
  3. Implement async agent loop: register → heartbeat → pull/ack.
  4. Implement CLI runner with allowlist (§15) and exec log.
  5. Implement Python runner stub (tasks/example.py).
  6. Write result uploader (multipart) and finalize call.
  7. Add systemd unit (§6) and basic install script.
  8. Test endtoend against mock_coordinator.py (§16).
  9. Document log fields + troubleshooting card.
  10. Add optional /healthz endpoint.