feat: add marketplace metrics, privacy features, and service registry endpoints

- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels - Implement confidential transaction models with encryption support and access control - Add key management system with registration, rotation, and audit logging - Create services and registry routers for service discovery and management - Integrate ZK proof generation for privacy-preserving receipts - Add metrics instru
2025-12-22 10:33:23 +01:00
parent d98b2c7772
commit c8be9d7414
260 changed files with 59033 additions and 351 deletions
--- a/docs/reference/bootstrap/coordinator_api.md
+++ b/docs/reference/bootstrap/coordinator_api.md
@ -0,0 +1,438 @@
+# coordinator-api.md
+
+Central API that orchestrates **jobs** from clients to **miners**, tracks lifecycle, validates results, and (later) settles AITokens.  
+**Stage 1 (MVP):** no blockchain, no pool hub — just client ⇄ coordinator ⇄ miner.
+
+## 1) Goals & Non-Goals
+
+**Goals (MVP)**
+- Accept computation jobs from clients.
+- Match jobs to eligible miners.
+- Track job state machine (QUEUED → RUNNING → COMPLETED/FAILED/CANCELED/EXPIRED).
+- Stream results back to clients; store minimal metadata.
+- Provide a clean, typed API (OpenAPI/Swagger).
+- Simple auth (API keys) + idempotency + rate limits.
+- Minimal persistence (SQLite/Postgres) with straightforward SQL (no migrations tooling).
+
+**Non-Goals (MVP)**
+- Token minting/settlement (stub hooks only).
+- Miner marketplace, staking, slashing, reputation (placeholders).
+- Pool hub coordination (future stage).
+
+---
+
+## 2) Tech Stack
+
+- **Python 3.12**, **FastAPI**, **Uvicorn**
+- **Pydantic** for schemas
+- **SQL** via `sqlite3` or Postgres (user can switch later)
+- **Redis (optional)** for queueing; MVP can start with in-DB FIFO
+- **HTTP + WebSocket** (for miner heartbeats / job streaming)
+
+> Debian 12 target. Run under **systemd** later.
+
+---
+
+## 3) Directory Layout (WindSurf Workspace)
+
+```
+coordinator-api/
+├─ app/
+│  ├─ main.py                  # FastAPI init, lifespan, routers
+│  ├─ config.py                # env parsing
+│  ├─ deps.py                  # auth, rate-limit deps
+│  ├─ db.py                    # simple DB layer (sqlite/postgres)
+│  ├─ matching.py              # job→miner selection
+│  ├─ queue.py                 # enqueue/dequeue logic
+│  ├─ settlement.py            # stubs for token accounting
+│  ├─ models.py                # Pydantic request/response schemas
+│  ├─ states.py                # state machine + transitions
+│  ├─ routers/
+│  │  ├─ client.py             # /v1/jobs (submit/status/result/cancel)
+│  │  ├─ miner.py              # /v1/miners (register/heartbeat/poll/submit/fail)
+│  │  └─ admin.py              # /v1/admin (stats)
+│  └─ ws/
+│     ├─ miner.py              # WS for miner heartbeats / job stream (optional)
+│     └─ client.py             # WS for client result stream (optional)
+├─ tests/
+│  ├─ test_client_flow.http    # REST client flow (HTTP file)
+│  └─ test_miner_flow.http     # REST miner flow
+├─ .env.example
+├─ pyproject.toml
+└─ README.md
+```
+
+---
+
+## 4) Environment (.env)
+
+```
+APP_ENV=dev
+APP_HOST=127.0.0.1
+APP_PORT=8011
+DATABASE_URL=sqlite:///./coordinator.db
+# or: DATABASE_URL=postgresql://user:pass@localhost:5432/aitbc
+
+# Auth
+CLIENT_API_KEYS=client_dev_key_1,client_dev_key_2
+MINER_API_KEYS=miner_dev_key_1,miner_dev_key_2
+ADMIN_API_KEYS=admin_dev_key_1
+
+# Security
+HMAC_SECRET=change_me
+ALLOW_ORIGINS=*
+
+# Queue
+JOB_TTL_SECONDS=900
+HEARTBEAT_INTERVAL_SECONDS=10
+HEARTBEAT_TIMEOUT_SECONDS=30
+```
+
+---
+
+## 5) Core Data Model (conceptual)
+
+**Job**
+- `job_id` (uuid)
+- `client_id` (from API key)
+- `requested_at`, `expires_at`
+- `payload` (opaque JSON / bytes ref)
+- `constraints` (gpu, cuda, mem, model, max_price, region)
+- `state` (QUEUED|RUNNING|COMPLETED|FAILED|CANCELED|EXPIRED)
+- `assigned_miner_id` (nullable)
+- `result_ref` (blob path / inline json)
+- `error` (nullable)
+- `cost_estimate` (optional)
+
+**Miner**
+- `miner_id` (from API key)
+- `capabilities` (gpu, cuda, vram, models[], region)
+- `heartbeat_at`
+- `status` (ONLINE|OFFLINE|DRAINING)
+- `concurrency` (int), `inflight` (int)
+
+**WorkerSession**
+- `session_id`, `miner_id`, `job_id`, `started_at`, `ended_at`, `exit_reason`
+
+---
+
+## 6) State Machine
+
+```
+QUEUED
+  -> RUNNING (assigned to miner)
+  -> CANCELED (client)
+  -> EXPIRED (ttl)
+
+RUNNING
+  -> COMPLETED (miner submit_result)
+  -> FAILED    (miner fail / timeout)
+  -> CANCELED  (client)
+```
+
+---
+
+## 7) Matching (MVP)
+
+- Filter ONLINE miners by **capabilities** & **region**
+- Prefer lowest `inflight` (simple load)
+- Tiebreak by earliest `heartbeat_at` or random
+- Lock job row → assign → return to miner
+
+---
+
+## 8) Auth & Rate Limits
+
+- **API keys** via `X-Api-Key` header for `client`, `miner`, `admin`.
+- Optional **HMAC** (`X-Signature`) over body with `HMAC_SECRET`.
+- **Idempotency**: clients send `Idempotency-Key` on **POST /jobs**.
+- **Rate limiting**: naive per-key window (e.g., 60 req / 60 s).
+
+---
+
+## 9) REST API
+
+### Client
+
+- `POST /v1/jobs`
+  - Create job. Returns `job_id`.
+- `GET /v1/jobs/{job_id}`
+  - Job status & metadata.
+- `GET /v1/jobs/{job_id}/result`
+  - Result (200 when ready, 425 if not ready).
+- `POST /v1/jobs/{job_id}/cancel`
+  - Cancel if QUEUED or RUNNING (best effort).
+
+### Miner
+
+- `POST /v1/miners/register`
+  - Upsert miner capabilities; set ONLINE.
+- `POST /v1/miners/heartbeat`
+  - Touch `heartbeat_at`, report `inflight`.
+- `POST /v1/miners/poll`
+  - Long-poll for next job → returns a job or 204.
+- `POST /v1/miners/{job_id}/start`
+  - Confirm start (optional if `poll` implies start).
+- `POST /v1/miners/{job_id}/result`
+  - Submit result; transitions to COMPLETED.
+- `POST /v1/miners/{job_id}/fail`
+  - Submit failure; transitions to FAILED.
+- `POST /v1/miners/drain`
+  - Graceful stop accepting new jobs.
+
+### Admin
+
+- `GET /v1/admin/stats`
+  - Queue depth, miners online, success rates, avg latency.
+- `GET /v1/admin/jobs?state=&limit=...`
+- `GET /v1/admin/miners`
+
+**Error Shape**
+```json
+{ "error": { "code": "STRING_CODE", "message": "human readable", "details": {} } }
+```
+
+Common codes: `UNAUTHORIZED_KEY`, `RATE_LIMITED`, `INVALID_PAYLOAD`, `NO_ELIGIBLE_MINER`, `JOB_NOT_FOUND`, `JOB_NOT_READY`, `CONFLICT_STATE`.
+
+---
+
+## 10) WebSockets (optional MVP+)
+
+- `WS /v1/ws/miner?api_key=...`
+  - Server → miner: `job.assigned`
+  - Miner → server: `heartbeat`, `result`, `fail`
+- `WS /v1/ws/client?job_id=...&api_key=...`
+  - Server → client: `state.changed`, `result.ready`
+
+Fallback remains HTTP long-polling.
+
+---
+
+## 11) Result Storage
+
+- **Inline JSON** if ≤ 1 MB.
+- For larger payloads: store to disk path (e.g., `/var/lib/coordinator/results/{job_id}`) and return `result_ref`.
+
+---
+
+## 12) Settlement Hooks (stub)
+
+`settlement.py` exposes:
+- `record_usage(job, miner)`  
+- `quote_cost(job)`  
+Later wired to **AIToken** mint/transfer when blockchain lands.
+
+---
+
+## 13) Minimal FastAPI Skeleton
+
+```python
+# app/main.py
+from fastapi import FastAPI
+from app.routers import client, miner, admin
+
+def create_app():
+    app = FastAPI(title="AITBC Coordinator API", version="0.1.0")
+    app.include_router(client.router, prefix="/v1")
+    app.include_router(miner.router, prefix="/v1")
+    app.include_router(admin.router, prefix="/v1")
+    return app
+
+app = create_app()
+```
+
+```python
+# app/models.py
+from pydantic import BaseModel, Field
+from typing import Any, Dict, List, Optional
+
+class Constraints(BaseModel):
+    gpu: Optional[str] = None
+    cuda: Optional[str] = None
+    min_vram_gb: Optional[int] = None
+    models: Optional[List[str]] = None
+    region: Optional[str] = None
+    max_price: Optional[float] = None
+
+class JobCreate(BaseModel):
+    payload: Dict[str, Any]
+    constraints: Constraints = Constraints()
+    ttl_seconds: int = 900
+
+class JobView(BaseModel):
+    job_id: str
+    state: str
+    assigned_miner_id: Optional[str] = None
+    requested_at: str
+    expires_at: str
+    error: Optional[str] = None
+
+class MinerRegister(BaseModel):
+    capabilities: Dict[str, Any]
+    concurrency: int = 1
+    region: Optional[str] = None
+
+class PollRequest(BaseModel):
+    max_wait_seconds: int = 15
+
+class AssignedJob(BaseModel):
+    job_id: str
+    payload: Dict[str, Any]
+```
+
+```python
+# app/routers/client.py
+from fastapi import APIRouter, Depends, HTTPException
+from app.models import JobCreate, JobView
+from app.deps import require_client_key
+
+router = APIRouter(tags=["client"])
+
+@router.post("/jobs", response_model=JobView)
+def submit_job(req: JobCreate, client_id: str = Depends(require_client_key)):
+    # enqueue + return JobView
+    ...
+
+@router.get("/jobs/{job_id}", response_model=JobView)
+def get_job(job_id: str, client_id: str = Depends(require_client_key)):
+    ...
+```
+
+```python
+# app/routers/miner.py
+from fastapi import APIRouter, Depends
+from app.models import MinerRegister, PollRequest, AssignedJob
+from app.deps import require_miner_key
+
+router = APIRouter(tags=["miner"])
+
+@router.post("/miners/register")
+def register(req: MinerRegister, miner_id: str = Depends(require_miner_key)):
+    ...
+
+@router.post("/miners/poll", response_model=AssignedJob, status_code=200)
+def poll(req: PollRequest, miner_id: str = Depends(require_miner_key)):
+    # try dequeue, else 204
+    ...
+```
+
+Run:
+```bash
+uvicorn app.main:app --host 127.0.0.1 --port 8011 --reload
+```
+
+OpenAPI: `http://127.0.0.1:8011/docs`
+
+---
+
+## 14) Matching & Queue Pseudocode
+
+```python
+def match_next_job(miner):
+    eligible = db.jobs.filter(
+        state="QUEUED",
+        constraints.satisfied_by(miner.capabilities)
+    ).order_by("requested_at").first()
+    if not eligible:
+        return None
+    db.txn(lambda:
+        db.jobs.assign(eligible.job_id, miner.id) and
+        db.states.transition(eligible.job_id, "RUNNING")
+    )
+    return eligible
+```
+
+---
+
+## 15) CURL Examples
+
+**Client creates a job**
+```bash
+curl -sX POST http://127.0.0.1:8011/v1/jobs \
+  -H 'X-Api-Key: client_dev_key_1' \
+  -H 'Idempotency-Key: 7d4a...' \
+  -H 'Content-Type: application/json' \
+  -d '{
+        "payload": {"task":"sum","a":2,"b":3},
+        "constraints": {"gpu": null, "region": "eu-central"}
+      }'
+```
+
+**Miner registers + polls**
+```bash
+curl -sX POST http://127.0.0.1:8011/v1/miners/register \
+  -H 'X-Api-Key: miner_dev_key_1' \
+  -H 'Content-Type: application/json' \
+  -d '{"capabilities":{"gpu":"RTX4060Ti","cuda":"12.3","vram_gb":16},"concurrency":2,"region":"eu-central"}'
+
+curl -i -sX POST http://127.0.0.1:8011/v1/miners/poll \
+  -H 'X-Api-Key: miner_dev_key_1' \
+  -H 'Content-Type: application/json' \
+  -d '{"max_wait_seconds":10}'
+```
+
+**Miner submits result**
+```bash
+curl -sX POST http://127.0.0.1:8011/v1/miners/<JOB_ID>/result \
+  -H 'X-Api-Key: miner_dev_key_1' \
+  -H 'Content-Type: application/json' \
+  -d '{"result":{"sum":5},"metrics":{"latency_ms":42}}'
+```
+
+**Client fetches result**
+```bash
+curl -s http://127.0.0.1:8011/v1/jobs/<JOB_ID>/result \
+  -H 'X-Api-Key: client_dev_key_1'
+```
+
+---
+
+## 16) Timeouts & Health
+
+- **Job TTL**: auto-expire QUEUED after `JOB_TTL_SECONDS`.
+- **Heartbeat**: miners post every `HEARTBEAT_INTERVAL_SECONDS`.
+- **Miner OFFLINE** if no heartbeat for `HEARTBEAT_TIMEOUT_SECONDS`.
+- **Requeue**: RUNNING jobs from OFFLINE miners → back to QUEUED.
+
+---
+
+## 17) Security Notes
+
+- Validate `payload` size & type; enforce max 1 MB inline.
+- Optional **HMAC** signature for tamper detection.
+- Sanitize/validate miner-reported capabilities.
+- Log every state transition (append-only).
+
+---
+
+## 18) Admin Metrics (MVP)
+
+- Queue depth, running count
+- Miner online/offline, inflight
+- P50/P95 job latency
+- Success/fail/cancel rates (windowed)
+
+---
+
+## 19) Future Stages
+
+- **Blockchain layer**: mint on verified compute; tie to `record_usage`.
+- **Pool hub**: multi-coordinator balancing; marketplace.
+- **Reputation**: miner scoring, penalty, slashing.
+- **Bidding**: price discovery; client max price.
+
+---
+
+## 20) Checklist (WindSurf)
+
+1. Create repo structure from section **3**.  
+2. Implement `.env` & `config.py` keys from **4**.  
+3. Add `models.py`, `states.py`, `deps.py` (auth, rate limit).  
+4. Implement DB tables for Job, Miner, WorkerSession.  
+5. Implement `queue.py` and `matching.py`.  
+6. Wire **client** and **miner** routers (MVP endpoints).  
+7. Add admin stats (basic counts).  
+8. Add OpenAPI tags, descriptions.  
+9. Add curl `.http` test files.  
+10. Systemd unit + Nginx proxy (later).
+