Files
aitbc/docs/reference/bootstrap/coordinator_api.md
oib c8be9d7414 feat: add marketplace metrics, privacy features, and service registry endpoints
- Add Prometheus metrics for marketplace API throughput and error rates with new dashboard panels
- Implement confidential transaction models with encryption support and access control
- Add key management system with registration, rotation, and audit logging
- Create services and registry routers for service discovery and management
- Integrate ZK proof generation for privacy-preserving receipts
- Add metrics instru
2025-12-22 10:33:23 +01:00

12 KiB

coordinator-api.md

Central API that orchestrates jobs from clients to miners, tracks lifecycle, validates results, and (later) settles AITokens.
Stage 1 (MVP): no blockchain, no pool hub — just client ⇄ coordinator ⇄ miner.

1) Goals & Non-Goals

Goals (MVP)

  • Accept computation jobs from clients.
  • Match jobs to eligible miners.
  • Track job state machine (QUEUED → RUNNING → COMPLETED/FAILED/CANCELED/EXPIRED).
  • Stream results back to clients; store minimal metadata.
  • Provide a clean, typed API (OpenAPI/Swagger).
  • Simple auth (API keys) + idempotency + rate limits.
  • Minimal persistence (SQLite/Postgres) with straightforward SQL (no migrations tooling).

Non-Goals (MVP)

  • Token minting/settlement (stub hooks only).
  • Miner marketplace, staking, slashing, reputation (placeholders).
  • Pool hub coordination (future stage).

2) Tech Stack

  • Python 3.12, FastAPI, Uvicorn
  • Pydantic for schemas
  • SQL via sqlite3 or Postgres (user can switch later)
  • Redis (optional) for queueing; MVP can start with in-DB FIFO
  • HTTP + WebSocket (for miner heartbeats / job streaming)

Debian 12 target. Run under systemd later.


3) Directory Layout (WindSurf Workspace)

coordinator-api/
├─ app/
│  ├─ main.py                  # FastAPI init, lifespan, routers
│  ├─ config.py                # env parsing
│  ├─ deps.py                  # auth, rate-limit deps
│  ├─ db.py                    # simple DB layer (sqlite/postgres)
│  ├─ matching.py              # job→miner selection
│  ├─ queue.py                 # enqueue/dequeue logic
│  ├─ settlement.py            # stubs for token accounting
│  ├─ models.py                # Pydantic request/response schemas
│  ├─ states.py                # state machine + transitions
│  ├─ routers/
│  │  ├─ client.py             # /v1/jobs (submit/status/result/cancel)
│  │  ├─ miner.py              # /v1/miners (register/heartbeat/poll/submit/fail)
│  │  └─ admin.py              # /v1/admin (stats)
│  └─ ws/
│     ├─ miner.py              # WS for miner heartbeats / job stream (optional)
│     └─ client.py             # WS for client result stream (optional)
├─ tests/
│  ├─ test_client_flow.http    # REST client flow (HTTP file)
│  └─ test_miner_flow.http     # REST miner flow
├─ .env.example
├─ pyproject.toml
└─ README.md

4) Environment (.env)

APP_ENV=dev
APP_HOST=127.0.0.1
APP_PORT=8011
DATABASE_URL=sqlite:///./coordinator.db
# or: DATABASE_URL=postgresql://user:pass@localhost:5432/aitbc

# Auth
CLIENT_API_KEYS=client_dev_key_1,client_dev_key_2
MINER_API_KEYS=miner_dev_key_1,miner_dev_key_2
ADMIN_API_KEYS=admin_dev_key_1

# Security
HMAC_SECRET=change_me
ALLOW_ORIGINS=*

# Queue
JOB_TTL_SECONDS=900
HEARTBEAT_INTERVAL_SECONDS=10
HEARTBEAT_TIMEOUT_SECONDS=30

5) Core Data Model (conceptual)

Job

  • job_id (uuid)
  • client_id (from API key)
  • requested_at, expires_at
  • payload (opaque JSON / bytes ref)
  • constraints (gpu, cuda, mem, model, max_price, region)
  • state (QUEUED|RUNNING|COMPLETED|FAILED|CANCELED|EXPIRED)
  • assigned_miner_id (nullable)
  • result_ref (blob path / inline json)
  • error (nullable)
  • cost_estimate (optional)

Miner

  • miner_id (from API key)
  • capabilities (gpu, cuda, vram, models[], region)
  • heartbeat_at
  • status (ONLINE|OFFLINE|DRAINING)
  • concurrency (int), inflight (int)

WorkerSession

  • session_id, miner_id, job_id, started_at, ended_at, exit_reason

6) State Machine

QUEUED
  -> RUNNING (assigned to miner)
  -> CANCELED (client)
  -> EXPIRED (ttl)

RUNNING
  -> COMPLETED (miner submit_result)
  -> FAILED    (miner fail / timeout)
  -> CANCELED  (client)

7) Matching (MVP)

  • Filter ONLINE miners by capabilities & region
  • Prefer lowest inflight (simple load)
  • Tiebreak by earliest heartbeat_at or random
  • Lock job row → assign → return to miner

8) Auth & Rate Limits

  • API keys via X-Api-Key header for client, miner, admin.
  • Optional HMAC (X-Signature) over body with HMAC_SECRET.
  • Idempotency: clients send Idempotency-Key on POST /jobs.
  • Rate limiting: naive per-key window (e.g., 60 req / 60 s).

9) REST API

Client

  • POST /v1/jobs
    • Create job. Returns job_id.
  • GET /v1/jobs/{job_id}
    • Job status & metadata.
  • GET /v1/jobs/{job_id}/result
    • Result (200 when ready, 425 if not ready).
  • POST /v1/jobs/{job_id}/cancel
    • Cancel if QUEUED or RUNNING (best effort).

Miner

  • POST /v1/miners/register
    • Upsert miner capabilities; set ONLINE.
  • POST /v1/miners/heartbeat
    • Touch heartbeat_at, report inflight.
  • POST /v1/miners/poll
    • Long-poll for next job → returns a job or 204.
  • POST /v1/miners/{job_id}/start
    • Confirm start (optional if poll implies start).
  • POST /v1/miners/{job_id}/result
    • Submit result; transitions to COMPLETED.
  • POST /v1/miners/{job_id}/fail
    • Submit failure; transitions to FAILED.
  • POST /v1/miners/drain
    • Graceful stop accepting new jobs.

Admin

  • GET /v1/admin/stats
    • Queue depth, miners online, success rates, avg latency.
  • GET /v1/admin/jobs?state=&limit=...
  • GET /v1/admin/miners

Error Shape

{ "error": { "code": "STRING_CODE", "message": "human readable", "details": {} } }

Common codes: UNAUTHORIZED_KEY, RATE_LIMITED, INVALID_PAYLOAD, NO_ELIGIBLE_MINER, JOB_NOT_FOUND, JOB_NOT_READY, CONFLICT_STATE.


10) WebSockets (optional MVP+)

  • WS /v1/ws/miner?api_key=...
    • Server → miner: job.assigned
    • Miner → server: heartbeat, result, fail
  • WS /v1/ws/client?job_id=...&api_key=...
    • Server → client: state.changed, result.ready

Fallback remains HTTP long-polling.


11) Result Storage

  • Inline JSON if ≤ 1 MB.
  • For larger payloads: store to disk path (e.g., /var/lib/coordinator/results/{job_id}) and return result_ref.

12) Settlement Hooks (stub)

settlement.py exposes:

  • record_usage(job, miner)
  • quote_cost(job)
    Later wired to AIToken mint/transfer when blockchain lands.

13) Minimal FastAPI Skeleton

# app/main.py
from fastapi import FastAPI
from app.routers import client, miner, admin

def create_app():
    app = FastAPI(title="AITBC Coordinator API", version="0.1.0")
    app.include_router(client.router, prefix="/v1")
    app.include_router(miner.router, prefix="/v1")
    app.include_router(admin.router, prefix="/v1")
    return app

app = create_app()
# app/models.py
from pydantic import BaseModel, Field
from typing import Any, Dict, List, Optional

class Constraints(BaseModel):
    gpu: Optional[str] = None
    cuda: Optional[str] = None
    min_vram_gb: Optional[int] = None
    models: Optional[List[str]] = None
    region: Optional[str] = None
    max_price: Optional[float] = None

class JobCreate(BaseModel):
    payload: Dict[str, Any]
    constraints: Constraints = Constraints()
    ttl_seconds: int = 900

class JobView(BaseModel):
    job_id: str
    state: str
    assigned_miner_id: Optional[str] = None
    requested_at: str
    expires_at: str
    error: Optional[str] = None

class MinerRegister(BaseModel):
    capabilities: Dict[str, Any]
    concurrency: int = 1
    region: Optional[str] = None

class PollRequest(BaseModel):
    max_wait_seconds: int = 15

class AssignedJob(BaseModel):
    job_id: str
    payload: Dict[str, Any]
# app/routers/client.py
from fastapi import APIRouter, Depends, HTTPException
from app.models import JobCreate, JobView
from app.deps import require_client_key

router = APIRouter(tags=["client"])

@router.post("/jobs", response_model=JobView)
def submit_job(req: JobCreate, client_id: str = Depends(require_client_key)):
    # enqueue + return JobView
    ...

@router.get("/jobs/{job_id}", response_model=JobView)
def get_job(job_id: str, client_id: str = Depends(require_client_key)):
    ...
# app/routers/miner.py
from fastapi import APIRouter, Depends
from app.models import MinerRegister, PollRequest, AssignedJob
from app.deps import require_miner_key

router = APIRouter(tags=["miner"])

@router.post("/miners/register")
def register(req: MinerRegister, miner_id: str = Depends(require_miner_key)):
    ...

@router.post("/miners/poll", response_model=AssignedJob, status_code=200)
def poll(req: PollRequest, miner_id: str = Depends(require_miner_key)):
    # try dequeue, else 204
    ...

Run:

uvicorn app.main:app --host 127.0.0.1 --port 8011 --reload

OpenAPI: http://127.0.0.1:8011/docs


14) Matching & Queue Pseudocode

def match_next_job(miner):
    eligible = db.jobs.filter(
        state="QUEUED",
        constraints.satisfied_by(miner.capabilities)
    ).order_by("requested_at").first()
    if not eligible:
        return None
    db.txn(lambda:
        db.jobs.assign(eligible.job_id, miner.id) and
        db.states.transition(eligible.job_id, "RUNNING")
    )
    return eligible

15) CURL Examples

Client creates a job

curl -sX POST http://127.0.0.1:8011/v1/jobs \
  -H 'X-Api-Key: client_dev_key_1' \
  -H 'Idempotency-Key: 7d4a...' \
  -H 'Content-Type: application/json' \
  -d '{
        "payload": {"task":"sum","a":2,"b":3},
        "constraints": {"gpu": null, "region": "eu-central"}
      }'

Miner registers + polls

curl -sX POST http://127.0.0.1:8011/v1/miners/register \
  -H 'X-Api-Key: miner_dev_key_1' \
  -H 'Content-Type: application/json' \
  -d '{"capabilities":{"gpu":"RTX4060Ti","cuda":"12.3","vram_gb":16},"concurrency":2,"region":"eu-central"}'

curl -i -sX POST http://127.0.0.1:8011/v1/miners/poll \
  -H 'X-Api-Key: miner_dev_key_1' \
  -H 'Content-Type: application/json' \
  -d '{"max_wait_seconds":10}'

Miner submits result

curl -sX POST http://127.0.0.1:8011/v1/miners/<JOB_ID>/result \
  -H 'X-Api-Key: miner_dev_key_1' \
  -H 'Content-Type: application/json' \
  -d '{"result":{"sum":5},"metrics":{"latency_ms":42}}'

Client fetches result

curl -s http://127.0.0.1:8011/v1/jobs/<JOB_ID>/result \
  -H 'X-Api-Key: client_dev_key_1'

16) Timeouts & Health

  • Job TTL: auto-expire QUEUED after JOB_TTL_SECONDS.
  • Heartbeat: miners post every HEARTBEAT_INTERVAL_SECONDS.
  • Miner OFFLINE if no heartbeat for HEARTBEAT_TIMEOUT_SECONDS.
  • Requeue: RUNNING jobs from OFFLINE miners → back to QUEUED.

17) Security Notes

  • Validate payload size & type; enforce max 1 MB inline.
  • Optional HMAC signature for tamper detection.
  • Sanitize/validate miner-reported capabilities.
  • Log every state transition (append-only).

18) Admin Metrics (MVP)

  • Queue depth, running count
  • Miner online/offline, inflight
  • P50/P95 job latency
  • Success/fail/cancel rates (windowed)

19) Future Stages

  • Blockchain layer: mint on verified compute; tie to record_usage.
  • Pool hub: multi-coordinator balancing; marketplace.
  • Reputation: miner scoring, penalty, slashing.
  • Bidding: price discovery; client max price.

20) Checklist (WindSurf)

  1. Create repo structure from section 3.
  2. Implement .env & config.py keys from 4.
  3. Add models.py, states.py, deps.py (auth, rate limit).
  4. Implement DB tables for Job, Miner, WorkerSession.
  5. Implement queue.py and matching.py.
  6. Wire client and miner routers (MVP endpoints).
  7. Add admin stats (basic counts).
  8. Add OpenAPI tags, descriptions.
  9. Add curl .http test files.
  10. Systemd unit + Nginx proxy (later).