# coordinator-api.md Central API that orchestrates **jobs** from clients to **miners**, tracks lifecycle, validates results, and (later) settles AITokens. **Stage 1 (MVP):** no blockchain, no pool hub — just client ⇄ coordinator ⇄ miner. ## 1) Goals & Non-Goals **Goals (MVP)** - Accept computation jobs from clients. - Match jobs to eligible miners. - Track job state machine (QUEUED → RUNNING → COMPLETED/FAILED/CANCELED/EXPIRED). - Stream results back to clients; store minimal metadata. - Provide a clean, typed API (OpenAPI/Swagger). - Simple auth (API keys) + idempotency + rate limits. - Minimal persistence (SQLite/Postgres) with straightforward SQL (no migrations tooling). **Non-Goals (MVP)** - Token minting/settlement (stub hooks only). - Miner marketplace, staking, slashing, reputation (placeholders). - Pool hub coordination (future stage). --- ## 2) Tech Stack - **Python 3.12**, **FastAPI**, **Uvicorn** - **Pydantic** for schemas - **SQL** via `sqlite3` or Postgres (user can switch later) - **Redis (optional)** for queueing; MVP can start with in-DB FIFO - **HTTP + WebSocket** (for miner heartbeats / job streaming) > Debian 12 target. Run under **systemd** later. --- ## 3) Directory Layout (WindSurf Workspace) ``` coordinator-api/ ├─ app/ │ ├─ main.py # FastAPI init, lifespan, routers │ ├─ config.py # env parsing │ ├─ deps.py # auth, rate-limit deps │ ├─ db.py # simple DB layer (sqlite/postgres) │ ├─ matching.py # job→miner selection │ ├─ queue.py # enqueue/dequeue logic │ ├─ settlement.py # stubs for token accounting │ ├─ models.py # Pydantic request/response schemas │ ├─ states.py # state machine + transitions │ ├─ routers/ │ │ ├─ client.py # /v1/jobs (submit/status/result/cancel) │ │ ├─ miner.py # /v1/miners (register/heartbeat/poll/submit/fail) │ │ └─ admin.py # /v1/admin (stats) │ └─ ws/ │ ├─ miner.py # WS for miner heartbeats / job stream (optional) │ └─ client.py # WS for client result stream (optional) ├─ tests/ │ ├─ test_client_flow.http # REST client flow (HTTP file) │ └─ test_miner_flow.http # REST miner flow ├─ .env.example ├─ pyproject.toml └─ README.md ``` --- ## 4) Environment (.env) ``` APP_ENV=dev APP_HOST=127.0.0.1 APP_PORT=8011 DATABASE_URL=sqlite:///./coordinator.db # or: DATABASE_URL=postgresql://user:pass@localhost:5432/aitbc # Auth CLIENT_API_KEYS=client_dev_key_1,client_dev_key_2 MINER_API_KEYS=miner_dev_key_1,miner_dev_key_2 ADMIN_API_KEYS=admin_dev_key_1 # Security HMAC_SECRET=change_me ALLOW_ORIGINS=* # Queue JOB_TTL_SECONDS=900 HEARTBEAT_INTERVAL_SECONDS=10 HEARTBEAT_TIMEOUT_SECONDS=30 ``` --- ## 5) Core Data Model (conceptual) **Job** - `job_id` (uuid) - `client_id` (from API key) - `requested_at`, `expires_at` - `payload` (opaque JSON / bytes ref) - `constraints` (gpu, cuda, mem, model, max_price, region) - `state` (QUEUED|RUNNING|COMPLETED|FAILED|CANCELED|EXPIRED) - `assigned_miner_id` (nullable) - `result_ref` (blob path / inline json) - `error` (nullable) - `cost_estimate` (optional) **Miner** - `miner_id` (from API key) - `capabilities` (gpu, cuda, vram, models[], region) - `heartbeat_at` - `status` (ONLINE|OFFLINE|DRAINING) - `concurrency` (int), `inflight` (int) **WorkerSession** - `session_id`, `miner_id`, `job_id`, `started_at`, `ended_at`, `exit_reason` --- ## 6) State Machine ``` QUEUED -> RUNNING (assigned to miner) -> CANCELED (client) -> EXPIRED (ttl) RUNNING -> COMPLETED (miner submit_result) -> FAILED (miner fail / timeout) -> CANCELED (client) ``` --- ## 7) Matching (MVP) - Filter ONLINE miners by **capabilities** & **region** - Prefer lowest `inflight` (simple load) - Tiebreak by earliest `heartbeat_at` or random - Lock job row → assign → return to miner --- ## 8) Auth & Rate Limits - **API keys** via `X-Api-Key` header for `client`, `miner`, `admin`. - Optional **HMAC** (`X-Signature`) over body with `HMAC_SECRET`. - **Idempotency**: clients send `Idempotency-Key` on **POST /jobs**. - **Rate limiting**: naive per-key window (e.g., 60 req / 60 s). --- ## 9) REST API ### Client - `POST /v1/jobs` - Create job. Returns `job_id`. - `GET /v1/jobs/{job_id}` - Job status & metadata. - `GET /v1/jobs/{job_id}/result` - Result (200 when ready, 425 if not ready). - `POST /v1/jobs/{job_id}/cancel` - Cancel if QUEUED or RUNNING (best effort). ### Miner - `POST /v1/miners/register` - Upsert miner capabilities; set ONLINE. - `POST /v1/miners/heartbeat` - Touch `heartbeat_at`, report `inflight`. - `POST /v1/miners/poll` - Long-poll for next job → returns a job or 204. - `POST /v1/miners/{job_id}/start` - Confirm start (optional if `poll` implies start). - `POST /v1/miners/{job_id}/result` - Submit result; transitions to COMPLETED. - `POST /v1/miners/{job_id}/fail` - Submit failure; transitions to FAILED. - `POST /v1/miners/drain` - Graceful stop accepting new jobs. ### Admin - `GET /v1/admin/stats` - Queue depth, miners online, success rates, avg latency. - `GET /v1/admin/jobs?state=&limit=...` - `GET /v1/admin/miners` **Error Shape** ```json { "error": { "code": "STRING_CODE", "message": "human readable", "details": {} } } ``` Common codes: `UNAUTHORIZED_KEY`, `RATE_LIMITED`, `INVALID_PAYLOAD`, `NO_ELIGIBLE_MINER`, `JOB_NOT_FOUND`, `JOB_NOT_READY`, `CONFLICT_STATE`. --- ## 10) WebSockets (optional MVP+) - `WS /v1/ws/miner?api_key=...` - Server → miner: `job.assigned` - Miner → server: `heartbeat`, `result`, `fail` - `WS /v1/ws/client?job_id=...&api_key=...` - Server → client: `state.changed`, `result.ready` Fallback remains HTTP long-polling. --- ## 11) Result Storage - **Inline JSON** if ≤ 1 MB. - For larger payloads: store to disk path (e.g., `/var/lib/coordinator/results/{job_id}`) and return `result_ref`. --- ## 12) Settlement Hooks (stub) `settlement.py` exposes: - `record_usage(job, miner)` - `quote_cost(job)` Later wired to **AIToken** mint/transfer when blockchain lands. --- ## 13) Minimal FastAPI Skeleton ```python # app/main.py from fastapi import FastAPI from app.routers import client, miner, admin def create_app(): app = FastAPI(title="AITBC Coordinator API", version="0.1.0") app.include_router(client.router, prefix="/v1") app.include_router(miner.router, prefix="/v1") app.include_router(admin.router, prefix="/v1") return app app = create_app() ``` ```python # app/models.py from pydantic import BaseModel, Field from typing import Any, Dict, List, Optional class Constraints(BaseModel): gpu: Optional[str] = None cuda: Optional[str] = None min_vram_gb: Optional[int] = None models: Optional[List[str]] = None region: Optional[str] = None max_price: Optional[float] = None class JobCreate(BaseModel): payload: Dict[str, Any] constraints: Constraints = Constraints() ttl_seconds: int = 900 class JobView(BaseModel): job_id: str state: str assigned_miner_id: Optional[str] = None requested_at: str expires_at: str error: Optional[str] = None class MinerRegister(BaseModel): capabilities: Dict[str, Any] concurrency: int = 1 region: Optional[str] = None class PollRequest(BaseModel): max_wait_seconds: int = 15 class AssignedJob(BaseModel): job_id: str payload: Dict[str, Any] ``` ```python # app/routers/client.py from fastapi import APIRouter, Depends, HTTPException from app.models import JobCreate, JobView from app.deps import require_client_key router = APIRouter(tags=["client"]) @router.post("/jobs", response_model=JobView) def submit_job(req: JobCreate, client_id: str = Depends(require_client_key)): # enqueue + return JobView ... @router.get("/jobs/{job_id}", response_model=JobView) def get_job(job_id: str, client_id: str = Depends(require_client_key)): ... ``` ```python # app/routers/miner.py from fastapi import APIRouter, Depends from app.models import MinerRegister, PollRequest, AssignedJob from app.deps import require_miner_key router = APIRouter(tags=["miner"]) @router.post("/miners/register") def register(req: MinerRegister, miner_id: str = Depends(require_miner_key)): ... @router.post("/miners/poll", response_model=AssignedJob, status_code=200) def poll(req: PollRequest, miner_id: str = Depends(require_miner_key)): # try dequeue, else 204 ... ``` Run: ```bash uvicorn app.main:app --host 127.0.0.1 --port 8011 --reload ``` OpenAPI: `http://127.0.0.1:8011/docs` --- ## 14) Matching & Queue Pseudocode ```python def match_next_job(miner): eligible = db.jobs.filter( state="QUEUED", constraints.satisfied_by(miner.capabilities) ).order_by("requested_at").first() if not eligible: return None db.txn(lambda: db.jobs.assign(eligible.job_id, miner.id) and db.states.transition(eligible.job_id, "RUNNING") ) return eligible ``` --- ## 15) CURL Examples **Client creates a job** ```bash curl -sX POST http://127.0.0.1:8011/v1/jobs \ -H 'X-Api-Key: client_dev_key_1' \ -H 'Idempotency-Key: 7d4a...' \ -H 'Content-Type: application/json' \ -d '{ "payload": {"task":"sum","a":2,"b":3}, "constraints": {"gpu": null, "region": "eu-central"} }' ``` **Miner registers + polls** ```bash curl -sX POST http://127.0.0.1:8011/v1/miners/register \ -H 'X-Api-Key: miner_dev_key_1' \ -H 'Content-Type: application/json' \ -d '{"capabilities":{"gpu":"RTX4060Ti","cuda":"12.3","vram_gb":16},"concurrency":2,"region":"eu-central"}' curl -i -sX POST http://127.0.0.1:8011/v1/miners/poll \ -H 'X-Api-Key: miner_dev_key_1' \ -H 'Content-Type: application/json' \ -d '{"max_wait_seconds":10}' ``` **Miner submits result** ```bash curl -sX POST http://127.0.0.1:8011/v1/miners//result \ -H 'X-Api-Key: miner_dev_key_1' \ -H 'Content-Type: application/json' \ -d '{"result":{"sum":5},"metrics":{"latency_ms":42}}' ``` **Client fetches result** ```bash curl -s http://127.0.0.1:8011/v1/jobs//result \ -H 'X-Api-Key: client_dev_key_1' ``` --- ## 16) Timeouts & Health - **Job TTL**: auto-expire QUEUED after `JOB_TTL_SECONDS`. - **Heartbeat**: miners post every `HEARTBEAT_INTERVAL_SECONDS`. - **Miner OFFLINE** if no heartbeat for `HEARTBEAT_TIMEOUT_SECONDS`. - **Requeue**: RUNNING jobs from OFFLINE miners → back to QUEUED. --- ## 17) Security Notes - Validate `payload` size & type; enforce max 1 MB inline. - Optional **HMAC** signature for tamper detection. - Sanitize/validate miner-reported capabilities. - Log every state transition (append-only). --- ## 18) Admin Metrics (MVP) - Queue depth, running count - Miner online/offline, inflight - P50/P95 job latency - Success/fail/cancel rates (windowed) --- ## 19) Future Stages - **Blockchain layer**: mint on verified compute; tie to `record_usage`. - **Pool hub**: multi-coordinator balancing; marketplace. - **Reputation**: miner scoring, penalty, slashing. - **Bidding**: price discovery; client max price. --- ## 20) Checklist (WindSurf) 1. Create repo structure from section **3**. 2. Implement `.env` & `config.py` keys from **4**. 3. Add `models.py`, `states.py`, `deps.py` (auth, rate limit). 4. Implement DB tables for Job, Miner, WorkerSession. 5. Implement `queue.py` and `matching.py`. 6. Wire **client** and **miner** routers (MVP endpoints). 7. Add admin stats (basic counts). 8. Add OpenAPI tags, descriptions. 9. Add curl `.http` test files. 10. Systemd unit + Nginx proxy (later).