From 3df724d9fc18fc20b53b0519e503882fa24b6d62 Mon Sep 17 00:00:00 2001 From: aitbc Date: Mon, 20 Apr 2026 12:05:31 +0200 Subject: [PATCH] docs: add gitea-runner SSH-based CI log debugging skill and workflow Added comprehensive documentation for autonomous investigation of failed Gitea Actions runs via SSH access to gitea-runner host. Includes log location mapping, classification heuristics for distinguishing workflow/dependency/application/service/infrastructure failures, and evidence-based debug suggestion templates. Provides read-only investigation sequences with safety constraints to prevent conflating application failures with runner inst --- .windsurf/skills/gitea-runner-log-debugger.md | 211 ++++++++++++++++ .windsurf/workflows/gitea-runner-ci-debug.md | 228 ++++++++++++++++++ 2 files changed, 439 insertions(+) create mode 100644 .windsurf/skills/gitea-runner-log-debugger.md create mode 100644 .windsurf/workflows/gitea-runner-ci-debug.md diff --git a/.windsurf/skills/gitea-runner-log-debugger.md b/.windsurf/skills/gitea-runner-log-debugger.md new file mode 100644 index 00000000..d9f32439 --- /dev/null +++ b/.windsurf/skills/gitea-runner-log-debugger.md @@ -0,0 +1,211 @@ +--- +description: Autonomous skill for SSH-based investigation of gitea-runner CI logs, runner health, and root-cause-oriented debug guidance +title: Gitea Runner Log Debugger +version: 1.0 +--- + +# Gitea Runner Log Debugger Skill + +## Purpose +Use this skill to diagnose failed Gitea Actions runs by connecting to `gitea-runner`, reading CI log files, correlating them with runner health, and producing targeted debug suggestions. + +## Activation +Activate this skill when: +- a Gitea workflow fails and the UI log is incomplete or inconvenient +- Windsurf needs direct access to runner-side CI logs +- you need to distinguish workflow failures from runner failures +- you need evidence-backed debug suggestions instead of generic guesses +- a job appears to fail because of OOM, restart loops, path mismatches, or missing dependencies + +## Known Environment Facts +- Runner host: `ssh gitea-runner` +- Runner service: `gitea-runner.service` +- Runner binary: `/opt/gitea-runner/act_runner` +- Persistent CI logs: `/opt/gitea-runner/logs` +- Indexed log manifest: `/opt/gitea-runner/logs/index.tsv` +- Latest log symlink: `/opt/gitea-runner/logs/latest.log` +- Gitea Actions on this runner exposes GitHub-compatible runtime variables, so `GITHUB_RUN_ID` is the correct run identifier to prefer over `GITEA_RUN_ID` + +## Inputs + +### Minimum Input +- failing workflow name, job name, or pasted error output + +### Best Input +```json +{ + "workflow_name": "Staking Tests", + "job_name": "test-staking-service", + "run_id": "1787", + "symptoms": [ + "ModuleNotFoundError: No module named click" + ], + "needs_runner_health_check": true +} +``` + +## Expected Outputs +```json +{ + "failure_class": "workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown", + "root_cause": "string", + "evidence": ["string"], + "minimal_fix": "string", + "follow_up_checks": ["string"], + "confidence": "low | medium | high" +} +``` + +## Investigation Sequence + +### 1. Connect and Verify Runner +```bash +ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner' +``` + +### 2. Locate Relevant CI Logs +Prefer indexed job logs first. + +```bash +ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv' +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log' +``` + +If a run id is known: + +```bash +ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv" +``` + +If only workflow/job names are known: + +```bash +ssh gitea-runner 'grep -i "production tests" /opt/gitea-runner/logs/index.tsv | tail -n 20' +ssh gitea-runner 'grep -i "test-production" /opt/gitea-runner/logs/index.tsv | tail -n 20' +``` + +### 3. Read the Job Log Before the Runner Log +```bash +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/.log' +``` + +### 4. Correlate With Runner State +```bash +ssh gitea-runner 'systemctl status gitea-runner --no-pager' +ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager' +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log' +``` + +### 5. Check for Resource Exhaustion Only if Indicated +```bash +ssh gitea-runner 'free -h; df -h /opt /var /tmp' +ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50' +``` + +## Classification Rules + +### Workflow Config Failure +Evidence patterns: +- script path not found +- wrong repo path +- wrong service/unit name +- wrong import target or startup command +- missing environment export + +Default recommendation: +- patch the workflow with the smallest targeted fix + +### Dependency / Packaging Failure +Evidence patterns: +- `ModuleNotFoundError` +- `ImportError` +- failed editable install +- Poetry package discovery failure +- missing pip/Node dependency in lean CI setup + +Default recommendation: +- add only the missing dependency when truly required +- otherwise fix the import chain or packaging metadata root cause + +### Application / Test Failure +Evidence patterns: +- normal environment setup completes +- tests collect and run +- failure is an assertion or application traceback + +Default recommendation: +- patch code or tests, not the runner + +### Service Readiness Failure +Evidence patterns: +- health endpoint timeout +- process exits immediately +- server log shows startup/config exception + +Default recommendation: +- inspect service startup logs and verify host/path/port assumptions + +### Runner / Infrastructure Failure +Evidence patterns: +- `oom-kill` in `journalctl` +- runner daemon restart loop +- truncated logs across unrelated workflows +- disk exhaustion or temp space errors + +Default recommendation: +- treat as runner capacity/stability issue only when evidence is direct + +## Decision Heuristics +- Prefer the job log over `journalctl` for code/workflow failures +- Prefer the smallest fix that explains all evidence +- Do not suggest restarting the runner unless the user asks or the runner is clearly unhealthy +- Ignore internal `task ` values for workflow naming or file lookup +- If `/opt/gitea-runner/logs` is missing a run, check whether the workflow had the logging initializer at that time + +## Debug Suggestion Template +When reporting back, use this structure: + +### Failure Class +`` + +### Root Cause +One sentence describing the most likely issue. + +### Evidence +- `` +- `` +- `` + +### Minimal Fix +One focused change that addresses the root cause. + +### Optional Follow-up +- `` +- `` + +### Confidence +`low | medium | high` + +## Safety Constraints +- Read-only first +- No service restarts without explicit user approval +- No deletion of runner files during diagnosis +- Do not conflate application tracebacks with runner instability + +## Fast First-Pass Bundle +```bash +ssh gitea-runner ' + echo "=== latest runs ==="; + tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true; + echo "=== latest log ==="; + tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true; + echo "=== runner service ==="; + systemctl status gitea-runner --no-pager | tail -n 40 || true; + echo "=== runner journal ==="; + journalctl -u gitea-runner -n 80 --no-pager || true +' +``` + +## Related Assets +- `.windsurf/workflows/gitea-runner-ci-debug.md` +- `scripts/ci/setup-job-logging.sh` diff --git a/.windsurf/workflows/gitea-runner-ci-debug.md b/.windsurf/workflows/gitea-runner-ci-debug.md new file mode 100644 index 00000000..16290d1b --- /dev/null +++ b/.windsurf/workflows/gitea-runner-ci-debug.md @@ -0,0 +1,228 @@ +--- +description: SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions +--- + +# Gitea Runner CI Debug Workflow + +## Purpose +Use this workflow when a Gitea Actions job fails and you need Windsurf to: +- SSH to `gitea-runner` +- locate the most relevant CI log files +- inspect runner health and runner-side failures +- separate workflow/application failures from runner/infrastructure failures +- produce actionable debug suggestions with evidence + +## Key Environment Facts +- The actual runner host is reachable via `ssh gitea-runner` +- The runner service is `gitea-runner.service` +- The runner binary is `/opt/gitea-runner/act_runner` +- Gitea Actions on this runner behaves like a GitHub-compatibility layer +- Prefer `GITHUB_RUN_ID` and `GITHUB_RUN_NUMBER`, not `GITEA_RUN_ID` +- Internal runner `task ` messages in `journalctl` are useful for runner debugging, but are not stable workflow-facing identifiers +- CI job logs created by the reusable logging wrapper live under `/opt/gitea-runner/logs` + +## Safety Rules +- Start with read-only inspection only +- Do not restart the runner or mutate files unless the user explicitly asks +- Prefer scoped log reads over dumping entire files +- If a failure is clearly application-level, stop proposing runner changes + +## Primary Log Sources + +### Job Logs +- `/opt/gitea-runner/logs/index.tsv` +- `/opt/gitea-runner/logs/latest.log` +- `/opt/gitea-runner/logs/latest-.log` +- `/opt/gitea-runner/logs/latest--.log` + +### Runner Logs +- `journalctl -u gitea-runner` +- `/opt/gitea-runner/runner.log` +- `systemctl status gitea-runner --no-pager` + +## Workflow Steps + +### Step 1: Confirm Runner Reachability +```bash +ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner' +``` + +Expected outcome: +- host is `gitea-runner` +- user is usually `root` +- service is `active` + +### Step 2: Find Candidate CI Logs +If you know the workflow or job name, start there. + +```bash +ssh gitea-runner 'ls -lah /opt/gitea-runner/logs' +ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv' +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log' +``` + +If you know the run id: + +```bash +ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv" +``` + +If you know the workflow/job name: + +```bash +ssh gitea-runner 'grep -i "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20' +ssh gitea-runner 'grep -i "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20' +``` + +### Step 3: Read the Most Relevant Job Log +After identifying the file path from `index.tsv`, inspect the tail first. + +```bash +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/.log' +``` + +If `latest.log` already matches the failing run: + +```bash +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log' +``` + +### Step 4: Correlate With Runner Health +Only do this after reading the job log, so you do not confuse test failures with runner failures. + +```bash +ssh gitea-runner 'systemctl status gitea-runner --no-pager' +ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager' +ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log' +``` + +### Step 5: Check for Infrastructure Pressure +Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits. + +```bash +ssh gitea-runner 'free -h; df -h /opt /var /tmp' +ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50' +ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | grep -i -E "oom|killed|failed|panic|error"' +``` + +### Step 6: Classify the Failure +Use the evidence to classify the failure into one of these buckets. + +#### A. Workflow / Config Regression +Typical evidence: +- missing script path +- wrong workspace path +- wrong import target +- wrong service name +- bad YAML logic + +Typical fixes: +- patch the workflow +- correct repo-relative paths +- fix `PYTHONPATH`, script invocation, or job dependencies + +#### B. Dependency / Packaging Failure +Typical evidence: +- `ModuleNotFoundError` +- editable install failure +- Poetry/pyproject packaging errors +- missing test/runtime packages + +Typical fixes: +- add the minimal missing dependency +- avoid broadening installs unnecessarily +- fix package metadata only if the install is actually required + +#### C. Application / Test Failure +Typical evidence: +- assertion failures +- application tracebacks after setup completes +- service starts but endpoint behavior is wrong + +Typical fixes: +- patch code or tests +- address the real failing import chain or runtime logic + +#### D. Service Readiness / Integration Failure +Typical evidence: +- health-check timeout +- `curl` connection refused +- server never starts +- dependent services unavailable + +Typical fixes: +- inspect service logs +- fix startup command or environment +- ensure readiness probes hit the correct host/path + +#### E. Runner / Infrastructure Failure +Typical evidence: +- `oom-kill` in `journalctl` +- runner daemon restart loop +- disk full or temp space exhaustion +- SSH reachable but job logs end abruptly + +Typical fixes: +- reduce CI memory footprint +- split large jobs +- investigate runner/container resource limits +- only restart runner if explicitly requested + +## Analysis Heuristics + +### Prefer the Smallest Plausible Root Cause +Do not blame the runner for a clean Python traceback in a job log. + +### Use Job Logs Before Runner Logs +Job logs usually explain application/workflow failures better than runner logs. + +### Treat OOM as a Runner Problem Only With Evidence +Look for `oom-kill`, `killed process`, or abrupt job termination without a normal traceback. + +### Distinguish Missing Logs From Missing Logging +If `/opt/gitea-runner/logs` does not contain the run you want, verify whether the workflow had the logging initializer yet. + +## Recommended Windsurf Output Format +When the investigation is complete, report findings in this structure: + +```text +Failure class: +Root cause: +Evidence: +- +- +Why this is the likely cause: +Minimal fix: +Optional follow-up checks: +Confidence: +``` + +## Quick Command Bundle +Use this bundle when you need a fast first pass. + +```bash +ssh gitea-runner ' + echo "=== service ==="; + systemctl is-active gitea-runner; + echo "=== latest indexed runs ==="; + tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true; + echo "=== latest job log ==="; + tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true; + echo "=== runner journal ==="; + journalctl -u gitea-runner -n 80 --no-pager || true +' +``` + +## Escalation Guidance +Escalate to a deeper infrastructure review when: +- the runner repeatedly shows `oom-kill` +- job logs are truncated across unrelated workflows +- the runner daemon is flapping +- disk or tmp space is exhausted +- the same failure occurs across multiple independent workflows without a shared code change + +## Related Files +- `/opt/aitbc/scripts/ci/setup-job-logging.sh` +- `/opt/aitbc/.gitea/workflows/staking-tests.yml` +- `/opt/aitbc/.gitea/workflows/production-tests.yml` +- `/opt/aitbc/.gitea/workflows/systemd-sync.yml`