docs: add gitea-runner SSH-based CI log debugging skill and workflow

Added comprehensive documentation for autonomous investigation of failed Gitea Actions runs via SSH access to gitea-runner host. Includes log location mapping, classification heuristics for distinguishing workflow/dependency/application/service/infrastructure failures, and evidence-based debug suggestion templates. Provides read-only investigation sequences with safety constraints to prevent conflating application failures with runner inst
2026-04-20 12:05:31 +02:00
parent eb51363ea9
commit 3df724d9fc
2 changed files with 439 additions and 0 deletions
--- a/.windsurf/skills/gitea-runner-log-debugger.md
+++ b/.windsurf/skills/gitea-runner-log-debugger.md
@@ -0,0 +1,211 @@
+---
+description: Autonomous skill for SSH-based investigation of gitea-runner CI logs, runner health, and root-cause-oriented debug guidance
+title: Gitea Runner Log Debugger
+version: 1.0
+---
+
+# Gitea Runner Log Debugger Skill
+
+## Purpose
+Use this skill to diagnose failed Gitea Actions runs by connecting to `gitea-runner`, reading CI log files, correlating them with runner health, and producing targeted debug suggestions.
+
+## Activation
+Activate this skill when:
+- a Gitea workflow fails and the UI log is incomplete or inconvenient
+- Windsurf needs direct access to runner-side CI logs
+- you need to distinguish workflow failures from runner failures
+- you need evidence-backed debug suggestions instead of generic guesses
+- a job appears to fail because of OOM, restart loops, path mismatches, or missing dependencies
+
+## Known Environment Facts
+- Runner host: `ssh gitea-runner`
+- Runner service: `gitea-runner.service`
+- Runner binary: `/opt/gitea-runner/act_runner`
+- Persistent CI logs: `/opt/gitea-runner/logs`
+- Indexed log manifest: `/opt/gitea-runner/logs/index.tsv`
+- Latest log symlink: `/opt/gitea-runner/logs/latest.log`
+- Gitea Actions on this runner exposes GitHub-compatible runtime variables, so `GITHUB_RUN_ID` is the correct run identifier to prefer over `GITEA_RUN_ID`
+
+## Inputs
+
+### Minimum Input
+- failing workflow name, job name, or pasted error output
+
+### Best Input
+```json
+{
+  "workflow_name": "Staking Tests",
+  "job_name": "test-staking-service",
+  "run_id": "1787",
+  "symptoms": [
+    "ModuleNotFoundError: No module named click"
+  ],
+  "needs_runner_health_check": true
+}
+```
+
+## Expected Outputs
+```json
+{
+  "failure_class": "workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown",
+  "root_cause": "string",
+  "evidence": ["string"],
+  "minimal_fix": "string",
+  "follow_up_checks": ["string"],
+  "confidence": "low | medium | high"
+}
+```
+
+## Investigation Sequence
+
+### 1. Connect and Verify Runner
+```bash
+ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
+```
+
+### 2. Locate Relevant CI Logs
+Prefer indexed job logs first.
+
+```bash
+ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
+```
+
+If a run id is known:
+
+```bash
+ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
+```
+
+If only workflow/job names are known:
+
+```bash
+ssh gitea-runner 'grep -i "production tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
+ssh gitea-runner 'grep -i "test-production" /opt/gitea-runner/logs/index.tsv | tail -n 20'
+```
+
+### 3. Read the Job Log Before the Runner Log
+```bash
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log>.log'
+```
+
+### 4. Correlate With Runner State
+```bash
+ssh gitea-runner 'systemctl status gitea-runner --no-pager'
+ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
+```
+
+### 5. Check for Resource Exhaustion Only if Indicated
+```bash
+ssh gitea-runner 'free -h; df -h /opt /var /tmp'
+ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
+```
+
+## Classification Rules
+
+### Workflow Config Failure
+Evidence patterns:
+- script path not found
+- wrong repo path
+- wrong service/unit name
+- wrong import target or startup command
+- missing environment export
+
+Default recommendation:
+- patch the workflow with the smallest targeted fix
+
+### Dependency / Packaging Failure
+Evidence patterns:
+- `ModuleNotFoundError`
+- `ImportError`
+- failed editable install
+- Poetry package discovery failure
+- missing pip/Node dependency in lean CI setup
+
+Default recommendation:
+- add only the missing dependency when truly required
+- otherwise fix the import chain or packaging metadata root cause
+
+### Application / Test Failure
+Evidence patterns:
+- normal environment setup completes
+- tests collect and run
+- failure is an assertion or application traceback
+
+Default recommendation:
+- patch code or tests, not the runner
+
+### Service Readiness Failure
+Evidence patterns:
+- health endpoint timeout
+- process exits immediately
+- server log shows startup/config exception
+
+Default recommendation:
+- inspect service startup logs and verify host/path/port assumptions
+
+### Runner / Infrastructure Failure
+Evidence patterns:
+- `oom-kill` in `journalctl`
+- runner daemon restart loop
+- truncated logs across unrelated workflows
+- disk exhaustion or temp space errors
+
+Default recommendation:
+- treat as runner capacity/stability issue only when evidence is direct
+
+## Decision Heuristics
+- Prefer the job log over `journalctl` for code/workflow failures
+- Prefer the smallest fix that explains all evidence
+- Do not suggest restarting the runner unless the user asks or the runner is clearly unhealthy
+- Ignore internal `task <id>` values for workflow naming or file lookup
+- If `/opt/gitea-runner/logs` is missing a run, check whether the workflow had the logging initializer at that time
+
+## Debug Suggestion Template
+When reporting back, use this structure:
+
+### Failure Class
+`<workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown>`
+
+### Root Cause
+One sentence describing the most likely issue.
+
+### Evidence
+- `<specific log line>`
+- `<specific log line>`
+- `<runner health correlation if relevant>`
+
+### Minimal Fix
+One focused change that addresses the root cause.
+
+### Optional Follow-up
+- `<verification step>`
+- `<secondary diagnostic if needed>`
+
+### Confidence
+`low | medium | high`
+
+## Safety Constraints
+- Read-only first
+- No service restarts without explicit user approval
+- No deletion of runner files during diagnosis
+- Do not conflate application tracebacks with runner instability
+
+## Fast First-Pass Bundle
+```bash
+ssh gitea-runner '
+  echo "=== latest runs ===";
+  tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
+  echo "=== latest log ===";
+  tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
+  echo "=== runner service ===";
+  systemctl status gitea-runner --no-pager | tail -n 40 || true;
+  echo "=== runner journal ===";
+  journalctl -u gitea-runner -n 80 --no-pager || true
+'
+```
+
+## Related Assets
+- `.windsurf/workflows/gitea-runner-ci-debug.md`
+- `scripts/ci/setup-job-logging.sh`
--- a/.windsurf/workflows/gitea-runner-ci-debug.md
+++ b/.windsurf/workflows/gitea-runner-ci-debug.md
@@ -0,0 +1,228 @@
+---
+description: SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions
+---
+
+# Gitea Runner CI Debug Workflow
+
+## Purpose
+Use this workflow when a Gitea Actions job fails and you need Windsurf to:
+- SSH to `gitea-runner`
+- locate the most relevant CI log files
+- inspect runner health and runner-side failures
+- separate workflow/application failures from runner/infrastructure failures
+- produce actionable debug suggestions with evidence
+
+## Key Environment Facts
+- The actual runner host is reachable via `ssh gitea-runner`
+- The runner service is `gitea-runner.service`
+- The runner binary is `/opt/gitea-runner/act_runner`
+- Gitea Actions on this runner behaves like a GitHub-compatibility layer
+- Prefer `GITHUB_RUN_ID` and `GITHUB_RUN_NUMBER`, not `GITEA_RUN_ID`
+- Internal runner `task <id>` messages in `journalctl` are useful for runner debugging, but are not stable workflow-facing identifiers
+- CI job logs created by the reusable logging wrapper live under `/opt/gitea-runner/logs`
+
+## Safety Rules
+- Start with read-only inspection only
+- Do not restart the runner or mutate files unless the user explicitly asks
+- Prefer scoped log reads over dumping entire files
+- If a failure is clearly application-level, stop proposing runner changes
+
+## Primary Log Sources
+
+### Job Logs
+- `/opt/gitea-runner/logs/index.tsv`
+- `/opt/gitea-runner/logs/latest.log`
+- `/opt/gitea-runner/logs/latest-<workflow>.log`
+- `/opt/gitea-runner/logs/latest-<workflow>-<job>.log`
+
+### Runner Logs
+- `journalctl -u gitea-runner`
+- `/opt/gitea-runner/runner.log`
+- `systemctl status gitea-runner --no-pager`
+
+## Workflow Steps
+
+### Step 1: Confirm Runner Reachability
+```bash
+ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
+```
+
+Expected outcome:
+- host is `gitea-runner`
+- user is usually `root`
+- service is `active`
+
+### Step 2: Find Candidate CI Logs
+If you know the workflow or job name, start there.
+
+```bash
+ssh gitea-runner 'ls -lah /opt/gitea-runner/logs'
+ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
+```
+
+If you know the run id:
+
+```bash
+ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
+```
+
+If you know the workflow/job name:
+
+```bash
+ssh gitea-runner 'grep -i "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
+ssh gitea-runner 'grep -i "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20'
+```
+
+### Step 3: Read the Most Relevant Job Log
+After identifying the file path from `index.tsv`, inspect the tail first.
+
+```bash
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log-file>.log'
+```
+
+If `latest.log` already matches the failing run:
+
+```bash
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
+```
+
+### Step 4: Correlate With Runner Health
+Only do this after reading the job log, so you do not confuse test failures with runner failures.
+
+```bash
+ssh gitea-runner 'systemctl status gitea-runner --no-pager'
+ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
+ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
+```
+
+### Step 5: Check for Infrastructure Pressure
+Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits.
+
+```bash
+ssh gitea-runner 'free -h; df -h /opt /var /tmp'
+ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
+ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | grep -i -E "oom|killed|failed|panic|error"'
+```
+
+### Step 6: Classify the Failure
+Use the evidence to classify the failure into one of these buckets.
+
+#### A. Workflow / Config Regression
+Typical evidence:
+- missing script path
+- wrong workspace path
+- wrong import target
+- wrong service name
+- bad YAML logic
+
+Typical fixes:
+- patch the workflow
+- correct repo-relative paths
+- fix `PYTHONPATH`, script invocation, or job dependencies
+
+#### B. Dependency / Packaging Failure
+Typical evidence:
+- `ModuleNotFoundError`
+- editable install failure
+- Poetry/pyproject packaging errors
+- missing test/runtime packages
+
+Typical fixes:
+- add the minimal missing dependency
+- avoid broadening installs unnecessarily
+- fix package metadata only if the install is actually required
+
+#### C. Application / Test Failure
+Typical evidence:
+- assertion failures
+- application tracebacks after setup completes
+- service starts but endpoint behavior is wrong
+
+Typical fixes:
+- patch code or tests
+- address the real failing import chain or runtime logic
+
+#### D. Service Readiness / Integration Failure
+Typical evidence:
+- health-check timeout
+- `curl` connection refused
+- server never starts
+- dependent services unavailable
+
+Typical fixes:
+- inspect service logs
+- fix startup command or environment
+- ensure readiness probes hit the correct host/path
+
+#### E. Runner / Infrastructure Failure
+Typical evidence:
+- `oom-kill` in `journalctl`
+- runner daemon restart loop
+- disk full or temp space exhaustion
+- SSH reachable but job logs end abruptly
+
+Typical fixes:
+- reduce CI memory footprint
+- split large jobs
+- investigate runner/container resource limits
+- only restart runner if explicitly requested
+
+## Analysis Heuristics
+
+### Prefer the Smallest Plausible Root Cause
+Do not blame the runner for a clean Python traceback in a job log.
+
+### Use Job Logs Before Runner Logs
+Job logs usually explain application/workflow failures better than runner logs.
+
+### Treat OOM as a Runner Problem Only With Evidence
+Look for `oom-kill`, `killed process`, or abrupt job termination without a normal traceback.
+
+### Distinguish Missing Logs From Missing Logging
+If `/opt/gitea-runner/logs` does not contain the run you want, verify whether the workflow had the logging initializer yet.
+
+## Recommended Windsurf Output Format
+When the investigation is complete, report findings in this structure:
+
+```text
+Failure class:
+Root cause:
+Evidence:
+- <log line or command result>
+- <log line or command result>
+Why this is the likely cause:
+Minimal fix:
+Optional follow-up checks:
+Confidence: <low|medium|high>
+```
+
+## Quick Command Bundle
+Use this bundle when you need a fast first pass.
+
+```bash
+ssh gitea-runner '
+  echo "=== service ===";
+  systemctl is-active gitea-runner;
+  echo "=== latest indexed runs ===";
+  tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
+  echo "=== latest job log ===";
+  tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
+  echo "=== runner journal ===";
+  journalctl -u gitea-runner -n 80 --no-pager || true
+'
+```
+
+## Escalation Guidance
+Escalate to a deeper infrastructure review when:
+- the runner repeatedly shows `oom-kill`
+- job logs are truncated across unrelated workflows
+- the runner daemon is flapping
+- disk or tmp space is exhausted
+- the same failure occurs across multiple independent workflows without a shared code change
+
+## Related Files
+- `/opt/aitbc/scripts/ci/setup-job-logging.sh`
+- `/opt/aitbc/.gitea/workflows/staking-tests.yml`
+- `/opt/aitbc/.gitea/workflows/production-tests.yml`
+- `/opt/aitbc/.gitea/workflows/systemd-sync.yml`