docs: add gitea-runner SSH-based CI log debugging skill and workflow

Added comprehensive documentation for autonomous investigation of failed Gitea Actions runs via SSH access to gitea-runner host. Includes log location mapping, classification heuristics for distinguishing workflow/dependency/application/service/infrastructure failures, and evidence-based debug suggestion templates. Provides read-only investigation sequences with safety constraints to prevent conflating application failures with runner inst
2026-04-20 12:05:31 +02:00
parent eb51363ea9
commit 3df724d9fc
2 changed files with 439 additions and 0 deletions
--- a/.windsurf/skills/gitea-runner-log-debugger.md
+++ b/.windsurf/skills/gitea-runner-log-debugger.md
@@ -0,0 +1,211 @@
 ---
 description: Autonomous skill for SSH-based investigation of gitea-runner CI logs, runner health, and root-cause-oriented debug guidance
 title: Gitea Runner Log Debugger
 version: 1.0
 ---
 # Gitea Runner Log Debugger Skill
 ## Purpose
 Use this skill to diagnose failed Gitea Actions runs by connecting to `gitea-runner`, reading CI log files, correlating them with runner health, and producing targeted debug suggestions.
 ## Activation
 Activate this skill when:
 - a Gitea workflow fails and the UI log is incomplete or inconvenient
 - Windsurf needs direct access to runner-side CI logs
 - you need to distinguish workflow failures from runner failures
 - you need evidence-backed debug suggestions instead of generic guesses
 - a job appears to fail because of OOM, restart loops, path mismatches, or missing dependencies
 ## Known Environment Facts
 - Runner host: `ssh gitea-runner`
 - Runner service: `gitea-runner.service`
 - Runner binary: `/opt/gitea-runner/act_runner`
 - Persistent CI logs: `/opt/gitea-runner/logs`
 - Indexed log manifest: `/opt/gitea-runner/logs/index.tsv`
 - Latest log symlink: `/opt/gitea-runner/logs/latest.log`
 - Gitea Actions on this runner exposes GitHub-compatible runtime variables, so `GITHUB_RUN_ID` is the correct run identifier to prefer over `GITEA_RUN_ID`
 ## Inputs
 ### Minimum Input
 - failing workflow name, job name, or pasted error output
 ### Best Input
 ```json
 {
  "workflow_name": "Staking Tests",
  "job_name": "test-staking-service",
  "run_id": "1787",
  "symptoms": [
    "ModuleNotFoundError: No module named click"
  ],
  "needs_runner_health_check": true
 }
 ```
 ## Expected Outputs
 ```json
 {
  "failure_class": "workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown",
  "root_cause": "string",
  "evidence": ["string"],
  "minimal_fix": "string",
  "follow_up_checks": ["string"],
  "confidence": "low | medium | high"
 }
 ```
 ## Investigation Sequence
 ### 1. Connect and Verify Runner
 ```bash
 ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
 ```
 ### 2. Locate Relevant CI Logs
 Prefer indexed job logs first.
 ```bash
 ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
 ```
 If a run id is known:
 ```bash
 ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
 ```
 If only workflow/job names are known:
 ```bash
 ssh gitea-runner 'grep -i "production tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
 ssh gitea-runner 'grep -i "test-production" /opt/gitea-runner/logs/index.tsv | tail -n 20'
 ```
 ### 3. Read the Job Log Before the Runner Log
 ```bash
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log>.log'
 ```
 ### 4. Correlate With Runner State
 ```bash
 ssh gitea-runner 'systemctl status gitea-runner --no-pager'
 ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
 ```
 ### 5. Check for Resource Exhaustion Only if Indicated
 ```bash
 ssh gitea-runner 'free -h; df -h /opt /var /tmp'
 ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
 ```
 ## Classification Rules
 ### Workflow Config Failure
 Evidence patterns:
 - script path not found
 - wrong repo path
 - wrong service/unit name
 - wrong import target or startup command
 - missing environment export
 Default recommendation:
 - patch the workflow with the smallest targeted fix
 ### Dependency / Packaging Failure
 Evidence patterns:
 - `ModuleNotFoundError`
 - `ImportError`
 - failed editable install
 - Poetry package discovery failure
 - missing pip/Node dependency in lean CI setup
 Default recommendation:
 - add only the missing dependency when truly required
 - otherwise fix the import chain or packaging metadata root cause
 ### Application / Test Failure
 Evidence patterns:
 - normal environment setup completes
 - tests collect and run
 - failure is an assertion or application traceback
 Default recommendation:
 - patch code or tests, not the runner
 ### Service Readiness Failure
 Evidence patterns:
 - health endpoint timeout
 - process exits immediately
 - server log shows startup/config exception
 Default recommendation:
 - inspect service startup logs and verify host/path/port assumptions
 ### Runner / Infrastructure Failure
 Evidence patterns:
 - `oom-kill` in `journalctl`
 - runner daemon restart loop
 - truncated logs across unrelated workflows
 - disk exhaustion or temp space errors
 Default recommendation:
 - treat as runner capacity/stability issue only when evidence is direct
 ## Decision Heuristics
 - Prefer the job log over `journalctl` for code/workflow failures
 - Prefer the smallest fix that explains all evidence
 - Do not suggest restarting the runner unless the user asks or the runner is clearly unhealthy
 - Ignore internal `task <id>` values for workflow naming or file lookup
 - If `/opt/gitea-runner/logs` is missing a run, check whether the workflow had the logging initializer at that time
 ## Debug Suggestion Template
 When reporting back, use this structure:
 ### Failure Class
 `<workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown>`
 ### Root Cause
 One sentence describing the most likely issue.
 ### Evidence
 - `<specific log line>`
 - `<specific log line>`
 - `<runner health correlation if relevant>`
 ### Minimal Fix
 One focused change that addresses the root cause.
 ### Optional Follow-up
 - `<verification step>`
 - `<secondary diagnostic if needed>`
 ### Confidence
 `low | medium | high`
 ## Safety Constraints
 - Read-only first
 - No service restarts without explicit user approval
 - No deletion of runner files during diagnosis
 - Do not conflate application tracebacks with runner instability
 ## Fast First-Pass Bundle
 ```bash
 ssh gitea-runner '
  echo "=== latest runs ===";
  tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
  echo "=== latest log ===";
  tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
  echo "=== runner service ===";
  systemctl status gitea-runner --no-pager | tail -n 40 || true;
  echo "=== runner journal ===";
  journalctl -u gitea-runner -n 80 --no-pager || true
 '
 ```
 ## Related Assets
 - `.windsurf/workflows/gitea-runner-ci-debug.md`
 - `scripts/ci/setup-job-logging.sh`
--- a/.windsurf/workflows/gitea-runner-ci-debug.md
+++ b/.windsurf/workflows/gitea-runner-ci-debug.md
@@ -0,0 +1,228 @@
 ---
 description: SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions
 ---
 # Gitea Runner CI Debug Workflow
 ## Purpose
 Use this workflow when a Gitea Actions job fails and you need Windsurf to:
 - SSH to `gitea-runner`
 - locate the most relevant CI log files
 - inspect runner health and runner-side failures
 - separate workflow/application failures from runner/infrastructure failures
 - produce actionable debug suggestions with evidence
 ## Key Environment Facts
 - The actual runner host is reachable via `ssh gitea-runner`
 - The runner service is `gitea-runner.service`
 - The runner binary is `/opt/gitea-runner/act_runner`
 - Gitea Actions on this runner behaves like a GitHub-compatibility layer
 - Prefer `GITHUB_RUN_ID` and `GITHUB_RUN_NUMBER`, not `GITEA_RUN_ID`
 - Internal runner `task <id>` messages in `journalctl` are useful for runner debugging, but are not stable workflow-facing identifiers
 - CI job logs created by the reusable logging wrapper live under `/opt/gitea-runner/logs`
 ## Safety Rules
 - Start with read-only inspection only
 - Do not restart the runner or mutate files unless the user explicitly asks
 - Prefer scoped log reads over dumping entire files
 - If a failure is clearly application-level, stop proposing runner changes
 ## Primary Log Sources
 ### Job Logs
 - `/opt/gitea-runner/logs/index.tsv`
 - `/opt/gitea-runner/logs/latest.log`
 - `/opt/gitea-runner/logs/latest-<workflow>.log`
 - `/opt/gitea-runner/logs/latest-<workflow>-<job>.log`
 ### Runner Logs
 - `journalctl -u gitea-runner`
 - `/opt/gitea-runner/runner.log`
 - `systemctl status gitea-runner --no-pager`
 ## Workflow Steps
 ### Step 1: Confirm Runner Reachability
 ```bash
 ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
 ```
 Expected outcome:
 - host is `gitea-runner`
 - user is usually `root`
 - service is `active`
 ### Step 2: Find Candidate CI Logs
 If you know the workflow or job name, start there.
 ```bash
 ssh gitea-runner 'ls -lah /opt/gitea-runner/logs'
 ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
 ```
 If you know the run id:
 ```bash
 ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
 ```
 If you know the workflow/job name:
 ```bash
 ssh gitea-runner 'grep -i "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
 ssh gitea-runner 'grep -i "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20'
 ```
 ### Step 3: Read the Most Relevant Job Log
 After identifying the file path from `index.tsv`, inspect the tail first.
 ```bash
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log-file>.log'
 ```
 If `latest.log` already matches the failing run:
 ```bash
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
 ```
 ### Step 4: Correlate With Runner Health
 Only do this after reading the job log, so you do not confuse test failures with runner failures.
 ```bash
 ssh gitea-runner 'systemctl status gitea-runner --no-pager'
 ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
 ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
 ```
 ### Step 5: Check for Infrastructure Pressure
 Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits.
 ```bash
 ssh gitea-runner 'free -h; df -h /opt /var /tmp'
 ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
 ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | grep -i -E "oom|killed|failed|panic|error"'
 ```
 ### Step 6: Classify the Failure
 Use the evidence to classify the failure into one of these buckets.
 #### A. Workflow / Config Regression
 Typical evidence:
 - missing script path
 - wrong workspace path
 - wrong import target
 - wrong service name
 - bad YAML logic
 Typical fixes:
 - patch the workflow
 - correct repo-relative paths
 - fix `PYTHONPATH`, script invocation, or job dependencies
 #### B. Dependency / Packaging Failure
 Typical evidence:
 - `ModuleNotFoundError`
 - editable install failure
 - Poetry/pyproject packaging errors
 - missing test/runtime packages
 Typical fixes:
 - add the minimal missing dependency
 - avoid broadening installs unnecessarily
 - fix package metadata only if the install is actually required
 #### C. Application / Test Failure
 Typical evidence:
 - assertion failures
 - application tracebacks after setup completes
 - service starts but endpoint behavior is wrong
 Typical fixes:
 - patch code or tests
 - address the real failing import chain or runtime logic
 #### D. Service Readiness / Integration Failure
 Typical evidence:
 - health-check timeout
 - `curl` connection refused
 - server never starts
 - dependent services unavailable
 Typical fixes:
 - inspect service logs
 - fix startup command or environment
 - ensure readiness probes hit the correct host/path
 #### E. Runner / Infrastructure Failure
 Typical evidence:
 - `oom-kill` in `journalctl`
 - runner daemon restart loop
 - disk full or temp space exhaustion
 - SSH reachable but job logs end abruptly
 Typical fixes:
 - reduce CI memory footprint
 - split large jobs
 - investigate runner/container resource limits
 - only restart runner if explicitly requested
 ## Analysis Heuristics
 ### Prefer the Smallest Plausible Root Cause
 Do not blame the runner for a clean Python traceback in a job log.
 ### Use Job Logs Before Runner Logs
 Job logs usually explain application/workflow failures better than runner logs.
 ### Treat OOM as a Runner Problem Only With Evidence
 Look for `oom-kill`, `killed process`, or abrupt job termination without a normal traceback.
 ### Distinguish Missing Logs From Missing Logging
 If `/opt/gitea-runner/logs` does not contain the run you want, verify whether the workflow had the logging initializer yet.
 ## Recommended Windsurf Output Format
 When the investigation is complete, report findings in this structure:
 ```text
 Failure class:
 Root cause:
 Evidence:
 - <log line or command result>
 - <log line or command result>
 Why this is the likely cause:
 Minimal fix:
 Optional follow-up checks:
 Confidence: <low|medium|high>
 ```
 ## Quick Command Bundle
 Use this bundle when you need a fast first pass.
 ```bash
 ssh gitea-runner '
  echo "=== service ===";
  systemctl is-active gitea-runner;
  echo "=== latest indexed runs ===";
  tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
  echo "=== latest job log ===";
  tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
  echo "=== runner journal ===";
  journalctl -u gitea-runner -n 80 --no-pager || true
 '
 ```
 ## Escalation Guidance
 Escalate to a deeper infrastructure review when:
 - the runner repeatedly shows `oom-kill`
 - job logs are truncated across unrelated workflows
 - the runner daemon is flapping
 - disk or tmp space is exhausted
 - the same failure occurs across multiple independent workflows without a shared code change
 ## Related Files
 - `/opt/aitbc/scripts/ci/setup-job-logging.sh`
 - `/opt/aitbc/.gitea/workflows/staking-tests.yml`
 - `/opt/aitbc/.gitea/workflows/production-tests.yml`
 - `/opt/aitbc/.gitea/workflows/systemd-sync.yml`