Files
aitbc/.windsurf/workflows/gitea-runner-ci-debug.md
aitbc 92f175c54f docs: improve gitea-runner CI debug workflow with ripgrep and failure markers
- Added ripgrep (rg) usage notes and preference over grep for targeted searches
- Updated log discovery examples to use rg with --fixed-strings for workflow/job name searches
- Added failure-marker search pattern (|Traceback|FAILED|etc.) for quick issue identification
- Replaced grep with rg in runner health checks (dmesg, journalctl)
- Added failure marker search to quick-start one-liner
- Preserved awk usage for tab
2026-04-20 13:21:19 +02:00

240 lines
7.9 KiB
Markdown

---
description: SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions
---
# Gitea Runner CI Debug Workflow
## Purpose
Use this workflow when a Gitea Actions job fails and you need Windsurf to:
- SSH to `gitea-runner`
- locate the most relevant CI log files
- inspect runner health and runner-side failures
- separate workflow/application failures from runner/infrastructure failures
- produce actionable debug suggestions with evidence
## Key Environment Facts
- The actual runner host is reachable via `ssh gitea-runner`
- The runner service is `gitea-runner.service`
- The runner binary is `/opt/gitea-runner/act_runner`
- Gitea Actions on this runner behaves like a GitHub-compatibility layer
- Prefer `GITHUB_RUN_ID` and `GITHUB_RUN_NUMBER`, not `GITEA_RUN_ID`
- Internal runner `task <id>` messages in `journalctl` are useful for runner debugging, but are not stable workflow-facing identifiers
- CI job logs created by the reusable logging wrapper live under `/opt/gitea-runner/logs`
- `rg` is installed on `gitea-runner`; prefer it over `grep` for targeted log discovery and failure-marker searches
## Safety Rules
- Start with read-only inspection only
- Do not restart the runner or mutate files unless the user explicitly asks
- Prefer scoped log reads over dumping entire files
- If a failure is clearly application-level, stop proposing runner changes
## Primary Log Sources
### Job Logs
- `/opt/gitea-runner/logs/index.tsv`
- `/opt/gitea-runner/logs/latest.log`
- `/opt/gitea-runner/logs/latest-<workflow>.log`
- `/opt/gitea-runner/logs/latest-<workflow>-<job>.log`
### Runner Logs
- `journalctl -u gitea-runner`
- `/opt/gitea-runner/runner.log`
- `systemctl status gitea-runner --no-pager`
## Workflow Steps
### Step 1: Confirm Runner Reachability
```bash
ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
```
Expected outcome:
- host is `gitea-runner`
- user is usually `root`
- service is `active`
### Step 2: Find Candidate CI Logs
If you know the workflow or job name, start there.
```bash
ssh gitea-runner 'ls -lah /opt/gitea-runner/logs'
ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
ssh gitea-runner 'rg -n --fixed-strings "Production Tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
ssh gitea-runner 'rg -n --fixed-strings "test-production" /opt/gitea-runner/logs/index.tsv | tail -n 20'
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
```
If you know the run id, keep using `awk` because `index.tsv` is tab-separated and you want an exact column match:
```bash
ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
```
If you know the workflow/job name:
```bash
ssh gitea-runner 'rg -n -i --fixed-strings "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
ssh gitea-runner 'rg -n -i --fixed-strings "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20'
```
### Step 3: Read the Most Relevant Job Log
After identifying the file path from `index.tsv`, inspect the tail first.
```bash
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log-file>.log'
```
If `latest.log` already matches the failing run:
```bash
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
```
For a fast failure-marker pass inside a resolved log file:
```bash
ssh gitea-runner 'rg -n "❌|Traceback|FAILED|FAILURES|ModuleNotFoundError|AssertionError|not ready|oom|Killed" /opt/gitea-runner/logs/<resolved-log-file>.log'
```
### Step 4: Correlate With Runner Health
Only do this after reading the job log, so you do not confuse test failures with runner failures.
```bash
ssh gitea-runner 'systemctl status gitea-runner --no-pager'
ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
```
### Step 5: Check for Infrastructure Pressure
Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits.
```bash
ssh gitea-runner 'free -h; df -h /opt /var /tmp'
ssh gitea-runner 'dmesg -T | rg -i "oom|out of memory|killed process" | tail -n 50'
ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | rg -i "oom|killed|failed|panic|error"'
```
### Step 6: Classify the Failure
Use the evidence to classify the failure into one of these buckets.
#### A. Workflow / Config Regression
Typical evidence:
- missing script path
- wrong workspace path
- wrong import target
- wrong service name
- bad YAML logic
Typical fixes:
- patch the workflow
- correct repo-relative paths
- fix `PYTHONPATH`, script invocation, or job dependencies
#### B. Dependency / Packaging Failure
Typical evidence:
- `ModuleNotFoundError`
- editable install failure
- Poetry/pyproject packaging errors
- missing test/runtime packages
Typical fixes:
- add the minimal missing dependency
- avoid broadening installs unnecessarily
- fix package metadata only if the install is actually required
#### C. Application / Test Failure
Typical evidence:
- assertion failures
- application tracebacks after setup completes
- service starts but endpoint behavior is wrong
Typical fixes:
- patch code or tests
- address the real failing import chain or runtime logic
#### D. Service Readiness / Integration Failure
Typical evidence:
- health-check timeout
- `curl` connection refused
- server never starts
- dependent services unavailable
Typical fixes:
- inspect service logs
- fix startup command or environment
- ensure readiness probes hit the correct host/path
#### E. Runner / Infrastructure Failure
Typical evidence:
- `oom-kill` in `journalctl`
- runner daemon restart loop
- disk full or temp space exhaustion
- SSH reachable but job logs end abruptly
Typical fixes:
- reduce CI memory footprint
- split large jobs
- investigate runner/container resource limits
- only restart runner if explicitly requested
## Analysis Heuristics
### Prefer the Smallest Plausible Root Cause
Do not blame the runner for a clean Python traceback in a job log.
### Use Job Logs Before Runner Logs
Job logs usually explain application/workflow failures better than runner logs.
### Treat OOM as a Runner Problem Only With Evidence
Look for `oom-kill`, `killed process`, or abrupt job termination without a normal traceback.
### Distinguish Missing Logs From Missing Logging
If `/opt/gitea-runner/logs` does not contain the run you want, verify whether the workflow had the logging initializer yet.
## Recommended Windsurf Output Format
When the investigation is complete, report findings in this structure:
```text
Failure class:
Root cause:
Evidence:
- <log line or command result>
- <log line or command result>
Why this is the likely cause:
Minimal fix:
Optional follow-up checks:
Confidence: <low|medium|high>
```
## Quick Command Bundle
Use this bundle when you need a fast first pass.
```bash
ssh gitea-runner '
echo "=== service ===";
systemctl is-active gitea-runner;
echo "=== latest indexed runs ===";
tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
echo "=== latest job log ===";
tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
echo "=== latest job markers ===";
rg -n "❌|Traceback|FAILED|FAILURES|ModuleNotFoundError|AssertionError|not ready|oom|Killed" /opt/gitea-runner/logs/latest.log 2>/dev/null | tail -n 40 || true;
echo "=== runner journal ===";
journalctl -u gitea-runner -n 80 --no-pager || true
'
```
## Escalation Guidance
Escalate to a deeper infrastructure review when:
- the runner repeatedly shows `oom-kill`
- job logs are truncated across unrelated workflows
- the runner daemon is flapping
- disk or tmp space is exhausted
- the same failure occurs across multiple independent workflows without a shared code change
## Related Files
- `/opt/aitbc/scripts/ci/setup-job-logging.sh`
- `/opt/aitbc/.gitea/workflows/staking-tests.yml`
- `/opt/aitbc/.gitea/workflows/production-tests.yml`
- `/opt/aitbc/.gitea/workflows/systemd-sync.yml`