Added comprehensive documentation for autonomous investigation of failed Gitea Actions runs via SSH access to gitea-runner host. Includes log location mapping, classification heuristics for distinguishing workflow/dependency/application/service/infrastructure failures, and evidence-based debug suggestion templates. Provides read-only investigation sequences with safety constraints to prevent conflating application failures with runner inst
7.0 KiB
description
| description |
|---|
| SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions |
Gitea Runner CI Debug Workflow
Purpose
Use this workflow when a Gitea Actions job fails and you need Windsurf to:
- SSH to
gitea-runner - locate the most relevant CI log files
- inspect runner health and runner-side failures
- separate workflow/application failures from runner/infrastructure failures
- produce actionable debug suggestions with evidence
Key Environment Facts
- The actual runner host is reachable via
ssh gitea-runner - The runner service is
gitea-runner.service - The runner binary is
/opt/gitea-runner/act_runner - Gitea Actions on this runner behaves like a GitHub-compatibility layer
- Prefer
GITHUB_RUN_IDandGITHUB_RUN_NUMBER, notGITEA_RUN_ID - Internal runner
task <id>messages injournalctlare useful for runner debugging, but are not stable workflow-facing identifiers - CI job logs created by the reusable logging wrapper live under
/opt/gitea-runner/logs
Safety Rules
- Start with read-only inspection only
- Do not restart the runner or mutate files unless the user explicitly asks
- Prefer scoped log reads over dumping entire files
- If a failure is clearly application-level, stop proposing runner changes
Primary Log Sources
Job Logs
/opt/gitea-runner/logs/index.tsv/opt/gitea-runner/logs/latest.log/opt/gitea-runner/logs/latest-<workflow>.log/opt/gitea-runner/logs/latest-<workflow>-<job>.log
Runner Logs
journalctl -u gitea-runner/opt/gitea-runner/runner.logsystemctl status gitea-runner --no-pager
Workflow Steps
Step 1: Confirm Runner Reachability
ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
Expected outcome:
- host is
gitea-runner - user is usually
root - service is
active
Step 2: Find Candidate CI Logs
If you know the workflow or job name, start there.
ssh gitea-runner 'ls -lah /opt/gitea-runner/logs'
ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
If you know the run id:
ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
If you know the workflow/job name:
ssh gitea-runner 'grep -i "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
ssh gitea-runner 'grep -i "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20'
Step 3: Read the Most Relevant Job Log
After identifying the file path from index.tsv, inspect the tail first.
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log-file>.log'
If latest.log already matches the failing run:
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
Step 4: Correlate With Runner Health
Only do this after reading the job log, so you do not confuse test failures with runner failures.
ssh gitea-runner 'systemctl status gitea-runner --no-pager'
ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
Step 5: Check for Infrastructure Pressure
Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits.
ssh gitea-runner 'free -h; df -h /opt /var /tmp'
ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | grep -i -E "oom|killed|failed|panic|error"'
Step 6: Classify the Failure
Use the evidence to classify the failure into one of these buckets.
A. Workflow / Config Regression
Typical evidence:
- missing script path
- wrong workspace path
- wrong import target
- wrong service name
- bad YAML logic
Typical fixes:
- patch the workflow
- correct repo-relative paths
- fix
PYTHONPATH, script invocation, or job dependencies
B. Dependency / Packaging Failure
Typical evidence:
ModuleNotFoundError- editable install failure
- Poetry/pyproject packaging errors
- missing test/runtime packages
Typical fixes:
- add the minimal missing dependency
- avoid broadening installs unnecessarily
- fix package metadata only if the install is actually required
C. Application / Test Failure
Typical evidence:
- assertion failures
- application tracebacks after setup completes
- service starts but endpoint behavior is wrong
Typical fixes:
- patch code or tests
- address the real failing import chain or runtime logic
D. Service Readiness / Integration Failure
Typical evidence:
- health-check timeout
curlconnection refused- server never starts
- dependent services unavailable
Typical fixes:
- inspect service logs
- fix startup command or environment
- ensure readiness probes hit the correct host/path
E. Runner / Infrastructure Failure
Typical evidence:
oom-killinjournalctl- runner daemon restart loop
- disk full or temp space exhaustion
- SSH reachable but job logs end abruptly
Typical fixes:
- reduce CI memory footprint
- split large jobs
- investigate runner/container resource limits
- only restart runner if explicitly requested
Analysis Heuristics
Prefer the Smallest Plausible Root Cause
Do not blame the runner for a clean Python traceback in a job log.
Use Job Logs Before Runner Logs
Job logs usually explain application/workflow failures better than runner logs.
Treat OOM as a Runner Problem Only With Evidence
Look for oom-kill, killed process, or abrupt job termination without a normal traceback.
Distinguish Missing Logs From Missing Logging
If /opt/gitea-runner/logs does not contain the run you want, verify whether the workflow had the logging initializer yet.
Recommended Windsurf Output Format
When the investigation is complete, report findings in this structure:
Failure class:
Root cause:
Evidence:
- <log line or command result>
- <log line or command result>
Why this is the likely cause:
Minimal fix:
Optional follow-up checks:
Confidence: <low|medium|high>
Quick Command Bundle
Use this bundle when you need a fast first pass.
ssh gitea-runner '
echo "=== service ===";
systemctl is-active gitea-runner;
echo "=== latest indexed runs ===";
tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
echo "=== latest job log ===";
tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
echo "=== runner journal ===";
journalctl -u gitea-runner -n 80 --no-pager || true
'
Escalation Guidance
Escalate to a deeper infrastructure review when:
- the runner repeatedly shows
oom-kill - job logs are truncated across unrelated workflows
- the runner daemon is flapping
- disk or tmp space is exhausted
- the same failure occurs across multiple independent workflows without a shared code change
Related Files
/opt/aitbc/scripts/ci/setup-job-logging.sh/opt/aitbc/.gitea/workflows/staking-tests.yml/opt/aitbc/.gitea/workflows/production-tests.yml/opt/aitbc/.gitea/workflows/systemd-sync.yml