docs: add gitea-runner SSH-based CI log debugging skill and workflow
Added comprehensive documentation for autonomous investigation of failed Gitea Actions runs via SSH access to gitea-runner host. Includes log location mapping, classification heuristics for distinguishing workflow/dependency/application/service/infrastructure failures, and evidence-based debug suggestion templates. Provides read-only investigation sequences with safety constraints to prevent conflating application failures with runner inst
This commit is contained in:
211
.windsurf/skills/gitea-runner-log-debugger.md
Normal file
211
.windsurf/skills/gitea-runner-log-debugger.md
Normal file
@@ -0,0 +1,211 @@
|
|||||||
|
---
|
||||||
|
description: Autonomous skill for SSH-based investigation of gitea-runner CI logs, runner health, and root-cause-oriented debug guidance
|
||||||
|
title: Gitea Runner Log Debugger
|
||||||
|
version: 1.0
|
||||||
|
---
|
||||||
|
|
||||||
|
# Gitea Runner Log Debugger Skill
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Use this skill to diagnose failed Gitea Actions runs by connecting to `gitea-runner`, reading CI log files, correlating them with runner health, and producing targeted debug suggestions.
|
||||||
|
|
||||||
|
## Activation
|
||||||
|
Activate this skill when:
|
||||||
|
- a Gitea workflow fails and the UI log is incomplete or inconvenient
|
||||||
|
- Windsurf needs direct access to runner-side CI logs
|
||||||
|
- you need to distinguish workflow failures from runner failures
|
||||||
|
- you need evidence-backed debug suggestions instead of generic guesses
|
||||||
|
- a job appears to fail because of OOM, restart loops, path mismatches, or missing dependencies
|
||||||
|
|
||||||
|
## Known Environment Facts
|
||||||
|
- Runner host: `ssh gitea-runner`
|
||||||
|
- Runner service: `gitea-runner.service`
|
||||||
|
- Runner binary: `/opt/gitea-runner/act_runner`
|
||||||
|
- Persistent CI logs: `/opt/gitea-runner/logs`
|
||||||
|
- Indexed log manifest: `/opt/gitea-runner/logs/index.tsv`
|
||||||
|
- Latest log symlink: `/opt/gitea-runner/logs/latest.log`
|
||||||
|
- Gitea Actions on this runner exposes GitHub-compatible runtime variables, so `GITHUB_RUN_ID` is the correct run identifier to prefer over `GITEA_RUN_ID`
|
||||||
|
|
||||||
|
## Inputs
|
||||||
|
|
||||||
|
### Minimum Input
|
||||||
|
- failing workflow name, job name, or pasted error output
|
||||||
|
|
||||||
|
### Best Input
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"workflow_name": "Staking Tests",
|
||||||
|
"job_name": "test-staking-service",
|
||||||
|
"run_id": "1787",
|
||||||
|
"symptoms": [
|
||||||
|
"ModuleNotFoundError: No module named click"
|
||||||
|
],
|
||||||
|
"needs_runner_health_check": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Expected Outputs
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"failure_class": "workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown",
|
||||||
|
"root_cause": "string",
|
||||||
|
"evidence": ["string"],
|
||||||
|
"minimal_fix": "string",
|
||||||
|
"follow_up_checks": ["string"],
|
||||||
|
"confidence": "low | medium | high"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Investigation Sequence
|
||||||
|
|
||||||
|
### 1. Connect and Verify Runner
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Locate Relevant CI Logs
|
||||||
|
Prefer indexed job logs first.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
If a run id is known:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
|
||||||
|
```
|
||||||
|
|
||||||
|
If only workflow/job names are known:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'grep -i "production tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
|
||||||
|
ssh gitea-runner 'grep -i "test-production" /opt/gitea-runner/logs/index.tsv | tail -n 20'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Read the Job Log Before the Runner Log
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log>.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Correlate With Runner State
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'systemctl status gitea-runner --no-pager'
|
||||||
|
ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Check for Resource Exhaustion Only if Indicated
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'free -h; df -h /opt /var /tmp'
|
||||||
|
ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Classification Rules
|
||||||
|
|
||||||
|
### Workflow Config Failure
|
||||||
|
Evidence patterns:
|
||||||
|
- script path not found
|
||||||
|
- wrong repo path
|
||||||
|
- wrong service/unit name
|
||||||
|
- wrong import target or startup command
|
||||||
|
- missing environment export
|
||||||
|
|
||||||
|
Default recommendation:
|
||||||
|
- patch the workflow with the smallest targeted fix
|
||||||
|
|
||||||
|
### Dependency / Packaging Failure
|
||||||
|
Evidence patterns:
|
||||||
|
- `ModuleNotFoundError`
|
||||||
|
- `ImportError`
|
||||||
|
- failed editable install
|
||||||
|
- Poetry package discovery failure
|
||||||
|
- missing pip/Node dependency in lean CI setup
|
||||||
|
|
||||||
|
Default recommendation:
|
||||||
|
- add only the missing dependency when truly required
|
||||||
|
- otherwise fix the import chain or packaging metadata root cause
|
||||||
|
|
||||||
|
### Application / Test Failure
|
||||||
|
Evidence patterns:
|
||||||
|
- normal environment setup completes
|
||||||
|
- tests collect and run
|
||||||
|
- failure is an assertion or application traceback
|
||||||
|
|
||||||
|
Default recommendation:
|
||||||
|
- patch code or tests, not the runner
|
||||||
|
|
||||||
|
### Service Readiness Failure
|
||||||
|
Evidence patterns:
|
||||||
|
- health endpoint timeout
|
||||||
|
- process exits immediately
|
||||||
|
- server log shows startup/config exception
|
||||||
|
|
||||||
|
Default recommendation:
|
||||||
|
- inspect service startup logs and verify host/path/port assumptions
|
||||||
|
|
||||||
|
### Runner / Infrastructure Failure
|
||||||
|
Evidence patterns:
|
||||||
|
- `oom-kill` in `journalctl`
|
||||||
|
- runner daemon restart loop
|
||||||
|
- truncated logs across unrelated workflows
|
||||||
|
- disk exhaustion or temp space errors
|
||||||
|
|
||||||
|
Default recommendation:
|
||||||
|
- treat as runner capacity/stability issue only when evidence is direct
|
||||||
|
|
||||||
|
## Decision Heuristics
|
||||||
|
- Prefer the job log over `journalctl` for code/workflow failures
|
||||||
|
- Prefer the smallest fix that explains all evidence
|
||||||
|
- Do not suggest restarting the runner unless the user asks or the runner is clearly unhealthy
|
||||||
|
- Ignore internal `task <id>` values for workflow naming or file lookup
|
||||||
|
- If `/opt/gitea-runner/logs` is missing a run, check whether the workflow had the logging initializer at that time
|
||||||
|
|
||||||
|
## Debug Suggestion Template
|
||||||
|
When reporting back, use this structure:
|
||||||
|
|
||||||
|
### Failure Class
|
||||||
|
`<workflow_config | dependency_packaging | application_test | service_readiness | runner_infrastructure | unknown>`
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
One sentence describing the most likely issue.
|
||||||
|
|
||||||
|
### Evidence
|
||||||
|
- `<specific log line>`
|
||||||
|
- `<specific log line>`
|
||||||
|
- `<runner health correlation if relevant>`
|
||||||
|
|
||||||
|
### Minimal Fix
|
||||||
|
One focused change that addresses the root cause.
|
||||||
|
|
||||||
|
### Optional Follow-up
|
||||||
|
- `<verification step>`
|
||||||
|
- `<secondary diagnostic if needed>`
|
||||||
|
|
||||||
|
### Confidence
|
||||||
|
`low | medium | high`
|
||||||
|
|
||||||
|
## Safety Constraints
|
||||||
|
- Read-only first
|
||||||
|
- No service restarts without explicit user approval
|
||||||
|
- No deletion of runner files during diagnosis
|
||||||
|
- Do not conflate application tracebacks with runner instability
|
||||||
|
|
||||||
|
## Fast First-Pass Bundle
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner '
|
||||||
|
echo "=== latest runs ===";
|
||||||
|
tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
|
||||||
|
echo "=== latest log ===";
|
||||||
|
tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
|
||||||
|
echo "=== runner service ===";
|
||||||
|
systemctl status gitea-runner --no-pager | tail -n 40 || true;
|
||||||
|
echo "=== runner journal ===";
|
||||||
|
journalctl -u gitea-runner -n 80 --no-pager || true
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related Assets
|
||||||
|
- `.windsurf/workflows/gitea-runner-ci-debug.md`
|
||||||
|
- `scripts/ci/setup-job-logging.sh`
|
||||||
228
.windsurf/workflows/gitea-runner-ci-debug.md
Normal file
228
.windsurf/workflows/gitea-runner-ci-debug.md
Normal file
@@ -0,0 +1,228 @@
|
|||||||
|
---
|
||||||
|
description: SSH to gitea-runner, inspect CI job logs, correlate runner health, and produce root-cause-focused debug suggestions
|
||||||
|
---
|
||||||
|
|
||||||
|
# Gitea Runner CI Debug Workflow
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Use this workflow when a Gitea Actions job fails and you need Windsurf to:
|
||||||
|
- SSH to `gitea-runner`
|
||||||
|
- locate the most relevant CI log files
|
||||||
|
- inspect runner health and runner-side failures
|
||||||
|
- separate workflow/application failures from runner/infrastructure failures
|
||||||
|
- produce actionable debug suggestions with evidence
|
||||||
|
|
||||||
|
## Key Environment Facts
|
||||||
|
- The actual runner host is reachable via `ssh gitea-runner`
|
||||||
|
- The runner service is `gitea-runner.service`
|
||||||
|
- The runner binary is `/opt/gitea-runner/act_runner`
|
||||||
|
- Gitea Actions on this runner behaves like a GitHub-compatibility layer
|
||||||
|
- Prefer `GITHUB_RUN_ID` and `GITHUB_RUN_NUMBER`, not `GITEA_RUN_ID`
|
||||||
|
- Internal runner `task <id>` messages in `journalctl` are useful for runner debugging, but are not stable workflow-facing identifiers
|
||||||
|
- CI job logs created by the reusable logging wrapper live under `/opt/gitea-runner/logs`
|
||||||
|
|
||||||
|
## Safety Rules
|
||||||
|
- Start with read-only inspection only
|
||||||
|
- Do not restart the runner or mutate files unless the user explicitly asks
|
||||||
|
- Prefer scoped log reads over dumping entire files
|
||||||
|
- If a failure is clearly application-level, stop proposing runner changes
|
||||||
|
|
||||||
|
## Primary Log Sources
|
||||||
|
|
||||||
|
### Job Logs
|
||||||
|
- `/opt/gitea-runner/logs/index.tsv`
|
||||||
|
- `/opt/gitea-runner/logs/latest.log`
|
||||||
|
- `/opt/gitea-runner/logs/latest-<workflow>.log`
|
||||||
|
- `/opt/gitea-runner/logs/latest-<workflow>-<job>.log`
|
||||||
|
|
||||||
|
### Runner Logs
|
||||||
|
- `journalctl -u gitea-runner`
|
||||||
|
- `/opt/gitea-runner/runner.log`
|
||||||
|
- `systemctl status gitea-runner --no-pager`
|
||||||
|
|
||||||
|
## Workflow Steps
|
||||||
|
|
||||||
|
### Step 1: Confirm Runner Reachability
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'hostname; whoami; systemctl is-active gitea-runner'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected outcome:
|
||||||
|
- host is `gitea-runner`
|
||||||
|
- user is usually `root`
|
||||||
|
- service is `active`
|
||||||
|
|
||||||
|
### Step 2: Find Candidate CI Logs
|
||||||
|
If you know the workflow or job name, start there.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'ls -lah /opt/gitea-runner/logs'
|
||||||
|
ssh gitea-runner 'tail -n 20 /opt/gitea-runner/logs/index.tsv'
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
If you know the run id:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner "awk -F '\t' '\$2 == \"1787\" {print}' /opt/gitea-runner/logs/index.tsv"
|
||||||
|
```
|
||||||
|
|
||||||
|
If you know the workflow/job name:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'grep -i "staking tests" /opt/gitea-runner/logs/index.tsv | tail -n 20'
|
||||||
|
ssh gitea-runner 'grep -i "test-staking-service" /opt/gitea-runner/logs/index.tsv | tail -n 20'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Read the Most Relevant Job Log
|
||||||
|
After identifying the file path from `index.tsv`, inspect the tail first.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/<resolved-log-file>.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
If `latest.log` already matches the failing run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/logs/latest.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Correlate With Runner Health
|
||||||
|
Only do this after reading the job log, so you do not confuse test failures with runner failures.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'systemctl status gitea-runner --no-pager'
|
||||||
|
ssh gitea-runner 'journalctl -u gitea-runner -n 200 --no-pager'
|
||||||
|
ssh gitea-runner 'tail -n 200 /opt/gitea-runner/runner.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Check for Infrastructure Pressure
|
||||||
|
Use these when the log suggests abrupt termination, hanging setup, missing containers, or unexplained exits.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner 'free -h; df -h /opt /var /tmp'
|
||||||
|
ssh gitea-runner 'dmesg -T | grep -i -E "oom|out of memory|killed process" | tail -n 50'
|
||||||
|
ssh gitea-runner 'journalctl -u gitea-runner --since "2 hours ago" --no-pager | grep -i -E "oom|killed|failed|panic|error"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 6: Classify the Failure
|
||||||
|
Use the evidence to classify the failure into one of these buckets.
|
||||||
|
|
||||||
|
#### A. Workflow / Config Regression
|
||||||
|
Typical evidence:
|
||||||
|
- missing script path
|
||||||
|
- wrong workspace path
|
||||||
|
- wrong import target
|
||||||
|
- wrong service name
|
||||||
|
- bad YAML logic
|
||||||
|
|
||||||
|
Typical fixes:
|
||||||
|
- patch the workflow
|
||||||
|
- correct repo-relative paths
|
||||||
|
- fix `PYTHONPATH`, script invocation, or job dependencies
|
||||||
|
|
||||||
|
#### B. Dependency / Packaging Failure
|
||||||
|
Typical evidence:
|
||||||
|
- `ModuleNotFoundError`
|
||||||
|
- editable install failure
|
||||||
|
- Poetry/pyproject packaging errors
|
||||||
|
- missing test/runtime packages
|
||||||
|
|
||||||
|
Typical fixes:
|
||||||
|
- add the minimal missing dependency
|
||||||
|
- avoid broadening installs unnecessarily
|
||||||
|
- fix package metadata only if the install is actually required
|
||||||
|
|
||||||
|
#### C. Application / Test Failure
|
||||||
|
Typical evidence:
|
||||||
|
- assertion failures
|
||||||
|
- application tracebacks after setup completes
|
||||||
|
- service starts but endpoint behavior is wrong
|
||||||
|
|
||||||
|
Typical fixes:
|
||||||
|
- patch code or tests
|
||||||
|
- address the real failing import chain or runtime logic
|
||||||
|
|
||||||
|
#### D. Service Readiness / Integration Failure
|
||||||
|
Typical evidence:
|
||||||
|
- health-check timeout
|
||||||
|
- `curl` connection refused
|
||||||
|
- server never starts
|
||||||
|
- dependent services unavailable
|
||||||
|
|
||||||
|
Typical fixes:
|
||||||
|
- inspect service logs
|
||||||
|
- fix startup command or environment
|
||||||
|
- ensure readiness probes hit the correct host/path
|
||||||
|
|
||||||
|
#### E. Runner / Infrastructure Failure
|
||||||
|
Typical evidence:
|
||||||
|
- `oom-kill` in `journalctl`
|
||||||
|
- runner daemon restart loop
|
||||||
|
- disk full or temp space exhaustion
|
||||||
|
- SSH reachable but job logs end abruptly
|
||||||
|
|
||||||
|
Typical fixes:
|
||||||
|
- reduce CI memory footprint
|
||||||
|
- split large jobs
|
||||||
|
- investigate runner/container resource limits
|
||||||
|
- only restart runner if explicitly requested
|
||||||
|
|
||||||
|
## Analysis Heuristics
|
||||||
|
|
||||||
|
### Prefer the Smallest Plausible Root Cause
|
||||||
|
Do not blame the runner for a clean Python traceback in a job log.
|
||||||
|
|
||||||
|
### Use Job Logs Before Runner Logs
|
||||||
|
Job logs usually explain application/workflow failures better than runner logs.
|
||||||
|
|
||||||
|
### Treat OOM as a Runner Problem Only With Evidence
|
||||||
|
Look for `oom-kill`, `killed process`, or abrupt job termination without a normal traceback.
|
||||||
|
|
||||||
|
### Distinguish Missing Logs From Missing Logging
|
||||||
|
If `/opt/gitea-runner/logs` does not contain the run you want, verify whether the workflow had the logging initializer yet.
|
||||||
|
|
||||||
|
## Recommended Windsurf Output Format
|
||||||
|
When the investigation is complete, report findings in this structure:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Failure class:
|
||||||
|
Root cause:
|
||||||
|
Evidence:
|
||||||
|
- <log line or command result>
|
||||||
|
- <log line or command result>
|
||||||
|
Why this is the likely cause:
|
||||||
|
Minimal fix:
|
||||||
|
Optional follow-up checks:
|
||||||
|
Confidence: <low|medium|high>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Command Bundle
|
||||||
|
Use this bundle when you need a fast first pass.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh gitea-runner '
|
||||||
|
echo "=== service ===";
|
||||||
|
systemctl is-active gitea-runner;
|
||||||
|
echo "=== latest indexed runs ===";
|
||||||
|
tail -n 10 /opt/gitea-runner/logs/index.tsv 2>/dev/null || true;
|
||||||
|
echo "=== latest job log ===";
|
||||||
|
tail -n 120 /opt/gitea-runner/logs/latest.log 2>/dev/null || true;
|
||||||
|
echo "=== runner journal ===";
|
||||||
|
journalctl -u gitea-runner -n 80 --no-pager || true
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Escalation Guidance
|
||||||
|
Escalate to a deeper infrastructure review when:
|
||||||
|
- the runner repeatedly shows `oom-kill`
|
||||||
|
- job logs are truncated across unrelated workflows
|
||||||
|
- the runner daemon is flapping
|
||||||
|
- disk or tmp space is exhausted
|
||||||
|
- the same failure occurs across multiple independent workflows without a shared code change
|
||||||
|
|
||||||
|
## Related Files
|
||||||
|
- `/opt/aitbc/scripts/ci/setup-job-logging.sh`
|
||||||
|
- `/opt/aitbc/.gitea/workflows/staking-tests.yml`
|
||||||
|
- `/opt/aitbc/.gitea/workflows/production-tests.yml`
|
||||||
|
- `/opt/aitbc/.gitea/workflows/systemd-sync.yml`
|
||||||
Reference in New Issue
Block a user