Files
aitbc/ai-memory/failures/debugging-notes.md

209 lines
6.6 KiB
Markdown

# Debugging Playbook
This is a collection of diagnostic checklists and debugging techniques for common issues in the AITBC system.
---
## 1. CLI Import Errors
**Symptom**: `aitbc` command crashes with `ImportError` or `ModuleNotFoundError`.
**Checklist**:
- [ ] Verify `apps/coordinator-api/src/app/services/__init__.py` exists.
- [ ] Check that `cli/aitbc_cli/commands/*` modules use correct relative imports.
- [ ] Ensure coordinator-api is importable: `python -c "import sys; sys.path.append('apps/coordinator-api/src'); from app.services import trading_surveillance"` should work.
- [ ] Run `aitbc --help` to see if base CLI loads (indicates command module issue).
- [ ] Look for absolute paths in command modules; replace with package-relative.
**Common Fixes**: See failure-archive for the hardcoded path issue.
---
## 2. Coordinator API Won't Start
**Symptom**: `uvicorn app.main:app` fails or hangs.
**Checklist**:
- [ ] Check port 8000 availability (`lsof -i:8000`).
- [ ] Verify database file exists or can be created: `apps/coordinator-api/data/`.
- [ ] Ensure `pyproject.toml` dependencies installed in active venv.
- [ ] Check logs for specific exception (traceback).
- [ ] Verify `REDIS_URL` if using broadcast; Redis must be running.
**Common Causes**:
- Missing `aiohttp` or `sqlalchemy`
- Database locked or permission denied
- Redis not running (if used)
---
## 3. Blockchain Node Not Producing Blocks
**Symptom**: RPC `/status` shows `height` not increasing.
**Checklist**:
- [ ] Is the node process running? (`ps aux | grep blockchain`)
- [ ] Check logs for consensus errors or DB errors.
- [ ] Verify ports 8006 (RPC) and 8005 (P2P) are open.
- [ ] Ensure wallet daemon running on 8015 (if needed for transactions).
- [ ] Confirm network: other peers? Running devnet with proposer account funded?
- [ ] Run `aitbc blockchain status` to see RPC response.
**Common Causes**:
- Not initialized (`scripts/devnet_up.sh` not executed)
- Genesis proposer has no funds
- P2P connectivity not established (check Redis for gossip)
---
## 4. AI Provider Job Fails with Payment Error
**Symptom**: Provider returns 403 or says balance insufficient.
**Checklist**:
- [ ] Did buyer send funds first? (`aitbc blockchain send ...`) should precede job request.
- [ ] Check provider's balance before/after; confirm expected amount transferred.
- [ ] Verify provider and buyer are on same network (ait-devnet).
- [ ] Ensure provider's wallet daemon is running (port 8015).
- [ ] Check coordinator job URL (`--marketplace-url`) reachable.
**Resolution**: Follow the correct payment flow: buyer sends transaction, waits for confirmation, then POST /job.
---
## 5. Gitea API Calls Fail (Transient)
**Symptom**: Scripts fail with connection reset, 502, etc.
**Checklist**:
- [ ] Is Gitea instance up? Can you `curl` the API?
- [ ] Check network connectivity and DNS.
- [ ] Add retry with exponential backoff (already in `monitor-prs.py`).
- [ ] If persistent, check Gitea logs for server-side issues.
**Temporary Workaround**: Wait and re-run the script manually.
---
## 6. Redis Pub/Sub Not Delivering Messages
**Symptom**: Agents don't receive broadcast messages.
**Checklist**:
- [ ] Is Redis running? `redis-cli ping` should return PONG.
- [ ] Check that all agents use the same `REDIS_URL`.
- [ ] Verify message channel names match exactly.
- [ ] Ensure agents are subscribed before messages are published.
- [ ] Use `redis-cli SUBSCRIBE <channel>` to debug manually.
**Note**: This is dev-only; production will use direct P2P.
---
## 7. Starlette Import Errors After Upgrade
**Symptom**: `ImportError: cannot import name 'Broadcast'`.
**Fix**: Pin Starlette to `<0.38` as documented. Alternatively, refactor to use a different broadcast mechanism (future work).
---
## 8. Test Isolation Failures
**Symptom**: Tests pass individually but fail when run together.
**Checklist**:
- [ ] Look for shared resources (database files, ports, files).
- [ ] Use fixtures with `scope="function"` and proper teardown.
- [ ] Clean up after each test: close DB connections, stop servers.
- [ ] Avoid global state; inject dependencies.
**Action**: Refactor tests to be hermetic.
---
## 9. Port Conflicts
**Symptom**: `OSError: Address already in use`.
**Checklist**:
- [ ] Identify which process owns the port: `lsof -i:<port>`.
- [ ] Kill lingering processes from previous runs.
- [ ] Use dynamic port allocation for tests if possible.
- [ ] Ensure services shut down cleanly on exit (signals).
---
## 10. Memory Conflicts (Concurrent Editing)
**Symptom**: Two agents editing the same file cause Git merge conflicts.
**Prevention**:
- Use `ai-memory/daily/` with one file per day; agents append, not edit.
- Avoid editing the same file simultaneously; coordinate via claims if necessary.
- If conflict occurs, resolve manually by merging entries; preserve both contributions.
---
## 11. Cron Jobs Not Running
**Symptom**: Expected periodic tasks not executing.
**Checklist**:
- [ ] Verify cron entries (`crontab -l` for the user).
- [ ] Check system cron logs (`/var/log/cron`, `journalctl`).
- [ ] Ensure scripts are executable and paths are absolute or correctly relative (use `cd` first).
- [ ] Redirect output to a log file for debugging: `>> /var/log/claim-task.log 2>&1`.
---
## 12. Wallet Operations Fail (Unknown Wallet)
**Symptom**: `aitbc wallet balance` returns "wallet not found".
**Checklist**:
- [ ] Has wallet been created? Use `aitbc wallet create` first.
- [ ] Check the wallet name and hostname pattern: `<hostname><wallet_name>_simple`.
- [ ] Verify wallet daemon running on port 8015.
- [ ] Ensure RPC URL matches (coordinator API running on 8000).
---
## 13. CI Jobs Stuck / Timeout
**Symptom**: CI job runs for > 1 hour without finishing.
**Checklist**:
- [ ] Check for infinite loops or deadlocks in tests.
- [ ] Increase CI timeout if legitimate long test.
- [ ] Add `pytest -x` to fail fast on first error to identify root cause.
- [ ] Split tests into smaller batches.
---
## 14. Permission Denied on Git Operations
**Symptom**: `fatal: could not read Username` or `Permission denied (publickey)`.
**Cause**: SSH key not loaded or Gitea token not set.
**Fix**:
- Ensure SSH agent has the key (`ssh-add -l`).
- Set `GITEA_TOKEN` environment variable for API operations.
- Test with `git push` manually.
---
## 15. Merge Conflict in Claim Branch
**Symptom**: Pulling latest main into claim branch causes conflicts.
**Resolution**:
- Resolve conflicts manually; keep both sets of changes if they are independent.
- Re-run tests after resolution.
- Push resolved branch.
- Consider rebasing instead of merging to keep history linear.
---
*Add new debugging patterns as they emerge.*