Files
aitbc/ai-memory/failures/debugging-notes.md

6.6 KiB

Debugging Playbook

This is a collection of diagnostic checklists and debugging techniques for common issues in the AITBC system.


1. CLI Import Errors

Symptom: aitbc command crashes with ImportError or ModuleNotFoundError.

Checklist:

  • Verify apps/coordinator-api/src/app/services/__init__.py exists.
  • Check that cli/aitbc_cli/commands/* modules use correct relative imports.
  • Ensure coordinator-api is importable: python -c "import sys; sys.path.append('apps/coordinator-api/src'); from app.services import trading_surveillance" should work.
  • Run aitbc --help to see if base CLI loads (indicates command module issue).
  • Look for absolute paths in command modules; replace with package-relative.

Common Fixes: See failure-archive for the hardcoded path issue.


2. Coordinator API Won't Start

Symptom: uvicorn app.main:app fails or hangs.

Checklist:

  • Check port 8000 availability (lsof -i:8000).
  • Verify database file exists or can be created: apps/coordinator-api/data/.
  • Ensure pyproject.toml dependencies installed in active venv.
  • Check logs for specific exception (traceback).
  • Verify REDIS_URL if using broadcast; Redis must be running.

Common Causes:

  • Missing aiohttp or sqlalchemy
  • Database locked or permission denied
  • Redis not running (if used)

3. Blockchain Node Not Producing Blocks

Symptom: RPC /status shows height not increasing.

Checklist:

  • Is the node process running? (ps aux | grep blockchain)
  • Check logs for consensus errors or DB errors.
  • Verify ports 8006 (RPC) and 8005 (P2P) are open.
  • Ensure wallet daemon running on 8015 (if needed for transactions).
  • Confirm network: other peers? Running devnet with proposer account funded?
  • Run aitbc blockchain status to see RPC response.

Common Causes:

  • Not initialized (scripts/devnet_up.sh not executed)
  • Genesis proposer has no funds
  • P2P connectivity not established (check Redis for gossip)

4. AI Provider Job Fails with Payment Error

Symptom: Provider returns 403 or says balance insufficient.

Checklist:

  • Did buyer send funds first? (aitbc blockchain send ...) should precede job request.
  • Check provider's balance before/after; confirm expected amount transferred.
  • Verify provider and buyer are on same network (ait-devnet).
  • Ensure provider's wallet daemon is running (port 8015).
  • Check coordinator job URL (--marketplace-url) reachable.

Resolution: Follow the correct payment flow: buyer sends transaction, waits for confirmation, then POST /job.


5. Gitea API Calls Fail (Transient)

Symptom: Scripts fail with connection reset, 502, etc.

Checklist:

  • Is Gitea instance up? Can you curl the API?
  • Check network connectivity and DNS.
  • Add retry with exponential backoff (already in monitor-prs.py).
  • If persistent, check Gitea logs for server-side issues.

Temporary Workaround: Wait and re-run the script manually.


6. Redis Pub/Sub Not Delivering Messages

Symptom: Agents don't receive broadcast messages.

Checklist:

  • Is Redis running? redis-cli ping should return PONG.
  • Check that all agents use the same REDIS_URL.
  • Verify message channel names match exactly.
  • Ensure agents are subscribed before messages are published.
  • Use redis-cli SUBSCRIBE <channel> to debug manually.

Note: This is dev-only; production will use direct P2P.


7. Starlette Import Errors After Upgrade

Symptom: ImportError: cannot import name 'Broadcast'.

Fix: Pin Starlette to <0.38 as documented. Alternatively, refactor to use a different broadcast mechanism (future work).


8. Test Isolation Failures

Symptom: Tests pass individually but fail when run together.

Checklist:

  • Look for shared resources (database files, ports, files).
  • Use fixtures with scope="function" and proper teardown.
  • Clean up after each test: close DB connections, stop servers.
  • Avoid global state; inject dependencies.

Action: Refactor tests to be hermetic.


9. Port Conflicts

Symptom: OSError: Address already in use.

Checklist:

  • Identify which process owns the port: lsof -i:<port>.
  • Kill lingering processes from previous runs.
  • Use dynamic port allocation for tests if possible.
  • Ensure services shut down cleanly on exit (signals).

10. Memory Conflicts (Concurrent Editing)

Symptom: Two agents editing the same file cause Git merge conflicts.

Prevention:

  • Use ai-memory/daily/ with one file per day; agents append, not edit.
  • Avoid editing the same file simultaneously; coordinate via claims if necessary.
  • If conflict occurs, resolve manually by merging entries; preserve both contributions.

11. Cron Jobs Not Running

Symptom: Expected periodic tasks not executing.

Checklist:

  • Verify cron entries (crontab -l for the user).
  • Check system cron logs (/var/log/cron, journalctl).
  • Ensure scripts are executable and paths are absolute or correctly relative (use cd first).
  • Redirect output to a log file for debugging: >> /var/log/claim-task.log 2>&1.

12. Wallet Operations Fail (Unknown Wallet)

Symptom: aitbc wallet balance returns "wallet not found".

Checklist:

  • Has wallet been created? Use aitbc wallet create first.
  • Check the wallet name and hostname pattern: <hostname><wallet_name>_simple.
  • Verify wallet daemon running on port 8015.
  • Ensure RPC URL matches (coordinator API running on 8000).

13. CI Jobs Stuck / Timeout

Symptom: CI job runs for > 1 hour without finishing.

Checklist:

  • Check for infinite loops or deadlocks in tests.
  • Increase CI timeout if legitimate long test.
  • Add pytest -x to fail fast on first error to identify root cause.
  • Split tests into smaller batches.

14. Permission Denied on Git Operations

Symptom: fatal: could not read Username or Permission denied (publickey).

Cause: SSH key not loaded or Gitea token not set.

Fix:

  • Ensure SSH agent has the key (ssh-add -l).
  • Set GITEA_TOKEN environment variable for API operations.
  • Test with git push manually.

15. Merge Conflict in Claim Branch

Symptom: Pulling latest main into claim branch causes conflicts.

Resolution:

  • Resolve conflicts manually; keep both sets of changes if they are independent.
  • Re-run tests after resolution.
  • Push resolved branch.
  • Consider rebasing instead of merging to keep history linear.

Add new debugging patterns as they emerge.