feat: implement structured agent memory architecture
This commit is contained in:
116
ai-memory/failures/ci-failures.md
Normal file
116
ai-memory/failures/ci-failures.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# CI Failures
|
||||
|
||||
This file tracks continuous integration failures, their diagnosis, and fixes. Consult when CI breaks.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Poetry Build Error – Missing README
|
||||
|
||||
**Date**: 2026-03-13
|
||||
|
||||
**Symptom**: Gitea Actions job fails during `poetry build`:
|
||||
```
|
||||
FileNotFoundError: [Errno 2] No such file or directory: 'README.md'
|
||||
```
|
||||
|
||||
**Package**: `packages/py/aitbc-agent-sdk`
|
||||
|
||||
**Cause**: The package directory lacked a README.md, which Poetry expects when building a package.
|
||||
|
||||
**Fix**: Added a minimal README.md (later expanded with usage examples). Re-ran CI; build passed.
|
||||
|
||||
**Action**: Recorded in `failures/failure-archive.md` as "Package Build Fails Due to Missing README.md".
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: ImportError in CLI Tests
|
||||
|
||||
**Symptom**: Test job for `cli` or import validation fails with:
|
||||
```
|
||||
ImportError: cannot import name 'trading_surveillance' from 'app.services'
|
||||
```
|
||||
|
||||
**Cause**: Starlette/Broadcast mismatch or missing `app/services/__init__.py`, or path issues.
|
||||
|
||||
**Resolution**: Ensured `app/services/__init__.py` exists; fixed command module imports as per failure-archive; pinned Starlette version.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Pytest Fails Due to Database Lock
|
||||
|
||||
**Symptom**: Intermittent test failures with `sqlite3.OperationalError: database is locked`.
|
||||
|
||||
**Cause**: Tests using the same SQLite file in parallel without proper isolation.
|
||||
|
||||
**Fix**: Switched to in-memory SQLite (`sqlite+aiosqlite:///:memory:`) for unit tests; ensured each test gets a fresh DB. Alternatively, use file-based with `cache=shared` and proper cleanup.
|
||||
|
||||
**Action**: Add test isolation to `conftest.py`; ensure fixtures tear down connections.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Missing aiohttp Dependency
|
||||
|
||||
**Symptom**: Import error for `aiohttp` in `kyc_aml_providers.py`.
|
||||
|
||||
**Cause**: Dependency not declared in `pyproject.toml`.
|
||||
|
||||
**Fix**: Added `aiohttp` to dependencies. Pushed fix; CI passed after install.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Syntax Error in Sibling's PR
|
||||
|
||||
**Symptom**: `monitor-prs.py` auto-requests changes because `py_compile` fails.
|
||||
|
||||
**Typical Cause**: Simple syntax mistake (missing colon, unmatched parentheses).
|
||||
|
||||
**Response**: Comment on PR with the syntax error. Developer fixes and pushes; CI re-runs.
|
||||
|
||||
**Note**: This is expected behavior; the script is doing its job.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Redis Connection Refused
|
||||
|
||||
**Symptom**: Tests that rely on Redis connectivity fail:
|
||||
```
|
||||
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.
|
||||
```
|
||||
|
||||
**Cause**: Redis service not running in CI environment.
|
||||
|
||||
**Fix**: Either start Redis in CI job before tests, or mock Redis in tests. For integration tests that need Redis, add a service container or start Redis as a background process.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Port Already in Use
|
||||
|
||||
**Symptom**: Test that starts a server fails with `OSError: [Errno 98] Address already in use`.
|
||||
|
||||
**Cause**: Previous test did not cleanly shut down the server; port 8006 (or other) still bound.
|
||||
|
||||
**Fix**: Ensure proper shutdown of servers in test teardown; use `asyncio` cancellation and wait for port release. Alternatively, use dynamic port allocation for CI.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Out of Memory (OOM)
|
||||
|
||||
**Symptom**: CI job killed with signal SIGKILL (exit code 137).
|
||||
|
||||
**Cause**: Building many packages or running heavy tests exceeded CI container memory limits.
|
||||
|
||||
**Fix**: Reduce parallelism; use swap if allowed; split CI into smaller jobs; optimize tests.
|
||||
|
||||
---
|
||||
|
||||
## CI Failure: Permission Denied on Executable Scripts
|
||||
|
||||
**Symptom**: `./scripts/claim-task.py: Permission denied` when cron tries to run it.
|
||||
|
||||
**Cause**: Script file not executable (`chmod +x` missing).
|
||||
|
||||
**Fix**: `chmod +x scripts/claim-task.py`; ensure all scripts have correct mode in repo.
|
||||
|
||||
---
|
||||
|
||||
*Log new CI failures chronologically.*
|
||||
209
ai-memory/failures/debugging-notes.md
Normal file
209
ai-memory/failures/debugging-notes.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# Debugging Playbook
|
||||
|
||||
This is a collection of diagnostic checklists and debugging techniques for common issues in the AITBC system.
|
||||
|
||||
---
|
||||
|
||||
## 1. CLI Import Errors
|
||||
|
||||
**Symptom**: `aitbc` command crashes with `ImportError` or `ModuleNotFoundError`.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Verify `apps/coordinator-api/src/app/services/__init__.py` exists.
|
||||
- [ ] Check that `cli/aitbc_cli/commands/*` modules use correct relative imports.
|
||||
- [ ] Ensure coordinator-api is importable: `python -c "import sys; sys.path.append('apps/coordinator-api/src'); from app.services import trading_surveillance"` should work.
|
||||
- [ ] Run `aitbc --help` to see if base CLI loads (indicates command module issue).
|
||||
- [ ] Look for absolute paths in command modules; replace with package-relative.
|
||||
|
||||
**Common Fixes**: See failure-archive for the hardcoded path issue.
|
||||
|
||||
---
|
||||
|
||||
## 2. Coordinator API Won't Start
|
||||
|
||||
**Symptom**: `uvicorn app.main:app` fails or hangs.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Check port 8000 availability (`lsof -i:8000`).
|
||||
- [ ] Verify database file exists or can be created: `apps/coordinator-api/data/`.
|
||||
- [ ] Ensure `pyproject.toml` dependencies installed in active venv.
|
||||
- [ ] Check logs for specific exception (traceback).
|
||||
- [ ] Verify `REDIS_URL` if using broadcast; Redis must be running.
|
||||
|
||||
**Common Causes**:
|
||||
- Missing `aiohttp` or `sqlalchemy`
|
||||
- Database locked or permission denied
|
||||
- Redis not running (if used)
|
||||
|
||||
---
|
||||
|
||||
## 3. Blockchain Node Not Producing Blocks
|
||||
|
||||
**Symptom**: RPC `/status` shows `height` not increasing.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Is the node process running? (`ps aux | grep blockchain`)
|
||||
- [ ] Check logs for consensus errors or DB errors.
|
||||
- [ ] Verify ports 8006 (RPC) and 8005 (P2P) are open.
|
||||
- [ ] Ensure wallet daemon running on 8015 (if needed for transactions).
|
||||
- [ ] Confirm network: other peers? Running devnet with proposer account funded?
|
||||
- [ ] Run `aitbc blockchain status` to see RPC response.
|
||||
|
||||
**Common Causes**:
|
||||
- Not initialized (`scripts/devnet_up.sh` not executed)
|
||||
- Genesis proposer has no funds
|
||||
- P2P connectivity not established (check Redis for gossip)
|
||||
|
||||
---
|
||||
|
||||
## 4. AI Provider Job Fails with Payment Error
|
||||
|
||||
**Symptom**: Provider returns 403 or says balance insufficient.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Did buyer send funds first? (`aitbc blockchain send ...`) should precede job request.
|
||||
- [ ] Check provider's balance before/after; confirm expected amount transferred.
|
||||
- [ ] Verify provider and buyer are on same network (ait-devnet).
|
||||
- [ ] Ensure provider's wallet daemon is running (port 8015).
|
||||
- [ ] Check coordinator job URL (`--marketplace-url`) reachable.
|
||||
|
||||
**Resolution**: Follow the correct payment flow: buyer sends transaction, waits for confirmation, then POST /job.
|
||||
|
||||
---
|
||||
|
||||
## 5. Gitea API Calls Fail (Transient)
|
||||
|
||||
**Symptom**: Scripts fail with connection reset, 502, etc.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Is Gitea instance up? Can you `curl` the API?
|
||||
- [ ] Check network connectivity and DNS.
|
||||
- [ ] Add retry with exponential backoff (already in `monitor-prs.py`).
|
||||
- [ ] If persistent, check Gitea logs for server-side issues.
|
||||
|
||||
**Temporary Workaround**: Wait and re-run the script manually.
|
||||
|
||||
---
|
||||
|
||||
## 6. Redis Pub/Sub Not Delivering Messages
|
||||
|
||||
**Symptom**: Agents don't receive broadcast messages.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Is Redis running? `redis-cli ping` should return PONG.
|
||||
- [ ] Check that all agents use the same `REDIS_URL`.
|
||||
- [ ] Verify message channel names match exactly.
|
||||
- [ ] Ensure agents are subscribed before messages are published.
|
||||
- [ ] Use `redis-cli SUBSCRIBE <channel>` to debug manually.
|
||||
|
||||
**Note**: This is dev-only; production will use direct P2P.
|
||||
|
||||
---
|
||||
|
||||
## 7. Starlette Import Errors After Upgrade
|
||||
|
||||
**Symptom**: `ImportError: cannot import name 'Broadcast'`.
|
||||
|
||||
**Fix**: Pin Starlette to `<0.38` as documented. Alternatively, refactor to use a different broadcast mechanism (future work).
|
||||
|
||||
---
|
||||
|
||||
## 8. Test Isolation Failures
|
||||
|
||||
**Symptom**: Tests pass individually but fail when run together.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Look for shared resources (database files, ports, files).
|
||||
- [ ] Use fixtures with `scope="function"` and proper teardown.
|
||||
- [ ] Clean up after each test: close DB connections, stop servers.
|
||||
- [ ] Avoid global state; inject dependencies.
|
||||
|
||||
**Action**: Refactor tests to be hermetic.
|
||||
|
||||
---
|
||||
|
||||
## 9. Port Conflicts
|
||||
|
||||
**Symptom**: `OSError: Address already in use`.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Identify which process owns the port: `lsof -i:<port>`.
|
||||
- [ ] Kill lingering processes from previous runs.
|
||||
- [ ] Use dynamic port allocation for tests if possible.
|
||||
- [ ] Ensure services shut down cleanly on exit (signals).
|
||||
|
||||
---
|
||||
|
||||
## 10. Memory Conflicts (Concurrent Editing)
|
||||
|
||||
**Symptom**: Two agents editing the same file cause Git merge conflicts.
|
||||
|
||||
**Prevention**:
|
||||
- Use `ai-memory/daily/` with one file per day; agents append, not edit.
|
||||
- Avoid editing the same file simultaneously; coordinate via claims if necessary.
|
||||
- If conflict occurs, resolve manually by merging entries; preserve both contributions.
|
||||
|
||||
---
|
||||
|
||||
## 11. Cron Jobs Not Running
|
||||
|
||||
**Symptom**: Expected periodic tasks not executing.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Verify cron entries (`crontab -l` for the user).
|
||||
- [ ] Check system cron logs (`/var/log/cron`, `journalctl`).
|
||||
- [ ] Ensure scripts are executable and paths are absolute or correctly relative (use `cd` first).
|
||||
- [ ] Redirect output to a log file for debugging: `>> /var/log/claim-task.log 2>&1`.
|
||||
|
||||
---
|
||||
|
||||
## 12. Wallet Operations Fail (Unknown Wallet)
|
||||
|
||||
**Symptom**: `aitbc wallet balance` returns "wallet not found".
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Has wallet been created? Use `aitbc wallet create` first.
|
||||
- [ ] Check the wallet name and hostname pattern: `<hostname><wallet_name>_simple`.
|
||||
- [ ] Verify wallet daemon running on port 8015.
|
||||
- [ ] Ensure RPC URL matches (coordinator API running on 8000).
|
||||
|
||||
---
|
||||
|
||||
## 13. CI Jobs Stuck / Timeout
|
||||
|
||||
**Symptom**: CI job runs for > 1 hour without finishing.
|
||||
|
||||
**Checklist**:
|
||||
- [ ] Check for infinite loops or deadlocks in tests.
|
||||
- [ ] Increase CI timeout if legitimate long test.
|
||||
- [ ] Add `pytest -x` to fail fast on first error to identify root cause.
|
||||
- [ ] Split tests into smaller batches.
|
||||
|
||||
---
|
||||
|
||||
## 14. Permission Denied on Git Operations
|
||||
|
||||
**Symptom**: `fatal: could not read Username` or `Permission denied (publickey)`.
|
||||
|
||||
**Cause**: SSH key not loaded or Gitea token not set.
|
||||
|
||||
**Fix**:
|
||||
- Ensure SSH agent has the key (`ssh-add -l`).
|
||||
- Set `GITEA_TOKEN` environment variable for API operations.
|
||||
- Test with `git push` manually.
|
||||
|
||||
---
|
||||
|
||||
## 15. Merge Conflict in Claim Branch
|
||||
|
||||
**Symptom**: Pulling latest main into claim branch causes conflicts.
|
||||
|
||||
**Resolution**:
|
||||
- Resolve conflicts manually; keep both sets of changes if they are independent.
|
||||
- Re-run tests after resolution.
|
||||
- Push resolved branch.
|
||||
- Consider rebasing instead of merging to keep history linear.
|
||||
|
||||
---
|
||||
|
||||
*Add new debugging patterns as they emerge.*
|
||||
134
ai-memory/failures/failure-archive.md
Normal file
134
ai-memory/failures/failure-archive.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Failure Archive
|
||||
|
||||
This archive collects known failure patterns experienced during development, along with their causes and resolutions. Agents should consult before debugging similar symptoms.
|
||||
|
||||
---
|
||||
|
||||
## Failure: CLI Fails to Launch – Hardcoded Absolute Paths
|
||||
|
||||
**Date**: 2026-03-13
|
||||
|
||||
**Symptom**: `ImportError: No module named 'trading_surveillance'` when running `aitbc --help` or any subcommand.
|
||||
|
||||
**Cause**: Multiple command modules in `cli/aitbc_cli/commands/` used:
|
||||
```python
|
||||
sys.path.append('/home/oib/windsurf/aitbc/apps/coordinator-api/src/app/services')
|
||||
```
|
||||
This path is user-specific and does not exist on the `aitbc1` host.
|
||||
|
||||
**Modules affected**:
|
||||
- `surveillance.py`
|
||||
- `ai_trading.py`
|
||||
- `ai_surveillance.py`
|
||||
- `advanced_analytics.py`
|
||||
- `regulatory.py`
|
||||
- `enterprise_integration.py`
|
||||
|
||||
**Resolution**:
|
||||
1. Added `__init__.py` to `apps/coordinator-api/src/app/services/` to make it a proper package.
|
||||
2. Updated each affected command module to use:
|
||||
```python
|
||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'apps', 'coordinator-api', 'src'))
|
||||
from app.services.trading_surveillance import ...
|
||||
```
|
||||
(or simply `from app.services import <module>` after path setup)
|
||||
3. Removed hardcoded fallback absolute paths.
|
||||
4. Verified: `aitbc --help` loads without errors; `aitbc surveillance start` works.
|
||||
|
||||
**Prevention**: Use package-relative imports; avoid user-specific absolute paths. Consider making coordinator-api a proper installable dependency.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Missing Dependency – aiohttp
|
||||
|
||||
**Symptom**: `ModuleNotFoundError: No module named 'aiohttp'` when importing `kyc_aml_providers.py`.
|
||||
|
||||
**Cause**: `cli/pyproject.toml` did not declare `aiohttp`.
|
||||
|
||||
**Resolution**: `poetry add aiohttp` (or `pip install aiohttp` in venv). Updated `pyproject.toml` accordingly.
|
||||
|
||||
**Prevention**: Keep dependencies declared; run tests in fresh environment.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Package Build Fails Due to Missing README.md
|
||||
|
||||
**Symptom**: `poetry build` for `packages/py/aitbc-agent-sdk` fails with `FileNotFoundError: README.md`.
|
||||
|
||||
**Cause**: The package directory lacked a README.md, which some build configurations require.
|
||||
|
||||
**Resolution**: Created an empty or placeholder README.md. Later enhanced with usage examples.
|
||||
|
||||
**Prevention**: Ensure each package has at least a minimal README; add pre-commit hook to check.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Starlette Broadcast Module Missing After Upgrade
|
||||
|
||||
**Symptom**: `ImportError: cannot import name 'Broadcast' from 'starlette'` after upgrading Starlette to 0.38+.
|
||||
|
||||
**Cause**: Starlette removed the Broadcast module in version 0.38.
|
||||
|
||||
**Impact**: P2P gossip backend (using Redis broadcast) fails to import. Services crash on startup.
|
||||
|
||||
**Resolution**:
|
||||
- Pinned Starlette to `>=0.37.2,<0.38` in `pyproject.toml`.
|
||||
- Added comment explaining the pin and that production should replace broadcast with direct P2P.
|
||||
|
||||
**Prevention**: Avoid upgrading Starlette without testing; track deprecations.
|
||||
|
||||
**See also**: `debugging-notes.md` for diagnostic steps.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Docker Compose Not Found
|
||||
|
||||
**Symptom**: `docker-compose: command not found` even though Docker is installed.
|
||||
|
||||
**Cause**: System has Docker Compose v2 (`docker compose`) but not v1 (`docker-compose`). The project documentation referenced `docker-compose`.
|
||||
|
||||
**Resolution**: Updated documentation to use `docker compose` (or detect whichever is available). Alternatively, create a symlink or alias.
|
||||
|
||||
**Prevention**: Detect both variants in scripts; document both names.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Test Scripts Use Absolute Paths
|
||||
|
||||
**Symptom**: `run_all_tests.sh` fails with "No such file or directory" for test scenario scripts located in `/home/oib/windsurf/aitbc/...`.
|
||||
|
||||
**Cause**: Test scripts referenced a specific user's home directory, not the project root.
|
||||
|
||||
**Resolution**: Rewrote paths to be project-relative using `$(dirname "$0")`. Example: `$(dirname "$0")/test_scenario_a.sh`.
|
||||
|
||||
**Prevention**: Never hardcode absolute paths; always compute relative to project root or script location.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Gitea API Unstable During PR Approval
|
||||
|
||||
**Symptom**: Script `monitor-prs.py` fails to post approvals due to "connection reset" or 5xx errors from Gitea.
|
||||
|
||||
**Cause**: Gitea instance may be under load or temporarily unavailable.
|
||||
|
||||
**Resolution**: Added retry logic with exponential backoff. If still failing, log and skip; next run will succeed.
|
||||
|
||||
**Prevention**: Make API clients resilient to transient failures.
|
||||
|
||||
---
|
||||
|
||||
## Failure: Coordinator API Idempotent DB Init
|
||||
|
||||
**Symptom**: Running `init_db()` multiple times causes `sqlite3.IntegrityError` due to duplicate index creation.
|
||||
|
||||
**Cause**: `init_db()` did not catch duplicate index errors; it assumed fresh DB.
|
||||
|
||||
**Resolution**: Wrapped index creation in try/except blocks catching `sqlite3.IntegrityError` (or using `IF NOT EXISTS` where supported). This made initialization idempotent.
|
||||
|
||||
**Impact**: Coordinator can be started repeatedly without manual DB cleanup.
|
||||
|
||||
**Prevention**: Design DB initialization to be idempotent from the start.
|
||||
|
||||
---
|
||||
|
||||
*Add new failures chronologically below.*
|
||||
Reference in New Issue
Block a user