feat: implement structured agent memory architecture

This commit is contained in:
aitbc1
2026-03-15 21:09:39 +00:00
commit 2d68f66405
17 changed files with 2273 additions and 0 deletions

View File

@@ -0,0 +1,116 @@
# CI Failures
This file tracks continuous integration failures, their diagnosis, and fixes. Consult when CI breaks.
---
## CI Failure: Poetry Build Error Missing README
**Date**: 2026-03-13
**Symptom**: Gitea Actions job fails during `poetry build`:
```
FileNotFoundError: [Errno 2] No such file or directory: 'README.md'
```
**Package**: `packages/py/aitbc-agent-sdk`
**Cause**: The package directory lacked a README.md, which Poetry expects when building a package.
**Fix**: Added a minimal README.md (later expanded with usage examples). Re-ran CI; build passed.
**Action**: Recorded in `failures/failure-archive.md` as "Package Build Fails Due to Missing README.md".
---
## CI Failure: ImportError in CLI Tests
**Symptom**: Test job for `cli` or import validation fails with:
```
ImportError: cannot import name 'trading_surveillance' from 'app.services'
```
**Cause**: Starlette/Broadcast mismatch or missing `app/services/__init__.py`, or path issues.
**Resolution**: Ensured `app/services/__init__.py` exists; fixed command module imports as per failure-archive; pinned Starlette version.
---
## CI Failure: Pytest Fails Due to Database Lock
**Symptom**: Intermittent test failures with `sqlite3.OperationalError: database is locked`.
**Cause**: Tests using the same SQLite file in parallel without proper isolation.
**Fix**: Switched to in-memory SQLite (`sqlite+aiosqlite:///:memory:`) for unit tests; ensured each test gets a fresh DB. Alternatively, use file-based with `cache=shared` and proper cleanup.
**Action**: Add test isolation to `conftest.py`; ensure fixtures tear down connections.
---
## CI Failure: Missing aiohttp Dependency
**Symptom**: Import error for `aiohttp` in `kyc_aml_providers.py`.
**Cause**: Dependency not declared in `pyproject.toml`.
**Fix**: Added `aiohttp` to dependencies. Pushed fix; CI passed after install.
---
## CI Failure: Syntax Error in Sibling's PR
**Symptom**: `monitor-prs.py` auto-requests changes because `py_compile` fails.
**Typical Cause**: Simple syntax mistake (missing colon, unmatched parentheses).
**Response**: Comment on PR with the syntax error. Developer fixes and pushes; CI re-runs.
**Note**: This is expected behavior; the script is doing its job.
---
## CI Failure: Redis Connection Refused
**Symptom**: Tests that rely on Redis connectivity fail:
```
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.
```
**Cause**: Redis service not running in CI environment.
**Fix**: Either start Redis in CI job before tests, or mock Redis in tests. For integration tests that need Redis, add a service container or start Redis as a background process.
---
## CI Failure: Port Already in Use
**Symptom**: Test that starts a server fails with `OSError: [Errno 98] Address already in use`.
**Cause**: Previous test did not cleanly shut down the server; port 8006 (or other) still bound.
**Fix**: Ensure proper shutdown of servers in test teardown; use `asyncio` cancellation and wait for port release. Alternatively, use dynamic port allocation for CI.
---
## CI Failure: Out of Memory (OOM)
**Symptom**: CI job killed with signal SIGKILL (exit code 137).
**Cause**: Building many packages or running heavy tests exceeded CI container memory limits.
**Fix**: Reduce parallelism; use swap if allowed; split CI into smaller jobs; optimize tests.
---
## CI Failure: Permission Denied on Executable Scripts
**Symptom**: `./scripts/claim-task.py: Permission denied` when cron tries to run it.
**Cause**: Script file not executable (`chmod +x` missing).
**Fix**: `chmod +x scripts/claim-task.py`; ensure all scripts have correct mode in repo.
---
*Log new CI failures chronologically.*

View File

@@ -0,0 +1,209 @@
# Debugging Playbook
This is a collection of diagnostic checklists and debugging techniques for common issues in the AITBC system.
---
## 1. CLI Import Errors
**Symptom**: `aitbc` command crashes with `ImportError` or `ModuleNotFoundError`.
**Checklist**:
- [ ] Verify `apps/coordinator-api/src/app/services/__init__.py` exists.
- [ ] Check that `cli/aitbc_cli/commands/*` modules use correct relative imports.
- [ ] Ensure coordinator-api is importable: `python -c "import sys; sys.path.append('apps/coordinator-api/src'); from app.services import trading_surveillance"` should work.
- [ ] Run `aitbc --help` to see if base CLI loads (indicates command module issue).
- [ ] Look for absolute paths in command modules; replace with package-relative.
**Common Fixes**: See failure-archive for the hardcoded path issue.
---
## 2. Coordinator API Won't Start
**Symptom**: `uvicorn app.main:app` fails or hangs.
**Checklist**:
- [ ] Check port 8000 availability (`lsof -i:8000`).
- [ ] Verify database file exists or can be created: `apps/coordinator-api/data/`.
- [ ] Ensure `pyproject.toml` dependencies installed in active venv.
- [ ] Check logs for specific exception (traceback).
- [ ] Verify `REDIS_URL` if using broadcast; Redis must be running.
**Common Causes**:
- Missing `aiohttp` or `sqlalchemy`
- Database locked or permission denied
- Redis not running (if used)
---
## 3. Blockchain Node Not Producing Blocks
**Symptom**: RPC `/status` shows `height` not increasing.
**Checklist**:
- [ ] Is the node process running? (`ps aux | grep blockchain`)
- [ ] Check logs for consensus errors or DB errors.
- [ ] Verify ports 8006 (RPC) and 8005 (P2P) are open.
- [ ] Ensure wallet daemon running on 8015 (if needed for transactions).
- [ ] Confirm network: other peers? Running devnet with proposer account funded?
- [ ] Run `aitbc blockchain status` to see RPC response.
**Common Causes**:
- Not initialized (`scripts/devnet_up.sh` not executed)
- Genesis proposer has no funds
- P2P connectivity not established (check Redis for gossip)
---
## 4. AI Provider Job Fails with Payment Error
**Symptom**: Provider returns 403 or says balance insufficient.
**Checklist**:
- [ ] Did buyer send funds first? (`aitbc blockchain send ...`) should precede job request.
- [ ] Check provider's balance before/after; confirm expected amount transferred.
- [ ] Verify provider and buyer are on same network (ait-devnet).
- [ ] Ensure provider's wallet daemon is running (port 8015).
- [ ] Check coordinator job URL (`--marketplace-url`) reachable.
**Resolution**: Follow the correct payment flow: buyer sends transaction, waits for confirmation, then POST /job.
---
## 5. Gitea API Calls Fail (Transient)
**Symptom**: Scripts fail with connection reset, 502, etc.
**Checklist**:
- [ ] Is Gitea instance up? Can you `curl` the API?
- [ ] Check network connectivity and DNS.
- [ ] Add retry with exponential backoff (already in `monitor-prs.py`).
- [ ] If persistent, check Gitea logs for server-side issues.
**Temporary Workaround**: Wait and re-run the script manually.
---
## 6. Redis Pub/Sub Not Delivering Messages
**Symptom**: Agents don't receive broadcast messages.
**Checklist**:
- [ ] Is Redis running? `redis-cli ping` should return PONG.
- [ ] Check that all agents use the same `REDIS_URL`.
- [ ] Verify message channel names match exactly.
- [ ] Ensure agents are subscribed before messages are published.
- [ ] Use `redis-cli SUBSCRIBE <channel>` to debug manually.
**Note**: This is dev-only; production will use direct P2P.
---
## 7. Starlette Import Errors After Upgrade
**Symptom**: `ImportError: cannot import name 'Broadcast'`.
**Fix**: Pin Starlette to `<0.38` as documented. Alternatively, refactor to use a different broadcast mechanism (future work).
---
## 8. Test Isolation Failures
**Symptom**: Tests pass individually but fail when run together.
**Checklist**:
- [ ] Look for shared resources (database files, ports, files).
- [ ] Use fixtures with `scope="function"` and proper teardown.
- [ ] Clean up after each test: close DB connections, stop servers.
- [ ] Avoid global state; inject dependencies.
**Action**: Refactor tests to be hermetic.
---
## 9. Port Conflicts
**Symptom**: `OSError: Address already in use`.
**Checklist**:
- [ ] Identify which process owns the port: `lsof -i:<port>`.
- [ ] Kill lingering processes from previous runs.
- [ ] Use dynamic port allocation for tests if possible.
- [ ] Ensure services shut down cleanly on exit (signals).
---
## 10. Memory Conflicts (Concurrent Editing)
**Symptom**: Two agents editing the same file cause Git merge conflicts.
**Prevention**:
- Use `ai-memory/daily/` with one file per day; agents append, not edit.
- Avoid editing the same file simultaneously; coordinate via claims if necessary.
- If conflict occurs, resolve manually by merging entries; preserve both contributions.
---
## 11. Cron Jobs Not Running
**Symptom**: Expected periodic tasks not executing.
**Checklist**:
- [ ] Verify cron entries (`crontab -l` for the user).
- [ ] Check system cron logs (`/var/log/cron`, `journalctl`).
- [ ] Ensure scripts are executable and paths are absolute or correctly relative (use `cd` first).
- [ ] Redirect output to a log file for debugging: `>> /var/log/claim-task.log 2>&1`.
---
## 12. Wallet Operations Fail (Unknown Wallet)
**Symptom**: `aitbc wallet balance` returns "wallet not found".
**Checklist**:
- [ ] Has wallet been created? Use `aitbc wallet create` first.
- [ ] Check the wallet name and hostname pattern: `<hostname><wallet_name>_simple`.
- [ ] Verify wallet daemon running on port 8015.
- [ ] Ensure RPC URL matches (coordinator API running on 8000).
---
## 13. CI Jobs Stuck / Timeout
**Symptom**: CI job runs for > 1 hour without finishing.
**Checklist**:
- [ ] Check for infinite loops or deadlocks in tests.
- [ ] Increase CI timeout if legitimate long test.
- [ ] Add `pytest -x` to fail fast on first error to identify root cause.
- [ ] Split tests into smaller batches.
---
## 14. Permission Denied on Git Operations
**Symptom**: `fatal: could not read Username` or `Permission denied (publickey)`.
**Cause**: SSH key not loaded or Gitea token not set.
**Fix**:
- Ensure SSH agent has the key (`ssh-add -l`).
- Set `GITEA_TOKEN` environment variable for API operations.
- Test with `git push` manually.
---
## 15. Merge Conflict in Claim Branch
**Symptom**: Pulling latest main into claim branch causes conflicts.
**Resolution**:
- Resolve conflicts manually; keep both sets of changes if they are independent.
- Re-run tests after resolution.
- Push resolved branch.
- Consider rebasing instead of merging to keep history linear.
---
*Add new debugging patterns as they emerge.*

View File

@@ -0,0 +1,134 @@
# Failure Archive
This archive collects known failure patterns experienced during development, along with their causes and resolutions. Agents should consult before debugging similar symptoms.
---
## Failure: CLI Fails to Launch Hardcoded Absolute Paths
**Date**: 2026-03-13
**Symptom**: `ImportError: No module named 'trading_surveillance'` when running `aitbc --help` or any subcommand.
**Cause**: Multiple command modules in `cli/aitbc_cli/commands/` used:
```python
sys.path.append('/home/oib/windsurf/aitbc/apps/coordinator-api/src/app/services')
```
This path is user-specific and does not exist on the `aitbc1` host.
**Modules affected**:
- `surveillance.py`
- `ai_trading.py`
- `ai_surveillance.py`
- `advanced_analytics.py`
- `regulatory.py`
- `enterprise_integration.py`
**Resolution**:
1. Added `__init__.py` to `apps/coordinator-api/src/app/services/` to make it a proper package.
2. Updated each affected command module to use:
```python
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'apps', 'coordinator-api', 'src'))
from app.services.trading_surveillance import ...
```
(or simply `from app.services import <module>` after path setup)
3. Removed hardcoded fallback absolute paths.
4. Verified: `aitbc --help` loads without errors; `aitbc surveillance start` works.
**Prevention**: Use package-relative imports; avoid user-specific absolute paths. Consider making coordinator-api a proper installable dependency.
---
## Failure: Missing Dependency aiohttp
**Symptom**: `ModuleNotFoundError: No module named 'aiohttp'` when importing `kyc_aml_providers.py`.
**Cause**: `cli/pyproject.toml` did not declare `aiohttp`.
**Resolution**: `poetry add aiohttp` (or `pip install aiohttp` in venv). Updated `pyproject.toml` accordingly.
**Prevention**: Keep dependencies declared; run tests in fresh environment.
---
## Failure: Package Build Fails Due to Missing README.md
**Symptom**: `poetry build` for `packages/py/aitbc-agent-sdk` fails with `FileNotFoundError: README.md`.
**Cause**: The package directory lacked a README.md, which some build configurations require.
**Resolution**: Created an empty or placeholder README.md. Later enhanced with usage examples.
**Prevention**: Ensure each package has at least a minimal README; add pre-commit hook to check.
---
## Failure: Starlette Broadcast Module Missing After Upgrade
**Symptom**: `ImportError: cannot import name 'Broadcast' from 'starlette'` after upgrading Starlette to 0.38+.
**Cause**: Starlette removed the Broadcast module in version 0.38.
**Impact**: P2P gossip backend (using Redis broadcast) fails to import. Services crash on startup.
**Resolution**:
- Pinned Starlette to `>=0.37.2,<0.38` in `pyproject.toml`.
- Added comment explaining the pin and that production should replace broadcast with direct P2P.
**Prevention**: Avoid upgrading Starlette without testing; track deprecations.
**See also**: `debugging-notes.md` for diagnostic steps.
---
## Failure: Docker Compose Not Found
**Symptom**: `docker-compose: command not found` even though Docker is installed.
**Cause**: System has Docker Compose v2 (`docker compose`) but not v1 (`docker-compose`). The project documentation referenced `docker-compose`.
**Resolution**: Updated documentation to use `docker compose` (or detect whichever is available). Alternatively, create a symlink or alias.
**Prevention**: Detect both variants in scripts; document both names.
---
## Failure: Test Scripts Use Absolute Paths
**Symptom**: `run_all_tests.sh` fails with "No such file or directory" for test scenario scripts located in `/home/oib/windsurf/aitbc/...`.
**Cause**: Test scripts referenced a specific user's home directory, not the project root.
**Resolution**: Rewrote paths to be project-relative using `$(dirname "$0")`. Example: `$(dirname "$0")/test_scenario_a.sh`.
**Prevention**: Never hardcode absolute paths; always compute relative to project root or script location.
---
## Failure: Gitea API Unstable During PR Approval
**Symptom**: Script `monitor-prs.py` fails to post approvals due to "connection reset" or 5xx errors from Gitea.
**Cause**: Gitea instance may be under load or temporarily unavailable.
**Resolution**: Added retry logic with exponential backoff. If still failing, log and skip; next run will succeed.
**Prevention**: Make API clients resilient to transient failures.
---
## Failure: Coordinator API Idempotent DB Init
**Symptom**: Running `init_db()` multiple times causes `sqlite3.IntegrityError` due to duplicate index creation.
**Cause**: `init_db()` did not catch duplicate index errors; it assumed fresh DB.
**Resolution**: Wrapped index creation in try/except blocks catching `sqlite3.IntegrityError` (or using `IF NOT EXISTS` where supported). This made initialization idempotent.
**Impact**: Coordinator can be started repeatedly without manual DB cleanup.
**Prevention**: Design DB initialization to be idempotent from the start.
---
*Add new failures chronologically below.*