feat: implement structured agent memory architecture

2026-03-15 21:09:39 +00:00
commit 2d68f66405
17 changed files with 2273 additions and 0 deletions
--- a/ai-memory/decisions/architectural-decisions.md
+++ b/ai-memory/decisions/architectural-decisions.md
@@ -0,0 +1,172 @@
+# Architectural Decisions
+
+This log records significant architectural decisions made during the AITBC project to prevent re-debating past choices.
+
+## Format
+- **Decision**: What was decided
+- **Date**: YYYY-MM-DD
+- **Context**: Why the decision was needed
+- **Alternatives Considered**: Other options
+- **Reason**: Why this option was chosen
+- **Impact**: Consequences (positive and negative)
+
+---
+
+## Decision 1: Stability Rings for PR Review Automation
+
+**Date**: 2026-03-15
+
+**Context**: Need to automate PR reviews while maintaining quality. Different parts of the codebase have different risk profiles; a blanket policy would either be too strict or too lax.
+
+**Alternatives Considered**:
+1. Manual review for all PRs (high overhead, slow)
+2. Path-based auto-approve with no ring classification (fragile)
+3. Single threshold based on file count (too coarse)
+
+**Reason**: Rings provide clear, scalable zones of trust. Core packages (Ring 0) require human review; lower rings can be automated. This balances safety and velocity.
+
+**Impact**:
+- Reduces review burden for non-critical changes
+- Maintains rigor where it matters (packages, blockchain)
+- Requires maintainers to understand and respect ring boundaries
+- Automated scripts can enforce ring policies consistently
+
+**Status**: Implemented and documented in `architecture/agent-roles.md` and `architecture/system-overview.md`
+
+---
+
+## Decision 2: Hierarchical Memory System (ai-memory/)
+
+**Date**: 2026-03-15
+
+**Context**: Existing memory was unstructured (`memory/` with hourly files per agent, `MEMORY.md` notes). This caused information retrieval to be slow, knowledge to be scattered, and made coordination difficult.
+
+**Alternatives Considered**:
+1. Single large daily file for all agents (edit conflicts, hard to parse)
+2. Wiki system (external dependency, complexity)
+3. Tag-based file system (ad-hoc, hard to enforce)
+
+**Reason**: A structured hierarchy with explicit layers (daily, architecture, decisions, failures, knowledge, agents) aligns with how agents need to consume information. Clear protocols for read/write operations improve consistency.
+
+**Impact**:
+- Agents have a predictable memory layout
+- Faster recall through organized documents
+- Reduces hallucinations by providing reliable sources
+- Encourages documentation discipline (record decisions, failures)
+
+**Status**: Implemented; this file is part of it.
+
+---
+
+## Decision 3: Distributed Task Claiming via Atomic Git Branches
+
+**Date**: 2026-03-15
+
+**Context**: Multiple autonomous agents need to claim issues without stepping on each other. There is no central task queue service; we rely on Git as the coordination point.
+
+**Alternatives Considered**:
+1. Gitea issue assignment API (requires locking, may race)
+2. Shared JSON lock file in repo (prone to merge conflicts)
+3. Cron-based claiming with sleep-and-retry (simple but effective)
+
+**Reason**: Atomic Git branch creation is a distributed mutex provided by Git itself. It's race-safe without extra infrastructure. Combining with a claiming script and issue labels yields a simple, robust system.
+
+**Impact**:
+- Eliminates duplicate work
+- Allows agents to operate independently
+- Easy to audit: branch names reveal claims
+- Claim branches are cleaned up after PR merge/close
+
+**Status**: Implemented in `scripts/claim-task.py`
+
+---
+
+## Decision 4: P2P Gossip via Redis Broadcast (Dev Only)
+
+**Date**: 2026-03-15
+
+**Context**: Agents need to broadcast messages to peers on the network. The initial implementation needed something quick and reliable for local development.
+
+**Alternatives Considered**:
+1. Direct peer-to-peer sockets (complex NAT traversal)
+2. Central message broker with auth (more setup)
+3. Multicast (limited to local network)
+
+**Reason**: Redis pub/sub is simple to set up, reliable, and works well on a local network. It's explicitly marked as dev-only; production will require a secure direct P2P mechanism.
+
+**Impact**:
+- Fast development iteration
+- No security for internet deployment (known limitation)
+- Forces future redesign for production (good constraint)
+
+**Status**: Dev environment uses Redis; production path deferred.
+
+---
+
+## Decision 5: Starlette Version Pinning (<0.38)
+
+**Date**: 2026-03-15
+
+**Context**: Starlette removed the `Broadcast` module in version 0.38, breaking the gossip backend that depends on it.
+
+**Alternatives Considered**:
+1. Migrate to a different broadcast library (effort, risk)
+2. Reimplement broadcast on top of Redis only (eventual)
+3. Pin Starlette version until production P2P is ready
+
+**Reason**: Pinning is the quickest way to restore dev environment functionality with minimal changes. The broadcast module is already dev-only; replacing it can be scheduled for production hardening.
+
+**Impact**:
+- Dev environment stable again
+- Must remember to bump/remove pin before production
+- Prevents accidental upgrades that break things
+
+**Status**: `pyproject.toml` pins `starlette>=0.37.2,<0.38`
+
+---
+
+## Decision 6: Use Poetry for Package Management
+
+**Date**: Prior to 2026-03-15
+
+**Context**: Need a consistent way to define dependencies, build packages, and manage virtualenvs across multiple packages in the monorepo.
+
+**Alternatives Considered**:
+1. pip + requirements.txt (flat, no build isolation)
+2. Hatch (similar, but Poetry chosen)
+3. Custom Makefile (reinventing wheel)
+
+**Reason**: Poetry provides a modern, all-in-one solution: dependency resolution, virtualenv management, building, publishing. It works well with monorepos via workspace-style handling (or multiple pyproject files).
+
+**Impact**:
+- Standardized packaging
+- Faster onboarding (poetry install)
+- Some learning curve; additional tool
+
+**Status**: Adopted across packages; ongoing.
+
+---
+
+## Decision 7: Blockchain Node Separate from Coordinator
+
+**Date**: Prior to 2026-03-15
+
+**Context**: The system needs a ledger for payments and consensus, but also a marketplace for job matching. Should they be one service or two?
+
+**Alternatives Considered**:
+1. Monolithic service (simpler deployment but tighter coupling)
+2. Separate services with well-defined API (more flexible, scalable)
+3. On-chain marketplace (too slow, costly)
+
+**Reason**: Separation of concerns: blockchain handles consensus and accounts; coordinator handles marketplace logic. This allows each to evolve independently and be scaled separately.
+
+**Impact**:
+- Clear service boundaries
+- Requires cross-service communication (HTTP)
+- More processes to run in dev
+
+**Status**: Two services in production (devnet).
+
+---
+
+*Add subsequent decisions below as they arise.*
--- a/ai-memory/decisions/protocol-decisions.md
+++ b/ai-memory/decisions/protocol-decisions.md
@@ -0,0 +1,188 @@
+# Protocol Decisions
+
+This document records decisions about agent coordination protocols, communication standards, and operational procedures.
+
+---
+
+## Decision 1: Memory Access Protocol
+
+**Date**: 2026-03-15
+
+**Protocol**: Before any task, agents must read:
+1. `ai-memory/architecture/system-overview.md`
+2. `ai-memory/failures/debugging-notes.md` (or relevant failure logs)
+3. `ai-memory/daily/YYYY-MM-DD.md` (today's entry; if none, create it)
+
+After completing a task, agents must:
+- Append a concise summary to `ai-memory/daily/YYYY-MM-DD.md`
+- Use a new bullet point with timestamp or section as needed
+
+**Rationale**: Ensures agents operate with current context and avoid repeating mistakes. Creates a persistent, searchable audit trail.
+
+---
+
+## Decision 2: Failure Recording Protocol
+
+**Date**: 2026-03-15
+
+**Protocol**: When an agent discovers a new failure pattern (bug, misconfiguration, CI break):
+1. Append to `ai-memory/failures/failure-archive.md` in the format:
+   ```
+   ## Failure: [Short title] – [Date]
+   - **Symptom**: ...
+   - **Cause**: ...
+   - **Resolution**: ...
+   ```
+2. If the failure is CI-specific, also append to `ci-failures.md`
+3. If the failure involves debugging steps, also append to `debugging-notes.md`
+
+**Rationale**: Centralizes troubleshooting knowledge; agents can proactively consult before debugging similar issues.
+
+---
+
+## Decision 3: Decision Logging Protocol
+
+**Date**: 2026-03-15
+
+**Protocol**: When an agent makes an architectural decision (affecting system structure, APIs, dependencies):
+1. Record in `ai-memory/decisions/architectural-decisions.md` using the ADR format (Decision, Date, Context, Alternatives, Reason, Impact)
+2. Keep entries concise but complete enough for future readers
+
+**Rationale**: Prevents re-debating settled questions; preserves reasoning for new agents.
+
+---
+
+## Decision 4: Task Claiming Protocol (Distributed Lock)
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- Issues must have label `task`, `bug`, `feature`, or `good-first-task-for-agent` to be claimable.
+- Agent runs `scripts/claim-task.py` periodically (cron every 5 minutes).
+- Script:
+  1. Fetches open, unassigned issues with eligible labels.
+  2. Skips issues already associated with the agent's branch or claimed by another agent.
+  3. Picks highest-scoring issue (simple scoring: label priority + age).
+  4. Creates an atomic Git branch: `aitbc1/<issue-number>-<slugified-title>`.
+  5. If branch creation succeeds, posts comment on issue: "Claiming this task (branch: ...)"
+  6. Works on the issue in that branch.
+- Claim is released when the branch is deleted (PR merged or closed).
+
+**Rationale**: Uses Git's atomic branch creation as a distributed mutex. No central coordinator needed; agents can run independently.
+
+---
+
+## Decision 5: PR Review Protocol
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- All PRs must request review from `@aitbc` (the reviewer agent).
+- Reviewer runs `scripts/monitor-prs.py` every 10 minutes (cron).
+- For **sibling PRs** (author ≠ self):
+  - Fetch branch, run `py_compile` on changed Python files.
+  - If syntax fails: request changes with comment "Syntax error detected."
+  - If syntax passes:
+    - Check Stability Ring of modified paths.
+    - Ring 0: cannot auto-approve; manual review required (request changes or comment)
+    - Ring 1: auto-approve with caution (approve + comment "Auto-approved Ring 1")
+    - Ring 2: auto-approve (approve)
+    - Ring 3: auto-approve (approve)
+- For **own PRs**: script auto-requests review from `@aitbc`.
+- CI status is monitored; failing CI triggers comment "CI failed; please fix."
+- After approvals and CI pass, reviewer may merge (or allow auto-merge if configured).
+- On merge: delete the claim branch; close linked issues.
+
+**Rationale**: Automates routine reviews while preserving human attention for critical areas. Enforces Ring policy consistently.
+
+---
+
+## Decision 6: Memory File Naming and Format
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- Daily memory: `ai-memory/daily/YYYY-MM-DD.md` (one file per day, Markdown)
+- Append-only: never edit or delete past entries; only add new bullets at the end.
+- Timestamps optional but helpful: use headings like `## 15:00–15:59 UTC Update` or bullet with time.
+- Each entry should be concise but informative: what was done, what was learned, any blockers.
+- Today's file must exist when needed; create it if absent.
+
+**Rationale**: Chronological, easy to scan. Append-only prevents tampering with history.
+
+---
+
+## Decision 7: Commit Message Convention
+
+**Date**: 2026-03-15
+
+**Protocol**: Use Conventional Commits format:
+- `feat: ...` for new features
+- `fix: ...` for bug fixes
+- `docs: ...` for documentation only
+- `refactor: ...` for code restructuring
+- `test: ...` for adding tests
+- `chore: ...` for trivial/maintenance changes
+
+Optionally add scope in parentheses: `feat(coordinator): add job cancellation`
+
+**Rationale**: Enables automated changelog generation and quick scanning of history.
+
+---
+
+## Decision 8: Issue and PR Naming
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- Issue title: concise summary, optionally with label prefix: `[Bug] CLI crashes on import`
+- PR title: `Fix: resolve CLI service imports` or `feat: implement claim-task script`
+- Branch name: `<agent>/<issue>-<slug>` (e.g., `aitbc1/3-add-tests-for-aitbc-core`)
+- Branch naming enforces claim linkage to issue.
+
+**Rationale**: Consistent naming makes it easy to trace work items to code changes.
+
+---
+
+## Decision 9: Agent Identity in Communications
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- In Gitea comments and issues, sign as the agent identity (e.g., "— aitbc1").
+- Use the same identity in daily memory entries.
+- When collaborating with sibling agent, mention `@aitbc` or `@aitbc1` appropriately.
+
+**Rationale**: Clear attribution for accountability and coordination.
+
+---
+
+## Decision 10: Emergency Protocol for Critical CI Failure
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- If a PR's CI fails due to infrastructure (not code), the agent who owns the PR should:
+  1. Investigate via `ai-memory/failures/ci-failures.md` and `debugging-notes.md`
+  2. If new failure pattern, record it immediately.
+  3. Comment on the PR with diagnosis and ETA.
+  4. If unable to resolve quickly, escalate to human via other channels (outside scope).
+
+**Rationale**: CI failures block all work; swift response is critical. Documentation prevents duplicated troubleshooting.
+
+---
+
+## Decision 11: Environment Configuration Drift Detection
+
+**Date**: 2026-03-15
+
+**Protocol**:
+- `ai-memory/knowledge/environment.md` must reflect the current dev environment (ports, URLs, credentials placeholders).
+- If an agent changes a configuration (e.g., port, database path), they must update this file immediately.
+- Before starting a task, agents should verify that their environment matches the documented settings.
+
+**Rationale**: Prevents "works on my machine" issues and keeps knowledge current.
+
+---
+
+*Add subsequent protocol decisions below.*