Files
aitbc/ai-memory/decisions/architectural-decisions.md

172 lines
6.4 KiB
Markdown

# Architectural Decisions
This log records significant architectural decisions made during the AITBC project to prevent re-debating past choices.
## Format
- **Decision**: What was decided
- **Date**: YYYY-MM-DD
- **Context**: Why the decision was needed
- **Alternatives Considered**: Other options
- **Reason**: Why this option was chosen
- **Impact**: Consequences (positive and negative)
---
## Decision 1: Stability Rings for PR Review Automation
**Date**: 2026-03-15
**Context**: Need to automate PR reviews while maintaining quality. Different parts of the codebase have different risk profiles; a blanket policy would either be too strict or too lax.
**Alternatives Considered**:
1. Manual review for all PRs (high overhead, slow)
2. Path-based auto-approve with no ring classification (fragile)
3. Single threshold based on file count (too coarse)
**Reason**: Rings provide clear, scalable zones of trust. Core packages (Ring 0) require human review; lower rings can be automated. This balances safety and velocity.
**Impact**:
- Reduces review burden for non-critical changes
- Maintains rigor where it matters (packages, blockchain)
- Requires maintainers to understand and respect ring boundaries
- Automated scripts can enforce ring policies consistently
**Status**: Implemented and documented in `architecture/agent-roles.md` and `architecture/system-overview.md`
---
## Decision 2: Hierarchical Memory System (ai-memory/)
**Date**: 2026-03-15
**Context**: Existing memory was unstructured (`memory/` with hourly files per agent, `MEMORY.md` notes). This caused information retrieval to be slow, knowledge to be scattered, and made coordination difficult.
**Alternatives Considered**:
1. Single large daily file for all agents (edit conflicts, hard to parse)
2. Wiki system (external dependency, complexity)
3. Tag-based file system (ad-hoc, hard to enforce)
**Reason**: A structured hierarchy with explicit layers (daily, architecture, decisions, failures, knowledge, agents) aligns with how agents need to consume information. Clear protocols for read/write operations improve consistency.
**Impact**:
- Agents have a predictable memory layout
- Faster recall through organized documents
- Reduces hallucinations by providing reliable sources
- Encourages documentation discipline (record decisions, failures)
**Status**: Implemented; this file is part of it.
---
## Decision 3: Distributed Task Claiming via Atomic Git Branches
**Date**: 2026-03-15
**Context**: Multiple autonomous agents need to claim issues without stepping on each other. There is no central task queue service; we rely on Git as the coordination point.
**Alternatives Considered**:
1. Gitea issue assignment API (requires locking, may race)
2. Shared JSON lock file in repo (prone to merge conflicts)
3. Cron-based claiming with sleep-and-retry (simple but effective)
**Reason**: Atomic Git branch creation is a distributed mutex provided by Git itself. It's race-safe without extra infrastructure. Combining with a claiming script and issue labels yields a simple, robust system.
**Impact**:
- Eliminates duplicate work
- Allows agents to operate independently
- Easy to audit: branch names reveal claims
- Claim branches are cleaned up after PR merge/close
**Status**: Implemented in `scripts/claim-task.py`
---
## Decision 4: P2P Gossip via Redis Broadcast (Dev Only)
**Date**: 2026-03-15
**Context**: Agents need to broadcast messages to peers on the network. The initial implementation needed something quick and reliable for local development.
**Alternatives Considered**:
1. Direct peer-to-peer sockets (complex NAT traversal)
2. Central message broker with auth (more setup)
3. Multicast (limited to local network)
**Reason**: Redis pub/sub is simple to set up, reliable, and works well on a local network. It's explicitly marked as dev-only; production will require a secure direct P2P mechanism.
**Impact**:
- Fast development iteration
- No security for internet deployment (known limitation)
- Forces future redesign for production (good constraint)
**Status**: Dev environment uses Redis; production path deferred.
---
## Decision 5: Starlette Version Pinning (<0.38)
**Date**: 2026-03-15
**Context**: Starlette removed the `Broadcast` module in version 0.38, breaking the gossip backend that depends on it.
**Alternatives Considered**:
1. Migrate to a different broadcast library (effort, risk)
2. Reimplement broadcast on top of Redis only (eventual)
3. Pin Starlette version until production P2P is ready
**Reason**: Pinning is the quickest way to restore dev environment functionality with minimal changes. The broadcast module is already dev-only; replacing it can be scheduled for production hardening.
**Impact**:
- Dev environment stable again
- Must remember to bump/remove pin before production
- Prevents accidental upgrades that break things
**Status**: `pyproject.toml` pins `starlette>=0.37.2,<0.38`
---
## Decision 6: Use Poetry for Package Management
**Date**: Prior to 2026-03-15
**Context**: Need a consistent way to define dependencies, build packages, and manage virtualenvs across multiple packages in the monorepo.
**Alternatives Considered**:
1. pip + requirements.txt (flat, no build isolation)
2. Hatch (similar, but Poetry chosen)
3. Custom Makefile (reinventing wheel)
**Reason**: Poetry provides a modern, all-in-one solution: dependency resolution, virtualenv management, building, publishing. It works well with monorepos via workspace-style handling (or multiple pyproject files).
**Impact**:
- Standardized packaging
- Faster onboarding (poetry install)
- Some learning curve; additional tool
**Status**: Adopted across packages; ongoing.
---
## Decision 7: Blockchain Node Separate from Coordinator
**Date**: Prior to 2026-03-15
**Context**: The system needs a ledger for payments and consensus, but also a marketplace for job matching. Should they be one service or two?
**Alternatives Considered**:
1. Monolithic service (simpler deployment but tighter coupling)
2. Separate services with well-defined API (more flexible, scalable)
3. On-chain marketplace (too slow, costly)
**Reason**: Separation of concerns: blockchain handles consensus and accounts; coordinator handles marketplace logic. This allows each to evolve independently and be scaled separately.
**Impact**:
- Clear service boundaries
- Requires cross-service communication (HTTP)
- More processes to run in dev
**Status**: Two services in production (devnet).
---
*Add subsequent decisions below as they arise.*