refactor: improve error handling and remove hardcoded credentials

- Changed bare except clauses to specific exception types in web3_utils.py, testing.py, messages.py, and message_storage.py - Replaced print() calls with logger in testing.py, agent_discovery.py, compliance_agent.py, coordinator.py, trading_agent.py, keys.py, escrow.py, persistent_spending_tracker.py, sync_cli.py, and client.py - Added logger initialization using get_logger(__name__) in compliance_agent.py, coordinator.py, trading_agent.py, keys.py, escrow.py, persistent_spending_tracker.py, and client.py - Removed hardcoded secret
2026-05-12 17:01:57 +02:00
parent 9133609603
commit 745f791eda
279 changed files with 12284 additions and 5061 deletions
--- a/.hermes/plans/2026-05-12_104500-coordinator-decomposition.md
+++ b/.hermes/plans/2026-05-12_104500-coordinator-decomposition.md
@@ -0,0 +1,97 @@
+
+# Coordinator-API Decomposition Plan
+
+## Current State
+- **1 monolith**: apps/coordinator-api/src/app/
+  - 89 service files, 46,594 LOC
+  - 53 routers
+  - 51 files over 500 LOC
+  - Largest: agent_integration.py (1,159 LOC)
+
+## Decomposition Strategy: Bounded Contexts
+
+Based on domain analysis, split into 7 microservices:
+
+1. **agent-management** (agent lifecycle, performance, communication)
+2. **blockchain** (chain operations, transactions, smart contracts)
+3. **computing** (GPU, resources, marketplace for compute)
+4. **enterprise** (integration, scalability, compliance)
+5. **identity** (authentication, authorization, agents identity)
+6. **payment** (billing, transactions, financial operations)
+7. **ai-models** (AI services, RL, multi-modal fusion)
+
+Each will be a separate FastAPI app with:
+- Its own routers/, services/, models/
+- Shared libraries: app.core.config, app.core.logging, app.core.database
+- Independent systemd service
+- Clear API boundaries
+
+## Implementation Phases
+
+### Phase 1: Infrastructure Setup (Week 1-2)
+- Create apps/ directory structure: agent-management/, blockchain/, etc.
+- Create shared core library: apps/coordinator-api/src/app/core/
+- Extract common config, logging, DB session, exceptions
+- Update pyproject.toml to support multiple packages
+
+### Phase 2: Extract Agent Management (Week 2-3)
+- Move agent_*.py, agent_service_marketplace.py -> agent-management
+- Move agent_communication.py, agent_performance_service.py -> agent-management
+- Create new systemd service for agent-management
+- Update reverse proxy (nginx) routes
+
+### Phase 3: Extract Blockchain (Week 3-4)
+- Move blockchain_context.py, contract_service.py, transaction_service.py -> blockchain
+- Move escrow.py, persistent_spending_tracker.py, etc.
+- Create blockchain systemd service
+
+### Phase 4: Extract Enterprise (Week 4-5)
+- Move enterprise_integration.py, compliance_engine.py, certification related -> enterprise
+- Create enterprise systemd service
+
+### Phase 5: Extract Identity (Week 5-6)
+- Move auth/identity service files -> identity
+- Create identity systemd service
+
+### Phase 6: Extract AI Models (Week 6-7)
+- Move advanced_*.py, multi_modal_fusion, ai verification -> ai-models
+- Create ai-models systemd service
+
+### Phase 7: Extract Computing & Payment (Week 7-8)
+- Move gpu, resource, payment services to their own packages
+
+### Phase 8: Final Integration (Week 8-9)
+- Update all clients to use new service endpoints
+- Test inter-service communication
+- Update documentation
+- Deprecate old monolith
+
+## Files to Create/Modify
+
+### New shared core (apps/coordinator-api/src/app/core/)
+- config.py (extracted from existing config.py)
+- logging.py (centralized logger setup)
+- database.py (SQLAlchemy session, Base)
+- exceptions.py (common exceptions)
+- security.py (auth dependencies)
+
+### New service apps (47 directories total)
+Each: apps/<service>/src/app/{routers,services,models,main.py}
+
+### Modified files
+- Root pyproject.toml: add service packages
+- Systemd: add 7 new .service files
+- Nginx config: new upstream blocks
+- Docker compose: add 7 new containers
+- Monitoring: new service endpoints for health
+
+## Rollback Plan
+- Keep original monolith running alongside new services during transition
+- Use feature flags to route traffic
+- Comprehensive integration tests before cutover
+
+## Success Criteria
+- Each service < 3,000 LOC (target 1,500)
+- Each service independently deployable
+- API contracts stable and documented
+- CI/CD per service
--- a/.hermes/plans/2026-05-12_142930-agent-management-extraction.md
+++ b/.hermes/plans/2026-05-12_142930-agent-management-extraction.md
@@ -0,0 +1,239 @@
+# Agent-Management Service Extraction Plan
+
+## Overview
+
+Extract the agent-related functionality from the coordinator-api monolith into a standalone microservice while maintaining operational continuity.
+
+## Current State
+
+**Monolith:** `apps/coordinator-api/src/app/`
+- Services: 46,594 LOC across 89 files
+- Domain layer: `domain/` contains all business entities (Agent, AgentExecution, AgentStatus, etc.)
+- Target agent files to extract: **18 files** (6 routers, 12 services)
+- Largest files: agent_service.py (1,159 LOC), agent_integration.py (1,117 LOC), agent_communication.py (988 LOC)
+
+## Bounded Context: Agent-Management
+
+**Responsibility:** AI agent lifecycle, orchestration, performance tracking, security, and marketplace registry.
+
+**In-Scope Files:**
+
+### Services (12)
+```
+services/agent_service.py (1,159 LOC)
+services/agent_integration.py (1,117 LOC)
+services/agent_communication.py (988 LOC)
+services/agent_orchestrator.py
+services/agent_performance_service.py
+services/agent_security.py
+services/agent_portfolio_manager.py
+services/agent_service_marketplace.py
+services/advanced_rl/agents.py (+ sub-agents: ppo_agent.py, rainbow_dqn_agent.py, sac_agent.py)
+```
+
+### Routers (6)
+```
+routers/agent_router.py
+routers/agent_integration_router.py
+routers/agent_performance.py
+routers/agent_creativity.py
+routers/agent_security_router.py
+routers/services.py (agent services listing endpoint)
+```
+
+## Critical Dependencies
+
+1. **Domain Layer** (`app.domain`)
+   - All agent services import from `..domain.agent` (AgentExecution, AgentStatus, AIAgentWorkflow, etc.)
+   - Solution: Keep domain/ in monolith for now; new service imports via a **shared-domain package** to be created
+   - Create `apps/shared-domain/src/app/domain/` as a symlink or copy that both services can import
+   - Long-term: Extract entire domain layer to shared-domain package
+
+2. **aitbc package**
+   - Already available as root package. Use directly.
+
+3. **SQLModel/SQLAlchemy**
+   - Already in dependencies via root pyproject.toml
+
+4. **Other monolith services**
+   - Some routers may call agent endpoints. These will need to be updated to use HTTP client to new service (Phase 3 internal routing via nginx)
+
+## Implementation Steps
+
+### Step 0: Prepare Shared Domain Package (Prerequisite)
+- Create `apps/shared-domain/src/app/domain/`
+- Copy all files from coordinator-api's `domain/` EXCEPT non-agent ones if desired
+- Or simpler: symlink entire domain directory: `ln -s ../../coordinator-api/src/app/domain apps/shared-domain/src/app/`
+- Update imports in new service to use `from shared-domain.app.domain.agent import ...`
+- Add `shared-domain` to pyproject.toml dependencies in consuming services
+
+**Recommendation:** Use symlink for rapid iteration, then formalize package later.
+
+### Step 1: Create agent-management Service Skeleton
+```
+apps/agent-management/
+├── pyproject.toml
+├── README.md
+└── src/
+    └── app/
+        ├── __init__.py
+        ├── main.py
+        ├── core/
+        │   ├── __init__.py
+        │   ├── config.py (import from shared-core)
+        │   ├── logging.py (import from shared-core)
+        │   └── database.py (import from shared-core)
+        ├── domain/ → symlink to ../../shared-domain/src/app/domain
+        ├── routers/
+        │   ├── __init__.py
+        │   ├── agent_router.py (copied & adapted)
+        │   ├── agent_integration_router.py
+        │   ├── agent_performance.py
+        │   ├── agent_creativity.py
+        │   ├── agent_security_router.py
+        │   └── services.py
+        └── services/
+            ├── __init__.py
+            ├── agent_service.py
+            ├── agent_orchestrator.py
+            ├── agent_communication.py
+            ├── agent_performance_service.py
+            ├── agent_security.py
+            ├── agent_integration.py
+            ├── agent_portfolio_manager.py
+            ├── agent_service_marketplace.py
+            └── advanced_rl/
+                ├── __init__.py
+                ├── agents.py
+                └── ppo_agent.py, rainbow_dqn_agent.py, sac_agent.py
+```
+
+### Step 2: Adapt Code for Service Boundaries
+
+**Changes needed per file:**
+
+- Update all `from ..domain.agent import X` to `from shared-domain.app.domain.agent import X`
+- Remove any imports from other monolith services (e.g., `from ..services.other_service import X`)
+- Replace internal service calls with HTTP client calls or event bus (defer to later phase)
+- Update `ServiceSettings` to use agent-management specific defaults (port 8012)
+- Add health check endpoint (already in template)
+- Verify database setup: AgentExecution etc use shared Base. Need to call `Base.metadata.create_all(bind=engine)` on startup
+
+**Special Case: advanced_rl/**
+- These are AI model inference services. Consider moving to `ai-models` service instead.
+- For now, keep in agent-management to maintain functionality.
+
+### Step 3: Update Monolith to Proxy Requests (During Transition)
+
+**Option A: Nginx Routing**
+- Add nginx upstream for agent-management on port 8012
+- Change coordinator-api routes for `/api/v1/agent/*` to proxy to agent-management
+- Monolith no longer handles agent endpoints
+
+**Option B: In-app Redirection**
+- Keep routers in monolith but replace handlers with `HTTPClient` calls to new service
+- More gradual migration but adds latency
+
+**Recommendation:** Option A - cleaner separation, easier to rollback.
+
+### Step 4: Create Systemd Service
+
+```
+/etc/systemd/system/aitbc-agent-management.service
+[Unit]
+Description=AITBC Agent Management Service
+After=network.target
+
+[Service]
+Type=simple
+User=aitbc
+WorkingDirectory=/opt/aitbc/apps/agent-management
+Environment=PATH=/opt/aitbc/venv/bin
+Environment=PYTHONPATH=/opt/aitbc
+ExecStart=/opt/aitbc/venv/bin/uvicorn app.main:app --host 127.0.0.1 --port 8012
+Restart=on-failure
+RestartSec=10
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Step 5: Database Migration
+
+- Agent domain models likely already have tables defined via SQLModel
+- In `main.py` startup event, call `Base.metadata.create_all(bind=engine)` to ensure tables exist
+- Ensure the new service uses same database as monolith (coordinator.db) initially
+- Later: separate database (Phase 8)
+
+### Step 6: Integration Testing
+
+1. Start agent-management service
+2. Verify health endpoint: `curl http://localhost:8012/health`
+3. Test agent creation via API
+4. Verify coordinator-api can still access agent data (through new service or direct DB if keeping shared DB)
+5. Run existing integration tests against new service
+
+### Step 7: Update Coordinator-API
+
+- Remove the 18 extracted files from monolith
+- Remove domain/agent related imports from remaining monolith services if they now use agent-management API
+- Update any remaining references to agent endpoints to use HTTP client or nginx proxy
+
+### Step 8: Documentation & Monitoring
+
+- Update README with agent-management API docs
+- Add metrics endpoint if enabled
+- Update deployment scripts
+
+## Rollback Plan
+
+1. Keep monolith files in git history (do not delete, just move)
+2. Keep nginx config either/or - can revert upstream routing
+3. Database shared initially, so data is accessible to both
+4. Systemd service can be disabled; monolith still runs
+
+## Success Criteria
+
+- [ ] Agent-management service starts and health check passes on port 8012
+- [ ] Can create/query agents via API
+- [ ] Existing coordinator-api functionality that depends on agents still works
+- [ ] No errors in logs during integration test
+- [ ] Systemd service auto-restarts on failure
+
+## Open Questions
+
+1. **RL Agents**: Should advanced_rl be part of agent-management or ai-models?
+   - Recommendation: Keep in agent-management for now (AI agent inference is part of agent runtime). Can split later if ai-models becomes a separate inference service.
+
+2. **Database**: Separate or shared?
+   - Phase 1: Shared (same coordinator.db) for simplicity
+   - Phase 8: Split to dedicated agent-management database
+
+3. **Cross-service calls**: Currently agent integration uses other services directly (imports). Need to replace with HTTP or event bus.
+   - Defer until Phase 8 (Final Integration) to avoid breaking existing flow
+
+4. **Domain extraction**: The domain models are currently in monolith. Should we extract entire domain to a package?
+   - Immediate need: Create shared-domain package (symlink) to break import cycle
+   - Future: Extract domain to true package with independent version
+
+## Timeline Estimate
+
+- Step 0 (shared-domain): 2h
+- Step 1 (skeleton): 4h
+- Step 2 (adaptation): 8h (bulk of work - fixing imports, resolving dependencies)
+- Step 3 (nginx routing): 2h
+- Step 4 (systemd): 1h
+- Step 5 (DB): 1h
+- Step 6 (testing): 4h
+- Step 7 (monolith cleanup): 4h
+- Step 8 (docs): 2h
+
+**Total: ~28 hours (3-4 days)**
+
+## Risks
+
+- Hidden dependencies on other monolith services may cause runtime import errors
+- Domain models may have cross-references that require co-migration
+- Database migrations may be needed if agent tables don't exist yet
+- Existing integration tests may fail and need updating
+- Breaking changes if API contracts differ from original
--- a/.hermes/plans/2026-05-12_150000-tighten-mypy-config.md
+++ b/.hermes/plans/2026-05-12_150000-tighten-mypy-config.md
@@ -0,0 +1,218 @@
+
+# Tighten Mypy Configuration Plan
+
+## Current State
+
+**Root pyproject.toml [tool.mypy] settings:**
+```toml
+warn_return_any = true
+warn_unused_configs = true
+check_untyped_defs = false
+disallow_incomplete_defs = false
+disallow_untyped_defs = false
+disallow_untyped_decorators = false
+no_implicit_optional = false
+warn_redundant_casts = false
+warn_unused_ignores = false
+warn_no_return = true
+warn_unreachable = false
+strict_equality = false
+```
+
+**Overrides:**
+- Heavy libraries (torch, cv2, pandas, numpy, web3, etc.) are `ignore_missing_imports = true`
+- Coordiator-api modules are `ignore_errors = true` (catch-all)
+
+This is **extremely permissive** - essentially just warns on return_any and missing configs. It does not enforce:
+- Function argument/return type completeness
+- Avoiding implicit `Any`
+- Avoiding unnecessary type: ignore comments
+- Detecting unreachable code
+- Strict equality checks (None vs False)
+
+## Proposed Tightening Phases
+
+### Phase 1: Enable Foundational Checks (Low Effort, High Value)
+Target: enable 4 key options that catch real bugs with minimal friction
+
+```toml
+disallow_untyped_defs = true
+disallow_incomplete_defs = true
+warn_redundant_casts = true
+warn_unused_ignores = true
+```
+
+**Impact:**
+- Functions must have complete type signatures (all args+returns typed)
+- Redundant cast() calls will be flagged
+- Unused `# type: ignore` comments will be flagged
+- Minimal code changes required (most functions already typed)
+
+**Estimated effort:**
+- 1 hour to update config
+- 2-4 hours to fix violations in production code
+- Total: ~1 day
+
+**Validation:**
+- Run `mypy apps` and ensure 0 errors
+- Keep existing overrides for external libraries and coordinator-api
+
+### Phase 2: Stricter Optional Handling (Medium Effort)
+Enable:
+```toml
+no_implicit_optional = true
+warn_unreachable = true
+strict_equality = true
+```
+
+**Impact:**
+- Variables defaulting to `None` must be explicitly `Optional[...]`
+- Unreachable code will be flagged (dead code detection)
+- Equality comparisons with None must use `is` not `==`
+
+**Estimated effort:** 2-3 days to fix violations across codebase
+
+### Phase 3: Gradual Per-Module Strictness (Long-term)
+- Move coordinator-api out of catch-all `ignore_errors`
+- Add per-module overrides as we achieve correctness
+- Eventually remove `ignore_errors` blanket
+
+**Estimated effort:** Ongoing as part of decomposition
+
+## Implementation Steps
+
+### Step 1: Backup Current Config
+```bash
+cp pyproject.toml pyproject.toml.backup
+```
+
+### Step 2: Update Root Configuration
+
+Modify `/opt/aitbc/pyproject.toml` [tool.mypy] section:
+
+```diff
+ [tool.mypy]
+ python_version = "3.13"
+ warn_return_any = true
+ warn_unused_configs = true
+ check_untyped_defs = false
+-disallow_incomplete_defs = false
+-disallow_untyped_defs = false
+disallow_incomplete_defs = true
+disallow_untyped_defs = true
+ disallow_untyped_decorators = false
+ no_implicit_optional = false
+ warn_redundant_casts = false
+ warn_unused_ignores = false
+ warn_no_return = true
+ warn_unreachable = false
+ strict_equality = false
+```
+
+### Step 3: Run Mypy and Collect Errors
+
+```bash
+cd /opt/aitbc
+venv/bin/mypy apps --show-error-codes --no-color-output > mypy_errors.txt 2>&1
+```
+
+### Step 4: Categorize Errors
+
+Typical violations we'll see:
+- `Function is missing a return type annotation` (from disallow_untyped_defs)
+- `Function is missing a type annotation for one or more arguments` (from disallow_untyped_defs)
+- `Class is missing type parameters for generic type` (rare)
+- `dict, list, etc. used without type parameters` (from disallow_incomplete_defs)
+- `Redundant cast to X` (from warn_redundant_casts)
+- `Unused "type: ignore" comment` (from warn_unused_ignores)
+
+### Step 5: Fix in Order of Impact
+
+**A. Add missing type annotations to functions**
+- Priority: functions in shared-core, services, routers
+- Use explicit return types; if truly dynamic, use `-> Any` (but rarely needed)
+- Example:
+  ```python
+  def get_engine(settings):  # BEFORE
+  def get_engine(settings: ServiceSettings) -> Engine:  # AFTER
+  ```
+
+**B. Add generic type parameters**
+- `list` -> `List[str]` or `list[int]`
+- `dict` -> `Dict[str, Any]`
+- Use `from typing import List, Dict`
+
+**C. Remove redundant casts**
+- Delete `cast(Type, value)` if type is already clear to mypy
+- Use `reveal_type(value)` to check actual inferred type before removing
+
+**D. Remove unused type: ignore**
+- Some `# type: ignore` comments are legacy and no longer needed
+- Delete them; if mypy still fails, leave or fix underlying issue
+
+### Step 6: Iterate and Validate
+
+After fixing categories, re-run mypy. Continue until `mypy apps` exits with code 0.
+
+**Note:** We preserve `ignore_missing_imports` for heavy libraries, and `ignore_errors` for coordinator-api (since we're deferring decomposition).
+
+### Step 7: Add CI Enforcement
+
+Update pre-commit hooks or CI to run mypy on PRs:
+```yaml
+# .pre-commit-config.yaml or GitHub Actions
+- repo: local
+  hooks:
+    - id: mypy
+      name: mypy
+      entry: mypy apps
+      language: system
+      pass_filenames: false
+```
+
+## Rollback Plan
+
+If the effort becomes too large:
+1. Revert pyproject.toml from backup
+2. Keep per-module `# mypy: ignore-errors` as needed
+3. Approach incrementally: enable one flag at a time
+
+## Success Criteria
+
+- `mypy apps` completes with 0 errors
+- No new type: ignore comments added without explanation
+- Production code has complete type signatures
+- CI pipeline includes mypy check
+
+## Risks & Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| Overwhelming number of errors | Enable flags incrementally (2 at a time), fix in batches by module |
+| Breaking existing functionality by incorrect type fixes | Run test suite after each batch; use `reveal_type` to debug |
+| Third-party library types incompatible | Keep `ignore_missing_imports` for those packages |
+| Coordinator-api too messy to fix now | Keep `ignore_errors` override; revisit after decomposition |
+
+## Related Tasks
+
+- **Decompose coordinator-api** - Once strict mypy is in place, easier to validate new services
+- **Shared-core library** - Strict typing ensures compatibility across services
+- **Connection pooling** - Use proper typed database sessions
+
+## Open Questions
+
+1. Should we also enable `strict` mode for new services? (Probably yes)
+2. Should we add type-checking to pre-commit hook for changed files only? (Yes, use `mypy --files <changed>`)
+3. How to handle legacy coordinator-api code? (Keep ignore_errors for now)
+
+## Estimated Timeline
+
+- **0-2 days:** Implement Phase 1, fix immediate violations
+- **3-7 days:** Address accumulated type errors, reach clean mypy
+- **Week 2:** Add CI enforcement, document guidelines
+- **Ongoing:** Maintain strict typing in new code
+
+## References
+
+- Mypy configuration: https://mypy.readthedocs.io/en/stable/config_file.html
+- Strict mode: https://mypy.readthedocs.io/en/stable/command_line.html#cmdoption-mypy-strict