Files
aitbc/.windsurf/skills/aitbc-training.md
aitbc b1edfe446a Add Redis URL configuration documentation and cross-node setup lessons learned
- Document explicit REDIS_URL environment variable usage in lifespan.py
- Add critical shared Redis configuration section for multi-node setups
- Document environment file configuration in /etc/aitbc/.env
- Add verification steps for cross-node agent discovery
- Add lessons learned section covering Redis configuration issues, integration test patterns, service deployment, and cross-node setup
- Update service initialization documentation to show
2026-05-07 19:18:29 +02:00

6.0 KiB

AITBC Training - Agent Coordinator Architecture

Important: Real Coordinator Location

The actual AITBC Agent Coordinator service is located at:

  • Path: /opt/aitbc/apps/agent-coordinator/src/app/
  • Port: 9001
  • Service: aitbc-agent-coordinator.service

DO NOT Use

  • Wrong location: /opt/aitbc/apps/agent-services/agent-coordinator/src/coordinator.py
  • This is a different/older implementation and is NOT the active service

Key Components

Core Files

  • agent_discovery.py - Redis-backed agent registry with persistence
  • load_balancer.py - Load balancer with multiple strategies (least_connections, round_robin, etc.)
  • routers/agents.py - Agent management REST API endpoints
  • routers/tasks.py - Task submission and distribution API endpoints
  • lifespan.py - Service initialization and component startup
  • state.py - Global state management for coordinator components

Service Initialization

The service initializes in lifespan.py:

  1. Creates AgentRegistry(redis_url=os.getenv("REDIS_URL", "redis://localhost:6379/1")) with Redis backing
  2. Starts registry Redis connection
  3. Creates LoadBalancer(registry) with least_connections strategy
  4. Creates TaskDistributor(balancer) with priority queues
  5. Starts background task distribution loop

Agent Registration

API Endpoint

POST /agents/register

Example

curl -X POST http://localhost:9001/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "hermes-agent",
    "agent_type": "worker",
    "capabilities": ["data-processing", "analysis", "general"],
    "services": ["task-execution", "analysis"],
    "endpoints": {"http": "http://localhost:9002"},
    "metadata": {"version": "1.0.0"}
  }'

Response

{
  "status": "success",
  "message": "Agent hermes-agent registered successfully",
  "agent_id": "hermes-agent",
  "registered_at": "2026-05-07T16:26:55.464178+00:00"
}

Task Distribution

API Endpoint

POST /tasks/submit

Example

curl -X POST http://localhost:9001/tasks/submit \
  -H "Content-Type: application/json" \
  -d '{
    "task_data": {
      "model": "llama2",
      "prompt": "test prompt"
    },
    "priority": "normal",
    "requirements": {}
  }'

Distribution Flow

  1. Task submitted to TaskDistributor
  2. Distributor finds eligible agents via AgentRegistry.discover_agents()
  3. Load balancer selects agent using configured strategy (default: least_connections)
  4. Task assigned to selected agent
  5. Agent metrics updated in Redis

Task Status

API Endpoint

GET /tasks/status

Example

curl http://localhost:9001/tasks/status

Response

{
  "status": "success",
  "stats": {
    "tasks_distributed": 1,
    "tasks_completed": 1,
    "tasks_failed": 0,
    "load_balancer_stats": {
      "strategy": "least_connections",
      "active_agents": 1,
      "total_assignments": 1,
      "avg_agent_load": 1
    }
  }
}

Agent Discovery

API Endpoint

POST /agents/discover

Example

curl -X POST http://localhost:9001/agents/discover \
  -H "Content-Type: application/json" \
  -d '{
    "status": "active",
    "agent_type": "worker"
  }'

Redis Persistence

The agent registry uses Redis for persistence:

  • Agent data stored as hashes: agent:{agent_id}
  • Active agents indexed in set: agents:active
  • Load metrics tracked per agent
  • Health scores calculated from heartbeats

Service Status

Health Check

curl http://localhost:9001/health

Service Management

systemctl status aitbc-agent-coordinator.service
systemctl restart aitbc-agent-coordinator.service
journalctl -u aitbc-agent-coordinator.service -f

Cross-Node Distribution

Critical: Shared Redis Configuration

For cross-node task distribution to work, ALL coordinator instances MUST use the same shared Redis instance:

  1. Environment Configuration: Set REDIS_URL in /etc/aitbc/.env:

    REDIS_URL=redis://10.1.223.93:6379/0
    
  2. Service Configuration: The systemd service loads environment variables:

    EnvironmentFile=/etc/aitbc/.env
    
  3. Application Configuration: The coordinator MUST read the environment variable in lifespan.py:

    redis_url = os.getenv("REDIS_URL", "redis://localhost:6379/1")
    state.agent_registry = AgentRegistry(redis_url=redis_url)
    
  4. Verification: Check if agents registered on one node appear on another:

    # Register on aitbc1
    curl -X POST http://aitbc1:9001/agents/register -d '{...}'
    
    # Discover on localhost
    curl -X POST http://localhost:9001/agents/discover -d '{}'
    
    # Should show the aitbc1 agent
    

Example Cross-Node Setup

# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
  -d '{"agent_id":"aitbc1-agent", ...}'

# Submit task on localhost
curl -X POST http://localhost:9001/tasks/submit \
  -d '{"task_data":{...}}'

# Task will be distributed to any active agent across nodes

Lessons Learned

Redis Configuration Issues

  • Problem: Coordinators default to redis://localhost:6379/1 instead of reading environment variable
  • Solution: Explicitly read REDIS_URL in lifespan.py and pass to AgentRegistry
  • Verification: Check shared Redis keys: redis-cli -h <host> KEYS 'agent:*'

Integration Test Patterns

  • Use httpx.AsyncClient for async HTTP requests
  • Use pytest fixtures for test setup/teardown
  • Mark async test classes with @pytest.mark.asyncio
  • Test both success and failure cases
  • Verify actual Redis state for persistence tests

Service Deployment

  • Copy code changes to remote nodes before restarting
  • Use systemctl restart to pick up code changes
  • Check journalctl logs for startup errors
  • Verify health endpoint after restart

Cross-Node Setup

  • Both coordinators must use same Redis instance
  • Environment variables must be set correctly
  • Service must be restarted to pick up code changes
  • Test agent discovery across nodes before task distribution