- Document explicit REDIS_URL environment variable usage in lifespan.py - Add critical shared Redis configuration section for multi-node setups - Document environment file configuration in /etc/aitbc/.env - Add verification steps for cross-node agent discovery - Add lessons learned section covering Redis configuration issues, integration test patterns, service deployment, and cross-node setup - Update service initialization documentation to show
6.0 KiB
6.0 KiB
AITBC Training - Agent Coordinator Architecture
Important: Real Coordinator Location
The actual AITBC Agent Coordinator service is located at:
- Path:
/opt/aitbc/apps/agent-coordinator/src/app/ - Port: 9001
- Service:
aitbc-agent-coordinator.service
DO NOT Use
- Wrong location:
/opt/aitbc/apps/agent-services/agent-coordinator/src/coordinator.py - This is a different/older implementation and is NOT the active service
Key Components
Core Files
agent_discovery.py- Redis-backed agent registry with persistenceload_balancer.py- Load balancer with multiple strategies (least_connections, round_robin, etc.)routers/agents.py- Agent management REST API endpointsrouters/tasks.py- Task submission and distribution API endpointslifespan.py- Service initialization and component startupstate.py- Global state management for coordinator components
Service Initialization
The service initializes in lifespan.py:
- Creates
AgentRegistry(redis_url=os.getenv("REDIS_URL", "redis://localhost:6379/1"))with Redis backing - Starts registry Redis connection
- Creates
LoadBalancer(registry)with least_connections strategy - Creates
TaskDistributor(balancer)with priority queues - Starts background task distribution loop
Agent Registration
API Endpoint
POST /agents/register
Example
curl -X POST http://localhost:9001/agents/register \
-H "Content-Type: application/json" \
-d '{
"agent_id": "hermes-agent",
"agent_type": "worker",
"capabilities": ["data-processing", "analysis", "general"],
"services": ["task-execution", "analysis"],
"endpoints": {"http": "http://localhost:9002"},
"metadata": {"version": "1.0.0"}
}'
Response
{
"status": "success",
"message": "Agent hermes-agent registered successfully",
"agent_id": "hermes-agent",
"registered_at": "2026-05-07T16:26:55.464178+00:00"
}
Task Distribution
API Endpoint
POST /tasks/submit
Example
curl -X POST http://localhost:9001/tasks/submit \
-H "Content-Type: application/json" \
-d '{
"task_data": {
"model": "llama2",
"prompt": "test prompt"
},
"priority": "normal",
"requirements": {}
}'
Distribution Flow
- Task submitted to
TaskDistributor - Distributor finds eligible agents via
AgentRegistry.discover_agents() - Load balancer selects agent using configured strategy (default: least_connections)
- Task assigned to selected agent
- Agent metrics updated in Redis
Task Status
API Endpoint
GET /tasks/status
Example
curl http://localhost:9001/tasks/status
Response
{
"status": "success",
"stats": {
"tasks_distributed": 1,
"tasks_completed": 1,
"tasks_failed": 0,
"load_balancer_stats": {
"strategy": "least_connections",
"active_agents": 1,
"total_assignments": 1,
"avg_agent_load": 1
}
}
}
Agent Discovery
API Endpoint
POST /agents/discover
Example
curl -X POST http://localhost:9001/agents/discover \
-H "Content-Type: application/json" \
-d '{
"status": "active",
"agent_type": "worker"
}'
Redis Persistence
The agent registry uses Redis for persistence:
- Agent data stored as hashes:
agent:{agent_id} - Active agents indexed in set:
agents:active - Load metrics tracked per agent
- Health scores calculated from heartbeats
Service Status
Health Check
curl http://localhost:9001/health
Service Management
systemctl status aitbc-agent-coordinator.service
systemctl restart aitbc-agent-coordinator.service
journalctl -u aitbc-agent-coordinator.service -f
Cross-Node Distribution
Critical: Shared Redis Configuration
For cross-node task distribution to work, ALL coordinator instances MUST use the same shared Redis instance:
-
Environment Configuration: Set
REDIS_URLin/etc/aitbc/.env:REDIS_URL=redis://10.1.223.93:6379/0 -
Service Configuration: The systemd service loads environment variables:
EnvironmentFile=/etc/aitbc/.env -
Application Configuration: The coordinator MUST read the environment variable in
lifespan.py:redis_url = os.getenv("REDIS_URL", "redis://localhost:6379/1") state.agent_registry = AgentRegistry(redis_url=redis_url) -
Verification: Check if agents registered on one node appear on another:
# Register on aitbc1 curl -X POST http://aitbc1:9001/agents/register -d '{...}' # Discover on localhost curl -X POST http://localhost:9001/agents/discover -d '{}' # Should show the aitbc1 agent
Example Cross-Node Setup
# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
-d '{"agent_id":"aitbc1-agent", ...}'
# Submit task on localhost
curl -X POST http://localhost:9001/tasks/submit \
-d '{"task_data":{...}}'
# Task will be distributed to any active agent across nodes
Lessons Learned
Redis Configuration Issues
- Problem: Coordinators default to
redis://localhost:6379/1instead of reading environment variable - Solution: Explicitly read
REDIS_URLinlifespan.pyand pass toAgentRegistry - Verification: Check shared Redis keys:
redis-cli -h <host> KEYS 'agent:*'
Integration Test Patterns
- Use
httpx.AsyncClientfor async HTTP requests - Use pytest fixtures for test setup/teardown
- Mark async test classes with
@pytest.mark.asyncio - Test both success and failure cases
- Verify actual Redis state for persistence tests
Service Deployment
- Copy code changes to remote nodes before restarting
- Use
systemctl restartto pick up code changes - Check journalctl logs for startup errors
- Verify health endpoint after restart
Cross-Node Setup
- Both coordinators must use same Redis instance
- Environment variables must be set correctly
- Service must be restarted to pick up code changes
- Test agent discovery across nodes before task distribution