- Add /agents/{agent_id}/heartbeat endpoint to receive and process agent heartbeats
- Add /tasks/queues endpoint to retrieve task queue sizes across all priorities
- Add /tasks/queues/{priority}/clear endpoint to clear specific priority queues
- Add /tasks/queues/stats endpoint to get detailed queue and distribution statistics
- Implement get_queue_sizes() method in TaskDistributor to return queue sizes by priority
- Implement clear_queue() method in TaskDistributor to drain
8.7 KiB
AITBC Agent Coordinator - Architecture Documentation
System Overview
The AITBC Agent Coordinator is a distributed task distribution system that manages AI agents, coordinates task assignment, and provides load balancing across multiple agent instances. The system uses Redis for persistence and FastAPI for REST API endpoints.
Service Location
Actual Service: /opt/aitbc/apps/agent-coordinator/src/app/
Port: 9001
Systemd Service: aitbc-agent-coordinator.service
DO NOT USE: /opt/aitbc/apps/agent-services/agent-coordinator/src/coordinator.py (this is an older/incorrect implementation)
Core Components
1. Agent Registry (agent_discovery.py)
The Agent Registry is the central component for managing agent lifecycle and discovery.
Key Features:
- Redis-backed persistence for agent data
- Agent registration and deregistration
- Agent discovery with filtering (by type, status, capabilities)
- Health score calculation based on heartbeat frequency
- Load metrics tracking (active connections, pending tasks)
Data Model:
- Agent data stored as Redis hashes:
agent:{agent_id} - Active agents indexed in Redis set:
agents:active - Agent status tracked: active, inactive, busy, stale
Key Classes:
AgentInfo- Dataclass representing agent informationAgentRegistry- Main registry class with Redis integrationAgentDiscoveryService- Service for discovering agents with criteria
2. Load Balancer (load_balancer.py)
The Load Balancer distributes tasks across eligible agents using configurable strategies.
Load Balancing Strategies:
LEAST_CONNECTIONS- Selects agent with fewest active connections (default)ROUND_ROBIN- Distributes tasks in circular orderWEIGHTED_ROUND_ROBIN- Based on agent performance weightsRESOURCE_BASED- Based on CPU/memory metricsGEOGRAPHIC- Based on agent locationRANDOM- For testing purposes
Key Classes:
LoadBalancer- Main load balancer classTaskDistributor- Manages task priority queues and distributionTaskPriority- Enum for task priorities (urgent, critical, high, normal, low)
Task Distribution Flow:
- Task submitted to
TaskDistributor.submit_task() - Task placed in appropriate priority queue
- Background distribution loop processes queues
- Load balancer finds eligible agents via
find_eligible_agents() - Agent selected using configured strategy
- Task assigned and agent metrics updated
3. REST API Routers
Agent Management (routers/agents.py)
Endpoints:
POST /agents/register- Register new agentPOST /agents/discover- Discover agents with filteringGET /agents/{agent_id}- Get agent informationPUT /agents/{agent_id}/status- Update agent status
Task Management (routers/tasks.py)
Endpoints:
POST /tasks/submit- Submit task for distributionGET /tasks/status- Get task distribution statistics
Service Initialization
The service initializes in lifespan.py during FastAPI startup:
async def lifespan(app: FastAPI):
# Create AgentRegistry with Redis backing
state.agent_registry = AgentRegistry()
await state.agent_registry.start()
# Create LoadBalancer with registry
state.load_balancer = LoadBalancer(state.agent_registry)
state.load_balancer.set_strategy(LoadBalancingStrategy.LEAST_CONNECTIONS)
# Create TaskDistributor
state.task_distributor = TaskDistributor(state.load_balancer)
# Start background tasks
asyncio.create_task(state.task_distributor.start_distribution())
asyncio.create_task(state.message_processor.start_processing())
Redis Persistence Model
Agent Data Structure
Hash Key: agent:{agent_id}
Fields:
agent_id- Unique identifieragent_type- Type (worker, provider, consumer, general)status- Current status (active, inactive, busy, stale)capabilities- JSON array of capabilitiesservices- JSON array of available servicesendpoints- JSON object of service endpointsmetadata- JSON object of additional metadatalast_heartbeat- Timestamp of last heartbeatregistration_time- Timestamp of registrationload_metrics- JSON object of load metricshealth_score- Calculated health score (0.0-1.0)version- Agent versiontags- JSON array of tags
Indexes
Set Key: agents:active - Contains IDs of all active agents
Agent Lifecycle
Registration
- Agent sends POST /agents/register with agent information
- Coordinator validates agent data
- Agent info stored in Redis
- Agent added to active agents set
- Success response returned
Heartbeat
- Agent sends heartbeat (not yet implemented as endpoint)
- Last heartbeat timestamp updated
- Health score recalculated
- Stale agents marked as inactive (configurable timeout)
Status Update
- Agent sends PUT /agents/{agent_id}/status
- Status and load metrics updated
- Load balancer uses updated metrics for task assignment
Deregistration
- Agent marked as inactive
- Removed from active agents set
- Data retained in Redis for historical purposes
Task Distribution Flow
Task Submission
sequenceDiagram
participant Client
participant Coordinator
participant LoadBalancer
participant AgentRegistry
participant Redis
participant Agent
Client->>Coordinator: POST /tasks/submit
Coordinator->>TaskDistributor: submit_task()
TaskDistributor->>TaskDistributor: add to priority queue
TaskDistributor->>LoadBalancer: find_eligible_agents()
LoadBalancer->>AgentRegistry: discover_agents(criteria)
AgentRegistry->>Redis: query active agents
Redis-->>AgentRegistry: agent data
AgentRegistry-->>LoadBalancer: eligible agents
LoadBalancer->>LoadBalancer: select_agent(strategy)
LoadBalancer->>Redis: update agent metrics
LoadBalancer-->>TaskDistributor: selected agent
TaskDistributor->>Agent: assign task
Coordinator-->>Client: task submitted
Load Balancing
The load balancer uses the following criteria to select agents:
- Agent status must be "active"
- Agent must have required capabilities
- Agent type must match requirements
- Health score must be above threshold
- Load metrics must be within limits
Configuration
Environment Variables
AITBC_REDIS_URL- Redis connection URL (default: redis://localhost:6379)AITBC_COORDINATOR_PORT- Coordinator service port (default: 9001)AITBC_LOG_LEVEL- Logging level (default: INFO)
Load Balancing Configuration
- Default strategy: LEAST_CONNECTIONS
- Strategy can be changed via LoadBalancer.set_strategy()
- Priority queues: urgent, critical, high, normal, low
Health Check Configuration
- Heartbeat timeout: 300 seconds (configurable)
- Health score threshold: 0.5 (configurable)
- Stale agent detection: enabled by default
Monitoring
Metrics Available
- Active agents count
- Tasks distributed/completed/failed
- Average distribution time
- Load balancer success rate
- Agent load distribution
- Queue sizes per priority
Monitoring Endpoints
GET /tasks/status- Task distribution statisticsGET /health- Service health check- Future: Prometheus metrics endpoint
Security
Authentication
- API key authentication via middleware (optional)
- JWT token support (optional)
- Role-based access control (optional)
Rate Limiting
- Not currently implemented
- Can be added via FastAPI middleware
Scalability
Horizontal Scaling
- Multiple coordinator instances can run behind a load balancer
- Redis provides shared state across instances
- Agent registry is distributed via Redis
Performance Considerations
- Redis operations are O(1) or O(log N)
- Task distribution is asynchronous
- Priority queues prevent starvation
- Load balancing strategies can be tuned
Troubleshooting
Common Issues
No active agents:
- Check Redis connection
- Verify agents are registered
- Check agent status (may be inactive/stale)
Tasks not distributing:
- Check task distributor is running
- Verify eligible agents exist
- Check load balancer strategy
- Review task requirements
Agent not discovered:
- Verify agent registration succeeded
- Check agent status is active
- Verify capabilities match query
- Check Redis connection
Debug Commands
# Check service status
systemctl status aitbc-agent-coordinator.service
# View logs
journalctl -u aitbc-agent-coordinator.service -f
# Check Redis
redis-cli
> KEYS agent:*
> SMEMBERS agents:active
# Test API
curl http://localhost:9001/health
curl http://localhost:9001/tasks/status
Future Enhancements
Planned improvements (see Phase 3):
- Agent heartbeat mechanism
- Additional load balancing strategies
- Task priority queue management
- Agent metrics dashboard
- WebSocket support for real-time updates