oib/aitbc

Fork 0

Files

aitbc 10a595a788

API Endpoint Tests / test-api-endpoints (push) Has been cancelled

Details

Cross-Node Transaction Testing / transaction-test (push) Has been cancelled

Details

Deploy to Testnet / deploy-testnet (push) Has been cancelled

Details

Integration Tests / test-service-integration (push) Has been cancelled

Details

Multi-Node Stress Testing / stress-test (push) Has been cancelled

Details

Node Failover Simulation / failover-test (push) Has been cancelled

Details

Production Tests / Production Integration Tests (push) Has been cancelled

Details

Python Tests / test-python (push) Has been cancelled

Details

Security Scanning / security-scan (push) Has been cancelled

Details

Documentation Validation / validate-docs (push) Has been cancelled

Details

Documentation Validation / validate-policies-strict (push) Has been cancelled

Details

Add message storage, broadcast, and peer management features to agent coordinator

- Import MessageStorage and PeerStorage in lifespan
- Initialize message_storage and peer_storage with Redis URL
- Add start/stop lifecycle management for storage services
- Add protocol field to MessageRequest model with validation
- Add BroadcastRequest model with agent_type and capabilities filters
- Store sent messages in Redis with metadata (message_id, sender, receiver, type, priority, protocol, timestamp)
- Add /

2026-05-07 20:16:50 +02:00

12 KiB

Raw Permalink Blame History

AITBC Agent Coordinator - Architecture Documentation

System Overview

The AITBC Agent Coordinator is a distributed task distribution system that manages AI agents, coordinates task assignment, and provides load balancing across multiple agent instances. The system uses Redis for persistence and FastAPI for REST API endpoints.

Service Location

Actual Service: /opt/aitbc/apps/agent-coordinator/src/app/ Port: 9001 Systemd Service: aitbc-agent-coordinator.service

DO NOT USE: /opt/aitbc/apps/agent-services/agent-coordinator/src/coordinator.py (this is an older/incorrect implementation)

Core Components

1. Agent Registry (`agent_discovery.py`)

The Agent Registry is the central component for managing agent lifecycle and discovery.

Key Features:

Redis-backed persistence for agent data
Agent registration and deregistration
Agent discovery with filtering (by type, status, capabilities)
Health score calculation based on heartbeat frequency
Load metrics tracking (active connections, pending tasks)

Data Model:

Agent data stored as Redis hashes: agent:{agent_id}
Active agents indexed in Redis set: agents:active
Agent status tracked: active, inactive, busy, stale

Key Classes:

AgentInfo - Dataclass representing agent information
AgentRegistry - Main registry class with Redis integration
AgentDiscoveryService - Service for discovering agents with criteria

2. Load Balancer (`load_balancer.py`)

The Load Balancer distributes tasks across eligible agents using configurable strategies.

Load Balancing Strategies:

LEAST_CONNECTIONS - Selects agent with fewest active connections (default)
ROUND_ROBIN - Distributes tasks in circular order
WEIGHTED_ROUND_ROBIN - Based on agent performance weights
RESOURCE_BASED - Based on CPU/memory metrics
GEOGRAPHIC - Based on agent location
RANDOM - For testing purposes

Key Classes:

LoadBalancer - Main load balancer class
TaskDistributor - Manages task priority queues and distribution
TaskPriority - Enum for task priorities (urgent, critical, high, normal, low)

Task Distribution Flow:

Task submitted to TaskDistributor.submit_task()
Task placed in appropriate priority queue
Background distribution loop processes queues
Load balancer finds eligible agents via find_eligible_agents()
Agent selected using configured strategy
Task assigned and agent metrics updated

3. REST API Routers

Agent Management (`routers/agents.py`)

Endpoints:

POST /agents/register - Register new agent
POST /agents/discover - Discover agents with filtering
GET /agents/{agent_id} - Get agent information
PUT /agents/{agent_id}/status - Update agent status

Task Management (`routers/tasks.py`)

Endpoints:

POST /tasks/submit - Submit task for distribution
GET /tasks/status - Get task distribution statistics

4. Agent Communication (`protocols/communication.py`)

The Agent Communication system enables agents to communicate with each other through the coordinator using various protocols.

Message Types:

DIRECT - Point-to-point messages between specific agents
BROADCAST - Messages sent to all connected agents
HIERARCHICAL - Master-agent to sub-agent communication
PEER_TO_PEER - Direct agent-to-agent communication
COORDINATION - Coordination and synchronization messages
TASK_ASSIGNMENT - Task distribution messages
STATUS_UPDATE - Agent status updates
HEARTBEAT - Keep-alive messages
DISCOVERY - Agent discovery messages
CONSENSUS - Consensus protocol messages

Message Priorities:

LOW - Low priority messages
NORMAL - Normal priority (default)
HIGH - High priority messages
CRITICAL - Critical priority messages

Communication Protocols:

Hierarchical Protocol:

Master agents manage sub-agents
Messages flow from master to sub-agents
Sub-agents can send messages back to master
Suitable for coordinated task execution

Peer-to-Peer Protocol:

Direct agent-to-agent communication
Agents maintain peer connections
Messages sent directly between peers
Suitable for decentralized coordination

Message Structure:

AgentMessage:
  - id: Unique message ID (UUID)
  - sender_id: Sending agent ID
  - receiver_id: Target agent ID (optional for broadcast)
  - message_type: Type of message
  - priority: Message priority level
  - timestamp: Message creation time (UTC)
  - payload: Message data (dictionary)
  - correlation_id: For request-response correlation
  - reply_to: For reply messages
  - ttl: Time-to-live in seconds (default: 300)

Communication Flow:

Agents register with coordinator via POST /agents/register
Agents establish connections via endpoints
Messages routed through coordinator or direct connections
Message handlers process incoming messages
TTL ensures expired messages are discarded
Priority levels ensure important messages are processed first

Current Implementation Status:

Implemented:

POST /messages/send - Send messages (hardcoded to "hierarchical" protocol only)
GET /load-balancer/stats - Load balancer statistics
GET /registry/stats - Agent registry statistics
GET /agents/service/{service} - Find agents by service
GET /agents/capability/{capability} - Find agents by capability
PUT /load-balancer/strategy - Change load balancing strategy

Missing / Incomplete:

POST /messages/send only uses "hierarchical" protocol - doesn't support:
- peer_to_peer protocol
- broadcast protocol
- Other protocols defined in MessageType enum
No broadcast endpoint - Can't send broadcast messages via API
No message history/storage - Messages aren't persisted
No peer management endpoints - Can't add/remove peers via API

Note: The protocols (Hierarchical, P2P, Broadcast) are well-implemented in communication.py, but the API layer (messages.py) doesn't fully expose them yet.

Service Initialization

The service initializes in lifespan.py during FastAPI startup:

async def lifespan(app: FastAPI):
    # Create AgentRegistry with Redis backing
    state.agent_registry = AgentRegistry()
    await state.agent_registry.start()
    
    # Create LoadBalancer with registry
    state.load_balancer = LoadBalancer(state.agent_registry)
    state.load_balancer.set_strategy(LoadBalancingStrategy.LEAST_CONNECTIONS)
    
    # Create TaskDistributor
    state.task_distributor = TaskDistributor(state.load_balancer)
    
    # Start background tasks
    asyncio.create_task(state.task_distributor.start_distribution())
    asyncio.create_task(state.message_processor.start_processing())

Redis Persistence Model

Agent Data Structure

Hash Key: agent:{agent_id}

Fields:

agent_id - Unique identifier
agent_type - Type (worker, provider, consumer, general)
status - Current status (active, inactive, busy, stale)
capabilities - JSON array of capabilities
services - JSON array of available services
endpoints - JSON object of service endpoints
metadata - JSON object of additional metadata
last_heartbeat - Timestamp of last heartbeat
registration_time - Timestamp of registration
load_metrics - JSON object of load metrics
health_score - Calculated health score (0.0-1.0)
version - Agent version
tags - JSON array of tags

Indexes

Set Key: agents:active - Contains IDs of all active agents

Agent Lifecycle

Registration

Agent sends POST /agents/register with agent information
Coordinator validates agent data
Agent info stored in Redis
Agent added to active agents set
Success response returned

Heartbeat

Agent sends heartbeat (not yet implemented as endpoint)
Last heartbeat timestamp updated
Health score recalculated
Stale agents marked as inactive (configurable timeout)

Status Update

Agent sends PUT /agents/{agent_id}/status
Status and load metrics updated
Load balancer uses updated metrics for task assignment

Deregistration

Agent marked as inactive
Removed from active agents set
Data retained in Redis for historical purposes

Task Distribution Flow

Task Submission

sequenceDiagram
    participant Client
    participant Coordinator
    participant LoadBalancer
    participant AgentRegistry
    participant Redis
    participant Agent
    
    Client->>Coordinator: POST /tasks/submit
    Coordinator->>TaskDistributor: submit_task()
    TaskDistributor->>TaskDistributor: add to priority queue
    TaskDistributor->>LoadBalancer: find_eligible_agents()
    LoadBalancer->>AgentRegistry: discover_agents(criteria)
    AgentRegistry->>Redis: query active agents
    Redis-->>AgentRegistry: agent data
    AgentRegistry-->>LoadBalancer: eligible agents
    LoadBalancer->>LoadBalancer: select_agent(strategy)
    LoadBalancer->>Redis: update agent metrics
    LoadBalancer-->>TaskDistributor: selected agent
    TaskDistributor->>Agent: assign task
    Coordinator-->>Client: task submitted

Load Balancing

The load balancer uses the following criteria to select agents:

Agent status must be "active"
Agent must have required capabilities
Agent type must match requirements
Health score must be above threshold
Load metrics must be within limits

Configuration

Environment Variables

AITBC_REDIS_URL - Redis connection URL (default: redis://localhost:6379)
AITBC_COORDINATOR_PORT - Coordinator service port (default: 9001)
AITBC_LOG_LEVEL - Logging level (default: INFO)

Load Balancing Configuration

Default strategy: LEAST_CONNECTIONS
Strategy can be changed via LoadBalancer.set_strategy()
Priority queues: urgent, critical, high, normal, low

Health Check Configuration

Heartbeat timeout: 300 seconds (configurable)
Health score threshold: 0.5 (configurable)
Stale agent detection: enabled by default

Monitoring

Metrics Available

Active agents count
Tasks distributed/completed/failed
Average distribution time
Load balancer success rate
Agent load distribution
Queue sizes per priority

Monitoring Endpoints

GET /tasks/status - Task distribution statistics
GET /health - Service health check
Future: Prometheus metrics endpoint

Security

Authentication

API key authentication via middleware (optional)
JWT token support (optional)
Role-based access control (optional)

Rate Limiting

Not currently implemented
Can be added via FastAPI middleware

Scalability

Horizontal Scaling

Multiple coordinator instances can run behind a load balancer
Redis provides shared state across instances
Agent registry is distributed via Redis

Performance Considerations

Redis operations are O(1) or O(log N)
Task distribution is asynchronous
Priority queues prevent starvation
Load balancing strategies can be tuned

Troubleshooting

Common Issues

No active agents:

Check Redis connection
Verify agents are registered
Check agent status (may be inactive/stale)

Tasks not distributing:

Check task distributor is running
Verify eligible agents exist
Check load balancer strategy
Review task requirements

Agent not discovered:

Verify agent registration succeeded
Check agent status is active
Verify capabilities match query
Check Redis connection

Debug Commands

# Check service status
systemctl status aitbc-agent-coordinator.service

# View logs
journalctl -u aitbc-agent-coordinator.service -f

# Check Redis
redis-cli
> KEYS agent:*
> SMEMBERS agents:active

# Test API
curl http://localhost:9001/health
curl http://localhost:9001/tasks/status

Future Enhancements

Planned improvements (see Phase 3):

Agent heartbeat mechanism
Additional load balancing strategies
Task priority queue management
Agent metrics dashboard
WebSocket support for real-time updates

12 KiB Raw Permalink Blame History

AITBC Agent Coordinator - Architecture Documentation

System Overview

Service Location

Core Components

1. Agent Registry (agent_discovery.py)

2. Load Balancer (load_balancer.py)

3. REST API Routers

Agent Management (routers/agents.py)

Task Management (routers/tasks.py)

4. Agent Communication (protocols/communication.py)

Service Initialization

Redis Persistence Model

Agent Data Structure

Indexes

Agent Lifecycle

Registration

Heartbeat

Status Update

Deregistration

Task Distribution Flow

Task Submission

Load Balancing

Configuration

Environment Variables

Load Balancing Configuration

Health Check Configuration

Monitoring

Metrics Available

Monitoring Endpoints

Security

Authentication

Rate Limiting

Scalability

Horizontal Scaling

Performance Considerations

Troubleshooting

Common Issues

Debug Commands

Future Enhancements

12 KiB

Raw Permalink Blame History

1. Agent Registry (`agent_discovery.py`)

2. Load Balancer (`load_balancer.py`)

Agent Management (`routers/agents.py`)

Task Management (`routers/tasks.py`)

4. Agent Communication (`protocols/communication.py`)