Files
aitbc/.windsurf/skills/aitbc-training.md
aitbc b1edfe446a Add Redis URL configuration documentation and cross-node setup lessons learned
- Document explicit REDIS_URL environment variable usage in lifespan.py
- Add critical shared Redis configuration section for multi-node setups
- Document environment file configuration in /etc/aitbc/.env
- Add verification steps for cross-node agent discovery
- Add lessons learned section covering Redis configuration issues, integration test patterns, service deployment, and cross-node setup
- Update service initialization documentation to show
2026-05-07 19:18:29 +02:00

232 lines
6.0 KiB
Markdown

# AITBC Training - Agent Coordinator Architecture
## Important: Real Coordinator Location
The actual AITBC Agent Coordinator service is located at:
- **Path:** `/opt/aitbc/apps/agent-coordinator/src/app/`
- **Port:** 9001
- **Service:** `aitbc-agent-coordinator.service`
## DO NOT Use
- **Wrong location:** `/opt/aitbc/apps/agent-services/agent-coordinator/src/coordinator.py`
- This is a different/older implementation and is NOT the active service
## Key Components
### Core Files
- `agent_discovery.py` - Redis-backed agent registry with persistence
- `load_balancer.py` - Load balancer with multiple strategies (least_connections, round_robin, etc.)
- `routers/agents.py` - Agent management REST API endpoints
- `routers/tasks.py` - Task submission and distribution API endpoints
- `lifespan.py` - Service initialization and component startup
- `state.py` - Global state management for coordinator components
### Service Initialization
The service initializes in `lifespan.py`:
1. Creates `AgentRegistry(redis_url=os.getenv("REDIS_URL", "redis://localhost:6379/1"))` with Redis backing
2. Starts registry Redis connection
3. Creates `LoadBalancer(registry)` with least_connections strategy
4. Creates `TaskDistributor(balancer)` with priority queues
5. Starts background task distribution loop
## Agent Registration
### API Endpoint
```
POST /agents/register
```
### Example
```bash
curl -X POST http://localhost:9001/agents/register \
-H "Content-Type: application/json" \
-d '{
"agent_id": "hermes-agent",
"agent_type": "worker",
"capabilities": ["data-processing", "analysis", "general"],
"services": ["task-execution", "analysis"],
"endpoints": {"http": "http://localhost:9002"},
"metadata": {"version": "1.0.0"}
}'
```
### Response
```json
{
"status": "success",
"message": "Agent hermes-agent registered successfully",
"agent_id": "hermes-agent",
"registered_at": "2026-05-07T16:26:55.464178+00:00"
}
```
## Task Distribution
### API Endpoint
```
POST /tasks/submit
```
### Example
```bash
curl -X POST http://localhost:9001/tasks/submit \
-H "Content-Type: application/json" \
-d '{
"task_data": {
"model": "llama2",
"prompt": "test prompt"
},
"priority": "normal",
"requirements": {}
}'
```
### Distribution Flow
1. Task submitted to `TaskDistributor`
2. Distributor finds eligible agents via `AgentRegistry.discover_agents()`
3. Load balancer selects agent using configured strategy (default: least_connections)
4. Task assigned to selected agent
5. Agent metrics updated in Redis
## Task Status
### API Endpoint
```
GET /tasks/status
```
### Example
```bash
curl http://localhost:9001/tasks/status
```
### Response
```json
{
"status": "success",
"stats": {
"tasks_distributed": 1,
"tasks_completed": 1,
"tasks_failed": 0,
"load_balancer_stats": {
"strategy": "least_connections",
"active_agents": 1,
"total_assignments": 1,
"avg_agent_load": 1
}
}
}
```
## Agent Discovery
### API Endpoint
```
POST /agents/discover
```
### Example
```bash
curl -X POST http://localhost:9001/agents/discover \
-H "Content-Type: application/json" \
-d '{
"status": "active",
"agent_type": "worker"
}'
```
## Redis Persistence
The agent registry uses Redis for persistence:
- Agent data stored as hashes: `agent:{agent_id}`
- Active agents indexed in set: `agents:active`
- Load metrics tracked per agent
- Health scores calculated from heartbeats
## Service Status
### Health Check
```bash
curl http://localhost:9001/health
```
### Service Management
```bash
systemctl status aitbc-agent-coordinator.service
systemctl restart aitbc-agent-coordinator.service
journalctl -u aitbc-agent-coordinator.service -f
```
## Cross-Node Distribution
### Critical: Shared Redis Configuration
For cross-node task distribution to work, ALL coordinator instances MUST use the same shared Redis instance:
1. **Environment Configuration:** Set `REDIS_URL` in `/etc/aitbc/.env`:
```
REDIS_URL=redis://10.1.223.93:6379/0
```
2. **Service Configuration:** The systemd service loads environment variables:
```
EnvironmentFile=/etc/aitbc/.env
```
3. **Application Configuration:** The coordinator MUST read the environment variable in `lifespan.py`:
```python
redis_url = os.getenv("REDIS_URL", "redis://localhost:6379/1")
state.agent_registry = AgentRegistry(redis_url=redis_url)
```
4. **Verification:** Check if agents registered on one node appear on another:
```bash
# Register on aitbc1
curl -X POST http://aitbc1:9001/agents/register -d '{...}'
# Discover on localhost
curl -X POST http://localhost:9001/agents/discover -d '{}'
# Should show the aitbc1 agent
```
### Example Cross-Node Setup
```bash
# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
-d '{"agent_id":"aitbc1-agent", ...}'
# Submit task on localhost
curl -X POST http://localhost:9001/tasks/submit \
-d '{"task_data":{...}}'
# Task will be distributed to any active agent across nodes
```
## Lessons Learned
### Redis Configuration Issues
- **Problem:** Coordinators default to `redis://localhost:6379/1` instead of reading environment variable
- **Solution:** Explicitly read `REDIS_URL` in `lifespan.py` and pass to `AgentRegistry`
- **Verification:** Check shared Redis keys: `redis-cli -h <host> KEYS 'agent:*'`
### Integration Test Patterns
- Use `httpx.AsyncClient` for async HTTP requests
- Use pytest fixtures for test setup/teardown
- Mark async test classes with `@pytest.mark.asyncio`
- Test both success and failure cases
- Verify actual Redis state for persistence tests
### Service Deployment
- Copy code changes to remote nodes before restarting
- Use `systemctl restart` to pick up code changes
- Check journalctl logs for startup errors
- Verify health endpoint after restart
### Cross-Node Setup
- Both coordinators must use same Redis instance
- Environment variables must be set correctly
- Service must be restarted to pick up code changes
- Test agent discovery across nodes before task distribution