- Add /agents/{agent_id}/heartbeat endpoint to receive and process agent heartbeats
- Add /tasks/queues endpoint to retrieve task queue sizes across all priorities
- Add /tasks/queues/{priority}/clear endpoint to clear specific priority queues
- Add /tasks/queues/stats endpoint to get detailed queue and distribution statistics
- Implement get_queue_sizes() method in TaskDistributor to return queue sizes by priority
- Implement clear_queue() method in TaskDistributor to drain
12 KiB
AITBC Agent Coordinator - Operator Guide
This guide provides operators with the knowledge to deploy, configure, monitor, and troubleshoot the AITBC Agent Coordinator service.
Service Deployment
Prerequisites
- Redis server running on localhost or remote host
- Python 3.13+
- Systemd (for service management)
- AITBC blockchain node (optional, for blockchain integration)
Installation
- Install dependencies:
cd /opt/aitbc/apps/agent-coordinator
pip install -r requirements.txt
- Configure environment:
# Edit /etc/aitbc/.env
export AITBC_REDIS_URL=redis://localhost:6379
export AITBC_COORDINATOR_PORT=9001
export AITBC_LOG_LEVEL=INFO
- Start Redis:
systemctl start redis
systemctl enable redis
- Start coordinator service:
systemctl start aitbc-agent-coordinator.service
systemctl enable aitbc-agent-coordinator.service
Service Configuration
Service file location: /etc/systemd/system/aitbc-agent-coordinator.service
Key configuration parameters:
PYTHONPATH=apps/agent-coordinator/src- Python module pathuvicorn app.main:app- FastAPI application entry point--host 0.0.0.0- Bind to all interfaces--port 9001- Service port
Redis Configuration
Connection URL: redis://localhost:6379/0
Redis data persistence:
- Agent data:
agent:{agent_id}(hash) - Active agents:
agents:active(set) - Load metrics: Stored in agent hash
Redis monitoring:
redis-cli
> KEYS agent:*
> SMEMBERS agents:active
> HGETALL agent:hermes-agent
Agent Registration Procedures
Manual Registration via CLI
Basic registration:
aitbc-cli agent sdk register \
--agent-id my-agent \
--type worker \
--coordinator-url http://localhost:9001
Full registration with capabilities:
aitbc-cli agent sdk register \
--agent-id my-agent \
--type worker \
--capabilities "data-processing,analysis,debugging" \
--services "task-execution,coordination" \
--endpoints '{"http":"http://my-host:9002"}' \
--metadata '{"version":"1.0.0","owner":"my-team"}' \
--coordinator-url http://localhost:9001
Automated Registration Script
#!/bin/bash
# register_agents.sh
COORDINATOR_URL="http://localhost:9001"
register_agent() {
local agent_id=$1
local agent_type=$2
local capabilities=$3
aitbc-cli agent sdk register \
--agent-id "$agent_id" \
--type "$agent_type" \
--capabilities "$capabilities" \
--coordinator-url "$COORDINATOR_URL"
}
# Register agents
register_agent "worker-1" "worker" "data-processing,analysis"
register_agent "worker-2" "worker" "data-processing,analysis"
register_agent "worker-3" "worker" "inference,training"
Cross-Node Registration
Register agents on multiple nodes for distributed task distribution:
# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
-H "Content-Type: application/json" \
-d '{
"agent_id": "aitbc1-worker",
"agent_type": "worker",
"capabilities": ["data-processing"],
"endpoints": {"http": "http://aitbc1:9002"}
}'
# Register agent on aitbc2
curl -X POST http://aitbc2:9001/agents/register \
-H "Content-Type: application/json" \
-d '{
"agent_id": "aitbc2-worker",
"agent_type": "worker",
"capabilities": ["inference"],
"endpoints": {"http": "http://aitbc2:9002"}
}'
Monitoring and Troubleshooting
Health Checks
Service health:
curl http://localhost:9001/health
Expected response:
{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2026-05-07T16:00:00.000000+00:00"
}
Task distribution stats:
curl http://localhost:9001/tasks/status
CLI health check:
aitbc-cli ai status
Service Status
Check systemd service:
systemctl status aitbc-agent-coordinator.service
View service logs:
journalctl -u aitbc-agent-coordinator.service -f
View recent logs:
journalctl -u aitbc-agent-coordinator.service -n 100
Agent Monitoring
List all agents:
aitbc-cli agent sdk list
List active agents only:
aitbc-cli agent sdk list --status active
Check specific agent:
aitbc-cli agent sdk status --agent-id my-agent
Monitor distribution stats:
aitbc-cli ai distribution-stats
Redis Monitoring
Check Redis connection:
redis-cli ping
View all registered agents:
redis-cli
> KEYS agent:*
View active agents:
redis-cli
> SMEMBERS agents:active
View agent details:
redis-cli
> HGETALL agent:my-agent
Monitor Redis memory:
redis-cli INFO memory
Common Issues and Solutions
Service won't start
Symptoms:
Failed to start aitbc-agent-coordinator.service
Solutions:
- Check Redis is running:
systemctl status redis
- Check Redis connection:
redis-cli ping
- Check service logs:
journalctl -u aitbc-agent-coordinator.service -n 50
- Verify PYTHONPATH:
echo $PYTHONPATH
# Should include: /opt/aitbc/apps/agent-coordinator/src
No agents discovered
Symptoms:
aitbc-cli agent sdk list
Found 0 agents
Solutions:
- Check if agents are registered:
redis-cli SMEMBERS agents:active
- Register an agent:
aitbc-cli agent sdk register --agent-id test-agent --type worker
- Check agent status:
aitbc-cli agent sdk status --agent-id test-agent
Tasks not distributing
Symptoms:
- Tasks submitted but not assigned
tasks_distributedcount not increasing
Solutions:
- Check for active agents:
aitbc-cli agent sdk list --status active
- Check task distributor status:
curl http://localhost:9001/tasks/status
- Verify agent capabilities match task requirements
- Check load balancer strategy
- Review service logs for errors
Agent marked as stale
Symptoms:
- Agent status changes from active to stale
- Agent not receiving new tasks
Solutions:
- Update agent status:
aitbc-cli agent sdk update-status --agent-id my-agent --status active
- Check heartbeat mechanism (if implemented)
- Verify agent is still running
- Check network connectivity
Redis connection errors
Symptoms:
Error connecting to Redis
Solutions:
- Check Redis service:
systemctl status redis
- Restart Redis:
systemctl restart redis
- Check Redis configuration:
redis-cli INFO server
- Verify Redis URL in environment:
echo $AITBC_REDIS_URL
Performance Tuning
Load Balancing Strategies
Current default: LEAST_CONNECTIONS
Available strategies:
LEAST_CONNECTIONS- Fewest active connectionsROUND_ROBIN- Circular distributionWEIGHTED_ROUND_ROBIN- Performance-basedRESOURCE_BASED- CPU/memory metricsGEOGRAPHIC- Location-basedRANDOM- Random selection (testing)
Changing strategy: (requires code modification in lifespan.py)
Priority Queue Configuration
Priority levels:
- urgent
- critical
- high
- normal
- low
Queue sizing: Configured in TaskDistributor class
Monitoring queue sizes:
curl http://localhost:9001/tasks/status | jq .stats.queue_sizes
Resource Limits
Redis memory limits:
redis-cli CONFIG SET maxmemory 1gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru
Service memory limits: (configure in systemd service file)
MemoryLimit=2G
MemorySwap=2G
Connection limits: (configure in uvicorn startup)
--limit-concurrency 100
Security Considerations
Network Security
Bind to specific interface:
# In service file, change --host 0.0.0.0 to --host 127.0.0.1 for local only
--host 127.0.0.1
Use firewall:
# Allow only specific IPs
ufw allow from 192.168.1.0/24 to any port 9001
Authentication
Future implementation: API key authentication and JWT tokens
Current status: No authentication (open access)
Recommendation: Deploy behind reverse proxy with authentication
Data Encryption
Redis encryption: Configure Redis with TLS API encryption: Use HTTPS in production
Backup and Recovery
Redis Backup
Manual backup:
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
Automated backup:
#!/bin/bash
# backup_redis.sh
redis-cli BGSAVE
sleep 5
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d-%H%M%S).rdb
# Keep last 7 days
find /backup -name "redis-*.rdb" -mtime +7 -delete
Restore from backup:
systemctl stop redis
cp /backup/redis-20260507.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb
systemctl start redis
Service Configuration Backup
Backup service file:
cp /etc/systemd/system/aitbc-agent-coordinator.service /backup/
Backup environment:
cp /etc/aitbc/.env /backup/
Scaling
Horizontal Scaling
Multiple coordinator instances:
- Deploy multiple coordinator instances behind load balancer
- Use shared Redis instance
- Configure consistent PYTHONPATH across instances
Load balancer configuration:
upstream coordinator {
server localhost:9001;
server localhost:9002;
server localhost:9003;
}
server {
listen 80;
location / {
proxy_pass http://coordinator;
}
}
Redis Clustering
For high availability:
- Use Redis Sentinel for failover
- Use Redis Cluster for sharding
- Configure coordinator to use Redis Sentinel
Maintenance
Regular Maintenance Tasks
Daily:
- Monitor service health
- Check task distribution stats
- Review error logs
Weekly:
- Backup Redis data
- Review agent registrations
- Clean up stale agents
Monthly:
- Review performance metrics
- Update software dependencies
- Audit security configurations
Agent Cleanup
Remove inactive agents:
redis-cli
> SREM agents:active "stale-agent-id"
> DEL agent:stale-agent-id
Bulk cleanup script:
#!/bin/bash
# cleanup_stale_agents.sh
redis-cli --scan --pattern "agent:*" | while read key; do
status=$(redis-cli HGET "$key" status)
if [ "$status" = "stale" ]; then
agent_id=$(echo "$key" | cut -d: -f2)
redis-cli SREM agents:active "$agent_id"
redis-cli DEL "$key"
echo "Removed stale agent: $agent_id"
fi
done
Service Restart
Graceful restart:
systemctl reload aitbc-agent-coordinator.service
Force restart:
systemctl restart aitbc-agent-coordinator.service
Rolling restart (multiple instances):
for i in {1..3}; do
systemctl restart aitbc-agent-coordinator@$i.service
sleep 10
done
Alerting
Recommended Alerts
Service alerts:
- Service down (health check fails)
- High error rate (> 5%)
- High response time (> 5s)
Agent alerts:
- No active agents
- Agent registration failures
- Agent stale count increasing
Task alerts:
- Task queue backlog (> 100 tasks)
- Task failure rate (> 10%)
- Distribution time increasing
Redis alerts:
- Redis connection failures
- Redis memory usage > 80%
- Redis latency > 100ms
Monitoring Tools
Prometheus metrics: (future implementation)
- Export metrics at
/metricsendpoint - Use Grafana for visualization
Log aggregation:
- Send logs to ELK stack
- Use Loki for log storage
- Configure alerting based on log patterns
Troubleshooting Checklist
When issues occur, check in this order:
-
Service status
- Service running?
- Health check passing?
- Logs showing errors?
-
Redis status
- Redis running?
- Connection successful?
- Memory usage normal?
-
Agent status
- Agents registered?
- Agents active?
- Agent capabilities valid?
-
Task status
- Tasks submitting?
- Tasks distributing?
- Tasks completing?
-
Network
- Connectivity to Redis?
- Connectivity to agents?
- Firewall rules correct?
-
Configuration
- Environment variables set?
- PYTHONPATH correct?
- Port available?