Add agent heartbeat and task queue management endpoints to coordinator API

- Add /agents/{agent_id}/heartbeat endpoint to receive and process agent heartbeats - Add /tasks/queues endpoint to retrieve task queue sizes across all priorities - Add /tasks/queues/{priority}/clear endpoint to clear specific priority queues - Add /tasks/queues/stats endpoint to get detailed queue and distribution statistics - Implement get_queue_sizes() method in TaskDistributor to return queue sizes by priority - Implement clear_queue() method in TaskDistributor to drain
2026-05-07 18:49:17 +02:00
parent 5343d20f6d
commit a9e727dac8
8 changed files with 2331 additions and 0 deletions
--- a/docs/agent-coordinator/OPERATOR_GUIDE.md
+++ b/docs/agent-coordinator/OPERATOR_GUIDE.md
@@ -0,0 +1,645 @@
+# AITBC Agent Coordinator - Operator Guide
+
+This guide provides operators with the knowledge to deploy, configure, monitor, and troubleshoot the AITBC Agent Coordinator service.
+
+## Service Deployment
+
+### Prerequisites
+
+- Redis server running on localhost or remote host
+- Python 3.13+
+- Systemd (for service management)
+- AITBC blockchain node (optional, for blockchain integration)
+
+### Installation
+
+1. **Install dependencies:**
+```bash
+cd /opt/aitbc/apps/agent-coordinator
+pip install -r requirements.txt
+```
+
+2. **Configure environment:**
+```bash
+# Edit /etc/aitbc/.env
+export AITBC_REDIS_URL=redis://localhost:6379
+export AITBC_COORDINATOR_PORT=9001
+export AITBC_LOG_LEVEL=INFO
+```
+
+3. **Start Redis:**
+```bash
+systemctl start redis
+systemctl enable redis
+```
+
+4. **Start coordinator service:**
+```bash
+systemctl start aitbc-agent-coordinator.service
+systemctl enable aitbc-agent-coordinator.service
+```
+
+### Service Configuration
+
+**Service file location:** `/etc/systemd/system/aitbc-agent-coordinator.service`
+
+**Key configuration parameters:**
+- `PYTHONPATH=apps/agent-coordinator/src` - Python module path
+- `uvicorn app.main:app` - FastAPI application entry point
+- `--host 0.0.0.0` - Bind to all interfaces
+- `--port 9001` - Service port
+
+### Redis Configuration
+
+**Connection URL:** `redis://localhost:6379/0`
+
+**Redis data persistence:**
+- Agent data: `agent:{agent_id}` (hash)
+- Active agents: `agents:active` (set)
+- Load metrics: Stored in agent hash
+
+**Redis monitoring:**
+```bash
+redis-cli
+> KEYS agent:*
+> SMEMBERS agents:active
+> HGETALL agent:hermes-agent
+```
+
+## Agent Registration Procedures
+
+### Manual Registration via CLI
+
+**Basic registration:**
+```bash
+aitbc-cli agent sdk register \
+  --agent-id my-agent \
+  --type worker \
+  --coordinator-url http://localhost:9001
+```
+
+**Full registration with capabilities:**
+```bash
+aitbc-cli agent sdk register \
+  --agent-id my-agent \
+  --type worker \
+  --capabilities "data-processing,analysis,debugging" \
+  --services "task-execution,coordination" \
+  --endpoints '{"http":"http://my-host:9002"}' \
+  --metadata '{"version":"1.0.0","owner":"my-team"}' \
+  --coordinator-url http://localhost:9001
+```
+
+### Automated Registration Script
+
+```bash
+#!/bin/bash
+# register_agents.sh
+
+COORDINATOR_URL="http://localhost:9001"
+
+register_agent() {
+  local agent_id=$1
+  local agent_type=$2
+  local capabilities=$3
+  
+  aitbc-cli agent sdk register \
+    --agent-id "$agent_id" \
+    --type "$agent_type" \
+    --capabilities "$capabilities" \
+    --coordinator-url "$COORDINATOR_URL"
+}
+
+# Register agents
+register_agent "worker-1" "worker" "data-processing,analysis"
+register_agent "worker-2" "worker" "data-processing,analysis"
+register_agent "worker-3" "worker" "inference,training"
+```
+
+### Cross-Node Registration
+
+Register agents on multiple nodes for distributed task distribution:
+
+```bash
+# Register agent on aitbc1
+curl -X POST http://aitbc1:9001/agents/register \
+  -H "Content-Type: application/json" \
+  -d '{
+    "agent_id": "aitbc1-worker",
+    "agent_type": "worker",
+    "capabilities": ["data-processing"],
+    "endpoints": {"http": "http://aitbc1:9002"}
+  }'
+
+# Register agent on aitbc2
+curl -X POST http://aitbc2:9001/agents/register \
+  -H "Content-Type: application/json" \
+  -d '{
+    "agent_id": "aitbc2-worker",
+    "agent_type": "worker",
+    "capabilities": ["inference"],
+    "endpoints": {"http": "http://aitbc2:9002"}
+  }'
+```
+
+## Monitoring and Troubleshooting
+
+### Health Checks
+
+**Service health:**
+```bash
+curl http://localhost:9001/health
+```
+
+**Expected response:**
+```json
+{
+  "status": "healthy",
+  "version": "1.0.0",
+  "timestamp": "2026-05-07T16:00:00.000000+00:00"
+}
+```
+
+**Task distribution stats:**
+```bash
+curl http://localhost:9001/tasks/status
+```
+
+**CLI health check:**
+```bash
+aitbc-cli ai status
+```
+
+### Service Status
+
+**Check systemd service:**
+```bash
+systemctl status aitbc-agent-coordinator.service
+```
+
+**View service logs:**
+```bash
+journalctl -u aitbc-agent-coordinator.service -f
+```
+
+**View recent logs:**
+```bash
+journalctl -u aitbc-agent-coordinator.service -n 100
+```
+
+### Agent Monitoring
+
+**List all agents:**
+```bash
+aitbc-cli agent sdk list
+```
+
+**List active agents only:**
+```bash
+aitbc-cli agent sdk list --status active
+```
+
+**Check specific agent:**
+```bash
+aitbc-cli agent sdk status --agent-id my-agent
+```
+
+**Monitor distribution stats:**
+```bash
+aitbc-cli ai distribution-stats
+```
+
+### Redis Monitoring
+
+**Check Redis connection:**
+```bash
+redis-cli ping
+```
+
+**View all registered agents:**
+```bash
+redis-cli
+> KEYS agent:*
+```
+
+**View active agents:**
+```bash
+redis-cli
+> SMEMBERS agents:active
+```
+
+**View agent details:**
+```bash
+redis-cli
+> HGETALL agent:my-agent
+```
+
+**Monitor Redis memory:**
+```bash
+redis-cli INFO memory
+```
+
+### Common Issues and Solutions
+
+#### Service won't start
+
+**Symptoms:**
+```
+Failed to start aitbc-agent-coordinator.service
+```
+
+**Solutions:**
+1. Check Redis is running:
+```bash
+systemctl status redis
+```
+
+2. Check Redis connection:
+```bash
+redis-cli ping
+```
+
+3. Check service logs:
+```bash
+journalctl -u aitbc-agent-coordinator.service -n 50
+```
+
+4. Verify PYTHONPATH:
+```bash
+echo $PYTHONPATH
+# Should include: /opt/aitbc/apps/agent-coordinator/src
+```
+
+#### No agents discovered
+
+**Symptoms:**
+```bash
+aitbc-cli agent sdk list
+Found 0 agents
+```
+
+**Solutions:**
+1. Check if agents are registered:
+```bash
+redis-cli SMEMBERS agents:active
+```
+
+2. Register an agent:
+```bash
+aitbc-cli agent sdk register --agent-id test-agent --type worker
+```
+
+3. Check agent status:
+```bash
+aitbc-cli agent sdk status --agent-id test-agent
+```
+
+#### Tasks not distributing
+
+**Symptoms:**
+- Tasks submitted but not assigned
+- `tasks_distributed` count not increasing
+
+**Solutions:**
+1. Check for active agents:
+```bash
+aitbc-cli agent sdk list --status active
+```
+
+2. Check task distributor status:
+```bash
+curl http://localhost:9001/tasks/status
+```
+
+3. Verify agent capabilities match task requirements
+4. Check load balancer strategy
+5. Review service logs for errors
+
+#### Agent marked as stale
+
+**Symptoms:**
+- Agent status changes from active to stale
+- Agent not receiving new tasks
+
+**Solutions:**
+1. Update agent status:
+```bash
+aitbc-cli agent sdk update-status --agent-id my-agent --status active
+```
+
+2. Check heartbeat mechanism (if implemented)
+3. Verify agent is still running
+4. Check network connectivity
+
+#### Redis connection errors
+
+**Symptoms:**
+```
+Error connecting to Redis
+```
+
+**Solutions:**
+1. Check Redis service:
+```bash
+systemctl status redis
+```
+
+2. Restart Redis:
+```bash
+systemctl restart redis
+```
+
+3. Check Redis configuration:
+```bash
+redis-cli INFO server
+```
+
+4. Verify Redis URL in environment:
+```bash
+echo $AITBC_REDIS_URL
+```
+
+## Performance Tuning
+
+### Load Balancing Strategies
+
+**Current default:** `LEAST_CONNECTIONS`
+
+**Available strategies:**
+- `LEAST_CONNECTIONS` - Fewest active connections
+- `ROUND_ROBIN` - Circular distribution
+- `WEIGHTED_ROUND_ROBIN` - Performance-based
+- `RESOURCE_BASED` - CPU/memory metrics
+- `GEOGRAPHIC` - Location-based
+- `RANDOM` - Random selection (testing)
+
+**Changing strategy:** (requires code modification in `lifespan.py`)
+
+### Priority Queue Configuration
+
+**Priority levels:**
+1. urgent
+2. critical
+3. high
+4. normal
+5. low
+
+**Queue sizing:** Configured in `TaskDistributor` class
+
+**Monitoring queue sizes:**
+```bash
+curl http://localhost:9001/tasks/status | jq .stats.queue_sizes
+```
+
+### Resource Limits
+
+**Redis memory limits:**
+```bash
+redis-cli CONFIG SET maxmemory 1gb
+redis-cli CONFIG SET maxmemory-policy allkeys-lru
+```
+
+**Service memory limits:** (configure in systemd service file)
+```
+MemoryLimit=2G
+MemorySwap=2G
+```
+
+**Connection limits:** (configure in uvicorn startup)
+```
+--limit-concurrency 100
+```
+
+## Security Considerations
+
+### Network Security
+
+**Bind to specific interface:**
+```bash
+# In service file, change --host 0.0.0.0 to --host 127.0.0.1 for local only
+--host 127.0.0.1
+```
+
+**Use firewall:**
+```bash
+# Allow only specific IPs
+ufw allow from 192.168.1.0/24 to any port 9001
+```
+
+### Authentication
+
+**Future implementation:** API key authentication and JWT tokens
+
+**Current status:** No authentication (open access)
+
+**Recommendation:** Deploy behind reverse proxy with authentication
+
+### Data Encryption
+
+**Redis encryption:** Configure Redis with TLS
+**API encryption:** Use HTTPS in production
+
+## Backup and Recovery
+
+### Redis Backup
+
+**Manual backup:**
+```bash
+redis-cli SAVE
+cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
+```
+
+**Automated backup:**
+```bash
+#!/bin/bash
+# backup_redis.sh
+redis-cli BGSAVE
+sleep 5
+cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d-%H%M%S).rdb
+# Keep last 7 days
+find /backup -name "redis-*.rdb" -mtime +7 -delete
+```
+
+**Restore from backup:**
+```bash
+systemctl stop redis
+cp /backup/redis-20260507.rdb /var/lib/redis/dump.rdb
+chown redis:redis /var/lib/redis/dump.rdb
+systemctl start redis
+```
+
+### Service Configuration Backup
+
+**Backup service file:**
+```bash
+cp /etc/systemd/system/aitbc-agent-coordinator.service /backup/
+```
+
+**Backup environment:**
+```bash
+cp /etc/aitbc/.env /backup/
+```
+
+## Scaling
+
+### Horizontal Scaling
+
+**Multiple coordinator instances:**
+1. Deploy multiple coordinator instances behind load balancer
+2. Use shared Redis instance
+3. Configure consistent PYTHONPATH across instances
+
+**Load balancer configuration:**
+```nginx
+upstream coordinator {
+    server localhost:9001;
+    server localhost:9002;
+    server localhost:9003;
+}
+
+server {
+    listen 80;
+    location / {
+        proxy_pass http://coordinator;
+    }
+}
+```
+
+### Redis Clustering
+
+**For high availability:**
+- Use Redis Sentinel for failover
+- Use Redis Cluster for sharding
+- Configure coordinator to use Redis Sentinel
+
+## Maintenance
+
+### Regular Maintenance Tasks
+
+**Daily:**
+- Monitor service health
+- Check task distribution stats
+- Review error logs
+
+**Weekly:**
+- Backup Redis data
+- Review agent registrations
+- Clean up stale agents
+
+**Monthly:**
+- Review performance metrics
+- Update software dependencies
+- Audit security configurations
+
+### Agent Cleanup
+
+**Remove inactive agents:**
+```bash
+redis-cli
+> SREM agents:active "stale-agent-id"
+> DEL agent:stale-agent-id
+```
+
+**Bulk cleanup script:**
+```bash
+#!/bin/bash
+# cleanup_stale_agents.sh
+redis-cli --scan --pattern "agent:*" | while read key; do
+  status=$(redis-cli HGET "$key" status)
+  if [ "$status" = "stale" ]; then
+    agent_id=$(echo "$key" | cut -d: -f2)
+    redis-cli SREM agents:active "$agent_id"
+    redis-cli DEL "$key"
+    echo "Removed stale agent: $agent_id"
+  fi
+done
+```
+
+### Service Restart
+
+**Graceful restart:**
+```bash
+systemctl reload aitbc-agent-coordinator.service
+```
+
+**Force restart:**
+```bash
+systemctl restart aitbc-agent-coordinator.service
+```
+
+**Rolling restart (multiple instances):**
+```bash
+for i in {1..3}; do
+  systemctl restart aitbc-agent-coordinator@$i.service
+  sleep 10
+done
+```
+
+## Alerting
+
+### Recommended Alerts
+
+**Service alerts:**
+- Service down (health check fails)
+- High error rate (> 5%)
+- High response time (> 5s)
+
+**Agent alerts:**
+- No active agents
+- Agent registration failures
+- Agent stale count increasing
+
+**Task alerts:**
+- Task queue backlog (> 100 tasks)
+- Task failure rate (> 10%)
+- Distribution time increasing
+
+**Redis alerts:**
+- Redis connection failures
+- Redis memory usage > 80%
+- Redis latency > 100ms
+
+### Monitoring Tools
+
+**Prometheus metrics:** (future implementation)
+- Export metrics at `/metrics` endpoint
+- Use Grafana for visualization
+
+**Log aggregation:**
+- Send logs to ELK stack
+- Use Loki for log storage
+- Configure alerting based on log patterns
+
+## Troubleshooting Checklist
+
+When issues occur, check in this order:
+
+1. **Service status**
+   - [ ] Service running?
+   - [ ] Health check passing?
+   - [ ] Logs showing errors?
+
+2. **Redis status**
+   - [ ] Redis running?
+   - [ ] Connection successful?
+   - [ ] Memory usage normal?
+
+3. **Agent status**
+   - [ ] Agents registered?
+   - [ ] Agents active?
+   - [ ] Agent capabilities valid?
+
+4. **Task status**
+   - [ ] Tasks submitting?
+   - [ ] Tasks distributing?
+   - [ ] Tasks completing?
+
+5. **Network**
+   - [ ] Connectivity to Redis?
+   - [ ] Connectivity to agents?
+   - [ ] Firewall rules correct?
+
+6. **Configuration**
+   - [ ] Environment variables set?
+   - [ ] PYTHONPATH correct?
+   - [ ] Port available?