Files
aitbc/docs/agent-coordinator/OPERATOR_GUIDE.md
aitbc a9e727dac8
Some checks failed
API Endpoint Tests / test-api-endpoints (push) Has been cancelled
Cross-Node Transaction Testing / transaction-test (push) Has been cancelled
Deploy to Testnet / deploy-testnet (push) Has been cancelled
Integration Tests / test-service-integration (push) Has been cancelled
Multi-Node Stress Testing / stress-test (push) Has been cancelled
Node Failover Simulation / failover-test (push) Has been cancelled
Production Tests / Production Integration Tests (push) Has been cancelled
Python Tests / test-python (push) Has been cancelled
Security Scanning / security-scan (push) Has been cancelled
CLI Tests / test-cli (push) Has been cancelled
Documentation Validation / validate-docs (push) Has been cancelled
Documentation Validation / validate-policies-strict (push) Has been cancelled
Add agent heartbeat and task queue management endpoints to coordinator API
- Add /agents/{agent_id}/heartbeat endpoint to receive and process agent heartbeats
- Add /tasks/queues endpoint to retrieve task queue sizes across all priorities
- Add /tasks/queues/{priority}/clear endpoint to clear specific priority queues
- Add /tasks/queues/stats endpoint to get detailed queue and distribution statistics
- Implement get_queue_sizes() method in TaskDistributor to return queue sizes by priority
- Implement clear_queue() method in TaskDistributor to drain
2026-05-07 18:49:17 +02:00

12 KiB

AITBC Agent Coordinator - Operator Guide

This guide provides operators with the knowledge to deploy, configure, monitor, and troubleshoot the AITBC Agent Coordinator service.

Service Deployment

Prerequisites

  • Redis server running on localhost or remote host
  • Python 3.13+
  • Systemd (for service management)
  • AITBC blockchain node (optional, for blockchain integration)

Installation

  1. Install dependencies:
cd /opt/aitbc/apps/agent-coordinator
pip install -r requirements.txt
  1. Configure environment:
# Edit /etc/aitbc/.env
export AITBC_REDIS_URL=redis://localhost:6379
export AITBC_COORDINATOR_PORT=9001
export AITBC_LOG_LEVEL=INFO
  1. Start Redis:
systemctl start redis
systemctl enable redis
  1. Start coordinator service:
systemctl start aitbc-agent-coordinator.service
systemctl enable aitbc-agent-coordinator.service

Service Configuration

Service file location: /etc/systemd/system/aitbc-agent-coordinator.service

Key configuration parameters:

  • PYTHONPATH=apps/agent-coordinator/src - Python module path
  • uvicorn app.main:app - FastAPI application entry point
  • --host 0.0.0.0 - Bind to all interfaces
  • --port 9001 - Service port

Redis Configuration

Connection URL: redis://localhost:6379/0

Redis data persistence:

  • Agent data: agent:{agent_id} (hash)
  • Active agents: agents:active (set)
  • Load metrics: Stored in agent hash

Redis monitoring:

redis-cli
> KEYS agent:*
> SMEMBERS agents:active
> HGETALL agent:hermes-agent

Agent Registration Procedures

Manual Registration via CLI

Basic registration:

aitbc-cli agent sdk register \
  --agent-id my-agent \
  --type worker \
  --coordinator-url http://localhost:9001

Full registration with capabilities:

aitbc-cli agent sdk register \
  --agent-id my-agent \
  --type worker \
  --capabilities "data-processing,analysis,debugging" \
  --services "task-execution,coordination" \
  --endpoints '{"http":"http://my-host:9002"}' \
  --metadata '{"version":"1.0.0","owner":"my-team"}' \
  --coordinator-url http://localhost:9001

Automated Registration Script

#!/bin/bash
# register_agents.sh

COORDINATOR_URL="http://localhost:9001"

register_agent() {
  local agent_id=$1
  local agent_type=$2
  local capabilities=$3
  
  aitbc-cli agent sdk register \
    --agent-id "$agent_id" \
    --type "$agent_type" \
    --capabilities "$capabilities" \
    --coordinator-url "$COORDINATOR_URL"
}

# Register agents
register_agent "worker-1" "worker" "data-processing,analysis"
register_agent "worker-2" "worker" "data-processing,analysis"
register_agent "worker-3" "worker" "inference,training"

Cross-Node Registration

Register agents on multiple nodes for distributed task distribution:

# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "aitbc1-worker",
    "agent_type": "worker",
    "capabilities": ["data-processing"],
    "endpoints": {"http": "http://aitbc1:9002"}
  }'

# Register agent on aitbc2
curl -X POST http://aitbc2:9001/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "aitbc2-worker",
    "agent_type": "worker",
    "capabilities": ["inference"],
    "endpoints": {"http": "http://aitbc2:9002"}
  }'

Monitoring and Troubleshooting

Health Checks

Service health:

curl http://localhost:9001/health

Expected response:

{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2026-05-07T16:00:00.000000+00:00"
}

Task distribution stats:

curl http://localhost:9001/tasks/status

CLI health check:

aitbc-cli ai status

Service Status

Check systemd service:

systemctl status aitbc-agent-coordinator.service

View service logs:

journalctl -u aitbc-agent-coordinator.service -f

View recent logs:

journalctl -u aitbc-agent-coordinator.service -n 100

Agent Monitoring

List all agents:

aitbc-cli agent sdk list

List active agents only:

aitbc-cli agent sdk list --status active

Check specific agent:

aitbc-cli agent sdk status --agent-id my-agent

Monitor distribution stats:

aitbc-cli ai distribution-stats

Redis Monitoring

Check Redis connection:

redis-cli ping

View all registered agents:

redis-cli
> KEYS agent:*

View active agents:

redis-cli
> SMEMBERS agents:active

View agent details:

redis-cli
> HGETALL agent:my-agent

Monitor Redis memory:

redis-cli INFO memory

Common Issues and Solutions

Service won't start

Symptoms:

Failed to start aitbc-agent-coordinator.service

Solutions:

  1. Check Redis is running:
systemctl status redis
  1. Check Redis connection:
redis-cli ping
  1. Check service logs:
journalctl -u aitbc-agent-coordinator.service -n 50
  1. Verify PYTHONPATH:
echo $PYTHONPATH
# Should include: /opt/aitbc/apps/agent-coordinator/src

No agents discovered

Symptoms:

aitbc-cli agent sdk list
Found 0 agents

Solutions:

  1. Check if agents are registered:
redis-cli SMEMBERS agents:active
  1. Register an agent:
aitbc-cli agent sdk register --agent-id test-agent --type worker
  1. Check agent status:
aitbc-cli agent sdk status --agent-id test-agent

Tasks not distributing

Symptoms:

  • Tasks submitted but not assigned
  • tasks_distributed count not increasing

Solutions:

  1. Check for active agents:
aitbc-cli agent sdk list --status active
  1. Check task distributor status:
curl http://localhost:9001/tasks/status
  1. Verify agent capabilities match task requirements
  2. Check load balancer strategy
  3. Review service logs for errors

Agent marked as stale

Symptoms:

  • Agent status changes from active to stale
  • Agent not receiving new tasks

Solutions:

  1. Update agent status:
aitbc-cli agent sdk update-status --agent-id my-agent --status active
  1. Check heartbeat mechanism (if implemented)
  2. Verify agent is still running
  3. Check network connectivity

Redis connection errors

Symptoms:

Error connecting to Redis

Solutions:

  1. Check Redis service:
systemctl status redis
  1. Restart Redis:
systemctl restart redis
  1. Check Redis configuration:
redis-cli INFO server
  1. Verify Redis URL in environment:
echo $AITBC_REDIS_URL

Performance Tuning

Load Balancing Strategies

Current default: LEAST_CONNECTIONS

Available strategies:

  • LEAST_CONNECTIONS - Fewest active connections
  • ROUND_ROBIN - Circular distribution
  • WEIGHTED_ROUND_ROBIN - Performance-based
  • RESOURCE_BASED - CPU/memory metrics
  • GEOGRAPHIC - Location-based
  • RANDOM - Random selection (testing)

Changing strategy: (requires code modification in lifespan.py)

Priority Queue Configuration

Priority levels:

  1. urgent
  2. critical
  3. high
  4. normal
  5. low

Queue sizing: Configured in TaskDistributor class

Monitoring queue sizes:

curl http://localhost:9001/tasks/status | jq .stats.queue_sizes

Resource Limits

Redis memory limits:

redis-cli CONFIG SET maxmemory 1gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

Service memory limits: (configure in systemd service file)

MemoryLimit=2G
MemorySwap=2G

Connection limits: (configure in uvicorn startup)

--limit-concurrency 100

Security Considerations

Network Security

Bind to specific interface:

# In service file, change --host 0.0.0.0 to --host 127.0.0.1 for local only
--host 127.0.0.1

Use firewall:

# Allow only specific IPs
ufw allow from 192.168.1.0/24 to any port 9001

Authentication

Future implementation: API key authentication and JWT tokens

Current status: No authentication (open access)

Recommendation: Deploy behind reverse proxy with authentication

Data Encryption

Redis encryption: Configure Redis with TLS API encryption: Use HTTPS in production

Backup and Recovery

Redis Backup

Manual backup:

redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

Automated backup:

#!/bin/bash
# backup_redis.sh
redis-cli BGSAVE
sleep 5
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d-%H%M%S).rdb
# Keep last 7 days
find /backup -name "redis-*.rdb" -mtime +7 -delete

Restore from backup:

systemctl stop redis
cp /backup/redis-20260507.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb
systemctl start redis

Service Configuration Backup

Backup service file:

cp /etc/systemd/system/aitbc-agent-coordinator.service /backup/

Backup environment:

cp /etc/aitbc/.env /backup/

Scaling

Horizontal Scaling

Multiple coordinator instances:

  1. Deploy multiple coordinator instances behind load balancer
  2. Use shared Redis instance
  3. Configure consistent PYTHONPATH across instances

Load balancer configuration:

upstream coordinator {
    server localhost:9001;
    server localhost:9002;
    server localhost:9003;
}

server {
    listen 80;
    location / {
        proxy_pass http://coordinator;
    }
}

Redis Clustering

For high availability:

  • Use Redis Sentinel for failover
  • Use Redis Cluster for sharding
  • Configure coordinator to use Redis Sentinel

Maintenance

Regular Maintenance Tasks

Daily:

  • Monitor service health
  • Check task distribution stats
  • Review error logs

Weekly:

  • Backup Redis data
  • Review agent registrations
  • Clean up stale agents

Monthly:

  • Review performance metrics
  • Update software dependencies
  • Audit security configurations

Agent Cleanup

Remove inactive agents:

redis-cli
> SREM agents:active "stale-agent-id"
> DEL agent:stale-agent-id

Bulk cleanup script:

#!/bin/bash
# cleanup_stale_agents.sh
redis-cli --scan --pattern "agent:*" | while read key; do
  status=$(redis-cli HGET "$key" status)
  if [ "$status" = "stale" ]; then
    agent_id=$(echo "$key" | cut -d: -f2)
    redis-cli SREM agents:active "$agent_id"
    redis-cli DEL "$key"
    echo "Removed stale agent: $agent_id"
  fi
done

Service Restart

Graceful restart:

systemctl reload aitbc-agent-coordinator.service

Force restart:

systemctl restart aitbc-agent-coordinator.service

Rolling restart (multiple instances):

for i in {1..3}; do
  systemctl restart aitbc-agent-coordinator@$i.service
  sleep 10
done

Alerting

Service alerts:

  • Service down (health check fails)
  • High error rate (> 5%)
  • High response time (> 5s)

Agent alerts:

  • No active agents
  • Agent registration failures
  • Agent stale count increasing

Task alerts:

  • Task queue backlog (> 100 tasks)
  • Task failure rate (> 10%)
  • Distribution time increasing

Redis alerts:

  • Redis connection failures
  • Redis memory usage > 80%
  • Redis latency > 100ms

Monitoring Tools

Prometheus metrics: (future implementation)

  • Export metrics at /metrics endpoint
  • Use Grafana for visualization

Log aggregation:

  • Send logs to ELK stack
  • Use Loki for log storage
  • Configure alerting based on log patterns

Troubleshooting Checklist

When issues occur, check in this order:

  1. Service status

    • Service running?
    • Health check passing?
    • Logs showing errors?
  2. Redis status

    • Redis running?
    • Connection successful?
    • Memory usage normal?
  3. Agent status

    • Agents registered?
    • Agents active?
    • Agent capabilities valid?
  4. Task status

    • Tasks submitting?
    • Tasks distributing?
    • Tasks completing?
  5. Network

    • Connectivity to Redis?
    • Connectivity to agents?
    • Firewall rules correct?
  6. Configuration

    • Environment variables set?
    • PYTHONPATH correct?
    • Port available?