oib/aitbc

Fork 0

Files

aitbc a9e727dac8

API Endpoint Tests / test-api-endpoints (push) Has been cancelled

Details

Cross-Node Transaction Testing / transaction-test (push) Has been cancelled

Details

Deploy to Testnet / deploy-testnet (push) Has been cancelled

Details

Integration Tests / test-service-integration (push) Has been cancelled

Details

Multi-Node Stress Testing / stress-test (push) Has been cancelled

Details

Node Failover Simulation / failover-test (push) Has been cancelled

Details

Production Tests / Production Integration Tests (push) Has been cancelled

Details

Python Tests / test-python (push) Has been cancelled

Details

Security Scanning / security-scan (push) Has been cancelled

Details

CLI Tests / test-cli (push) Has been cancelled

Details

Documentation Validation / validate-docs (push) Has been cancelled

Details

Documentation Validation / validate-policies-strict (push) Has been cancelled

Details

Add agent heartbeat and task queue management endpoints to coordinator API

- Add /agents/{agent_id}/heartbeat endpoint to receive and process agent heartbeats
- Add /tasks/queues endpoint to retrieve task queue sizes across all priorities
- Add /tasks/queues/{priority}/clear endpoint to clear specific priority queues
- Add /tasks/queues/stats endpoint to get detailed queue and distribution statistics
- Implement get_queue_sizes() method in TaskDistributor to return queue sizes by priority
- Implement clear_queue() method in TaskDistributor to drain

2026-05-07 18:49:17 +02:00

12 KiB

Raw Blame History

AITBC Agent Coordinator - Operator Guide

This guide provides operators with the knowledge to deploy, configure, monitor, and troubleshoot the AITBC Agent Coordinator service.

Service Deployment

Prerequisites

Redis server running on localhost or remote host
Python 3.13+
Systemd (for service management)
AITBC blockchain node (optional, for blockchain integration)

Installation

Install dependencies:

cd /opt/aitbc/apps/agent-coordinator
pip install -r requirements.txt

Configure environment:

# Edit /etc/aitbc/.env
export AITBC_REDIS_URL=redis://localhost:6379
export AITBC_COORDINATOR_PORT=9001
export AITBC_LOG_LEVEL=INFO

Start Redis:

systemctl start redis
systemctl enable redis

Start coordinator service:

systemctl start aitbc-agent-coordinator.service
systemctl enable aitbc-agent-coordinator.service

Service Configuration

Service file location: /etc/systemd/system/aitbc-agent-coordinator.service

Key configuration parameters:

PYTHONPATH=apps/agent-coordinator/src - Python module path
uvicorn app.main:app - FastAPI application entry point
--host 0.0.0.0 - Bind to all interfaces
--port 9001 - Service port

Redis Configuration

Connection URL: redis://localhost:6379/0

Redis data persistence:

Agent data: agent:{agent_id} (hash)
Active agents: agents:active (set)
Load metrics: Stored in agent hash

Redis monitoring:

redis-cli
> KEYS agent:*
> SMEMBERS agents:active
> HGETALL agent:hermes-agent

Agent Registration Procedures

Manual Registration via CLI

Basic registration:

aitbc-cli agent sdk register \
  --agent-id my-agent \
  --type worker \
  --coordinator-url http://localhost:9001

Full registration with capabilities:

aitbc-cli agent sdk register \
  --agent-id my-agent \
  --type worker \
  --capabilities "data-processing,analysis,debugging" \
  --services "task-execution,coordination" \
  --endpoints '{"http":"http://my-host:9002"}' \
  --metadata '{"version":"1.0.0","owner":"my-team"}' \
  --coordinator-url http://localhost:9001

Automated Registration Script

#!/bin/bash
# register_agents.sh

COORDINATOR_URL="http://localhost:9001"

register_agent() {
  local agent_id=$1
  local agent_type=$2
  local capabilities=$3
  
  aitbc-cli agent sdk register \
    --agent-id "$agent_id" \
    --type "$agent_type" \
    --capabilities "$capabilities" \
    --coordinator-url "$COORDINATOR_URL"
}

# Register agents
register_agent "worker-1" "worker" "data-processing,analysis"
register_agent "worker-2" "worker" "data-processing,analysis"
register_agent "worker-3" "worker" "inference,training"

Cross-Node Registration

# Register agent on aitbc1
curl -X POST http://aitbc1:9001/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "aitbc1-worker",
    "agent_type": "worker",
    "capabilities": ["data-processing"],
    "endpoints": {"http": "http://aitbc1:9002"}
  }'

# Register agent on aitbc2
curl -X POST http://aitbc2:9001/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "aitbc2-worker",
    "agent_type": "worker",
    "capabilities": ["inference"],
    "endpoints": {"http": "http://aitbc2:9002"}
  }'

Monitoring and Troubleshooting

Health Checks

Service health:

curl http://localhost:9001/health

Expected response:

{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2026-05-07T16:00:00.000000+00:00"
}

Task distribution stats:

curl http://localhost:9001/tasks/status

CLI health check:

aitbc-cli ai status

Service Status

Check systemd service:

systemctl status aitbc-agent-coordinator.service

View service logs:

journalctl -u aitbc-agent-coordinator.service -f

View recent logs:

journalctl -u aitbc-agent-coordinator.service -n 100

Agent Monitoring

List all agents:

aitbc-cli agent sdk list

List active agents only:

aitbc-cli agent sdk list --status active

Check specific agent:

aitbc-cli agent sdk status --agent-id my-agent

Monitor distribution stats:

aitbc-cli ai distribution-stats

Redis Monitoring

Check Redis connection:

redis-cli ping

View all registered agents:

redis-cli
> KEYS agent:*

View active agents:

redis-cli
> SMEMBERS agents:active

View agent details:

redis-cli
> HGETALL agent:my-agent

Monitor Redis memory:

redis-cli INFO memory

Common Issues and Solutions

Service won't start

Symptoms:

Failed to start aitbc-agent-coordinator.service

Solutions:

Check Redis is running:

systemctl status redis

Check Redis connection:

redis-cli ping

Check service logs:

journalctl -u aitbc-agent-coordinator.service -n 50

Verify PYTHONPATH:

echo $PYTHONPATH
# Should include: /opt/aitbc/apps/agent-coordinator/src

No agents discovered

Symptoms:

aitbc-cli agent sdk list
Found 0 agents

Solutions:

Check if agents are registered:

redis-cli SMEMBERS agents:active

aitbc-cli agent sdk register --agent-id test-agent --type worker

Check agent status:

aitbc-cli agent sdk status --agent-id test-agent

Tasks not distributing

Symptoms:

Tasks submitted but not assigned
tasks_distributed count not increasing

Solutions:

Check for active agents:

aitbc-cli agent sdk list --status active

Check task distributor status:

curl http://localhost:9001/tasks/status

Verify agent capabilities match task requirements
Check load balancer strategy
Review service logs for errors

Agent marked as stale

Symptoms:

Agent status changes from active to stale
Agent not receiving new tasks

Solutions:

Update agent status:

aitbc-cli agent sdk update-status --agent-id my-agent --status active

Check heartbeat mechanism (if implemented)
Verify agent is still running
Check network connectivity

Redis connection errors

Symptoms:

Error connecting to Redis

Solutions:

Check Redis service:

systemctl status redis

Restart Redis:

systemctl restart redis

Check Redis configuration:

redis-cli INFO server

Verify Redis URL in environment:

echo $AITBC_REDIS_URL

Performance Tuning

Load Balancing Strategies

Current default: LEAST_CONNECTIONS

Available strategies:

LEAST_CONNECTIONS - Fewest active connections
ROUND_ROBIN - Circular distribution
WEIGHTED_ROUND_ROBIN - Performance-based
RESOURCE_BASED - CPU/memory metrics
GEOGRAPHIC - Location-based
RANDOM - Random selection (testing)

Changing strategy: (requires code modification in lifespan.py)

Priority Queue Configuration

Priority levels:

urgent
critical
high
normal
low

Queue sizing: Configured in TaskDistributor class

Monitoring queue sizes:

curl http://localhost:9001/tasks/status | jq .stats.queue_sizes

Resource Limits

Redis memory limits:

redis-cli CONFIG SET maxmemory 1gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

Service memory limits: (configure in systemd service file)

MemoryLimit=2G
MemorySwap=2G

Connection limits: (configure in uvicorn startup)

--limit-concurrency 100

Security Considerations

Network Security

Bind to specific interface:

# In service file, change --host 0.0.0.0 to --host 127.0.0.1 for local only
--host 127.0.0.1

Use firewall:

# Allow only specific IPs
ufw allow from 192.168.1.0/24 to any port 9001

Authentication

Future implementation: API key authentication and JWT tokens

Current status: No authentication (open access)

Recommendation: Deploy behind reverse proxy with authentication

Data Encryption

Redis encryption: Configure Redis with TLS API encryption: Use HTTPS in production

Backup and Recovery

Redis Backup

Manual backup:

redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

Automated backup:

#!/bin/bash
# backup_redis.sh
redis-cli BGSAVE
sleep 5
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d-%H%M%S).rdb
# Keep last 7 days
find /backup -name "redis-*.rdb" -mtime +7 -delete

Restore from backup:

systemctl stop redis
cp /backup/redis-20260507.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb
systemctl start redis

Service Configuration Backup

Backup service file:

cp /etc/systemd/system/aitbc-agent-coordinator.service /backup/

Backup environment:

cp /etc/aitbc/.env /backup/

Scaling

Horizontal Scaling

Multiple coordinator instances:

Deploy multiple coordinator instances behind load balancer
Use shared Redis instance
Configure consistent PYTHONPATH across instances

Load balancer configuration:

upstream coordinator {
    server localhost:9001;
    server localhost:9002;
    server localhost:9003;
}

server {
    listen 80;
    location / {
        proxy_pass http://coordinator;
    }
}

Redis Clustering

For high availability:

Use Redis Sentinel for failover
Use Redis Cluster for sharding
Configure coordinator to use Redis Sentinel

Maintenance

Regular Maintenance Tasks

Daily:

Monitor service health
Check task distribution stats
Review error logs

Weekly:

Backup Redis data
Review agent registrations
Clean up stale agents

Monthly:

Review performance metrics
Update software dependencies
Audit security configurations

Agent Cleanup

Remove inactive agents:

redis-cli
> SREM agents:active "stale-agent-id"
> DEL agent:stale-agent-id

Bulk cleanup script:

#!/bin/bash
# cleanup_stale_agents.sh
redis-cli --scan --pattern "agent:*" | while read key; do
  status=$(redis-cli HGET "$key" status)
  if [ "$status" = "stale" ]; then
    agent_id=$(echo "$key" | cut -d: -f2)
    redis-cli SREM agents:active "$agent_id"
    redis-cli DEL "$key"
    echo "Removed stale agent: $agent_id"
  fi
done

Service Restart

Graceful restart:

systemctl reload aitbc-agent-coordinator.service

Force restart:

systemctl restart aitbc-agent-coordinator.service

Rolling restart (multiple instances):

for i in {1..3}; do
  systemctl restart aitbc-agent-coordinator@$i.service
  sleep 10
done

Alerting

Recommended Alerts

Service alerts:

Service down (health check fails)
High error rate (> 5%)
High response time (> 5s)

Agent alerts:

No active agents
Agent registration failures
Agent stale count increasing

Task alerts:

Task queue backlog (> 100 tasks)
Task failure rate (> 10%)
Distribution time increasing

Redis alerts:

Redis connection failures
Redis memory usage > 80%
Redis latency > 100ms

Monitoring Tools

Prometheus metrics: (future implementation)

Export metrics at /metrics endpoint
Use Grafana for visualization

Log aggregation:

Send logs to ELK stack
Use Loki for log storage
Configure alerting based on log patterns

Troubleshooting Checklist

When issues occur, check in this order:

Service status
- Service running?
- Health check passing?
- Logs showing errors?
Redis status
- Redis running?
- Connection successful?
- Memory usage normal?
Agent status
- Agents registered?
- Agents active?
- Agent capabilities valid?
Task status
- Tasks submitting?
- Tasks distributing?
- Tasks completing?
Network
- Connectivity to Redis?
- Connectivity to agents?
- Firewall rules correct?
Configuration
- Environment variables set?
- PYTHONPATH correct?
- Port available?

12 KiB Raw Blame History

AITBC Agent Coordinator - Operator Guide

Service Deployment

Prerequisites

Installation

Service Configuration

Redis Configuration

Agent Registration Procedures

Manual Registration via CLI

Automated Registration Script

Cross-Node Registration

Monitoring and Troubleshooting

Health Checks

Service Status

Agent Monitoring

Redis Monitoring

Common Issues and Solutions

Service won't start

No agents discovered

Tasks not distributing

Agent marked as stale

Redis connection errors

Performance Tuning

Load Balancing Strategies

Priority Queue Configuration

Resource Limits

Security Considerations

Network Security

Authentication

Data Encryption

Backup and Recovery

Redis Backup

Service Configuration Backup

Scaling

Horizontal Scaling

Redis Clustering

Maintenance

Regular Maintenance Tasks

Agent Cleanup

Service Restart

Alerting

Recommended Alerts

Monitoring Tools

Troubleshooting Checklist

12 KiB

Raw Blame History