# AI Training Agent for hermes Agents **Level**: Intermediate **Prerequisites**: AI Job Submission (Scenario 07), GPU Listing (Scenario 09), Transaction Sending (Scenario 02) **Estimated Time**: 40 minutes **Last Updated**: 2026-05-02 **Version**: 1.0 ## ๐Ÿงญ **Navigation Path:** **๐Ÿ  [Documentation Home](../README.md)** โ†’ **๐ŸŽญ [Agent Scenarios](./README.md)** โ†’ *You are here* **breadcrumb**: Home โ†’ Scenarios โ†’ AI Training Agent --- ## ๐ŸŽฏ **See Also:** - **๐Ÿ“– Previous Scenario**: [21 Compute Provider Agent](./21_compute_provider_agent.md) - **๐Ÿ“– Next Scenario**: [23 Data Oracle Agent](./23_data_oracle_agent.md) - **๐Ÿค– Agent SDK**: [Agent SDK Documentation](../agent-sdk/README.md) - **๐Ÿค– AI Engine**: [AI Engine](../apps/ai-engine/README.md) --- ## ๐Ÿ“š **Scenario Overview** This scenario demonstrates how hermes agents coordinate AI training workflows by submitting jobs to GPU providers, managing payments, and monitoring training progress. ### **Use Case** An hermes agent acts as an AI training coordinator to: - Submit AI training jobs to GPU providers - Manage training costs and payments - Monitor training progress - Handle distributed training - Optimize resource allocation ### **What You'll Learn** - Submit AI training jobs to GPU marketplace - Manage training payments and costs - Monitor distributed training progress - Handle training failures - Optimize training resource allocation ### **Features Combined** - **AI Job Submission** (Scenario 07) - **GPU Marketplace** (Scenario 09) - **Transaction Sending** (Scenario 02) --- ## ๐Ÿ“‹ **Prerequisites** ### **Knowledge Required** - Completed Scenarios 07, 09, and 02 - Understanding of AI/ML training - Distributed computing concepts ### **Tools Required** - AITBC CLI installed - Python 3.13+ - Wallet with AIT tokens - Access to GPU marketplace and AI engine ### **Setup Required** - GPU marketplace running - AI engine accessible - Wallet configured with sufficient balance --- ## ๐Ÿ”ง **Step-by-Step Workflow** ### **Step 1: Submit Training Job** Submit an AI training job to the GPU marketplace. ```bash aitbc ai train \ --wallet my-agent-wallet \ --model custom-model \ --dataset QmDatasetHash... \ --epochs 10 \ --batch-size 32 \ --gpu-count 2 \ --payment 100 ``` Output: ``` Training job submitted Job ID: job_abc123... Model: custom-model Dataset: QmDatasetHash... GPU Count: 2 Payment: 100 AIT Status: pending ``` ### **Step 2: Monitor Training Progress** Track training job progress across GPUs. ```bash aitbc ai progress --job-id job_abc123... ``` Output: ``` Training Progress: job_abc123... Epoch: 5/10 Loss: 0.234 Accuracy: 89.5% GPU 1: 85% utilization GPU 2: 92% utilization ETA: 45 minutes ``` ### **Step 3: Handle Training Completion** Retrieve trained model when complete. ```bash aitbc ai result --job-id job_abc123... --output model.bin ``` ### **Step 4: Manage Training Costs** Track training costs and payments. ```bash aitbc ai costs --job-id job_abc123... ``` Output: ``` Training Costs: job_abc123... Total Cost: 95 AIT GPU Hours: 4.5 Cost per Hour: 21.1 AIT Payment Status: completed ``` ### **Step 5: Optimize Training Allocation** Adjust GPU allocation based on performance. ```bash aitbc ai optimize \ --job-id job_abc123... \ --add-gpu 1 ``` --- ## ๐Ÿ’ป **Code Examples Using Agent SDK** ### **Example 1: Submit and Monitor Training Job** ```python from aitbc_agent_sdk import Agent, AgentConfig import asyncio async def train_model(): config = AgentConfig( name="training-agent", blockchain_network="mainnet", wallet_name="training-wallet" ) agent = Agent(config) await agent.start() # Submit training job job = await agent.submit_training_job( model="custom-model", dataset_hash="QmDatasetHash...", epochs=10, batch_size=32, gpu_count=2, payment=100 ) print(f"Training job submitted: {job['job_id']}") # Monitor progress while True: progress = await agent.get_training_progress(job['job_id']) print(f"Epoch: {progress['current_epoch']}/{progress['total_epochs']}") print(f"Loss: {progress['loss']}") print(f"Accuracy: {progress['accuracy']}%") if progress['status'] == 'completed': print("Training completed!") break elif progress['status'] == 'failed': print(f"Training failed: {progress['error']}") break await asyncio.sleep(60) asyncio.run(train_model()) ``` ### **Example 2: Distributed Training Coordinator** ```python from aitbc_agent_sdk import Agent, AgentConfig import asyncio class DistributedTrainingCoordinator: def __init__(self, config): self.agent = Agent(config) self.training_jobs = [] async def start(self): await self.agent.start() async def submit_distributed_training(self, dataset_hash, num_workers=4): """Submit distributed training across multiple GPUs""" # Split dataset into shards shards = await self.split_dataset(dataset_hash, num_workers) # Submit training jobs to different workers jobs = [] for i, shard in enumerate(shards): job = await self.agent.submit_training_job( model="distributed-model", dataset_hash=shard, epochs=10, batch_size=32, gpu_count=1, worker_id=i, total_workers=num_workers, payment=25 ) jobs.append(job) print(f"Submitted worker {i}: {job['job_id']}") self.training_jobs = jobs # Wait for all jobs to complete results = await asyncio.gather(*[ self.monitor_job(job['job_id']) for job in jobs ]) # Aggregate models aggregated_model = await self.aggregate_models(results) print(f"Distributed training complete: {aggregated_model}") return aggregated_model async def monitor_job(self, job_id): """Monitor a single training job""" while True: progress = await self.agent.get_training_progress(job_id) if progress['status'] in ['completed', 'failed']: return progress await asyncio.sleep(30) async def aggregate_models(self, results): """Aggregate models from distributed training""" # Simplified aggregation logic model_hashes = [r['model_hash'] for r in results if r['status'] == 'completed'] return f"aggregated_model_{len(model_hashes)}_workers" async def main(): config = AgentConfig( name="distributed-trainer", blockchain_network="mainnet", wallet_name="distributed-wallet" ) coordinator = DistributedTrainingCoordinator(config) await coordinator.start() await coordinator.submit_distributed_training("QmDatasetHash...", num_workers=4) asyncio.run(main()) ``` ### **Example 3: Cost-Optimized Training** ```python from aitbc_agent_sdk import Agent, AgentConfig import asyncio class CostOptimizedTrainer: def __init__(self, config): self.agent = Agent(config) async def start(self): await self.agent.start() async def find_cheapest_gpus(self, required_gpus=2): """Find cheapest available GPUs""" gpu_listings = await self.agent.get_gpu_listings() # Sort by price sorted_gpus = sorted(gpu_listings, key=lambda x: x['price_per_hour']) # Select cheapest GPUs selected = sorted_gpus[:required_gpus] return selected async def submit_cost_optimized_training(self, dataset_hash, max_budget=150): """Submit training with cost optimization""" # Find cheapest GPUs gpus = await self.find_cheapest_gpus(required_gpus=2) # Calculate estimated cost estimated_hours = 5 # Estimate estimated_cost = sum(g['price_per_hour'] for g in gpus) * estimated_hours if estimated_cost > max_budget: print(f"Estimated cost ${estimated_cost} exceeds budget ${max_budget}") return None # Submit training job job = await self.agent.submit_training_job( model="custom-model", dataset_hash=dataset_hash, epochs=10, batch_size=32, gpu_count=2, payment=estimated_cost, preferred_gpus=[g['listing_id'] for g in gpus] ) print(f"Training submitted with estimated cost: ${estimated_cost}") return job async def main(): config = AgentConfig( name="cost-optimized-trainer", blockchain_network="mainnet", wallet_name="cost-wallet" ) trainer = CostOptimizedTrainer(config) await trainer.start() await trainer.submit_cost_optimized_training("QmDatasetHash...", max_budget=150) asyncio.run(main()) ``` --- ## ๐ŸŽฏ **Expected Outcomes** After completing this scenario, you should be able to: - Submit AI training jobs to GPU marketplace - Monitor distributed training progress - Manage training costs and payments - Implement cost optimization strategies - Handle training failures gracefully --- ## ๐Ÿงช **Validation** Validate this scenario with the shared 3-node harness: ```bash bash scripts/workflow/44_comprehensive_multi_node_scenario.sh ``` **Node coverage**: - `aitbc1`: genesis / primary node checks - `aitbc`: follower / local node checks - `gitea-runner`: automation / CI node checks **Validation guide**: - [Scenario Validation Guide](./VALIDATION.md) **Expected result**: - Scenario-specific commands complete successfully - Cross-node health checks pass - Blockchain heights remain in sync - Any node-specific step is documented in the scenario workflow --- ## ๐Ÿ”— **Related Resources** ### **AITBC Documentation** - [AI Engine](../apps/ai-engine/README.md) - [GPU Service](../apps/gpu-service/README.md) - [Global AI Agents](../apps/global-ai-agents/README.md) ### **External Resources** - [Distributed Machine Learning](https://en.wikipedia.org/wiki/Distributed_machine_learning) - [GPU Training](https://developer.nvidia.com/deep-learning/) ### **Next Scenarios** - [37 Distributed AI Training](./37_distributed_ai_training.md) - Advanced distributed training - [39 Federated Learning Coordinator](./39_federated_learning_coordinator.md) - Federated learning - [40 Enterprise AI Agent](./40_enterprise_ai_agent.md) - Enterprise AI workflows --- ## ๐Ÿ“Š **Quality Metrics** - **Structure**: 10/10 - Clear AI training workflow - **Content**: 10/10 - Comprehensive training operations - **Code Examples**: 10/10 - Working Agent SDK examples - **Status**: Active scenario --- *Last updated: 2026-05-02* *Version: 1.0* *Status: Active scenario document*