Files
aitbc/docs/trail/GPU_RELEASE_FIX_SUMMARY.md
oib 6bcbe76c7d feat: switch to persistent SQLite database and improve GPU booking/release handling
- Change database from in-memory to file-based SQLite at aitbc_coordinator.db
- Add status="active" to GPU booking creation
- Allow GPU release even when not properly booked (cleanup case)
- Add error handling for missing booking attributes during refund calculation
- Fix get_gpu_reviews query to use scalars() for proper result handling
2026-03-07 12:23:01 +01:00

5.9 KiB

GPU Release Issue Fix Summary

ISSUE IDENTIFIED

Problem:

  • GPU release endpoint returning HTTP 500 Internal Server Error
  • Error: Failed to release GPU: 500
  • GPU status stuck as "booked" instead of "available"

Root Causes Found:

1. SQLModel Session Method Mismatch

# PROBLEM: Using SQLAlchemy execute() instead of SQLModel exec()
booking = session.execute(select(GPUBooking).where(...))

# FIXED: Using SQLModel exec() method
booking = session.exec(select(GPUBooking).where(...))

2. Missing Booking Status Field

# PROBLEM: Booking created without explicit status
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    # Missing: status="active"
)

# FIXED: Explicit status setting
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    status="active"  # Explicitly set
)

3. Database Table Issues

  • SQLite in-memory database causing data loss on restart
  • Tables not properly initialized
  • Missing GPURegistry table references

FIXES APPLIED

1. Fixed SQLModel Session Methods

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Line 189: Fixed GPU list query
gpus = session.exec(stmt).scalars().all()  # was: session.execute()

# Line 200: Fixed GPU details booking query  
booking = session.exec(select(GPUBooking).where(...))  # was: session.execute()

# Line 292: Fixed GPU release booking query
booking = session.exec(select(GPUBooking).where(...))  # was: session.execute()

2. Fixed Booking Creation

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Line 259: Added explicit status field
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    duration_hours=request.duration_hours,
    total_cost=total_cost,
    start_time=start_time,
    end_time=end_time,
    status="active"  # ADDED: Explicit status
)

3. Improved Release Logic

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Lines 286-293: Added graceful handling for already available GPUs
if gpu.status != "booked":
    return {
        "status": "already_available", 
        "gpu_id": gpu_id,
        "message": f"GPU {gpu_id} is already available",
    }

🧪 TESTING RESULTS

Before Fixes:

❌ GPU Release: HTTP 500 Internal Server Error
❌ Error: Failed to release GPU: 500
❌ GPU Status: Stuck as "booked"
❌ Booking Records: Missing or inconsistent

After Fixes:

❌ GPU Release: Still returning HTTP 500
❌ Error: Failed to release GPU: 500  
❌ GPU Status: Still showing as "booked"
❌ Issue: Persists despite fixes

🔍 INVESTIGATION FINDINGS

Database Issues:

  • In-memory SQLite: Database resets on coordinator restart
  • Table Creation: GPURegistry table not persisting
  • Data Loss: Fake GPUs reappear after restart

API Endpoints Affected:

  • POST /v1/marketplace/gpu/{gpu_id}/release - Primary issue
  • GET /v1/marketplace/gpu/list - Shows inconsistent data
  • POST /v1/marketplace/gpu/{gpu_id}/book - Creates incomplete bookings

Service Architecture Issues:

  • Multiple coordinator processes running
  • Database connection inconsistencies
  • Session management problems

🛠️ ADDITIONAL FIXES NEEDED

1. Database Persistence

# Need to switch from in-memory to persistent SQLite
engine = create_engine(
    "sqlite:///aitbc_coordinator.db",  # Persistent file
    connect_args={"check_same_thread": False},
    echo=False
)

2. Service Management

# Need to properly manage single coordinator instance
systemctl stop aitbc-coordinator
systemctl start aitbc-coordinator
systemctl status aitbc-coordinator

3. Fake GPU Cleanup

# Need direct database cleanup script
# Remove fake RTX-4090 entries
# Keep only legitimate GPUs

📋 CURRENT STATUS

Fixed:

  • SQLModel session method calls (3 instances)
  • Booking creation with explicit status
  • Improved release error handling
  • Syntax errors resolved

Still Issues:

  • HTTP 500 error persists
  • Database persistence problems
  • Fake GPU entries reappearing
  • Service restart issues

🔄 Next Steps:

  1. Database Migration: Switch to persistent storage
  2. Service Cleanup: Ensure single coordinator instance
  3. Direct Database Fix: Manual cleanup of fake entries
  4. End-to-End Test: Verify complete booking/release cycle

💡 RECOMMENDATIONS

Immediate Actions:

  1. Stop All Coordinator Processes: pkill -f coordinator
  2. Use Persistent Database: Modify database.py
  3. Clean Database Directly: Remove fake entries
  4. Start Fresh Service: Single instance only

Long-term Solutions:

  1. Database Migration: PostgreSQL for production
  2. Service Management: Proper systemd configuration
  3. API Testing: Comprehensive endpoint testing
  4. Monitoring: Service health checks

🎯 SUCCESS METRICS

When Fixed Should See:

aitbc marketplace gpu release gpu_c5be877c
# Expected: ✅ GPU released successfully

aitbc marketplace gpu list
# Expected: GPU status = "available"

aitbc marketplace gpu book gpu_c5be877c --hours 1
# Expected: ✅ GPU booked successfully

📝 CONCLUSION

The GPU release issue has been partially fixed with SQLModel method corrections and improved error handling, but the core database persistence and service management issues remain.

Key fixes applied:

  • SQLModel session methods corrected
  • Booking creation improved
  • Release logic enhanced
  • Syntax errors resolved

Remaining work needed:

  • Database persistence implementation
  • Service process cleanup
  • Fake GPU data removal
  • End-to-end testing validation

The foundation is in place, but database and service issues need resolution for complete fix.