oib/aitbc

Files

oib 6bcbe76c7d feat: switch to persistent SQLite database and improve GPU booking/release handling

- Change database from in-memory to file-based SQLite at aitbc_coordinator.db
- Add status="active" to GPU booking creation
- Allow GPU release even when not properly booked (cleanup case)
- Add error handling for missing booking attributes during refund calculation
- Fix get_gpu_reviews query to use scalars() for proper result handling

2026-03-07 12:23:01 +01:00

5.9 KiB

Raw Blame History

GPU Release Issue Fix Summary

❌ ISSUE IDENTIFIED

Problem:

GPU release endpoint returning HTTP 500 Internal Server Error
Error: Failed to release GPU: 500
GPU status stuck as "booked" instead of "available"

Root Causes Found:

1. SQLModel Session Method Mismatch

# PROBLEM: Using SQLAlchemy execute() instead of SQLModel exec()
booking = session.execute(select(GPUBooking).where(...))

# FIXED: Using SQLModel exec() method
booking = session.exec(select(GPUBooking).where(...))

2. Missing Booking Status Field

# PROBLEM: Booking created without explicit status
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    # Missing: status="active"
)

# FIXED: Explicit status setting
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    status="active"  # Explicitly set
)

3. Database Table Issues

SQLite in-memory database causing data loss on restart
Tables not properly initialized
Missing GPURegistry table references

✅ FIXES APPLIED

1. Fixed SQLModel Session Methods

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Line 189: Fixed GPU list query
gpus = session.exec(stmt).scalars().all()  # was: session.execute()

# Line 200: Fixed GPU details booking query  
booking = session.exec(select(GPUBooking).where(...))  # was: session.execute()

# Line 292: Fixed GPU release booking query
booking = session.exec(select(GPUBooking).where(...))  # was: session.execute()

2. Fixed Booking Creation

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Line 259: Added explicit status field
booking = GPUBooking(
    gpu_id=gpu_id,
    job_id=request.job_id,
    duration_hours=request.duration_hours,
    total_cost=total_cost,
    start_time=start_time,
    end_time=end_time,
    status="active"  # ADDED: Explicit status
)

3. Improved Release Logic

File: /apps/coordinator-api/src/app/routers/marketplace_gpu.py

Changes Made:

# Lines 286-293: Added graceful handling for already available GPUs
if gpu.status != "booked":
    return {
        "status": "already_available", 
        "gpu_id": gpu_id,
        "message": f"GPU {gpu_id} is already available",
    }

🧪 TESTING RESULTS

Before Fixes:

❌ GPU Release: HTTP 500 Internal Server Error
❌ Error: Failed to release GPU: 500
❌ GPU Status: Stuck as "booked"
❌ Booking Records: Missing or inconsistent

After Fixes:

❌ GPU Release: Still returning HTTP 500
❌ Error: Failed to release GPU: 500  
❌ GPU Status: Still showing as "booked"
❌ Issue: Persists despite fixes

🔍 INVESTIGATION FINDINGS

Database Issues:

In-memory SQLite: Database resets on coordinator restart
Table Creation: GPURegistry table not persisting
Data Loss: Fake GPUs reappear after restart

API Endpoints Affected:

POST /v1/marketplace/gpu/{gpu_id}/release - Primary issue
GET /v1/marketplace/gpu/list - Shows inconsistent data
POST /v1/marketplace/gpu/{gpu_id}/book - Creates incomplete bookings

Service Architecture Issues:

Multiple coordinator processes running
Database connection inconsistencies
Session management problems

🛠️ ADDITIONAL FIXES NEEDED

1. Database Persistence

# Need to switch from in-memory to persistent SQLite
engine = create_engine(
    "sqlite:///aitbc_coordinator.db",  # Persistent file
    connect_args={"check_same_thread": False},
    echo=False
)

2. Service Management

# Need to properly manage single coordinator instance
systemctl stop aitbc-coordinator
systemctl start aitbc-coordinator
systemctl status aitbc-coordinator

3. Fake GPU Cleanup

# Need direct database cleanup script
# Remove fake RTX-4090 entries
# Keep only legitimate GPUs

📋 CURRENT STATUS

✅ Fixed:

SQLModel session method calls (3 instances)
Booking creation with explicit status
Improved release error handling
Syntax errors resolved

❌ Still Issues:

HTTP 500 error persists
Database persistence problems
Fake GPU entries reappearing
Service restart issues

🔄 Next Steps:

Database Migration: Switch to persistent storage
Service Cleanup: Ensure single coordinator instance
Direct Database Fix: Manual cleanup of fake entries
End-to-End Test: Verify complete booking/release cycle

💡 RECOMMENDATIONS

Immediate Actions:

Stop All Coordinator Processes: pkill -f coordinator
Use Persistent Database: Modify database.py
Clean Database Directly: Remove fake entries
Start Fresh Service: Single instance only

Long-term Solutions:

Database Migration: PostgreSQL for production
Service Management: Proper systemd configuration
API Testing: Comprehensive endpoint testing
Monitoring: Service health checks

🎯 SUCCESS METRICS

When Fixed Should See:

aitbc marketplace gpu release gpu_c5be877c
# Expected: ✅ GPU released successfully

aitbc marketplace gpu list
# Expected: GPU status = "available"

aitbc marketplace gpu book gpu_c5be877c --hours 1
# Expected: ✅ GPU booked successfully

📝 CONCLUSION

The GPU release issue has been partially fixed with SQLModel method corrections and improved error handling, but the core database persistence and service management issues remain.

Key fixes applied:

✅ SQLModel session methods corrected
✅ Booking creation improved
✅ Release logic enhanced
✅ Syntax errors resolved

Remaining work needed:

❌ Database persistence implementation
❌ Service process cleanup
❌ Fake GPU data removal
❌ End-to-end testing validation

The foundation is in place, but database and service issues need resolution for complete fix.

5.9 KiB Raw Blame History

GPU Release Issue Fix Summary

❌ ISSUE IDENTIFIED

Problem:

Root Causes Found:

1. SQLModel Session Method Mismatch

2. Missing Booking Status Field

3. Database Table Issues

✅ FIXES APPLIED

1. Fixed SQLModel Session Methods

2. Fixed Booking Creation

3. Improved Release Logic

🧪 TESTING RESULTS

Before Fixes:

After Fixes:

🔍 INVESTIGATION FINDINGS

Database Issues:

API Endpoints Affected:

Service Architecture Issues:

🛠️ ADDITIONAL FIXES NEEDED

1. Database Persistence

2. Service Management

3. Fake GPU Cleanup

📋 CURRENT STATUS

✅ Fixed:

❌ Still Issues:

🔄 Next Steps:

💡 RECOMMENDATIONS

Immediate Actions:

Long-term Solutions:

🎯 SUCCESS METRICS

When Fixed Should See:

📝 CONCLUSION

5.9 KiB

Raw Blame History