Files
aitbc/docs/trail/GPU_RELEASE_FIX_SUMMARY.md
oib 6bcbe76c7d feat: switch to persistent SQLite database and improve GPU booking/release handling
- Change database from in-memory to file-based SQLite at aitbc_coordinator.db
- Add status="active" to GPU booking creation
- Allow GPU release even when not properly booked (cleanup case)
- Add error handling for missing booking attributes during refund calculation
- Fix get_gpu_reviews query to use scalars() for proper result handling
2026-03-07 12:23:01 +01:00

234 lines
5.9 KiB
Markdown

# GPU Release Issue Fix Summary
## ❌ **ISSUE IDENTIFIED**
### **Problem:**
- GPU release endpoint returning HTTP 500 Internal Server Error
- Error: `Failed to release GPU: 500`
- GPU status stuck as "booked" instead of "available"
### **Root Causes Found:**
#### **1. SQLModel Session Method Mismatch**
```python
# PROBLEM: Using SQLAlchemy execute() instead of SQLModel exec()
booking = session.execute(select(GPUBooking).where(...))
# FIXED: Using SQLModel exec() method
booking = session.exec(select(GPUBooking).where(...))
```
#### **2. Missing Booking Status Field**
```python
# PROBLEM: Booking created without explicit status
booking = GPUBooking(
gpu_id=gpu_id,
job_id=request.job_id,
# Missing: status="active"
)
# FIXED: Explicit status setting
booking = GPUBooking(
gpu_id=gpu_id,
job_id=request.job_id,
status="active" # Explicitly set
)
```
#### **3. Database Table Issues**
- SQLite in-memory database causing data loss on restart
- Tables not properly initialized
- Missing GPURegistry table references
---
## ✅ **FIXES APPLIED**
### **1. Fixed SQLModel Session Methods**
**File:** `/apps/coordinator-api/src/app/routers/marketplace_gpu.py`
**Changes Made:**
```python
# Line 189: Fixed GPU list query
gpus = session.exec(stmt).scalars().all() # was: session.execute()
# Line 200: Fixed GPU details booking query
booking = session.exec(select(GPUBooking).where(...)) # was: session.execute()
# Line 292: Fixed GPU release booking query
booking = session.exec(select(GPUBooking).where(...)) # was: session.execute()
```
### **2. Fixed Booking Creation**
**File:** `/apps/coordinator-api/src/app/routers/marketplace_gpu.py`
**Changes Made:**
```python
# Line 259: Added explicit status field
booking = GPUBooking(
gpu_id=gpu_id,
job_id=request.job_id,
duration_hours=request.duration_hours,
total_cost=total_cost,
start_time=start_time,
end_time=end_time,
status="active" # ADDED: Explicit status
)
```
### **3. Improved Release Logic**
**File:** `/apps/coordinator-api/src/app/routers/marketplace_gpu.py`
**Changes Made:**
```python
# Lines 286-293: Added graceful handling for already available GPUs
if gpu.status != "booked":
return {
"status": "already_available",
"gpu_id": gpu_id,
"message": f"GPU {gpu_id} is already available",
}
```
---
## 🧪 **TESTING RESULTS**
### **Before Fixes:**
```
❌ GPU Release: HTTP 500 Internal Server Error
❌ Error: Failed to release GPU: 500
❌ GPU Status: Stuck as "booked"
❌ Booking Records: Missing or inconsistent
```
### **After Fixes:**
```
❌ GPU Release: Still returning HTTP 500
❌ Error: Failed to release GPU: 500
❌ GPU Status: Still showing as "booked"
❌ Issue: Persists despite fixes
```
---
## 🔍 **INVESTIGATION FINDINGS**
### **Database Issues:**
- **In-memory SQLite**: Database resets on coordinator restart
- **Table Creation**: GPURegistry table not persisting
- **Data Loss**: Fake GPUs reappear after restart
### **API Endpoints Affected:**
- `POST /v1/marketplace/gpu/{gpu_id}/release` - Primary issue
- `GET /v1/marketplace/gpu/list` - Shows inconsistent data
- `POST /v1/marketplace/gpu/{gpu_id}/book` - Creates incomplete bookings
### **Service Architecture Issues:**
- Multiple coordinator processes running
- Database connection inconsistencies
- Session management problems
---
## 🛠️ **ADDITIONAL FIXES NEEDED**
### **1. Database Persistence**
```python
# Need to switch from in-memory to persistent SQLite
engine = create_engine(
"sqlite:///aitbc_coordinator.db", # Persistent file
connect_args={"check_same_thread": False},
echo=False
)
```
### **2. Service Management**
```bash
# Need to properly manage single coordinator instance
systemctl stop aitbc-coordinator
systemctl start aitbc-coordinator
systemctl status aitbc-coordinator
```
### **3. Fake GPU Cleanup**
```python
# Need direct database cleanup script
# Remove fake RTX-4090 entries
# Keep only legitimate GPUs
```
---
## 📋 **CURRENT STATUS**
### **✅ Fixed:**
- SQLModel session method calls (3 instances)
- Booking creation with explicit status
- Improved release error handling
- Syntax errors resolved
### **❌ Still Issues:**
- HTTP 500 error persists
- Database persistence problems
- Fake GPU entries reappearing
- Service restart issues
### **🔄 Next Steps:**
1. **Database Migration**: Switch to persistent storage
2. **Service Cleanup**: Ensure single coordinator instance
3. **Direct Database Fix**: Manual cleanup of fake entries
4. **End-to-End Test**: Verify complete booking/release cycle
---
## 💡 **RECOMMENDATIONS**
### **Immediate Actions:**
1. **Stop All Coordinator Processes**: `pkill -f coordinator`
2. **Use Persistent Database**: Modify database.py
3. **Clean Database Directly**: Remove fake entries
4. **Start Fresh Service**: Single instance only
### **Long-term Solutions:**
1. **Database Migration**: PostgreSQL for production
2. **Service Management**: Proper systemd configuration
3. **API Testing**: Comprehensive endpoint testing
4. **Monitoring**: Service health checks
---
## 🎯 **SUCCESS METRICS**
### **When Fixed Should See:**
```bash
aitbc marketplace gpu release gpu_c5be877c
# Expected: ✅ GPU released successfully
aitbc marketplace gpu list
# Expected: GPU status = "available"
aitbc marketplace gpu book gpu_c5be877c --hours 1
# Expected: ✅ GPU booked successfully
```
---
## 📝 **CONCLUSION**
**The GPU release issue has been partially fixed with SQLModel method corrections and improved error handling, but the core database persistence and service management issues remain.**
**Key fixes applied:**
- ✅ SQLModel session methods corrected
- ✅ Booking creation improved
- ✅ Release logic enhanced
- ✅ Syntax errors resolved
**Remaining work needed:**
- ❌ Database persistence implementation
- ❌ Service process cleanup
- ❌ Fake GPU data removal
- ❌ End-to-end testing validation
**The foundation is in place, but database and service issues need resolution for complete fix.**