Files
aitbc/docs/8_development/EVENT_DRIVEN_CACHE_STRATEGY.md
AITBC System b033923756 chore: normalize file permissions across repository
- Remove executable permissions from configuration files (.editorconfig, .env.example, .gitignore)
- Remove executable permissions from documentation files (README.md, LICENSE, SECURITY.md)
- Remove executable permissions from web assets (HTML, CSS, JS files)
- Remove executable permissions from data files (JSON, SQL, YAML, requirements.txt)
- Remove executable permissions from source code files across all apps
- Add executable permissions to Python
2026-03-08 11:26:18 +01:00

459 lines
12 KiB
Markdown

# Event-Driven Redis Caching Strategy for Global Edge Nodes
## Overview
This document describes the implementation of an event-driven Redis caching strategy for the AITBC platform, specifically designed to handle distributed edge nodes with immediate propagation of GPU availability and pricing changes on booking/cancellation events.
## Architecture
### Multi-Tier Caching
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Edge Node 1 │ │ Edge Node 2 │ │ Edge Node N │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │
│ │ (Memory) │ │ │ │ (Memory) │ │ │ │ (Memory) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘
┌─────────────┴─────────────┐
│ Redis Cluster │
│ (L2 Distributed) │
│ │
│ ┌─────────────────────┐ │
│ │ Pub/Sub Channel │ │
│ │ Cache Invalidation │ │
│ └─────────────────────┘ │
└─────────────────────────┘
```
### Event-Driven Invalidation Flow
```
Booking/Cancellation Event
Event Publisher
Redis Pub/Sub
Event Subscribers
(All Edge Nodes)
Cache Invalidation
(L1 + L2 Cache)
Immediate Propagation
```
## Key Features
### 1. Event-Driven Cache Invalidation
**Problem Solved**: TTL-only caching causes stale data propagation delays across edge nodes.
**Solution**: Real-time event-driven invalidation using Redis pub/sub for immediate propagation.
**Critical Data Types**:
- GPU availability status
- GPU pricing information
- Order book data
- Provider status
### 2. Multi-Tier Cache Architecture
**L1 Cache (Memory)**:
- Fastest access (sub-millisecond)
- Limited size (1000-5000 entries)
- Shorter TTL (30-60 seconds)
- Immediate invalidation on events
**L2 Cache (Redis)**:
- Distributed across all edge nodes
- Larger capacity (GBs)
- Longer TTL (5-60 minutes)
- Event-driven updates
### 3. Distributed Edge Node Coordination
**Node Identification**:
- Unique node IDs for each edge node
- Regional grouping for optimization
- Network tier classification (edge/regional/global)
**Event Propagation**:
- Pub/sub for real-time events
- Event queuing for reliability
- Automatic failover and recovery
## Implementation Details
### Cache Event Types
```python
class CacheEventType(Enum):
GPU_AVAILABILITY_CHANGED = "gpu_availability_changed"
PRICING_UPDATED = "pricing_updated"
BOOKING_CREATED = "booking_created"
BOOKING_CANCELLED = "booking_cancelled"
PROVIDER_STATUS_CHANGED = "provider_status_changed"
MARKET_STATS_UPDATED = "market_stats_updated"
ORDER_BOOK_UPDATED = "order_book_updated"
MANUAL_INVALIDATION = "manual_invalidation"
```
### Cache Configurations
| Data Type | TTL | Event-Driven | Critical | Memory Limit |
|-----------|-----|--------------|----------|--------------|
| GPU Availability | 30s | ✅ | ✅ | 100MB |
| GPU Pricing | 60s | ✅ | ✅ | 50MB |
| Order Book | 5s | ✅ | ✅ | 200MB |
| Provider Status | 120s | ✅ | ❌ | 50MB |
| Market Stats | 300s | ✅ | ❌ | 100MB |
| Historical Data | 3600s | ❌ | ❌ | 500MB |
### Event Structure
```python
@dataclass
class CacheEvent:
event_type: CacheEventType
resource_id: str
data: Dict[str, Any]
timestamp: float
source_node: str
event_id: str
affected_namespaces: List[str]
```
## Usage Examples
### Basic Cache Operations
```python
from aitbc_cache import init_marketplace_cache, get_marketplace_cache
# Initialize cache manager
cache_manager = await init_marketplace_cache(
redis_url="redis://redis-cluster:6379/0",
node_id="edge_node_us_east_1",
region="us-east"
)
# Get GPU availability
gpus = await cache_manager.get_gpu_availability(
region="us-east",
gpu_type="RTX 3080"
)
# Update GPU status (triggers event)
await cache_manager.update_gpu_status("gpu_123", "busy")
```
### Booking Operations with Cache Updates
```python
# Create booking (automatically updates caches)
booking = BookingInfo(
booking_id="booking_456",
gpu_id="gpu_123",
user_id="user_789",
start_time=datetime.utcnow(),
end_time=datetime.utcnow() + timedelta(hours=2),
status="active",
total_cost=0.2
)
success = await cache_manager.create_booking(booking)
# This triggers:
# 1. GPU availability update
# 2. Pricing recalculation
# 3. Order book invalidation
# 4. Market stats update
# 5. Event publishing to all nodes
```
### Event-Driven Pricing Updates
```python
# Update pricing (immediately propagated)
await cache_manager.update_gpu_pricing("RTX 3080", 0.15, "us-east")
# All edge nodes receive this event instantly
# and invalidate their pricing caches
```
## Deployment Configuration
### Environment Variables
```bash
# Redis Configuration
REDIS_HOST=redis-cluster.internal
REDIS_PORT=6379
REDIS_DB=0
REDIS_PASSWORD=your_redis_password
REDIS_SSL=true
REDIS_MAX_CONNECTIONS=50
# Edge Node Configuration
EDGE_NODE_ID=edge_node_us_east_1
EDGE_NODE_REGION=us-east
EDGE_NODE_DATACENTER=dc1
EDGE_NODE_CACHE_TIER=edge
# Cache Configuration
CACHE_L1_SIZE=1000
CACHE_ENABLE_EVENT_DRIVEN=true
CACHE_ENABLE_METRICS=true
CACHE_HEALTH_CHECK_INTERVAL=30
# Security
CACHE_ENABLE_TLS=true
CACHE_REQUIRE_AUTH=true
CACHE_AUTH_TOKEN=your_auth_token
```
### Redis Cluster Setup
```yaml
# docker-compose.yml
version: '3.8'
services:
redis-master:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --appendonly yes --cluster-enabled yes
redis-replica-1:
image: redis:7-alpine
ports:
- "6380:6379"
command: redis-server --appendonly yes --cluster-enabled yes
redis-replica-2:
image: redis:7-alpine
ports:
- "6381:6379"
command: redis-server --appendonly yes --cluster-enabled yes
```
## Performance Optimization
### Cache Hit Ratios
**Target Performance**:
- L1 Cache Hit Ratio: >80%
- L2 Cache Hit Ratio: >95%
- Event Propagation Latency: <100ms
- Total Cache Response Time: <5ms
### Optimization Strategies
1. **L1 Cache Sizing**:
- Edge nodes: 500 entries (faster lookup)
- Regional nodes: 2000 entries (better coverage)
- Global nodes: 5000 entries (maximum coverage)
2. **Event Processing**:
- Batch event processing for high throughput
- Event deduplication to prevent storms
- Priority queues for critical events
3. **Memory Management**:
- LFU eviction for frequently accessed data
- Time-based expiration for stale data
- Memory pressure monitoring
## Monitoring and Observability
### Cache Metrics
```python
# Get cache statistics
stats = await cache_manager.get_cache_stats()
# Key metrics:
# - cache_hits / cache_misses
# - events_processed
# - invalidations
# - l1_cache_size
# - redis_memory_used_mb
```
### Health Checks
```python
# Comprehensive health check
health = await cache_manager.health_check()
# Health indicators:
# - redis_connected
# - pubsub_active
# - event_queue_size
# - last_event_age
```
### Alerting Thresholds
| Metric | Warning | Critical |
|--------|---------|----------|
| Cache Hit Ratio | <70% | <50% |
| Event Queue Size | >1000 | >5000 |
| Event Latency | >500ms | >2000ms |
| Redis Memory | >80% | >95% |
| Connection Failures | >5/min | >20/min |
## Security Considerations
### Network Security
1. **TLS Encryption**: All Redis connections use TLS
2. **Authentication**: Redis AUTH tokens required
3. **Network Isolation**: Redis cluster in private VPC
4. **Access Control**: IP whitelisting for edge nodes
### Data Security
1. **Sensitive Data**: No private keys or passwords cached
2. **Data Encryption**: At-rest encryption for Redis
3. **Access Logging**: All cache operations logged
4. **Data Retention**: Automatic cleanup of old data
## Troubleshooting
### Common Issues
1. **Stale Cache Data**:
- Check event propagation
- Verify pub/sub connectivity
- Review event queue size
2. **High Memory Usage**:
- Monitor L1 cache size
- Check TTL configurations
- Review eviction policies
3. **Slow Performance**:
- Check Redis connection pool
- Monitor network latency
- Review cache hit ratios
### Debug Commands
```python
# Check cache health
health = await cache_manager.health_check()
print(f"Cache status: {health['status']}")
# Check event processing
stats = await cache_manager.get_cache_stats()
print(f"Events processed: {stats['events_processed']}")
# Manual cache invalidation
await cache_manager.invalidate_cache('gpu_availability', reason='debug')
```
## Best Practices
### 1. Cache Key Design
- Use consistent naming conventions
- Include relevant parameters in key
- Avoid key collisions
- Use appropriate TTL values
### 2. Event Design
- Include all necessary context
- Use unique event IDs
- Timestamp all events
- Handle event idempotency
### 3. Error Handling
- Graceful degradation on Redis failures
- Retry logic for transient errors
- Fallback to database when needed
- Comprehensive error logging
### 4. Performance Optimization
- Batch operations when possible
- Use connection pooling
- Monitor memory usage
- Optimize serialization
## Migration Guide
### From TTL-Only Caching
1. **Phase 1**: Deploy event-driven cache alongside existing cache
2. **Phase 2**: Enable event-driven invalidation for critical data
3. **Phase 3**: Migrate all data types to event-driven
4. **Phase 4**: Remove old TTL-only cache
### Configuration Migration
```python
# Old configuration
cache_ttl = {
'gpu_availability': 30,
'gpu_pricing': 60
}
# New configuration
cache_configs = {
'gpu_availability': CacheConfig(
namespace='gpu_avail',
ttl_seconds=30,
event_driven=True,
critical_data=True
),
'gpu_pricing': CacheConfig(
namespace='gpu_pricing',
ttl_seconds=60,
event_driven=True,
critical_data=True
)
}
```
## Future Enhancements
### Planned Features
1. **Intelligent Caching**: ML-based cache preloading
2. **Adaptive TTL**: Dynamic TTL based on access patterns
3. **Multi-Region Replication**: Cross-region cache synchronization
4. **Cache Analytics**: Advanced usage analytics and optimization
### Scalability Improvements
1. **Sharding**: Horizontal scaling of cache data
2. **Compression**: Data compression for memory efficiency
3. **Tiered Storage**: SSD/HDD tiering for large datasets
4. **Edge Computing**: Push cache closer to users
## Conclusion
The event-driven Redis caching strategy provides:
- **Immediate Propagation**: Sub-100ms event propagation across all edge nodes
- **High Performance**: Multi-tier caching with >95% hit ratios
- **Scalability**: Distributed architecture supporting global edge deployment
- **Reliability**: Automatic failover and recovery mechanisms
- **Security**: Enterprise-grade security with TLS and authentication
This system ensures that GPU availability and pricing changes are immediately propagated to all edge nodes, eliminating stale data issues and providing a consistent user experience across the global AITBC platform.