# Event-Driven Redis Caching Strategy for Global Edge Nodes ## Overview This document describes the implementation of an event-driven Redis caching strategy for the AITBC platform, specifically designed to handle distributed edge nodes with immediate propagation of GPU availability and pricing changes on booking/cancellation events. ## Architecture ### Multi-Tier Caching ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Edge Node 1 │ │ Edge Node 2 │ │ Edge Node N │ │ │ │ │ │ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ (Memory) │ │ │ │ (Memory) │ │ │ │ (Memory) │ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ └─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘ │ │ │ └──────────────────────┼──────────────────────┘ │ ┌─────────────┴─────────────┐ │ Redis Cluster │ │ (L2 Distributed) │ │ │ │ ┌─────────────────────┐ │ │ │ Pub/Sub Channel │ │ │ │ Cache Invalidation │ │ │ └─────────────────────┘ │ └─────────────────────────┘ ``` ### Event-Driven Invalidation Flow ``` Booking/Cancellation Event │ ▼ Event Publisher │ ▼ Redis Pub/Sub │ ▼ Event Subscribers (All Edge Nodes) │ ▼ Cache Invalidation (L1 + L2 Cache) │ ▼ Immediate Propagation ``` ## Key Features ### 1. Event-Driven Cache Invalidation **Problem Solved**: TTL-only caching causes stale data propagation delays across edge nodes. **Solution**: Real-time event-driven invalidation using Redis pub/sub for immediate propagation. **Critical Data Types**: - GPU availability status - GPU pricing information - Order book data - Provider status ### 2. Multi-Tier Cache Architecture **L1 Cache (Memory)**: - Fastest access (sub-millisecond) - Limited size (1000-5000 entries) - Shorter TTL (30-60 seconds) - Immediate invalidation on events **L2 Cache (Redis)**: - Distributed across all edge nodes - Larger capacity (GBs) - Longer TTL (5-60 minutes) - Event-driven updates ### 3. Distributed Edge Node Coordination **Node Identification**: - Unique node IDs for each edge node - Regional grouping for optimization - Network tier classification (edge/regional/global) **Event Propagation**: - Pub/sub for real-time events - Event queuing for reliability - Automatic failover and recovery ## Implementation Details ### Cache Event Types ```python class CacheEventType(Enum): GPU_AVAILABILITY_CHANGED = "gpu_availability_changed" PRICING_UPDATED = "pricing_updated" BOOKING_CREATED = "booking_created" BOOKING_CANCELLED = "booking_cancelled" PROVIDER_STATUS_CHANGED = "provider_status_changed" MARKET_STATS_UPDATED = "market_stats_updated" ORDER_BOOK_UPDATED = "order_book_updated" MANUAL_INVALIDATION = "manual_invalidation" ``` ### Cache Configurations | Data Type | TTL | Event-Driven | Critical | Memory Limit | |-----------|-----|--------------|----------|--------------| | GPU Availability | 30s | ✅ | ✅ | 100MB | | GPU Pricing | 60s | ✅ | ✅ | 50MB | | Order Book | 5s | ✅ | ✅ | 200MB | | Provider Status | 120s | ✅ | ❌ | 50MB | | Market Stats | 300s | ✅ | ❌ | 100MB | | Historical Data | 3600s | ❌ | ❌ | 500MB | ### Event Structure ```python @dataclass class CacheEvent: event_type: CacheEventType resource_id: str data: Dict[str, Any] timestamp: float source_node: str event_id: str affected_namespaces: List[str] ``` ## Usage Examples ### Basic Cache Operations ```python from aitbc_cache import init_marketplace_cache, get_marketplace_cache # Initialize cache manager cache_manager = await init_marketplace_cache( redis_url="redis://redis-cluster:6379/0", node_id="edge_node_us_east_1", region="us-east" ) # Get GPU availability gpus = await cache_manager.get_gpu_availability( region="us-east", gpu_type="RTX 3080" ) # Update GPU status (triggers event) await cache_manager.update_gpu_status("gpu_123", "busy") ``` ### Booking Operations with Cache Updates ```python # Create booking (automatically updates caches) booking = BookingInfo( booking_id="booking_456", gpu_id="gpu_123", user_id="user_789", start_time=datetime.utcnow(), end_time=datetime.utcnow() + timedelta(hours=2), status="active", total_cost=0.2 ) success = await cache_manager.create_booking(booking) # This triggers: # 1. GPU availability update # 2. Pricing recalculation # 3. Order book invalidation # 4. Market stats update # 5. Event publishing to all nodes ``` ### Event-Driven Pricing Updates ```python # Update pricing (immediately propagated) await cache_manager.update_gpu_pricing("RTX 3080", 0.15, "us-east") # All edge nodes receive this event instantly # and invalidate their pricing caches ``` ## Deployment Configuration ### Environment Variables ```bash # Redis Configuration REDIS_HOST=redis-cluster.internal REDIS_PORT=6379 REDIS_DB=0 REDIS_PASSWORD=your_redis_password REDIS_SSL=true REDIS_MAX_CONNECTIONS=50 # Edge Node Configuration EDGE_NODE_ID=edge_node_us_east_1 EDGE_NODE_REGION=us-east EDGE_NODE_DATACENTER=dc1 EDGE_NODE_CACHE_TIER=edge # Cache Configuration CACHE_L1_SIZE=1000 CACHE_ENABLE_EVENT_DRIVEN=true CACHE_ENABLE_METRICS=true CACHE_HEALTH_CHECK_INTERVAL=30 # Security CACHE_ENABLE_TLS=true CACHE_REQUIRE_AUTH=true CACHE_AUTH_TOKEN=your_auth_token ``` ### Redis Cluster Setup ```yaml # docker-compose.yml version: '3.8' services: redis-master: image: redis:7-alpine ports: - "6379:6379" command: redis-server --appendonly yes --cluster-enabled yes redis-replica-1: image: redis:7-alpine ports: - "6380:6379" command: redis-server --appendonly yes --cluster-enabled yes redis-replica-2: image: redis:7-alpine ports: - "6381:6379" command: redis-server --appendonly yes --cluster-enabled yes ``` ## Performance Optimization ### Cache Hit Ratios **Target Performance**: - L1 Cache Hit Ratio: >80% - L2 Cache Hit Ratio: >95% - Event Propagation Latency: <100ms - Total Cache Response Time: <5ms ### Optimization Strategies 1. **L1 Cache Sizing**: - Edge nodes: 500 entries (faster lookup) - Regional nodes: 2000 entries (better coverage) - Global nodes: 5000 entries (maximum coverage) 2. **Event Processing**: - Batch event processing for high throughput - Event deduplication to prevent storms - Priority queues for critical events 3. **Memory Management**: - LFU eviction for frequently accessed data - Time-based expiration for stale data - Memory pressure monitoring ## Monitoring and Observability ### Cache Metrics ```python # Get cache statistics stats = await cache_manager.get_cache_stats() # Key metrics: # - cache_hits / cache_misses # - events_processed # - invalidations # - l1_cache_size # - redis_memory_used_mb ``` ### Health Checks ```python # Comprehensive health check health = await cache_manager.health_check() # Health indicators: # - redis_connected # - pubsub_active # - event_queue_size # - last_event_age ``` ### Alerting Thresholds | Metric | Warning | Critical | |--------|---------|----------| | Cache Hit Ratio | <70% | <50% | | Event Queue Size | >1000 | >5000 | | Event Latency | >500ms | >2000ms | | Redis Memory | >80% | >95% | | Connection Failures | >5/min | >20/min | ## Security Considerations ### Network Security 1. **TLS Encryption**: All Redis connections use TLS 2. **Authentication**: Redis AUTH tokens required 3. **Network Isolation**: Redis cluster in private VPC 4. **Access Control**: IP whitelisting for edge nodes ### Data Security 1. **Sensitive Data**: No private keys or passwords cached 2. **Data Encryption**: At-rest encryption for Redis 3. **Access Logging**: All cache operations logged 4. **Data Retention**: Automatic cleanup of old data ## Troubleshooting ### Common Issues 1. **Stale Cache Data**: - Check event propagation - Verify pub/sub connectivity - Review event queue size 2. **High Memory Usage**: - Monitor L1 cache size - Check TTL configurations - Review eviction policies 3. **Slow Performance**: - Check Redis connection pool - Monitor network latency - Review cache hit ratios ### Debug Commands ```python # Check cache health health = await cache_manager.health_check() print(f"Cache status: {health['status']}") # Check event processing stats = await cache_manager.get_cache_stats() print(f"Events processed: {stats['events_processed']}") # Manual cache invalidation await cache_manager.invalidate_cache('gpu_availability', reason='debug') ``` ## Best Practices ### 1. Cache Key Design - Use consistent naming conventions - Include relevant parameters in key - Avoid key collisions - Use appropriate TTL values ### 2. Event Design - Include all necessary context - Use unique event IDs - Timestamp all events - Handle event idempotency ### 3. Error Handling - Graceful degradation on Redis failures - Retry logic for transient errors - Fallback to database when needed - Comprehensive error logging ### 4. Performance Optimization - Batch operations when possible - Use connection pooling - Monitor memory usage - Optimize serialization ## Migration Guide ### From TTL-Only Caching 1. **Phase 1**: Deploy event-driven cache alongside existing cache 2. **Phase 2**: Enable event-driven invalidation for critical data 3. **Phase 3**: Migrate all data types to event-driven 4. **Phase 4**: Remove old TTL-only cache ### Configuration Migration ```python # Old configuration cache_ttl = { 'gpu_availability': 30, 'gpu_pricing': 60 } # New configuration cache_configs = { 'gpu_availability': CacheConfig( namespace='gpu_avail', ttl_seconds=30, event_driven=True, critical_data=True ), 'gpu_pricing': CacheConfig( namespace='gpu_pricing', ttl_seconds=60, event_driven=True, critical_data=True ) } ``` ## Future Enhancements ### Planned Features 1. **Intelligent Caching**: ML-based cache preloading 2. **Adaptive TTL**: Dynamic TTL based on access patterns 3. **Multi-Region Replication**: Cross-region cache synchronization 4. **Cache Analytics**: Advanced usage analytics and optimization ### Scalability Improvements 1. **Sharding**: Horizontal scaling of cache data 2. **Compression**: Data compression for memory efficiency 3. **Tiered Storage**: SSD/HDD tiering for large datasets 4. **Edge Computing**: Push cache closer to users ## Conclusion The event-driven Redis caching strategy provides: - **Immediate Propagation**: Sub-100ms event propagation across all edge nodes - **High Performance**: Multi-tier caching with >95% hit ratios - **Scalability**: Distributed architecture supporting global edge deployment - **Reliability**: Automatic failover and recovery mechanisms - **Security**: Enterprise-grade security with TLS and authentication This system ensures that GPU availability and pricing changes are immediately propagated to all edge nodes, eliminating stale data issues and providing a consistent user experience across the global AITBC platform.