- Remove executable permissions from configuration files (.editorconfig, .env.example, .gitignore) - Remove executable permissions from documentation files (README.md, LICENSE, SECURITY.md) - Remove executable permissions from web assets (HTML, CSS, JS files) - Remove executable permissions from data files (JSON, SQL, YAML, requirements.txt) - Remove executable permissions from source code files across all apps - Add executable permissions to Python
459 lines
12 KiB
Markdown
459 lines
12 KiB
Markdown
# Event-Driven Redis Caching Strategy for Global Edge Nodes
|
|
|
|
## Overview
|
|
|
|
This document describes the implementation of an event-driven Redis caching strategy for the AITBC platform, specifically designed to handle distributed edge nodes with immediate propagation of GPU availability and pricing changes on booking/cancellation events.
|
|
|
|
## Architecture
|
|
|
|
### Multi-Tier Caching
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Edge Node 1 │ │ Edge Node 2 │ │ Edge Node N │
|
|
│ │ │ │ │ │
|
|
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
|
|
│ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │
|
|
│ │ (Memory) │ │ │ │ (Memory) │ │ │ │ (Memory) │ │
|
|
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
|
|
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
|
|
│ │ │
|
|
└──────────────────────┼──────────────────────┘
|
|
│
|
|
┌─────────────┴─────────────┐
|
|
│ Redis Cluster │
|
|
│ (L2 Distributed) │
|
|
│ │
|
|
│ ┌─────────────────────┐ │
|
|
│ │ Pub/Sub Channel │ │
|
|
│ │ Cache Invalidation │ │
|
|
│ └─────────────────────┘ │
|
|
└─────────────────────────┘
|
|
```
|
|
|
|
### Event-Driven Invalidation Flow
|
|
|
|
```
|
|
Booking/Cancellation Event
|
|
│
|
|
▼
|
|
Event Publisher
|
|
│
|
|
▼
|
|
Redis Pub/Sub
|
|
│
|
|
▼
|
|
Event Subscribers
|
|
(All Edge Nodes)
|
|
│
|
|
▼
|
|
Cache Invalidation
|
|
(L1 + L2 Cache)
|
|
│
|
|
▼
|
|
Immediate Propagation
|
|
```
|
|
|
|
## Key Features
|
|
|
|
### 1. Event-Driven Cache Invalidation
|
|
|
|
**Problem Solved**: TTL-only caching causes stale data propagation delays across edge nodes.
|
|
|
|
**Solution**: Real-time event-driven invalidation using Redis pub/sub for immediate propagation.
|
|
|
|
**Critical Data Types**:
|
|
- GPU availability status
|
|
- GPU pricing information
|
|
- Order book data
|
|
- Provider status
|
|
|
|
### 2. Multi-Tier Cache Architecture
|
|
|
|
**L1 Cache (Memory)**:
|
|
- Fastest access (sub-millisecond)
|
|
- Limited size (1000-5000 entries)
|
|
- Shorter TTL (30-60 seconds)
|
|
- Immediate invalidation on events
|
|
|
|
**L2 Cache (Redis)**:
|
|
- Distributed across all edge nodes
|
|
- Larger capacity (GBs)
|
|
- Longer TTL (5-60 minutes)
|
|
- Event-driven updates
|
|
|
|
### 3. Distributed Edge Node Coordination
|
|
|
|
**Node Identification**:
|
|
- Unique node IDs for each edge node
|
|
- Regional grouping for optimization
|
|
- Network tier classification (edge/regional/global)
|
|
|
|
**Event Propagation**:
|
|
- Pub/sub for real-time events
|
|
- Event queuing for reliability
|
|
- Automatic failover and recovery
|
|
|
|
## Implementation Details
|
|
|
|
### Cache Event Types
|
|
|
|
```python
|
|
class CacheEventType(Enum):
|
|
GPU_AVAILABILITY_CHANGED = "gpu_availability_changed"
|
|
PRICING_UPDATED = "pricing_updated"
|
|
BOOKING_CREATED = "booking_created"
|
|
BOOKING_CANCELLED = "booking_cancelled"
|
|
PROVIDER_STATUS_CHANGED = "provider_status_changed"
|
|
MARKET_STATS_UPDATED = "market_stats_updated"
|
|
ORDER_BOOK_UPDATED = "order_book_updated"
|
|
MANUAL_INVALIDATION = "manual_invalidation"
|
|
```
|
|
|
|
### Cache Configurations
|
|
|
|
| Data Type | TTL | Event-Driven | Critical | Memory Limit |
|
|
|-----------|-----|--------------|----------|--------------|
|
|
| GPU Availability | 30s | ✅ | ✅ | 100MB |
|
|
| GPU Pricing | 60s | ✅ | ✅ | 50MB |
|
|
| Order Book | 5s | ✅ | ✅ | 200MB |
|
|
| Provider Status | 120s | ✅ | ❌ | 50MB |
|
|
| Market Stats | 300s | ✅ | ❌ | 100MB |
|
|
| Historical Data | 3600s | ❌ | ❌ | 500MB |
|
|
|
|
### Event Structure
|
|
|
|
```python
|
|
@dataclass
|
|
class CacheEvent:
|
|
event_type: CacheEventType
|
|
resource_id: str
|
|
data: Dict[str, Any]
|
|
timestamp: float
|
|
source_node: str
|
|
event_id: str
|
|
affected_namespaces: List[str]
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Cache Operations
|
|
|
|
```python
|
|
from aitbc_cache import init_marketplace_cache, get_marketplace_cache
|
|
|
|
# Initialize cache manager
|
|
cache_manager = await init_marketplace_cache(
|
|
redis_url="redis://redis-cluster:6379/0",
|
|
node_id="edge_node_us_east_1",
|
|
region="us-east"
|
|
)
|
|
|
|
# Get GPU availability
|
|
gpus = await cache_manager.get_gpu_availability(
|
|
region="us-east",
|
|
gpu_type="RTX 3080"
|
|
)
|
|
|
|
# Update GPU status (triggers event)
|
|
await cache_manager.update_gpu_status("gpu_123", "busy")
|
|
```
|
|
|
|
### Booking Operations with Cache Updates
|
|
|
|
```python
|
|
# Create booking (automatically updates caches)
|
|
booking = BookingInfo(
|
|
booking_id="booking_456",
|
|
gpu_id="gpu_123",
|
|
user_id="user_789",
|
|
start_time=datetime.utcnow(),
|
|
end_time=datetime.utcnow() + timedelta(hours=2),
|
|
status="active",
|
|
total_cost=0.2
|
|
)
|
|
|
|
success = await cache_manager.create_booking(booking)
|
|
# This triggers:
|
|
# 1. GPU availability update
|
|
# 2. Pricing recalculation
|
|
# 3. Order book invalidation
|
|
# 4. Market stats update
|
|
# 5. Event publishing to all nodes
|
|
```
|
|
|
|
### Event-Driven Pricing Updates
|
|
|
|
```python
|
|
# Update pricing (immediately propagated)
|
|
await cache_manager.update_gpu_pricing("RTX 3080", 0.15, "us-east")
|
|
|
|
# All edge nodes receive this event instantly
|
|
# and invalidate their pricing caches
|
|
```
|
|
|
|
## Deployment Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Redis Configuration
|
|
REDIS_HOST=redis-cluster.internal
|
|
REDIS_PORT=6379
|
|
REDIS_DB=0
|
|
REDIS_PASSWORD=your_redis_password
|
|
REDIS_SSL=true
|
|
REDIS_MAX_CONNECTIONS=50
|
|
|
|
# Edge Node Configuration
|
|
EDGE_NODE_ID=edge_node_us_east_1
|
|
EDGE_NODE_REGION=us-east
|
|
EDGE_NODE_DATACENTER=dc1
|
|
EDGE_NODE_CACHE_TIER=edge
|
|
|
|
# Cache Configuration
|
|
CACHE_L1_SIZE=1000
|
|
CACHE_ENABLE_EVENT_DRIVEN=true
|
|
CACHE_ENABLE_METRICS=true
|
|
CACHE_HEALTH_CHECK_INTERVAL=30
|
|
|
|
# Security
|
|
CACHE_ENABLE_TLS=true
|
|
CACHE_REQUIRE_AUTH=true
|
|
CACHE_AUTH_TOKEN=your_auth_token
|
|
```
|
|
|
|
### Redis Cluster Setup
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
version: '3.8'
|
|
services:
|
|
redis-master:
|
|
image: redis:7-alpine
|
|
ports:
|
|
- "6379:6379"
|
|
command: redis-server --appendonly yes --cluster-enabled yes
|
|
|
|
redis-replica-1:
|
|
image: redis:7-alpine
|
|
ports:
|
|
- "6380:6379"
|
|
command: redis-server --appendonly yes --cluster-enabled yes
|
|
|
|
redis-replica-2:
|
|
image: redis:7-alpine
|
|
ports:
|
|
- "6381:6379"
|
|
command: redis-server --appendonly yes --cluster-enabled yes
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Cache Hit Ratios
|
|
|
|
**Target Performance**:
|
|
- L1 Cache Hit Ratio: >80%
|
|
- L2 Cache Hit Ratio: >95%
|
|
- Event Propagation Latency: <100ms
|
|
- Total Cache Response Time: <5ms
|
|
|
|
### Optimization Strategies
|
|
|
|
1. **L1 Cache Sizing**:
|
|
- Edge nodes: 500 entries (faster lookup)
|
|
- Regional nodes: 2000 entries (better coverage)
|
|
- Global nodes: 5000 entries (maximum coverage)
|
|
|
|
2. **Event Processing**:
|
|
- Batch event processing for high throughput
|
|
- Event deduplication to prevent storms
|
|
- Priority queues for critical events
|
|
|
|
3. **Memory Management**:
|
|
- LFU eviction for frequently accessed data
|
|
- Time-based expiration for stale data
|
|
- Memory pressure monitoring
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Cache Metrics
|
|
|
|
```python
|
|
# Get cache statistics
|
|
stats = await cache_manager.get_cache_stats()
|
|
|
|
# Key metrics:
|
|
# - cache_hits / cache_misses
|
|
# - events_processed
|
|
# - invalidations
|
|
# - l1_cache_size
|
|
# - redis_memory_used_mb
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
```python
|
|
# Comprehensive health check
|
|
health = await cache_manager.health_check()
|
|
|
|
# Health indicators:
|
|
# - redis_connected
|
|
# - pubsub_active
|
|
# - event_queue_size
|
|
# - last_event_age
|
|
```
|
|
|
|
### Alerting Thresholds
|
|
|
|
| Metric | Warning | Critical |
|
|
|--------|---------|----------|
|
|
| Cache Hit Ratio | <70% | <50% |
|
|
| Event Queue Size | >1000 | >5000 |
|
|
| Event Latency | >500ms | >2000ms |
|
|
| Redis Memory | >80% | >95% |
|
|
| Connection Failures | >5/min | >20/min |
|
|
|
|
## Security Considerations
|
|
|
|
### Network Security
|
|
|
|
1. **TLS Encryption**: All Redis connections use TLS
|
|
2. **Authentication**: Redis AUTH tokens required
|
|
3. **Network Isolation**: Redis cluster in private VPC
|
|
4. **Access Control**: IP whitelisting for edge nodes
|
|
|
|
### Data Security
|
|
|
|
1. **Sensitive Data**: No private keys or passwords cached
|
|
2. **Data Encryption**: At-rest encryption for Redis
|
|
3. **Access Logging**: All cache operations logged
|
|
4. **Data Retention**: Automatic cleanup of old data
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Stale Cache Data**:
|
|
- Check event propagation
|
|
- Verify pub/sub connectivity
|
|
- Review event queue size
|
|
|
|
2. **High Memory Usage**:
|
|
- Monitor L1 cache size
|
|
- Check TTL configurations
|
|
- Review eviction policies
|
|
|
|
3. **Slow Performance**:
|
|
- Check Redis connection pool
|
|
- Monitor network latency
|
|
- Review cache hit ratios
|
|
|
|
### Debug Commands
|
|
|
|
```python
|
|
# Check cache health
|
|
health = await cache_manager.health_check()
|
|
print(f"Cache status: {health['status']}")
|
|
|
|
# Check event processing
|
|
stats = await cache_manager.get_cache_stats()
|
|
print(f"Events processed: {stats['events_processed']}")
|
|
|
|
# Manual cache invalidation
|
|
await cache_manager.invalidate_cache('gpu_availability', reason='debug')
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Cache Key Design
|
|
|
|
- Use consistent naming conventions
|
|
- Include relevant parameters in key
|
|
- Avoid key collisions
|
|
- Use appropriate TTL values
|
|
|
|
### 2. Event Design
|
|
|
|
- Include all necessary context
|
|
- Use unique event IDs
|
|
- Timestamp all events
|
|
- Handle event idempotency
|
|
|
|
### 3. Error Handling
|
|
|
|
- Graceful degradation on Redis failures
|
|
- Retry logic for transient errors
|
|
- Fallback to database when needed
|
|
- Comprehensive error logging
|
|
|
|
### 4. Performance Optimization
|
|
|
|
- Batch operations when possible
|
|
- Use connection pooling
|
|
- Monitor memory usage
|
|
- Optimize serialization
|
|
|
|
## Migration Guide
|
|
|
|
### From TTL-Only Caching
|
|
|
|
1. **Phase 1**: Deploy event-driven cache alongside existing cache
|
|
2. **Phase 2**: Enable event-driven invalidation for critical data
|
|
3. **Phase 3**: Migrate all data types to event-driven
|
|
4. **Phase 4**: Remove old TTL-only cache
|
|
|
|
### Configuration Migration
|
|
|
|
```python
|
|
# Old configuration
|
|
cache_ttl = {
|
|
'gpu_availability': 30,
|
|
'gpu_pricing': 60
|
|
}
|
|
|
|
# New configuration
|
|
cache_configs = {
|
|
'gpu_availability': CacheConfig(
|
|
namespace='gpu_avail',
|
|
ttl_seconds=30,
|
|
event_driven=True,
|
|
critical_data=True
|
|
),
|
|
'gpu_pricing': CacheConfig(
|
|
namespace='gpu_pricing',
|
|
ttl_seconds=60,
|
|
event_driven=True,
|
|
critical_data=True
|
|
)
|
|
}
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
|
|
1. **Intelligent Caching**: ML-based cache preloading
|
|
2. **Adaptive TTL**: Dynamic TTL based on access patterns
|
|
3. **Multi-Region Replication**: Cross-region cache synchronization
|
|
4. **Cache Analytics**: Advanced usage analytics and optimization
|
|
|
|
### Scalability Improvements
|
|
|
|
1. **Sharding**: Horizontal scaling of cache data
|
|
2. **Compression**: Data compression for memory efficiency
|
|
3. **Tiered Storage**: SSD/HDD tiering for large datasets
|
|
4. **Edge Computing**: Push cache closer to users
|
|
|
|
## Conclusion
|
|
|
|
The event-driven Redis caching strategy provides:
|
|
|
|
- **Immediate Propagation**: Sub-100ms event propagation across all edge nodes
|
|
- **High Performance**: Multi-tier caching with >95% hit ratios
|
|
- **Scalability**: Distributed architecture supporting global edge deployment
|
|
- **Reliability**: Automatic failover and recovery mechanisms
|
|
- **Security**: Enterprise-grade security with TLS and authentication
|
|
|
|
This system ensures that GPU availability and pricing changes are immediately propagated to all edge nodes, eliminating stale data issues and providing a consistent user experience across the global AITBC platform.
|