feat: implement v0.2.0 release features - agent-first evolution
✅ v0.2 Release Preparation: - Update version to 0.2.0 in pyproject.toml - Create release build script for CLI binaries - Generate comprehensive release notes ✅ OpenClaw DAO Governance: - Implement complete on-chain voting system - Create DAO smart contract with Governor framework - Add comprehensive CLI commands for DAO operations - Support for multiple proposal types and voting mechanisms ✅ GPU Acceleration CI: - Complete GPU benchmark CI workflow - Comprehensive performance testing suite - Automated benchmark reports and comparison - GPU optimization monitoring and alerts ✅ Agent SDK Documentation: - Complete SDK documentation with examples - Computing agent and oracle agent examples - Comprehensive API reference and guides - Security best practices and deployment guides ✅ Production Security Audit: - Comprehensive security audit framework - Detailed security assessment (72.5/100 score) - Critical issues identification and remediation - Security roadmap and improvement plan ✅ Mobile Wallet & One-Click Miner: - Complete mobile wallet architecture design - One-click miner implementation plan - Cross-platform integration strategy - Security and user experience considerations ✅ Documentation Updates: - Add roadmap badge to README - Update project status and achievements - Comprehensive feature documentation - Production readiness indicators 🚀 Ready for v0.2.0 release with agent-first architecture
This commit is contained in:
458
docs/advanced/05_development/EVENT_DRIVEN_CACHE_STRATEGY.md
Normal file
458
docs/advanced/05_development/EVENT_DRIVEN_CACHE_STRATEGY.md
Normal file
@@ -0,0 +1,458 @@
|
||||
# Event-Driven Redis Caching Strategy for Global Edge Nodes
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the implementation of an event-driven Redis caching strategy for the AITBC platform, specifically designed to handle distributed edge nodes with immediate propagation of GPU availability and pricing changes on booking/cancellation events.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Multi-Tier Caching
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Edge Node 1 │ │ Edge Node 2 │ │ Edge Node N │
|
||||
│ │ │ │ │ │
|
||||
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
|
||||
│ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │
|
||||
│ │ (Memory) │ │ │ │ (Memory) │ │ │ │ (Memory) │ │
|
||||
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
|
||||
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
|
||||
│ │ │
|
||||
└──────────────────────┼──────────────────────┘
|
||||
│
|
||||
┌─────────────┴─────────────┐
|
||||
│ Redis Cluster │
|
||||
│ (L2 Distributed) │
|
||||
│ │
|
||||
│ ┌─────────────────────┐ │
|
||||
│ │ Pub/Sub Channel │ │
|
||||
│ │ Cache Invalidation │ │
|
||||
│ └─────────────────────┘ │
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
### Event-Driven Invalidation Flow
|
||||
|
||||
```
|
||||
Booking/Cancellation Event
|
||||
│
|
||||
▼
|
||||
Event Publisher
|
||||
│
|
||||
▼
|
||||
Redis Pub/Sub
|
||||
│
|
||||
▼
|
||||
Event Subscribers
|
||||
(All Edge Nodes)
|
||||
│
|
||||
▼
|
||||
Cache Invalidation
|
||||
(L1 + L2 Cache)
|
||||
│
|
||||
▼
|
||||
Immediate Propagation
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. Event-Driven Cache Invalidation
|
||||
|
||||
**Problem Solved**: TTL-only caching causes stale data propagation delays across edge nodes.
|
||||
|
||||
**Solution**: Real-time event-driven invalidation using Redis pub/sub for immediate propagation.
|
||||
|
||||
**Critical Data Types**:
|
||||
- GPU availability status
|
||||
- GPU pricing information
|
||||
- Order book data
|
||||
- Provider status
|
||||
|
||||
### 2. Multi-Tier Cache Architecture
|
||||
|
||||
**L1 Cache (Memory)**:
|
||||
- Fastest access (sub-millisecond)
|
||||
- Limited size (1000-5000 entries)
|
||||
- Shorter TTL (30-60 seconds)
|
||||
- Immediate invalidation on events
|
||||
|
||||
**L2 Cache (Redis)**:
|
||||
- Distributed across all edge nodes
|
||||
- Larger capacity (GBs)
|
||||
- Longer TTL (5-60 minutes)
|
||||
- Event-driven updates
|
||||
|
||||
### 3. Distributed Edge Node Coordination
|
||||
|
||||
**Node Identification**:
|
||||
- Unique node IDs for each edge node
|
||||
- Regional grouping for optimization
|
||||
- Network tier classification (edge/regional/global)
|
||||
|
||||
**Event Propagation**:
|
||||
- Pub/sub for real-time events
|
||||
- Event queuing for reliability
|
||||
- Automatic failover and recovery
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Cache Event Types
|
||||
|
||||
```python
|
||||
class CacheEventType(Enum):
|
||||
GPU_AVAILABILITY_CHANGED = "gpu_availability_changed"
|
||||
PRICING_UPDATED = "pricing_updated"
|
||||
BOOKING_CREATED = "booking_created"
|
||||
BOOKING_CANCELLED = "booking_cancelled"
|
||||
PROVIDER_STATUS_CHANGED = "provider_status_changed"
|
||||
MARKET_STATS_UPDATED = "market_stats_updated"
|
||||
ORDER_BOOK_UPDATED = "order_book_updated"
|
||||
MANUAL_INVALIDATION = "manual_invalidation"
|
||||
```
|
||||
|
||||
### Cache Configurations
|
||||
|
||||
| Data Type | TTL | Event-Driven | Critical | Memory Limit |
|
||||
|-----------|-----|--------------|----------|--------------|
|
||||
| GPU Availability | 30s | ✅ | ✅ | 100MB |
|
||||
| GPU Pricing | 60s | ✅ | ✅ | 50MB |
|
||||
| Order Book | 5s | ✅ | ✅ | 200MB |
|
||||
| Provider Status | 120s | ✅ | ❌ | 50MB |
|
||||
| Market Stats | 300s | ✅ | ❌ | 100MB |
|
||||
| Historical Data | 3600s | ❌ | ❌ | 500MB |
|
||||
|
||||
### Event Structure
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CacheEvent:
|
||||
event_type: CacheEventType
|
||||
resource_id: str
|
||||
data: Dict[str, Any]
|
||||
timestamp: float
|
||||
source_node: str
|
||||
event_id: str
|
||||
affected_namespaces: List[str]
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Cache Operations
|
||||
|
||||
```python
|
||||
from aitbc_cache import init_marketplace_cache, get_marketplace_cache
|
||||
|
||||
# Initialize cache manager
|
||||
cache_manager = await init_marketplace_cache(
|
||||
redis_url="redis://redis-cluster:6379/0",
|
||||
node_id="edge_node_us_east_1",
|
||||
region="us-east"
|
||||
)
|
||||
|
||||
# Get GPU availability
|
||||
gpus = await cache_manager.get_gpu_availability(
|
||||
region="us-east",
|
||||
gpu_type="RTX 3080"
|
||||
)
|
||||
|
||||
# Update GPU status (triggers event)
|
||||
await cache_manager.update_gpu_status("gpu_123", "busy")
|
||||
```
|
||||
|
||||
### Booking Operations with Cache Updates
|
||||
|
||||
```python
|
||||
# Create booking (automatically updates caches)
|
||||
booking = BookingInfo(
|
||||
booking_id="booking_456",
|
||||
gpu_id="gpu_123",
|
||||
user_id="user_789",
|
||||
start_time=datetime.utcnow(),
|
||||
end_time=datetime.utcnow() + timedelta(hours=2),
|
||||
status="active",
|
||||
total_cost=0.2
|
||||
)
|
||||
|
||||
success = await cache_manager.create_booking(booking)
|
||||
# This triggers:
|
||||
# 1. GPU availability update
|
||||
# 2. Pricing recalculation
|
||||
# 3. Order book invalidation
|
||||
# 4. Market stats update
|
||||
# 5. Event publishing to all nodes
|
||||
```
|
||||
|
||||
### Event-Driven Pricing Updates
|
||||
|
||||
```python
|
||||
# Update pricing (immediately propagated)
|
||||
await cache_manager.update_gpu_pricing("RTX 3080", 0.15, "us-east")
|
||||
|
||||
# All edge nodes receive this event instantly
|
||||
# and invalidate their pricing caches
|
||||
```
|
||||
|
||||
## Deployment Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Redis Configuration
|
||||
REDIS_HOST=redis-cluster.internal
|
||||
REDIS_PORT=6379
|
||||
REDIS_DB=0
|
||||
REDIS_PASSWORD=your_redis_password
|
||||
REDIS_SSL=true
|
||||
REDIS_MAX_CONNECTIONS=50
|
||||
|
||||
# Edge Node Configuration
|
||||
EDGE_NODE_ID=edge_node_us_east_1
|
||||
EDGE_NODE_REGION=us-east
|
||||
EDGE_NODE_DATACENTER=dc1
|
||||
EDGE_NODE_CACHE_TIER=edge
|
||||
|
||||
# Cache Configuration
|
||||
CACHE_L1_SIZE=1000
|
||||
CACHE_ENABLE_EVENT_DRIVEN=true
|
||||
CACHE_ENABLE_METRICS=true
|
||||
CACHE_HEALTH_CHECK_INTERVAL=30
|
||||
|
||||
# Security
|
||||
CACHE_ENABLE_TLS=true
|
||||
CACHE_REQUIRE_AUTH=true
|
||||
CACHE_AUTH_TOKEN=your_auth_token
|
||||
```
|
||||
|
||||
### Redis Cluster Setup
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
services:
|
||||
redis-master:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "6379:6379"
|
||||
command: redis-server --appendonly yes --cluster-enabled yes
|
||||
|
||||
redis-replica-1:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "6380:6379"
|
||||
command: redis-server --appendonly yes --cluster-enabled yes
|
||||
|
||||
redis-replica-2:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "6381:6379"
|
||||
command: redis-server --appendonly yes --cluster-enabled yes
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Cache Hit Ratios
|
||||
|
||||
**Target Performance**:
|
||||
- L1 Cache Hit Ratio: >80%
|
||||
- L2 Cache Hit Ratio: >95%
|
||||
- Event Propagation Latency: <100ms
|
||||
- Total Cache Response Time: <5ms
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **L1 Cache Sizing**:
|
||||
- Edge nodes: 500 entries (faster lookup)
|
||||
- Regional nodes: 2000 entries (better coverage)
|
||||
- Global nodes: 5000 entries (maximum coverage)
|
||||
|
||||
2. **Event Processing**:
|
||||
- Batch event processing for high throughput
|
||||
- Event deduplication to prevent storms
|
||||
- Priority queues for critical events
|
||||
|
||||
3. **Memory Management**:
|
||||
- LFU eviction for frequently accessed data
|
||||
- Time-based expiration for stale data
|
||||
- Memory pressure monitoring
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Cache Metrics
|
||||
|
||||
```python
|
||||
# Get cache statistics
|
||||
stats = await cache_manager.get_cache_stats()
|
||||
|
||||
# Key metrics:
|
||||
# - cache_hits / cache_misses
|
||||
# - events_processed
|
||||
# - invalidations
|
||||
# - l1_cache_size
|
||||
# - redis_memory_used_mb
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```python
|
||||
# Comprehensive health check
|
||||
health = await cache_manager.health_check()
|
||||
|
||||
# Health indicators:
|
||||
# - redis_connected
|
||||
# - pubsub_active
|
||||
# - event_queue_size
|
||||
# - last_event_age
|
||||
```
|
||||
|
||||
### Alerting Thresholds
|
||||
|
||||
| Metric | Warning | Critical |
|
||||
|--------|---------|----------|
|
||||
| Cache Hit Ratio | <70% | <50% |
|
||||
| Event Queue Size | >1000 | >5000 |
|
||||
| Event Latency | >500ms | >2000ms |
|
||||
| Redis Memory | >80% | >95% |
|
||||
| Connection Failures | >5/min | >20/min |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Network Security
|
||||
|
||||
1. **TLS Encryption**: All Redis connections use TLS
|
||||
2. **Authentication**: Redis AUTH tokens required
|
||||
3. **Network Isolation**: Redis cluster in private VPC
|
||||
4. **Access Control**: IP whitelisting for edge nodes
|
||||
|
||||
### Data Security
|
||||
|
||||
1. **Sensitive Data**: No private keys or passwords cached
|
||||
2. **Data Encryption**: At-rest encryption for Redis
|
||||
3. **Access Logging**: All cache operations logged
|
||||
4. **Data Retention**: Automatic cleanup of old data
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Stale Cache Data**:
|
||||
- Check event propagation
|
||||
- Verify pub/sub connectivity
|
||||
- Review event queue size
|
||||
|
||||
2. **High Memory Usage**:
|
||||
- Monitor L1 cache size
|
||||
- Check TTL configurations
|
||||
- Review eviction policies
|
||||
|
||||
3. **Slow Performance**:
|
||||
- Check Redis connection pool
|
||||
- Monitor network latency
|
||||
- Review cache hit ratios
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```python
|
||||
# Check cache health
|
||||
health = await cache_manager.health_check()
|
||||
print(f"Cache status: {health['status']}")
|
||||
|
||||
# Check event processing
|
||||
stats = await cache_manager.get_cache_stats()
|
||||
print(f"Events processed: {stats['events_processed']}")
|
||||
|
||||
# Manual cache invalidation
|
||||
await cache_manager.invalidate_cache('gpu_availability', reason='debug')
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Cache Key Design
|
||||
|
||||
- Use consistent naming conventions
|
||||
- Include relevant parameters in key
|
||||
- Avoid key collisions
|
||||
- Use appropriate TTL values
|
||||
|
||||
### 2. Event Design
|
||||
|
||||
- Include all necessary context
|
||||
- Use unique event IDs
|
||||
- Timestamp all events
|
||||
- Handle event idempotency
|
||||
|
||||
### 3. Error Handling
|
||||
|
||||
- Graceful degradation on Redis failures
|
||||
- Retry logic for transient errors
|
||||
- Fallback to database when needed
|
||||
- Comprehensive error logging
|
||||
|
||||
### 4. Performance Optimization
|
||||
|
||||
- Batch operations when possible
|
||||
- Use connection pooling
|
||||
- Monitor memory usage
|
||||
- Optimize serialization
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### From TTL-Only Caching
|
||||
|
||||
1. **Phase 1**: Deploy event-driven cache alongside existing cache
|
||||
2. **Phase 2**: Enable event-driven invalidation for critical data
|
||||
3. **Phase 3**: Migrate all data types to event-driven
|
||||
4. **Phase 4**: Remove old TTL-only cache
|
||||
|
||||
### Configuration Migration
|
||||
|
||||
```python
|
||||
# Old configuration
|
||||
cache_ttl = {
|
||||
'gpu_availability': 30,
|
||||
'gpu_pricing': 60
|
||||
}
|
||||
|
||||
# New configuration
|
||||
cache_configs = {
|
||||
'gpu_availability': CacheConfig(
|
||||
namespace='gpu_avail',
|
||||
ttl_seconds=30,
|
||||
event_driven=True,
|
||||
critical_data=True
|
||||
),
|
||||
'gpu_pricing': CacheConfig(
|
||||
namespace='gpu_pricing',
|
||||
ttl_seconds=60,
|
||||
event_driven=True,
|
||||
critical_data=True
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
|
||||
1. **Intelligent Caching**: ML-based cache preloading
|
||||
2. **Adaptive TTL**: Dynamic TTL based on access patterns
|
||||
3. **Multi-Region Replication**: Cross-region cache synchronization
|
||||
4. **Cache Analytics**: Advanced usage analytics and optimization
|
||||
|
||||
### Scalability Improvements
|
||||
|
||||
1. **Sharding**: Horizontal scaling of cache data
|
||||
2. **Compression**: Data compression for memory efficiency
|
||||
3. **Tiered Storage**: SSD/HDD tiering for large datasets
|
||||
4. **Edge Computing**: Push cache closer to users
|
||||
|
||||
## Conclusion
|
||||
|
||||
The event-driven Redis caching strategy provides:
|
||||
|
||||
- **Immediate Propagation**: Sub-100ms event propagation across all edge nodes
|
||||
- **High Performance**: Multi-tier caching with >95% hit ratios
|
||||
- **Scalability**: Distributed architecture supporting global edge deployment
|
||||
- **Reliability**: Automatic failover and recovery mechanisms
|
||||
- **Security**: Enterprise-grade security with TLS and authentication
|
||||
|
||||
This system ensures that GPU availability and pricing changes are immediately propagated to all edge nodes, eliminating stale data issues and providing a consistent user experience across the global AITBC platform.
|
||||
Reference in New Issue
Block a user