Scaling Guide
Scaling Guide
Comprehensive guide for scaling Rizk SDK deployments from single instance to enterprise-scale distributed systems, based on real-world patterns from production deployments.
Scaling Overview
Rizk SDK supports multiple scaling patterns:
- Horizontal Scaling: Multiple instances with shared Redis cache
- Vertical Scaling: Single instance with optimized resource allocation
- Multi-Region Deployment: Distributed deployments across regions
- Auto-Scaling: Dynamic scaling based on load
- Load Balancing: Intelligent request distribution
- Cache Distribution: Multi-tier distributed caching
Single Instance to Multi-Instance
1. Shared Cache Configuration
from rizk.sdk.cache.cache_hierarchy import CacheHierarchy, CacheHierarchyConfigfrom rizk.sdk.cache.redis_adapter import RedisConfig
# Multi-instance shared cache configurationdef create_shared_cache_config(instance_id: str) -> CacheHierarchyConfig: """Create cache configuration for multi-instance deployment."""
redis_config = RedisConfig( url=os.getenv("REDIS_CLUSTER_URL", "redis://redis-cluster:6379"), max_connections=100, # Per instance socket_timeout=1.0, socket_connect_timeout=2.0, retry_on_timeout=True, retry_attempts=3, enable_cluster=True, # Critical for multi-instance cluster_nodes=[ "redis-node1:6379", "redis-node2:6379", "redis-node3:6379" ], key_prefix=f"rizk:{instance_id}:", # Instance-specific prefix default_ttl=1800 )
return CacheHierarchyConfig( # L1: Instance-local cache l1_enabled=True, l1_max_size=10000, # Smaller per instance l1_ttl_seconds=300,
# L2: Shared Redis cluster l2_enabled=True, l2_redis_config=redis_config, l2_ttl_seconds=3600, l2_fallback_on_error=True,
# Multi-instance optimizations async_write_behind=True, promotion_threshold=2, # Less aggressive promotion
# Instance coordination instance_id=instance_id, enable_instance_coordination=True, coordination_interval=30 )
# Initialize with instance-specific configurationinstance_id = os.getenv("INSTANCE_ID", f"instance-{os.getpid()}")cache_config = create_shared_cache_config(instance_id)cache_hierarchy = CacheHierarchy(cache_config)
2. Load Balancer Configuration
# nginx.conf for Rizk SDK load balancingupstream rizk_backend { # Health check enabled server rizk-instance-1:8000 max_fails=3 fail_timeout=30s; server rizk-instance-2:8000 max_fails=3 fail_timeout=30s; server rizk-instance-3:8000 max_fails=3 fail_timeout=30s;
# Load balancing method least_conn; # Route to instance with fewest connections
# Session persistence for stateful operations ip_hash; # Optional: sticky sessions
# Health check keepalive 32;}
server { listen 80; server_name api.yourcompany.com;
# Rate limiting limit_req_zone $binary_remote_addr zone=rizk_rate_limit:10m rate=100r/s; limit_req zone=rizk_rate_limit burst=200 nodelay;
location / { proxy_pass http://rizk_backend;
# Load balancing headers proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-Instance-ID $upstream_addr;
# Streaming support proxy_buffering off; proxy_cache off; proxy_read_timeout 300s; proxy_send_timeout 300s;
# Health check proxy_next_upstream error timeout invalid_header http_500 http_502 http_503; proxy_next_upstream_tries 3; proxy_next_upstream_timeout 10s; }
# Health check endpoint location /health { proxy_pass http://rizk_backend/health; access_log off; }
# Metrics endpoint (internal only) location /metrics { allow 10.0.0.0/8; # Internal network only allow 172.16.0.0/12; allow 192.168.0.0/16; deny all;
proxy_pass http://rizk_backend/metrics; }}
Kubernetes Scaling
1. Horizontal Pod Autoscaler
# hpa.yaml - Horizontal Pod AutoscalerapiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: rizk-app-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: rizk-app minReplicas: 3 maxReplicas: 50 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: rizk_requests_per_second target: type: AverageValue averageValue: "100" behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 15 - type: Pods value: 5 periodSeconds: 15 selectPolicy: Max
2. Deployment with Resource Optimization
# deployment.yaml - Optimized for scalingapiVersion: apps/v1kind: Deploymentmetadata: name: rizk-app labels: app: rizk-appspec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 2 maxUnavailable: 1 selector: matchLabels: app: rizk-app template: metadata: labels: app: rizk-app annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: rizk-app image: your-company/rizk-app:latest ports: - containerPort: 8000 name: http - containerPort: 8080 name: metrics
# Resource allocation for scaling resources: requests: memory: "256Mi" cpu: "200m" limits: memory: "1Gi" cpu: "1000m"
# Environment configuration env: - name: RIZK_API_KEY valueFrom: secretKeyRef: name: rizk-secrets key: api-key - name: REDIS_URL value: "redis://redis-cluster:6379" - name: INSTANCE_ID valueFrom: fieldRef: fieldPath: metadata.name - name: RIZK_FRAMEWORK_CACHE_SIZE value: "10000" # Optimized for multiple instances
# Health checks livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3
readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2
# Startup probe for slow initialization startupProbe: httpGet: path: /health port: 8000 failureThreshold: 30 periodSeconds: 10
# Pod disruption budget affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - rizk-app topologyKey: kubernetes.io/hostname
Auto-Scaling Strategies
1. Custom Metrics for Scaling
from typing import Dict, Anyimport timeimport psutilfrom dataclasses import dataclass
@dataclassclass ScalingMetrics: """Metrics used for auto-scaling decisions.""" cpu_percent: float memory_percent: float requests_per_second: float cache_hit_rate: float error_rate: float response_time_ms: float active_connections: int
class AutoScalingController: """Control auto-scaling based on Rizk SDK metrics."""
def __init__(self): self.metrics_history = [] self.scaling_thresholds = { "scale_up": { "cpu_percent": 70, "memory_percent": 80, "requests_per_second": 100, "response_time_ms": 1000, "error_rate": 5 }, "scale_down": { "cpu_percent": 30, "memory_percent": 40, "requests_per_second": 20, "response_time_ms": 200, "error_rate": 1 } }
def collect_metrics(self) -> ScalingMetrics: """Collect current system and Rizk metrics.""" # System metrics cpu_percent = psutil.cpu_percent(interval=1) memory = psutil.virtual_memory() memory_percent = memory.percent
# Rizk-specific metrics (would be collected from SDK) cache_stats = cache_hierarchy.get_stats() if 'cache_hierarchy' in globals() else {}
metrics = ScalingMetrics( cpu_percent=cpu_percent, memory_percent=memory_percent, requests_per_second=self._get_requests_per_second(), cache_hit_rate=cache_stats.get("overall_hit_rate", 0), error_rate=self._get_error_rate(), response_time_ms=self._get_avg_response_time(), active_connections=self._get_active_connections() )
self.metrics_history.append(metrics)
# Keep only recent metrics if len(self.metrics_history) > 100: self.metrics_history = self.metrics_history[-100:]
return metrics
def should_scale_up(self, metrics: ScalingMetrics) -> bool: """Determine if scaling up is needed.""" thresholds = self.scaling_thresholds["scale_up"]
conditions = [ metrics.cpu_percent > thresholds["cpu_percent"], metrics.memory_percent > thresholds["memory_percent"], metrics.requests_per_second > thresholds["requests_per_second"], metrics.response_time_ms > thresholds["response_time_ms"], metrics.error_rate > thresholds["error_rate"] ]
# Scale up if any 2 conditions are met return sum(conditions) >= 2
def should_scale_down(self, metrics: ScalingMetrics) -> bool: """Determine if scaling down is safe.""" thresholds = self.scaling_thresholds["scale_down"]
# Only scale down if ALL conditions are met for safety conditions = [ metrics.cpu_percent < thresholds["cpu_percent"], metrics.memory_percent < thresholds["memory_percent"], metrics.requests_per_second < thresholds["requests_per_second"], metrics.response_time_ms < thresholds["response_time_ms"], metrics.error_rate < thresholds["error_rate"] ]
return all(conditions)
def get_scaling_recommendation(self) -> Dict[str, Any]: """Get scaling recommendation based on current metrics.""" if len(self.metrics_history) < 3: return {"action": "wait", "reason": "Insufficient metrics history"}
current_metrics = self.metrics_history[-1]
# Check recent trend recent_metrics = self.metrics_history[-3:] avg_cpu = sum(m.cpu_percent for m in recent_metrics) / len(recent_metrics) avg_memory = sum(m.memory_percent for m in recent_metrics) / len(recent_metrics) avg_response_time = sum(m.response_time_ms for m in recent_metrics) / len(recent_metrics)
if self.should_scale_up(current_metrics): return { "action": "scale_up", "reason": f"High resource usage: CPU {avg_cpu:.1f}%, Memory {avg_memory:.1f}%, Response Time {avg_response_time:.1f}ms", "recommended_replicas": self._calculate_scale_up_replicas(current_metrics) } elif self.should_scale_down(current_metrics): return { "action": "scale_down", "reason": f"Low resource usage: CPU {avg_cpu:.1f}%, Memory {avg_memory:.1f}%", "recommended_replicas": self._calculate_scale_down_replicas(current_metrics) } else: return { "action": "maintain", "reason": "Metrics within acceptable range" }
def _calculate_scale_up_replicas(self, metrics: ScalingMetrics) -> int: """Calculate how many replicas to add.""" # Simple calculation based on CPU usage if metrics.cpu_percent > 90: return 3 # Aggressive scaling for very high CPU elif metrics.cpu_percent > 80: return 2 else: return 1
def _calculate_scale_down_replicas(self, metrics: ScalingMetrics) -> int: """Calculate how many replicas to remove.""" # Conservative scaling down return 1 # Remove one at a time
def _get_requests_per_second(self) -> float: """Get current requests per second (implementation specific).""" # This would typically come from your web framework metrics return 50.0 # Placeholder
def _get_error_rate(self) -> float: """Get current error rate percentage.""" # This would typically come from your application metrics return 2.0 # Placeholder
def _get_avg_response_time(self) -> float: """Get average response time in milliseconds.""" # This would typically come from your application metrics return 250.0 # Placeholder
def _get_active_connections(self) -> int: """Get number of active connections.""" # This would typically come from your web server metrics return 50 # Placeholder
# Usagescaling_controller = AutoScalingController()
def monitor_and_scale(): """Monitor metrics and provide scaling recommendations.""" metrics = scaling_controller.collect_metrics() recommendation = scaling_controller.get_scaling_recommendation()
print(f"Current Metrics:") print(f" CPU: {metrics.cpu_percent:.1f}%") print(f" Memory: {metrics.memory_percent:.1f}%") print(f" RPS: {metrics.requests_per_second:.1f}") print(f" Response Time: {metrics.response_time_ms:.1f}ms") print(f" Cache Hit Rate: {metrics.cache_hit_rate:.1f}%")
print(f"\nScaling Recommendation: {recommendation['action']}") print(f"Reason: {recommendation['reason']}")
if "recommended_replicas" in recommendation: print(f"Recommended Replica Change: {recommendation['recommended_replicas']}")
return recommendation
Scaling Best Practices
1. Scaling Checklist
SCALING_CHECKLIST = { "infrastructure": [ "✅ Redis cluster configured for high availability", "✅ Load balancer configured with health checks", "✅ Auto-scaling policies defined", "✅ Resource limits and requests configured", "✅ Pod disruption budgets set" ], "configuration": [ "✅ Instance-specific cache prefixes", "✅ Shared cache configuration", "✅ Regional endpoints configured", "✅ Cross-region sync enabled", "✅ Monitoring and alerting setup" ], "performance": [ "✅ Cache hit rates > 80%", "✅ Response times < 500ms", "✅ Error rates < 1%", "✅ CPU utilization 60-80%", "✅ Memory utilization < 85%" ], "reliability": [ "✅ Multi-AZ deployment", "✅ Graceful shutdown handling", "✅ Circuit breakers implemented", "✅ Retry logic configured", "✅ Backup and recovery tested" ]}
def validate_scaling_readiness() -> Dict[str, Any]: """Validate readiness for scaling deployment.""" print("🚀 Validating scaling readiness...")
# This would include actual checks validation_results = { "ready": True, "warnings": [], "errors": [] }
# Example checks redis_health = check_redis_cluster_health() if not redis_health["healthy"]: validation_results["errors"].append("Redis cluster not healthy") validation_results["ready"] = False
cache_hit_rate = get_current_cache_hit_rate() if cache_hit_rate < 70: validation_results["warnings"].append(f"Cache hit rate low: {cache_hit_rate:.1f}%")
return validation_results
def check_redis_cluster_health() -> Dict[str, Any]: """Check Redis cluster health (placeholder).""" return {"healthy": True, "nodes": 3, "status": "ok"}
def get_current_cache_hit_rate() -> float: """Get current cache hit rate (placeholder).""" return 85.0
2. Deployment Strategy
#!/bin/bash# deploy-scaled.sh - Deploy scaled Rizk SDK application
set -e
echo "🚀 Starting scaled deployment..."
# 1. Validate prerequisitesecho "📋 Validating prerequisites..."kubectl cluster-infokubectl get nodeskubectl get pv # Check persistent volumes
# 2. Deploy Redis cluster firstecho "🔧 Deploying Redis cluster..."kubectl apply -f redis-cluster.yamlkubectl wait --for=condition=ready pod -l app=redis-cluster --timeout=300s
# 3. Deploy application with minimal replicasecho "🚀 Deploying application (minimal replicas)..."kubectl apply -f deployment.yamlkubectl wait --for=condition=available deployment/rizk-app --timeout=300s
# 4. Run health checksecho "🥠Running health checks..."kubectl get pods -l app=rizk-appkubectl exec deployment/rizk-app -- curl -f http://localhost:8000/health
# 5. Deploy auto-scalingecho "📈 Enabling auto-scaling..."kubectl apply -f hpa.yamlkubectl apply -f pdb.yaml
# 6. Configure monitoringecho "📊 Setting up monitoring..."kubectl apply -f monitoring.yaml
# 7. Run load test to verify scalingecho "🧪 Running scaling verification..."kubectl run load-test --image=busybox --rm -it --restart=Never -- \ sh -c 'for i in $(seq 1 100); do wget -q -O- http://rizk-app-service/health; done'
# 8. Monitor scaling behaviorecho "👀 Monitoring scaling behavior..."kubectl get hpa rizk-app-hpa --watch
echo "✅ Scaled deployment completed successfully!"
Next Steps
- Production Setup - Deploy your scaled architecture
- Performance Tuning - Optimize for scale
- Security Best Practices - Secure your scaled deployment
Scaling Implementation Checklist
✅ Multi-instance configuration ready
✅ Shared cache infrastructure deployed
✅ Load balancing configured
✅ Auto-scaling policies defined
✅ Multi-region strategy planned
✅ Monitoring and alerting setup
✅ Performance benchmarks established
✅ Disaster recovery tested
Enterprise-scale LLM governance architecture