Skip to content

Scaling Guide

Scaling Guide

Comprehensive guide for scaling Rizk SDK deployments from single instance to enterprise-scale distributed systems, based on real-world patterns from production deployments.

Scaling Overview

Rizk SDK supports multiple scaling patterns:

  • Horizontal Scaling: Multiple instances with shared Redis cache
  • Vertical Scaling: Single instance with optimized resource allocation
  • Multi-Region Deployment: Distributed deployments across regions
  • Auto-Scaling: Dynamic scaling based on load
  • Load Balancing: Intelligent request distribution
  • Cache Distribution: Multi-tier distributed caching

Single Instance to Multi-Instance

1. Shared Cache Configuration

from rizk.sdk.cache.cache_hierarchy import CacheHierarchy, CacheHierarchyConfig
from rizk.sdk.cache.redis_adapter import RedisConfig
# Multi-instance shared cache configuration
def create_shared_cache_config(instance_id: str) -> CacheHierarchyConfig:
"""Create cache configuration for multi-instance deployment."""
redis_config = RedisConfig(
url=os.getenv("REDIS_CLUSTER_URL", "redis://redis-cluster:6379"),
max_connections=100, # Per instance
socket_timeout=1.0,
socket_connect_timeout=2.0,
retry_on_timeout=True,
retry_attempts=3,
enable_cluster=True, # Critical for multi-instance
cluster_nodes=[
"redis-node1:6379",
"redis-node2:6379",
"redis-node3:6379"
],
key_prefix=f"rizk:{instance_id}:", # Instance-specific prefix
default_ttl=1800
)
return CacheHierarchyConfig(
# L1: Instance-local cache
l1_enabled=True,
l1_max_size=10000, # Smaller per instance
l1_ttl_seconds=300,
# L2: Shared Redis cluster
l2_enabled=True,
l2_redis_config=redis_config,
l2_ttl_seconds=3600,
l2_fallback_on_error=True,
# Multi-instance optimizations
async_write_behind=True,
promotion_threshold=2, # Less aggressive promotion
# Instance coordination
instance_id=instance_id,
enable_instance_coordination=True,
coordination_interval=30
)
# Initialize with instance-specific configuration
instance_id = os.getenv("INSTANCE_ID", f"instance-{os.getpid()}")
cache_config = create_shared_cache_config(instance_id)
cache_hierarchy = CacheHierarchy(cache_config)

2. Load Balancer Configuration

# nginx.conf for Rizk SDK load balancing
upstream rizk_backend {
# Health check enabled
server rizk-instance-1:8000 max_fails=3 fail_timeout=30s;
server rizk-instance-2:8000 max_fails=3 fail_timeout=30s;
server rizk-instance-3:8000 max_fails=3 fail_timeout=30s;
# Load balancing method
least_conn; # Route to instance with fewest connections
# Session persistence for stateful operations
ip_hash; # Optional: sticky sessions
# Health check
keepalive 32;
}
server {
listen 80;
server_name api.yourcompany.com;
# Rate limiting
limit_req_zone $binary_remote_addr zone=rizk_rate_limit:10m rate=100r/s;
limit_req zone=rizk_rate_limit burst=200 nodelay;
location / {
proxy_pass http://rizk_backend;
# Load balancing headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Instance-ID $upstream_addr;
# Streaming support
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
# Health check
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
# Health check endpoint
location /health {
proxy_pass http://rizk_backend/health;
access_log off;
}
# Metrics endpoint (internal only)
location /metrics {
allow 10.0.0.0/8; # Internal network only
allow 172.16.0.0/12;
allow 192.168.0.0/16;
deny all;
proxy_pass http://rizk_backend/metrics;
}
}

Kubernetes Scaling

1. Horizontal Pod Autoscaler

# hpa.yaml - Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rizk-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rizk-app
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: rizk_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 5
periodSeconds: 15
selectPolicy: Max

2. Deployment with Resource Optimization

# deployment.yaml - Optimized for scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: rizk-app
labels:
app: rizk-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
selector:
matchLabels:
app: rizk-app
template:
metadata:
labels:
app: rizk-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: rizk-app
image: your-company/rizk-app:latest
ports:
- containerPort: 8000
name: http
- containerPort: 8080
name: metrics
# Resource allocation for scaling
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
# Environment configuration
env:
- name: RIZK_API_KEY
valueFrom:
secretKeyRef:
name: rizk-secrets
key: api-key
- name: REDIS_URL
value: "redis://redis-cluster:6379"
- name: INSTANCE_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: RIZK_FRAMEWORK_CACHE_SIZE
value: "10000" # Optimized for multiple instances
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
# Startup probe for slow initialization
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30
periodSeconds: 10
# Pod disruption budget
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rizk-app
topologyKey: kubernetes.io/hostname

Auto-Scaling Strategies

1. Custom Metrics for Scaling

from typing import Dict, Any
import time
import psutil
from dataclasses import dataclass
@dataclass
class ScalingMetrics:
"""Metrics used for auto-scaling decisions."""
cpu_percent: float
memory_percent: float
requests_per_second: float
cache_hit_rate: float
error_rate: float
response_time_ms: float
active_connections: int
class AutoScalingController:
"""Control auto-scaling based on Rizk SDK metrics."""
def __init__(self):
self.metrics_history = []
self.scaling_thresholds = {
"scale_up": {
"cpu_percent": 70,
"memory_percent": 80,
"requests_per_second": 100,
"response_time_ms": 1000,
"error_rate": 5
},
"scale_down": {
"cpu_percent": 30,
"memory_percent": 40,
"requests_per_second": 20,
"response_time_ms": 200,
"error_rate": 1
}
}
def collect_metrics(self) -> ScalingMetrics:
"""Collect current system and Rizk metrics."""
# System metrics
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
memory_percent = memory.percent
# Rizk-specific metrics (would be collected from SDK)
cache_stats = cache_hierarchy.get_stats() if 'cache_hierarchy' in globals() else {}
metrics = ScalingMetrics(
cpu_percent=cpu_percent,
memory_percent=memory_percent,
requests_per_second=self._get_requests_per_second(),
cache_hit_rate=cache_stats.get("overall_hit_rate", 0),
error_rate=self._get_error_rate(),
response_time_ms=self._get_avg_response_time(),
active_connections=self._get_active_connections()
)
self.metrics_history.append(metrics)
# Keep only recent metrics
if len(self.metrics_history) > 100:
self.metrics_history = self.metrics_history[-100:]
return metrics
def should_scale_up(self, metrics: ScalingMetrics) -> bool:
"""Determine if scaling up is needed."""
thresholds = self.scaling_thresholds["scale_up"]
conditions = [
metrics.cpu_percent > thresholds["cpu_percent"],
metrics.memory_percent > thresholds["memory_percent"],
metrics.requests_per_second > thresholds["requests_per_second"],
metrics.response_time_ms > thresholds["response_time_ms"],
metrics.error_rate > thresholds["error_rate"]
]
# Scale up if any 2 conditions are met
return sum(conditions) >= 2
def should_scale_down(self, metrics: ScalingMetrics) -> bool:
"""Determine if scaling down is safe."""
thresholds = self.scaling_thresholds["scale_down"]
# Only scale down if ALL conditions are met for safety
conditions = [
metrics.cpu_percent < thresholds["cpu_percent"],
metrics.memory_percent < thresholds["memory_percent"],
metrics.requests_per_second < thresholds["requests_per_second"],
metrics.response_time_ms < thresholds["response_time_ms"],
metrics.error_rate < thresholds["error_rate"]
]
return all(conditions)
def get_scaling_recommendation(self) -> Dict[str, Any]:
"""Get scaling recommendation based on current metrics."""
if len(self.metrics_history) < 3:
return {"action": "wait", "reason": "Insufficient metrics history"}
current_metrics = self.metrics_history[-1]
# Check recent trend
recent_metrics = self.metrics_history[-3:]
avg_cpu = sum(m.cpu_percent for m in recent_metrics) / len(recent_metrics)
avg_memory = sum(m.memory_percent for m in recent_metrics) / len(recent_metrics)
avg_response_time = sum(m.response_time_ms for m in recent_metrics) / len(recent_metrics)
if self.should_scale_up(current_metrics):
return {
"action": "scale_up",
"reason": f"High resource usage: CPU {avg_cpu:.1f}%, Memory {avg_memory:.1f}%, Response Time {avg_response_time:.1f}ms",
"recommended_replicas": self._calculate_scale_up_replicas(current_metrics)
}
elif self.should_scale_down(current_metrics):
return {
"action": "scale_down",
"reason": f"Low resource usage: CPU {avg_cpu:.1f}%, Memory {avg_memory:.1f}%",
"recommended_replicas": self._calculate_scale_down_replicas(current_metrics)
}
else:
return {
"action": "maintain",
"reason": "Metrics within acceptable range"
}
def _calculate_scale_up_replicas(self, metrics: ScalingMetrics) -> int:
"""Calculate how many replicas to add."""
# Simple calculation based on CPU usage
if metrics.cpu_percent > 90:
return 3 # Aggressive scaling for very high CPU
elif metrics.cpu_percent > 80:
return 2
else:
return 1
def _calculate_scale_down_replicas(self, metrics: ScalingMetrics) -> int:
"""Calculate how many replicas to remove."""
# Conservative scaling down
return 1 # Remove one at a time
def _get_requests_per_second(self) -> float:
"""Get current requests per second (implementation specific)."""
# This would typically come from your web framework metrics
return 50.0 # Placeholder
def _get_error_rate(self) -> float:
"""Get current error rate percentage."""
# This would typically come from your application metrics
return 2.0 # Placeholder
def _get_avg_response_time(self) -> float:
"""Get average response time in milliseconds."""
# This would typically come from your application metrics
return 250.0 # Placeholder
def _get_active_connections(self) -> int:
"""Get number of active connections."""
# This would typically come from your web server metrics
return 50 # Placeholder
# Usage
scaling_controller = AutoScalingController()
def monitor_and_scale():
"""Monitor metrics and provide scaling recommendations."""
metrics = scaling_controller.collect_metrics()
recommendation = scaling_controller.get_scaling_recommendation()
print(f"Current Metrics:")
print(f" CPU: {metrics.cpu_percent:.1f}%")
print(f" Memory: {metrics.memory_percent:.1f}%")
print(f" RPS: {metrics.requests_per_second:.1f}")
print(f" Response Time: {metrics.response_time_ms:.1f}ms")
print(f" Cache Hit Rate: {metrics.cache_hit_rate:.1f}%")
print(f"\nScaling Recommendation: {recommendation['action']}")
print(f"Reason: {recommendation['reason']}")
if "recommended_replicas" in recommendation:
print(f"Recommended Replica Change: {recommendation['recommended_replicas']}")
return recommendation

Scaling Best Practices

1. Scaling Checklist

SCALING_CHECKLIST = {
"infrastructure": [
"✅ Redis cluster configured for high availability",
"✅ Load balancer configured with health checks",
"✅ Auto-scaling policies defined",
"✅ Resource limits and requests configured",
"✅ Pod disruption budgets set"
],
"configuration": [
"✅ Instance-specific cache prefixes",
"✅ Shared cache configuration",
"✅ Regional endpoints configured",
"✅ Cross-region sync enabled",
"✅ Monitoring and alerting setup"
],
"performance": [
"✅ Cache hit rates > 80%",
"✅ Response times < 500ms",
"✅ Error rates < 1%",
"✅ CPU utilization 60-80%",
"✅ Memory utilization < 85%"
],
"reliability": [
"✅ Multi-AZ deployment",
"✅ Graceful shutdown handling",
"✅ Circuit breakers implemented",
"✅ Retry logic configured",
"✅ Backup and recovery tested"
]
}
def validate_scaling_readiness() -> Dict[str, Any]:
"""Validate readiness for scaling deployment."""
print("🚀 Validating scaling readiness...")
# This would include actual checks
validation_results = {
"ready": True,
"warnings": [],
"errors": []
}
# Example checks
redis_health = check_redis_cluster_health()
if not redis_health["healthy"]:
validation_results["errors"].append("Redis cluster not healthy")
validation_results["ready"] = False
cache_hit_rate = get_current_cache_hit_rate()
if cache_hit_rate < 70:
validation_results["warnings"].append(f"Cache hit rate low: {cache_hit_rate:.1f}%")
return validation_results
def check_redis_cluster_health() -> Dict[str, Any]:
"""Check Redis cluster health (placeholder)."""
return {"healthy": True, "nodes": 3, "status": "ok"}
def get_current_cache_hit_rate() -> float:
"""Get current cache hit rate (placeholder)."""
return 85.0

2. Deployment Strategy

#!/bin/bash
# deploy-scaled.sh - Deploy scaled Rizk SDK application
set -e
echo "🚀 Starting scaled deployment..."
# 1. Validate prerequisites
echo "📋 Validating prerequisites..."
kubectl cluster-info
kubectl get nodes
kubectl get pv # Check persistent volumes
# 2. Deploy Redis cluster first
echo "🔧 Deploying Redis cluster..."
kubectl apply -f redis-cluster.yaml
kubectl wait --for=condition=ready pod -l app=redis-cluster --timeout=300s
# 3. Deploy application with minimal replicas
echo "🚀 Deploying application (minimal replicas)..."
kubectl apply -f deployment.yaml
kubectl wait --for=condition=available deployment/rizk-app --timeout=300s
# 4. Run health checks
echo "🏥 Running health checks..."
kubectl get pods -l app=rizk-app
kubectl exec deployment/rizk-app -- curl -f http://localhost:8000/health
# 5. Deploy auto-scaling
echo "📈 Enabling auto-scaling..."
kubectl apply -f hpa.yaml
kubectl apply -f pdb.yaml
# 6. Configure monitoring
echo "📊 Setting up monitoring..."
kubectl apply -f monitoring.yaml
# 7. Run load test to verify scaling
echo "🧪 Running scaling verification..."
kubectl run load-test --image=busybox --rm -it --restart=Never -- \
sh -c 'for i in $(seq 1 100); do wget -q -O- http://rizk-app-service/health; done'
# 8. Monitor scaling behavior
echo "👀 Monitoring scaling behavior..."
kubectl get hpa rizk-app-hpa --watch
echo "✅ Scaled deployment completed successfully!"

Next Steps

  1. Production Setup - Deploy your scaled architecture
  2. Performance Tuning - Optimize for scale
  3. Security Best Practices - Secure your scaled deployment

Scaling Implementation Checklist

✅ Multi-instance configuration ready
✅ Shared cache infrastructure deployed
✅ Load balancing configured
✅ Auto-scaling policies defined
✅ Multi-region strategy planned
✅ Monitoring and alerting setup
✅ Performance benchmarks established
✅ Disaster recovery tested

Enterprise-scale LLM governance architecture