Advanced Patterns Overview
Building on SRE fundamentals, advanced patterns address complex scenarios at scale:
Beginner SRE Patterns:
✅ Basic monitoring and alerting
✅ Simple runbooks
✅ Error budgets
✅ On-call rotations
Advanced SRE Patterns:
🚀 Predicting failures (machine learning)
🚀 Multi-region resilience
🚀 Game days and resilience testing
🚀 Advanced observability (semantic)
🚀 SRE for microservices
🚀 Cost-aware reliability
Pattern 1: Predictive Reliability (AIOps)
Predicting Before Failure
Instead of reactive alerting, predict and prevent:
Traditional (Reactive):
- Metric exceeds threshold
→ Alert fires
→ Response begins
→ (Service already degraded)
Predictive (Proactive):
- Trend analysis shows steady increase
→ Predict capacity will be exceeded in days
→ Proactive scaling before threshold
→ (Service maintains performance)
Machine Learning for Anomaly Detection
# Example: ML-based anomaly detection
from sklearn.ensemble import IsolationForest
import numpy as np
def detect_anomalies(time_series_data, contamination=0.01):
"""
Use Isolation Forest to detect anomalous metrics
Advantages over threshold-based:
- Learns normal patterns (handles seasonality)
- Adapts to changing baselines
- Detects subtle anomalies
"""
X = np.array(time_series_data).reshape(-1, 1)
# Train model on historical data
iso_forest = IsolationForest(
contamination=contamination, # ~1% of points are anomalies
random_state=42
)
# Detect anomalies
predictions = iso_forest.predict(X)
# -1 = anomaly, 1 = normal
anomaly_indices = np.where(predictions == -1)[0]
return anomaly_indices
# Usage
cpu_metrics = [45, 48, 46, 47, 45, 75, 76, 74, 45, 46] # Spike in middle
anomalies = detect_anomalies(cpu_metrics)
print(f"Anomalous points: {anomalies}") # [5, 6, 7]Forecasting Capacity
# Predict when capacity will be exceeded
from statsmodels.tsa.holtwinters import ExponentialSmoothing
def forecast_capacity(historical_usage, periods_ahead=30, capacity_limit=85):
"""Forecast when usage will hit capacity"""
# Fit exponential smoothing model
model = ExponentialSmoothing(
historical_usage,
seasonal_periods=7, # Weekly seasonality
trend='add',
seasonal='add'
)
fit = model.fit()
# Forecast ahead
forecast = fit.forecast(steps=periods_ahead)
# Find when it exceeds capacity
days_until_capacity = None
for day, value in enumerate(forecast):
if value > capacity_limit:
days_until_capacity = day
break
return {
'forecast': forecast,
'days_until_capacity': days_until_capacity,
'action': 'Scale now' if days_until_capacity and days_until_capacity is less than 14 else 'Monitor'
}Pattern 2: Multi-Region Resilience
Active-Active Replication
Systems that operate across multiple regions simultaneously:
Active-Active Architecture:
┌─────────────────────────────────────────┐
│ Global Traffic │
│ (Anycast / Geo-routing) │
└────────┬────────────────┬───────────────┘
│ │
Region A Region B
┌─────────┐ ┌─────────┐
│ API │ │ API │
│ Server │◄─────►│ Server │
│ DB Sync │ │ DB Sync │
└─────────┘ └─────────┘
│ │
└────────┬───────┘
│
(Data replication, ~150ms latency)
Consistency vs Availability Trade-off
# CP (Consistency + Partition tolerance)
- Strong consistency across regions
- May not be available (wait for confirmation)
- Example: Financial transactions
# AP (Availability + Partition tolerance)
- Always available
- May have eventual consistency
- Example: Social media likes/comments
SRE Decision:
- For critical operations: Sacrifice some availability for consistency
- For UX features: Sacrifice some consistency for availability
- Monitor and test in chaos scenariosDatabase Replication Strategies
// Example: Multi-region database
// Approach 1: Leader-Follower (Master-Slave)
// Master in Region A → Replica in Region B (one-way replication)
// Problem: Region B can't take writes (must failover)
// Approach 2: Multi-Master (Active-Active)
// Region A and Region B both accept writes
// Problem: Potential conflicts, need conflict resolution
@Entity
public class ConflictResolution {
// Last-Writer-Wins: Latest timestamp wins
@Column(columnDefinition = "timestamp DEFAULT CURRENT_TIMESTAMP")
private LocalDateTime lastUpdated;
// LWW conflict resolution
public void mergeUpdates(ConflictResolution remote) {
if (remote.lastUpdated.isAfter(this.lastUpdated)) {
this.data = remote.data;
this.lastUpdated = remote.lastUpdated;
}
}
}Pattern 3: Chaos Engineering (Game Days)
Structured Resilience Testing
Game Days are scheduled chaos events:
Game Day Scenario: "Region Failure"
Objective: Can we failover if US-East region disappears?
Timeline:
09:00 - Kick-off meeting (explain scenario)
09:15 - Check monitoring setup
09:30 - Chaos starts: Block all traffic to US-East
(Simulated, not actual)
09:31 - Team begins incident response
09:45 - Service should be failover to other regions
10:00 - Verify service works
10:15 - Restore US-East traffic
10:30 - Debrief (what went wrong?)
11:00 - Document lessons learned
Examples of findings:
- Failover took 5 minutes (need faster)
- Monitoring didn't alert properly on traffic shift
- Database replication lag caused data loss
- Domain DNS wasn't configured for fallback
Chaos Engineering Tools
Tools for chaos:
- Gremlin: Commercial chaos platform
- Chaos Mesh: Open source Kubernetes chaos
- Locust: Load testing and chaos
- Boundary: Network chaos
- Custom scripts: Specific to your systems
What to test:
- Server failure (terminate process/instance)
- Network latency (slow down or delay)
- Packet loss (drop % of traffic)
- Database failure (make queries timeout)
- Dependency failure (mock service returns errors)Pattern 4: Observability at Scale
Semantic Observability
Moving beyond raw metrics to semantic understanding:
Traditional Metrics:
- CPU: 75%
- Memory: 82%
- Requests/sec: 1250
- Error rate: 0.5%
Semantic Observability:
- Service A: Healthy (all SLOs met)
- Service B: Degraded (latency SLO violated)
- Service C: Unhealthy (error budget depleted)
- Critical path: Database service delay root cause
- Likely impact: Payment processing affected
- Recommendation: Scale database or throttle writes
Structured Logging
# Instead of text logs, use structured logs
# ❌ Bad (text log)
logger.info("Request from 192.168.1.1 completed in 150ms with status 200")
# ✅ Good (structured log)
logger.info("request_complete", {
'timestamp': '2024-01-15T09:30:00Z',
'client_ip': '192.168.1.1',
'service': 'payment-api',
'endpoint': '/payments',
'method': 'POST',
'duration_ms': 150,
'status_code': 200,
'user_id': 'user-12345',
'transaction_id': 'txn-98765',
'trace_id': 'trace-abc123',
'severity': 'info'
})Real-Time Alerting with Context
# Advanced alerting with context
Alert: High Error Rate
Service: payment-api
Severity: Critical
Context:
- Deployment: v2.3.1 deployed 15 min ago
- Recent changes: Added new payment processor
- Database: Replication lag is normal
- Dependencies: All healthy
Root cause detection:
- 90% of errors from new payment processor
- Processor failing with auth timeouts
- Auth service healthy (not cause)
- New processor has wrong credentials
Recommended action:
- Immediate: Roll back to v2.3.0
- Quick: Verify credentials for new processor
- Follow-up: Add pre-deployment validationPattern 5: SRE for Microservices
Distributed Systems Challenges
Monolith: 1 service, 1 database, 1 failure mode
Microservices: 50+ services, 10+ databases, 1000+ failure modes
SRE must adapt:
- More services = more monitoring needed
- More services = more dependencies = harder to trace
- More services = more deployment complexity
- More surface area for failures
Service Mesh for Observability
Service mesh (like Istio) provides:
# Automatically (without code changes):
- Request tracing across services
- Latency metrics per service pair
- mTLS encryption between services
- Retry logic and circuit breakers
- Rate limiting and load balancing
# Example metrics automatically provided:
requests[
source_service="payment-api",
dest_service="user-service",
status="success",
latency="method_d"
]
# Immediately get insights:
- traffic flow between services
- error rates
- latency per path
- Can identify bottlenecks automaticallyDependency Management
# Track service dependencies
class ServiceDependencyGraph:
dependencies = {
'payments': ['user-service', 'auth', 'database'],
'orders': ['user-service', 'inventory', 'database'],
'inventory': ['warehouse', 'database'],
}
def critical_services(self):
"""Find services that break everything if they fail"""
dependency_count = {}
for service, deps in self.dependencies.items():
for dep in deps:
dependency_count[dep] = dependency_count.get(dep, 0) + 1
# Services depended on by many others
return sorted(dependency_count.items(), key=lambda x: x[1], reverse=True)
# Probably: auth and user-service are most critical
# Use this to prioritize:
# - Which services need highest SLO?
# - Which services to focus chaos testing on?
# - Where to invest reliability efforts?Pattern 6: Cost-Aware Reliability
Balancing Cost and Reliability
Too cheap (<< industry standard):
- Not enough infrastructure
- SLO frequently missed
- Customers unhappy
- Business loses money
Too expensive (>> necessary):
- Over-provisioned
- Wasting money
- Not competitive
- Shareholders unhappy
Goldilocks (right reliability level):
- SLO achievable and met
- Appropriate cost for business
- Competitive pricing
- Sustainable business
Reliability ROI Calculation
# Investment in reliability vs. benefit
class ReliabilityROI:
def __init__(self, service):
self.service = service
def calculate_optimal_reliability(self):
# Cost of infrastructure for N nines
cost_99_0 = 10_000 # $10k/month for 99%
cost_99_9 = 30_000 # $30k/month for 99.9%
cost_99_99 = 100_000 # $100k/month for 99.99%
# Cost of downtime
downtime_cost_per_min = 1_000 # $1000 per minute
# Downtime at each reliability level (annual)
downtime_99_0 = 365 * 24 * 60 * 0.01 # 5,256 minutes
downtime_99_9 = 365 * 24 * 60 * 0.001 # 526 minutes
downtime_99_99 = 365 * 24 * 60 * 0.0001 # 53 minutes
# Total cost per year
total_99_0 = cost_99_0 * 12 + downtime_99_0 * downtime_cost_per_min
total_99_9 = cost_99_9 * 12 + downtime_99_9 * downtime_cost_per_min
total_99_99 = cost_99_99 * 12 + downtime_99_99 * downtime_cost_per_min
return {
'99%': total_99_0,
'99.9%': total_99_9,
'99.99%': total_99_99,
'optimal': min([
('99%', total_99_0),
('99.9%', total_99_9),
('99.99%', total_99_99)
], key=lambda x: x[1])
}
# Results might show 99.9% is optimal for cost + benefitPattern 7: Gradual Rollouts
Reducing Risk in Deployment
Traditional deployment (high risk):
- Deploy to all users
- If broken → many impacts
- MTTR is critical
Gradual deployment (low risk):
- Deploy to 1% of users
- Monitor for issues
- If good → 5% of users
- If good → 25% of users
- If good → 100% of users
Implementation
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
service:
port: 8080
analysis:
interval: 1m
threshold: 5
metrics:
- name: error-rate
thresholdRange:
max: 0.5 # Max 0.5% errors
- name: latency
thresholdRange:
max: 100 # Max 100ms latency
webhooks:
- name: smoke-tests
url: http://flagger-loadtester/
skipAnalysis: false
progressDeadlineSeconds: 60
# Canary stages
stages:
- weight: 5 # 5% of traffic
maxWeight: 5
stepWeight: 1
metrics:
- name: error-rate
interval: 1m
- weight: 25 # 25% of traffic
metrics:
- name: error-rate
- name: latency
- weight: 50 # 50% of traffic
metrics:
- name: error-rate
- name: latency
# Then automatic 100%Pattern 8: Error Budgets as Throttle
Advanced use of error budgets:
Error Budget as Deployment Gating:
HIGH error budget remaining (>50%):
Deployment strategy: Aggressive
- Canary: 5% → 25% → 100%
- Speed: Deploy by end of day
Risk tolerance: High
MEDIUM error budget (20-50%):
Deployment strategy: Conservative
- Canary: 5% → 10% → 50% → 100%
- Speed: Staggered over 2-3 hours
Risk tolerance: Medium
LOW error budget (under 20%):
Deployment strategy: Ultra-conservative
- Canary: 1% → 5% → 25% → 100%
- Speed: Staggered over full day
Risk tolerance: Low
NO error budget (under 5%):
Deployment strategy: Emergency only
- No new features
- Critical fixes only
- Full hands-on monitoring
Bringing It All Together
The Advanced SRE Maturity Model
Level 1: Basic SRE
- Manual runbooks, basic monitoring
- Reactive incident response
- Fixed SLOs
Level 2: Intermediate SRE
- Automated response, good monitoring
- Proactive incident prevention
- Error budgets driving decisions
Level 3: Advanced SRE
- Predictive reliability (ML)
- Multi-region resilience
- Cost-aware reliability
- Advanced deployment strategies
Level 4: Strategic SRE
- AI-driven operations (AIOps)
- Self-healing systems
- Reliability as competitive advantage
- Organization-wide reliability culture
Key Takeaways
✓ Advanced patterns require scale and maturity
✓ Predictive reliability prevents incidents
✓ Multi-region systems need careful design
✓ Chaos testing validates resilience
✓ Observability must be semantic
✓ Microservices multiply complexity
✓ Cost and reliability are both important
✓ Gradual deployments reduce risk
✓ Error budgets gate deployment strategy
✓ Continuous improvement mindset essential