Prerequisites
Before starting with Chaos Engineering, ensure you have foundational knowledge and tools in place.
Required Knowledge
1. Infrastructure Basics
- Understand how servers, networks, and storage work
- Familiar with Linux/Unix concepts (processes, network interfaces, file systems)
- Know what a load balancer does
- Understand database replication
Quick Self-Check:
- Can you explain what happens when a server crashes? ✓
- Do you know how traffic is routed to healthy servers? ✓
- Understand what latency and packet loss mean? ✓
2. Application Architecture
- Know the difference between monoliths and microservices
- Understand service-to-service communication (REST, gRPC, messaging)
- Familiar with health checks and load balancer integration
- Know what circuit breakers and retries do
Quick Self-Check:
- Can you draw your system's architecture? ✓
- Can you explain how your services depend on each other? ✓
- Know what happens when one service is slow? ✓
3. Monitoring and Observability
- Can read and interpret metrics (latency, throughput, error rate)
- Familiar with logging and log parsing
- Understand distributed tracing concepts
- Can query monitoring systems (Prometheus, Datadog, CloudWatch, etc.)
Quick Self-Check:
- Can you find error spikes in your monitoring dashboard? ✓
- Can you correlate metrics across services? ✓
- Know where to look when something breaks? ✓
4. Containerization and Kubernetes (if using K8s-based tools)
- Understand how containers work
- Familiar with Kubernetes concepts (pods, services, deployments)
- Know how to run kubectl commands
- Understand persistent volumes and ConfigMaps
Quick Self-Check:
- Can you deploy an application to Kubernetes? ✓
- Can you kill a pod and see it restart? ✓
- Understand what a NodePort service does? ✓
Required Tools Installation
1. Container Runtime
Docker (for local testing and accessing tools):
# macOS
brew install docker
# Linux (Ubuntu/Debian)
sudo apt-get install docker.io
# Linux (RHEL/CentOS)
sudo yum install docker
# Verify
docker --version
docker run hello-world2. Kubernetes (for labs)
Option A: Minikube (local, single-node cluster):
# Install
brew install minikube # macOS
apt-get install minikube # Linux
# Start cluster
minikube start --cpus=4 --memory=8192
# Verify
kubectl cluster-info
kubectl get nodesOption B: Kind (lightweight, Docker-based):
# Install
brew install kind # macOS
# Create cluster
kind create cluster --name chaos-lab
# Verify
kubectl cluster-info --context kind-chaos-labOption C: Cloud Kubernetes (AWS EKS, Azure AKS, GCP GKE):
# AWS EKS with eksctl
eksctl create cluster --name chaos-lab --nodes=3
# Verify
kubectl get nodes
kubectl get services -n default3. Kubernetes Tools
# kubectl - Kubernetes CLI
# Usually comes with cluster installation
# Helm - Package manager for Kubernetes
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Verify
kubectl version
helm version4. Monitoring Stack
Option A: Prometheus + Grafana (local):
# Using docker-compose
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
EOF
docker-compose up -dOption B: Cloud Monitoring (Datadog, New Relic, CloudWatch):
# Datadog agent example
docker run -d \
-e DD_AGENT_MAJOR_VERSION=7 \
-e DD_API_KEY=<YOUR_API_KEY> \
-e DD_SITE=datadoghq.com \
datadog/agent:latest5. Text Editor / IDE
# VS Code
brew install visual-studio-code # macOS
# Or any editor you prefer (Vim, Neovim, Sublime, JetBrains, etc.)Optional but Recommended
# htop - interactive process monitor
brew install htop # macOS
apt-get install htop # Linux
# curl - for testing APIs
brew install curl
# jq - JSON parsing
brew install jq
# yq - YAML parsing
brew install yq
# git - for version control
git --versionEnvironment Setup
1. Create a Lab Project
# Create directory structure
mkdir -p ~/chaos-engineering-lab
cd ~/chaos-engineering-lab
mkdir -p {experiments, manifests, monitoring, scripts}2. Initialize Git Repository
git init
git config user.name "Your Name"
git config user.email "[email protected]"
# Create basic README
cat > README.md << 'EOF'
# Chaos Engineering Lab
## Structure
- experiments/: Chaos experiment definitions
- manifests/: Kubernetes manifests for test application
- monitoring/: Monitoring configuration
- scripts/: Helper scripts
## Quick Start
1. Start minikube: `minikube start`
2. Deploy test app: `kubectl apply -f manifests/`
3. Run experiment: `./scripts/run-experiment.sh`
EOF
git add README.md
git commit -m "Initial commit"3. Deploy a Test Application
Create a simple application to run chaos tests against:
# manifests/test-app.yaml
cat > manifests/test-app.yaml << 'EOF'
apiVersion: v1
kind: Namespace
metadata:
name: chaos-testing
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: chaos-testing
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: app
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: web-app
namespace: chaos-testing
spec:
selector:
app: web-app
ports:
- port: 80
targetPort: 80
type: LoadBalancer
EOF
# Deploy it
kubectl apply -f manifests/test-app.yaml
# Verify
kubectl get pods -n chaos-testing
kubectl get service web-app -n chaos-testing4. Install a Chaos Engineering Tool
For Kubernetes: Install Litmus Chaos
# Add namespace label for pod security
kubectl label namespace chaos-testing pod-security.kubernetes.io/enforce=baseline
# Install Litmus (if using Kubernetes 1.25+)
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace \
--set adminUser.name=admin \
--set adminUser.password=litmus
# Verify installation
kubectl get pods -n litmusFor VMs/Servers: Install Gremlin
# On Linux
curl -O https://downloads.gremlin.com/gremlin/downloads/client/latest/linux/gremlin-latest.linux_amd64.rpm
sudo rpm -i gremlin-latest.linux_amd64.rpm
# Authenticate
sudo gremlin config set -c <TEAM_ID> -p <PRIVATE_KEY>
# Start
sudo systemctl start gremlin
gremlin checkFirst Experiment
Simple Test: Kill a Container
#!/bin/bash
# scripts/first-experiment.sh
echo "=== First Chaos Experiment: Pod Deletion ==="
# Baseline: Check how many pods are running
echo "Initial pod count:"
kubectl get pods -n chaos-testing
# Run first experiment
echo "Deleting one pod..."
kubectl delete pod -n chaos-testing \
$(kubectl get pods -n chaos-testing -l app=web-app -o jsonpath='{.items[0].metadata.name}')
# Observe immediate recovery
echo "Checking pod status after 5 seconds..."
sleep 5
kubectl get pods -n chaos-testing
# Count ready pods
READY=$(kubectl get pods -n chaos-testing -l app=web-app -o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status=="True")].metadata.name}' | wc -w)
echo "Ready pods: $READY/3"
if [ "$READY" -eq 3 ]; then
echo "✓ PASS: System recovered automatically"
else
echo "✗ FAIL: Expected 3 pods, found $READY"
fiRun it:
chmod +x scripts/first-experiment.sh
./scripts/first-experiment.shExpected Output
=== First Chaos Experiment: Pod Deletion ===
Initial pod count:
NAME READY STATUS RESTARTS AGE
web-app-85d98d8c68-4kqm5 1/1 Running 0 5m
web-app-85d98d8c68-bx2jk 1/1 Running 0 5m
web-app-85d98d8c68-mnkl9 1/1 Running 0 5m
Deleting one pod...
pod "web-app-85d98d8c68-4kqm5" deleted
Checking pod status after 5 seconds...
NAME READY STATUS RESTARTS AGE
web-app-85d98d8c68-bx2jk 1/1 Running 0 5m
web-app-85d98d8c68-mnkl9 1/1 Running 0 5m
web-app-85d98d8c68-p9nml2 1/1 Running 0 3s
Ready pods: 3/3
✓ PASS: System recovered automatically
Verification Checklist
Before moving to main tutorials, verify you can:
- Start a Kubernetes cluster
- Deploy an application
- Check pod status with kubectl
- Access monitoring dashboard
- Trigger a simple chaos experiment
- Observe system recovery
- Read experiment results
Troubleshooting
Minikube won't start
minikube delete
minikube start --cpus=4 --memory=8192Docker not running
# macOS
open /Applications/Docker.app
# Linux
sudo systemctl start dockerkubectl connection refused
# Reset kubeconfig
rm ~/.kube/config
minikube start # or re-authenticate with your clusterMetrics not showing
# Install metrics-server for Kubernetes
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl top nodesNext Steps
Now that your environment is set up:
- Read Foundations → "Introduction to Chaos Engineering"
- Learn Principles → "Principles of Chaos"
- Understand Benefits → "Why Chaos Engineering Matters"
- Run Your First Test → Follow tool-specific tutorials
- Design Experiments → Use the design methodology