G
GuideDevOps
Lesson 7 of 14

Chaos Monkey & Simian Army

Part of the Chaos Engineering tutorial series.

Netflix's Chaos Engineering Innovation

Netflix pioneered Chaos Engineering in the cloud through a suite of tools known as the Simian Army. These tools automate the injection of various types of failures to ensure system resilience.

Chaos Monkey

Overview

Chaos Monkey is the most famous member of the Simian Army. It randomly terminates instances (virtual machines) in production to ensure the system handles server failures gracefully.

How It Works

  1. Runs on a Schedule: Typically runs during business hours (e.g., 9am-3pm)
  2. Random Selection: Randomly picks instances across regions/availability zones
  3. Termination: Kills the selected instances without warning
  4. Observation: System should maintain steady-state and recover automatically

Why Kill Instances Randomly?

  • Prevents Complacency: Teams can't assume instances will always be there
  • Forces Resilience: Systems must handle graceful shutdown
  • Tests Load Balancers: Ensures load balancers detect failures and route around them
  • Tests Auto-Scaling: Verifies new instances spin up and rejoin the load balancer
  • Tests Health Checks: Confirms health checks work properly

Configuration Example

# Chaos Monkey Configuration
monkey:
  enabled: true
  termination_schedule: "9 * * * ?"  # 9am daily
  regions:
    - us-east-1
    - us-west-2
  frequency:
    mean_time_between_kills: 1  # Kill an instance roughly every 1 day
  
  leashed: false  # true = dry-run, false = actual termination
  
  exceptions:
    - tag: "do_not_kill"
    - name: "*-prod-critical-*"

Expected System Behavior

Good: Instance is killed → New instance starts → Traffic reroutes → No user impact

Bad: Instance is killed → Traffic fails → Users see errors → Manual intervention needed

The Simian Army

Netflix expanded beyond Chaos Monkey with additional "monkeys" targeting different failure modes:

Chaos Gorilla

Targets: Full availability zone failures

What It Does: Terminates all instances in an entire availability zone

Why It Matters: Tests multi-AZ failover, data replication across zones, and DNS failover

Risk Level: High blast radius—typically run less frequently

chaos_gorilla:
  enabled: true
  kill_probability: 0.5  # Only 50% chance to actually run when triggered
  frequency: "monthly"

Chaos Kong

Targets: Entire region failures

What It Does: Simulates an entire AWS region becoming unavailable

Why It Matters: Tests global failover, multi-region data consistency, and disaster recovery

Risk Level: Very high—typically used for special testing events

Latency Monkey

Targets: Network latency issues

What It Does: Injects artificial latency (delays) into inter-service communication

Why It Matters: Tests timeout handling, circuit breakers, and graceful degradation

Common Injected Latencies:

  • 100-500ms: Client-perceptible slowdown
  • 1-5s: Service timeout scenarios
  • 10-30s: Hard timeout scenarios
latency_monkey:
  enabled: true
  rpc_latency_ms: 500  # Add 500ms to remote calls
  correlation_id_pattern: ".*latency.*"  # Only apply to requests matching pattern

Conformity Monkey

Targets: Configuration drift

What It Does: Verifies instances comply with expected configuration standards and terminates non-compliant instances

Why It Matters: Forces proper configuration management and prevents snowflake servers

Security Monkey

Targets: AWS security configuration issues

What It Does: Scans AWS accounts for security misconfigurations

Why It Matters: Identifies security group issues, public buckets, etc.

Janitor Monkey

Targets: Unused resources

What It Does: Cleans up unused resources (dangling security groups, unused load balancers, unattached volumes)

Why It Matters: Reduces costs and prevents configuration clutter

The Simian Army Architecture

How They Work Together

┌─────────────────────────────────┐
│   Chaos Monkey (Foundation)     │
│   - Random instance kills       │
└──────────────┬──────────────────┘
               │
      ┌────────┴────────┬─────────────┬──────────────┐
      │                 │             │              │
      ▼                 ▼             ▼              ▼
┌──────────┐  ┌──────────────┐  ┌───────────┐  ┌─────────────┐
│Gorilla   │  │Latency Monkey│  │Conformity │  │Security     │
│(AZ fail) │  │(network lag) │  │(config)   │  │(compliance) │
└──────────┘  └──────────────┘  └───────────┘  └─────────────┘

Running order: Conformity → Janitor → Chaos Monkey → Latency → Gorilla → Kong

Scheduling Considerations

Monday-Friday (Business Hours):
  9am  - Conformity Monkey runs
  10am - Chaos Monkey runs (random)
  11am - Latency Monkey runs (targeted)
  3pm  - Security Monkey audit

Weekly (Once):
  Friday 8pm - Chaos Gorilla runs (full AZ)

Monthly (Once):
  First Sunday of month - Chaos Kong runs (full region)

Implementing Chaos Monkey

Prerequisites

  1. Auto-scaling Groups: Instances must auto-restart when killed
  2. Load Balancers: Traffic must reroute to healthy instances
  3. Health Checks: System must detect failures automatically
  4. Monitoring: You need to observe what happens

Basic Setup

# 1. Install Chaos Monkey
docker pull netflix/chaosmonkey:latest
 
# 2. Configure (via environment variables)
export CHAOS_MONKEY_ENABLED=true
export CHAOS_MONKEY_LEASHED=false  # Actually kill instances
export CHAOS_MONKEY_REGIONS=us-east-1,us-west-2
export CHAOS_MONKEY_SCHEDULE="0 9 * * MON-FRI"  # 9am weekdays
 
# 3. Run
docker run -e CHAOS_MONKEY_ENABLED=$CHAOS_MONKEY_ENABLED \
           -e CHAOS_MONKEY_LEASHED=$CHAOS_MONKEY_LEASHED \
           netflix/chaosmonkey:latest

When to Use Chaos Monkey

✅ Good Use Cases

  • Testing auto-scaling behavior
  • Verifying load balancer health checks
  • Ensuring graceful shutdown on instances
  • Testing service discovery mechanisms

❌ Avoid With

  • Custom hardware with long startup times
  • Non-redundant systems (no auto-scaling)
  • Stateful services without replication
  • Peak traffic periods

Modern Alternatives

While Chaos Monkey is powerful, newer tools offer additional features:

ToolFocusModern?Cloud-Native?
Chaos MonkeyInstance terminationLegacyAWS-focused
GremlinComprehensive failuresYesMulti-cloud
LitmusKubernetes nativeYesKubernetes
chaos-meshKubernetes nativeYesKubernetes

Key Takeaways

  1. Chaos Monkey established the practice: Random instance termination at Netflix changed the industry
  2. The Simian Army expanded the concept: Different tools target different failure types
  3. Not just for Netflix: These principles apply to any auto-scaling, multi-instance system
  4. Evolution continues: Modern tools like Gremlin and Litmus build on these foundations