Chaos Engineering 101: Simulating Process Failures with ‘Process Roulette’ Safely
Hands-on chaos engineering lab: safely run a sandboxed 'process roulette' to learn fault injection, graceful degradation, and observability.
Hook: Learn resilience without breaking your machine
Struggling to learn chaos engineering because every tutorial tells you to "break things"—and you don't want to break your laptop, classmates' VMs, or production? You're not alone. Many students and early-career engineers want hands-on fault injection but lack a safe, structured environment and the know-how to limit blast radius. This tutorial uses the playful idea of process roulette—a tool that randomly kills processes—to teach foundational chaos engineering concepts like fault injection, graceful degradation, and safe sandboxing. By the end you'll be able to run controlled experiments in containers and Kubernetes, observe impact with Prometheus, and design abort-safe experiments driven by SLOs.
Quick summary: What you will get
- Why process-level chaos is a valuable learning tool in 2026
- A safe, step-by-step lab to run a constrained "process roulette" in a sandbox
- Code: a minimal process-roulette Python script with safety controls
- How to integrate telemetry and abort criteria using Prometheus and SLOs
- Advanced paths: eBPF-based fault injection, k8s chaos operators, and GitOps-driven experiments
Why process-level chaos matters now (2026 trends)
In late 2025 and into 2026 the focus of chaos engineering has shifted from only killing entire nodes or pods to more fine-grained, realistic faults: thread crashes, process OOMs, signal-based failures, and syscall-level errors. Modern stacks use ephemeral containers, sidecars, and service meshes, and teams need to validate graceful degradation at the process and container level. At the same time, observability is better integrated into CI pipelines and GitOps flows, letting educators and teams automate safety checks before experiments run.
Newer toolchains also give learners safer primitives: eBPF-based fault injection, sandboxed namespaces, and policies that scope experiments by namespace, label, or Kubernetes service—so you can teach true failure modes without risking production.
Core principles you'll use
- Minimize blast radius – run experiments in isolated environments (containers, VMs, namespaces) and only against test services.
- Declare hypotheses and abort criteria – state what you expect and when to stop the experiment.
- Measure before and after – baseline latency, error rates, and resource metrics.
- Iterate from process to system – start small (kill a background worker) and increase scope to containers/pods only after controlled success.
Safe setup: where to run process roulette
Never run random process-killing tools against your laptop's primary OS or any system holding real work. Use one of the following sandboxed environments:
- Docker container with its own PID namespace—easy for labs.
- Local VM via Vagrant, Multipass, or a cloud sandbox.
- Kubernetes namespace dedicated to experiments (with NetworkPolicy and resource quotas).
- Ephemeral CI job where experiments run inside a disposable runner.
We'll focus on the Docker + Kubernetes paths because they are reproducible for students.
Minimal demo: build a safe process-roulette
We'll create a minimal web service and a constrained process-roulette that randomly sends signals to processes. The key safety controls are:
- whitelist and blacklist of PIDs/names
- dry-run mode
- rate limits on kills
- max-kills and timeout
- logging and telemetry hooks
1. Simple web service (test target)
Use a tiny Python HTTP server that logs requests and exposes a health endpoint. Run it inside a container so process-terminations don't escape your sandbox.
from http.server import BaseHTTPRequestHandler, HTTPServer
import time
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/health':
self.send_response(200)
self.end_headers()
self.wfile.write(b'ok')
return
self.send_response(200)
self.end_headers()
self.wfile.write(b'hello from test service')
if __name__ == '__main__':
server = HTTPServer(('0.0.0.0', 8080), Handler)
print('serving on 8080')
server.serve_forever()
2. process_roulette.py (safe defaults)
The following script demonstrates a compact, safe process-killer you can run inside a container. It will only kill processes whose name matches a whitelist pattern, has built-in dry-run, and obeys a max-kills limit.
#!/usr/bin/env python3
import os
import random
import signal
import time
WHITELIST = ['python'] # only target processes whose cmd contains these
BLACKLIST_PIDS = {1} # never kill PID 1
DRY_RUN = True # set to False to actually send signals
MAX_KILLS = 3
KILL_INTERVAL = 2 # seconds between attempts
def list_candidate_pids():
pids = []
for entry in os.listdir('/proc'):
if not entry.isdigit():
continue
pid = int(entry)
if pid in BLACKLIST_PIDS:
continue
try:
with open(f'/proc/{pid}/cmdline', 'rb') as f:
cmd = f.read().decode('utf-8').replace('\x00', ' ').strip()
for w in WHITELIST:
if w in cmd:
pids.append((pid, cmd))
except Exception:
continue
return pids
def attempt_kill(pid, sig=signal.SIGTERM):
if DRY_RUN:
print(f'dry-run: would send {sig} to {pid}')
return True
try:
os.kill(pid, sig)
print(f'sent {sig} to {pid}')
return True
except ProcessLookupError:
print(f'pid {pid} not found')
except PermissionError:
print(f'no permission to kill {pid}')
return False
if __name__ == '__main__':
kills = 0
start = time.time()
while kills < MAX_KILLS and time.time() - start < 300:
candidates = list_candidate_pids()
if not candidates:
print('no candidates found')
time.sleep(KILL_INTERVAL)
continue
pid, cmd = random.choice(candidates)
print(f'selected pid {pid} cmd {cmd}')
if attempt_kill(pid):
kills += 1
time.sleep(KILL_INTERVAL)
print('done')
This script is intentionally conservative. For real experiments you will:
- set DRY_RUN=False
- add logging to stdout/stderr consumed by your observability stack
- restrict whitelist patterns to the exact binary names used by your test app
Run it in Docker: safe, reproducible lab
- Build a Docker image that contains both the web server and process_roulette script.
- Run the container with a dedicated network and resource limits: cpu and memory quotas reduce risk of noisy failures.
- Use Docker's PID namespace isolation so kills don't affect the host.
Example run commands (instructor):
docker build -t chaos-lab .
docker run --rm --name chaos-test -p 8080:8080 --memory=256m --cpus=0.5 chaos-lab
Observability: prove what changed
No experiment is useful without metrics. For a student lab, instrument these points:
- Request latency and HTTP error rate from the test service
- Process restarts and exit codes
- CPU and memory usage of the container
- Logs from process_roulette and the target app
Use Prometheus scrape targets and Grafana dashboards in your sandbox. In 2026 many classrooms embed lightweight metrics stacks via docker-compose or a k8s local cluster (k3s, kind) to make this easy.
Design an experiment: hypothesis, method, and abort criteria
Before you press run, write down:
- Hypothesis: "Killing a worker process causes at most 2% increase in 95th percentile latency for 5 minutes because the system will restart the worker and queue tasks."
- Method: Run process_roulette for 3 kills, monitor metrics and logs, and capture traces.
- Abort criteria: Error rate > 5% for 2 consecutive minutes OR system CPU > 90% OR container restarts > 5 in 5 minutes.
Make abort criteria machine-readable: a simple Prometheus alert rule or a small script that watches metrics and kills the experiment-runner container if thresholds are exceeded.
From process kills to graceful degradation
Observing how your system reacts to process failures is a perfect time to learn graceful degradation patterns. Teach (and practice) these resilient patterns:
- Bulkhead: isolate risky components so failures don't cascade.
- Retries with backoff: avoid synchronized retries that create thundering herds.
- Circuit breaker: fail fast to protect downstream systems.
- Fallbacks: return cached data or a reduced experience instead of an error page.
In a lab, implement a fallback for your test service or a queue-based worker that requeues work when a worker dies. Then run process_roulette again and observe how graceful degradation minimizes user-facing impact.
Kubernetes: safe process-level chaos at scale
On k8s, killing a container process inside a pod differs from deleting the pod. When PID 1 inside a container dies, the container exits and Kubernetes may restart it depending on restartPolicy. This is a natural place to test crash-restart semantics.
Use a dedicated namespace and tools like LitmusChaos or Chaos Mesh that provide CRDs for fault injection, combined with NetworkPolicy and ResourceQuota to bound blast radius. Recent chaos operators (2025 onward) integrate with GitOps so you can review experiments as PRs—great for classroom settings where instructors approve chaos experiments.
Advanced techniques (for instructors or advanced students)
- eBPF-based faults: inject syscall failures or latency without killing processes—more realistic for certain classes of errors.
- Ephemeral workloads + WASM: as WebAssembly workloads and edge runtimes grow, teach students about process-like failures in WASM instances.
- Service mesh fault injection: simulate partial failures at the network layer (delay, abort) using Istio or Linkerd test capabilities.
- Automated experiment governance: use policy engines to auto-abort experiments that touch production-like services.
Sample classroom exercise (45-60 minutes)
- Provision: each student spin up the Docker image or k8s namespace.
- Baseline: run a simple load generator for 2 minutes and record P50/P95 latency and error rate.
- Experiment: enable process_roulette in dry-run then real mode (instructor sets DRY_RUN=False) with MAX_KILLS=2.
- Observe: watch metrics and logs. If abort criteria triggered, stop the experiment and capture artifacts.
- Postmortem: students write a short explanation of what failed and a plan for graceful degradation to prevent user-visible errors.
Safety checklist before any process-killing experiment
- Is this environment isolated from production and real data?
- Are backup snapshots available to restore the sandbox quickly?
- Have you defined and automated abort criteria?
- Is observability in place and readable by the team running the experiment?
- Have you limited the blast radius (namespaces, quotas, whitelist)?
- Is there a runbook or a team member on-call to stop the experiment manually?
Common pitfalls and how to avoid them
- Running against real accounts: never use production credentials—use ephemeral or test credentials.
- Missing observability: always ensure metrics and logs are active before injecting faults.
- Not limiting restarts: misconfigured restart policies can cause restart storms; use restartPolicy and livenessProbe wisely.
- Permanent data loss: never run destructive operations on stateful databases; use snapshots and read-only clones if you must test stateful failure modes.
Why teach chaos engineering with process roulette?
Process-level fault injection gives students a clear mental model: services are just processes that can die. Starting here makes it easy to teach cascading failures, observability practices, and recovery patterns. The randomness of "roulette" forces engineers to think beyond happy paths and design systems that tolerate unexpected process death.
Actionable takeaways
- Start small and sandboxed: use containers or VMs for all experiments.
- Always declare hypothesis and abort criteria before any fault injection.
- Use telemetry to make experiment outcomes observable and repeatable.
- Teach graceful degradation patterns alongside fault injection; they are the remediation, not an afterthought.
- Adopt policy and GitOps controls to review and approve chaos runs in multi-student or team environments.
Further reading and tools (2026)
To extend this lab, explore:
- Chaos Mesh and LitmusChaos for Kubernetes-native experiments
- Gremlin and open-source chaos tools for structured chaos experiments
- eBPF toolkits for syscall-level fault injection
- Service mesh fault injection for network-layer failures
Tip: In late 2025 and early 2026 the community increased adoption of SLO-driven chaos—tie your experiment abort logic directly to SLO violations to protect user experience.
Wrap-up and next steps
Process roulette is a friendly, low-cost way to teach fault injection fundamentals. With sandboxing, observability, and clear abort criteria you can create repeatable labs that teach students how real systems fail and how to design for graceful degradation. Move from single-process experiments to container and k8s-level chaos as comfort grows, and always couple experiments with postmortems and system hardening tasks.
Call to action
Ready to try it? Spin up the Docker lab, run the process_roulette in dry-run, and post your findings. If you're an instructor, adapt the 45-minute exercise for your class and share the anonymized results with peers. For a scaffolded assignment and reproducible infra-as-code, fork a starter repo, add Prometheus scrapes, and submit a GitOps PR to run a controlled chaos experiment. Start safe, iterate confidently, and teach resilient thinking—one process kill at a time.
Related Reading
- Can Smart Lamps Reduce Driver Fatigue? Nighttime Mood Lighting and Road Safety
- From Stove to Scaling: How Small Fashion Labels Can Embrace a DIY Production Ethos
- Tax Treatment of High-Profile Settlements: Lessons from Celebrity Allegations
- Build a CRM Evaluation Checklist for Schools and Test Prep Centers
- Selling Niche Shows to International Buyers: A Checklist From Content Americas Deals
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Chrome to Puma: Migrating Extensions and Web Apps to Local-AI Browsers
Classroom Lab: Teach On-Device ML by Porting a Tiny Model to Mobile Browsers
Privacy-First Browsers: How Local AI in the Browser Changes Data Protection
Build a Tiny Local AI That Runs in Your Mobile Browser (No Cloud Required)
Quick Wins: Make Any Old Android Feel New Again (With Code)
From Our Network
Trending stories across our publication group