Interview Prep: OS & Process Management Questions

Targeted interview questions and answers on signals, process states, crash recovery, and robustness—practical labs for systems and DevOps roles.

Hook: Why mastering process management wins interviews (and prevents outages)

Interview prep for systems programming and DevOps often exposes a worrying gap: candidates can script CI pipelines and configure Kubernetes, but they struggle with low-level process behavior—the exact problems that cause midnight pager calls. If you want to ace interviews and build systems that survive real-world chaos (including tools that randomly kill processes, a.k.a. "process roulette"), this curated set of practice questions, answers, and actionable tips will tighten your understanding of signals, process states, crash recovery, and overall system robustness.

Why this matters in 2026

By 2026, production systems are more distributed and dynamic than ever: more workloads run as containers, observability relies on eBPF and programmable telemetry, and Rust's footprint in kernels and system tooling has grown since its inclusion in Linux. Chaos engineering tools (Chaos Mesh, Gremlin, Litmus) and process-killing utilities have moved from novelty to core testing practices. Interviewers increasingly expect candidates to explain not just how to configure orchestrators, but how OS-level behavior impacts availability, container limits, and recovery semantics.

“There are a surprising number of programs designed to randomly kill processes on your computer until it crashes—or you wimp out.” — inspiration for these questions (PC Gamer, 2024).

How to use this guide

Start at the top where fundamentals live, then practice the scenario and system-design questions. Use the code snippets to run quick experiments on your laptop or sandbox cluster. Each question is interview-style: concise prompt, a clear answer, and a short explanation with an actionable practice step.

Signals and Handlers

Q1: What's the difference between SIGTERM and SIGKILL? When should you prefer one over the other?

Answer: SIGTERM (15) is a polite request that the process should terminate; the process can catch it, run cleanup, and exit. SIGKILL (9) forcibly terminates the process; it cannot be caught or ignored. Prefer SIGTERM for graceful shutdown and SIGKILL as a last resort when a process doesn't respond.

Explanation: SIGTERM allows resource cleanup (flush files, deregister from service discovery). Systems like systemd and Kubernetes send SIGTERM followed by SIGKILL after a grace period. In interviews, emphasize graceful shutdown patterns, not only the signals.

Practice: Implement a process that traps SIGTERM to flush state and then sleeps—trigger SIGTERM and SIGKILL to observe differences.

// POSIX C: simple SIGTERM handler
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

void term(int sig){
  printf("got SIGTERM, cleaning up...\n");
  sleep(1); // simulate cleanup
  _exit(0);
}

int main(){
  signal(SIGTERM, term);
  while(1) pause();
}

Q2: How do you reliably run cleanup code on fatal signals like SIGSEGV?

Answer: For fatal asynchronous signals (SIGSEGV, SIGBUS), use a signal handler registered with sigaction and SA_SIGINFO; but note handler limitations: only async-signal-safe functions may be called. To capture state reliably, write a minimal async-safe reporter (e.g., write to a pre-opened file descriptor) or trigger an external watchdog that collects state.

Explanation: You can't call malloc or printf safely in a SIGSEGV handler. Common patterns: use sigaltstack to ensure a safe stack for handlers, write compact binary dumps, or set a process death signal (prctl) for children so an external supervisor records crash info.

Practice: Create a minimal handler that records a stack pointer or writes a single error byte to a pipe your test harness reads.

Process States and Lifecycle

Q3: What is a zombie process? How do you create and then fix one?

Answer: A zombie is a process that has terminated but whose parent hasn't reaped it via wait/waitpid; it's visible as a 'Z' state in ps. Create one by having a child exit while the parent sleeps without calling wait. Fix by having the parent call wait/waitpid, or by re-parenting the child to init/systemd when the parent exits (init will reap it).

Explanation: Zombies don't consume CPU or memory but do hold a PID until reaped. Accumulation indicates parent logic bugs. In container environments, PID namespaces affect how init processes manage reaping.

Practice: Write a small C program that forks and purposely avoids wait to observe zombies; then modify to reap children with SIGCHLD handler.

Q4: Explain the difference between a process being 'stopped' and 'sleeping'.

Answer: 'Stopped' (SIGSTOP or job control) means execution is suspended until continued; it isn't using CPU. 'Sleeping' (TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE in the kernel) means waiting for I/O or event; it will be scheduled to run when the condition is satisfied. Sleep states matter for latency and resource accounting.

Explanation: Tools like ps and top show state codes; interpreting them helps diagnose why a process isn't progressing (e.g., uninterruptible sleep could indicate stuck kernel I/O).

Crash Recovery & System Robustness

Q5: What is an OOM killer and how should services prepare for it?

Answer: The OOM (Out-of-Memory) killer is the kernel mechanism that selects processes to kill when the system is out of memory. Services should set memory limits (cgroups), monitor usage, implement graceful degradation, and use memory-aware autoscaling. Configure oom_score_adj to influence selection; ensure critical services aren't killed unexpectedly.

Explanation: In containerized deployments, Kubernetes enforces memory limits which trigger container termination—different from host-level OOM. In interviews, discuss trade-offs: reserving memory vs. overcommit and how to instrument memory usage.

Practice: Deploy a memory-hungry container with limits and observe the behavior; tune oom_score_adj and test how your supervisor handles restarts.

Q6: How would you design a supervisor pattern that avoids thundering herds on restart?

Answer: Use exponential backoff with jitter for restarts, cap the maximum restart attempts per interval, and coordinate via leader election if multiple instances must not restart simultaneously. Supervisors like systemd implement Restart=on-failure with StartLimitInterval and StartLimitBurst to prevent flapping.

Explanation: The thundering herd occurs when many processes restart and simultaneously load resources. Jitter spreads retries and reduces load spikes. For high-availability, incorporate rolling restarts and health-check-driven promotion of instances.

Practice: Implement an exponential backoff loop in a supervisor script and test against a small fleet of processes killed by a chaos tool.

# simple backoff supervisor (bash)
#!/bin/bash
backoff=1
while true; do
  ./myservice &
  pid=$!
  wait $pid
  sleep $backoff
  backoff=$((backoff*2))
  if [ $backoff -gt 60 ]; then backoff=60; fi
done

Q7: What is a core dump, and how do you configure systems to collect and analyze them safely?

Answer: A core dump is a snapshot of a process's memory at crash time. Configure ulimit -c to allow dumps, or use systemd-coredump and kernel.core_pattern to forward dumps to an external collector. For security, ensure core dumps don't leak secrets: restrict generation by default, sanitize addresses if possible, and use secure channels to transfer dumps for analysis.

Explanation: Many cloud environments disable core files. In 2026, organizations often route dumps into observability pipelines and use symbolicators (e.g., breakpad, llvm-symbolizer) to analyze crash stacks. Rust and C++ services should produce debug symbols for meaningful traces.

Containers, Namespaces, and Orchestration

Q8: How do PID namespaces change process management semantics?

Answer: PID namespaces provide an isolated PID numbering view; PID 1 in a container is not the host's PID 1. The container's init must reap zombies inside the namespace. On the host, parent-child relationships cross namespace boundaries, but some actions (like signaling PID 1 in a container) require care.

Explanation: A common interview pitfall: assuming the host's init will reap container children. If you run a single process as PID 1 inside a container, implement reaping (or use tini) to avoid zombie buildup.

Practice: Start a busybox container without an init and fork children that exit; inspect with ps to see zombies. Then add tini and observe changes.

Q9: What happens to signals when you send them to a process in a cgroup or to a container via docker/kubectl?

Answer: Signals are delivered to PIDs; when using docker/docker kill or kubectl exec, the runtime translates the command into signals sent to processes in the container's PID namespace. For process groups and sessions, processes started together inherit pgrp/session IDs; sending signals to a negative PID targets the process group.

Explanation: In interview answers, mention how orchestration systems implement graceful termination: Kubernetes sends SIGTERM to the container process (PID 1 inside the container) and waits for terminationGracePeriod before SIGKILL.

Advanced Kernel & Observability Topics

Q10: Explain how eBPF changed process-level observability and give an example use-case relevant to process crashes.

Answer: eBPF provides programmable hooks in the kernel to capture events with low overhead. Use eBPF to trace syscalls, signal deliveries, or file access patterns that preceded a crash. For example, trace signal events and syscalls leading up to a SIGSEGV to find a correlation between a specific syscall and crash frequency.

Explanation: By 2026, eBPF is standard in observability stacks (e.g., Cilium, Pixie). In interviews, describe how you'd deploy a lightweight eBPF probe to gather pre-crash telemetry without modifying the application.

Q11: What role does Rust in the kernel and safer languages play in reducing process crashes?

Answer: Rust reduces classes of memory-safety bugs (use-after-free, double-free) that cause crashes like SIGSEGV. Kernel subsystems rewritten or implemented in Rust can improve overall system stability. However, logic bugs still exist; defensive tooling and observability remain critical.

Explanation: In 2025-2026, more drivers and user-space critical tooling use Rust. In interviews, balance optimism about memory safety with pragmatic deployment concerns and the need for interoperability with existing C systems.

Scenario & System Design Questions

Q12: You're on-call and an important service is flapping with frequent restarts after SIGSEGV. How do you debug and mitigate quickly?

Answer: Triage steps: (1) gather crash artifacts—core dumps, logs, recent deployments; (2) isolate version that triggers crash and roll back to a safe revision; (3) enable increased logging / capture stack traces with addr2line (ensure symbols); (4) deploy a temporary circuit breaker (traffic split) and increase replicas to reduce impact; (5) add a kill switch in runtime to avoid repeated restarts and gather more telemetry via eBPF. Use canary rollouts to prevent wide impact.

Explanation: The key is containment + forensic collection. Provide both operational actions and a path to root cause. Interviewers want to see you balance speed with correctness.

Q13: For a distributed job that must not be duplicated, how do you handle process crashes and restarts?

Answer: Implement idempotent workers with leasing (e.g., distributed lock with TTL in etcd/consul), use transactional checkpoints, and make re-delivery safe. On crash, another worker can claim the lease. Use persistent job queues with visibility timeouts (like SQS) or leader election so no two workers work on the same job.

Explanation: Avoiding duplication often requires coordination and careful handling of partial failures. For interviews, sketch a concrete example using etcd leases or a database row lock with transactional state update.

Practice Questions (Quick Drills)

What does SIGCHLD indicate and how should a parent process handle it? (Answer: Child changed state; call wait/waitpid to reap or use SA_NOCLDWAIT.)
Explain prctl(PR_SET_PDEATHSIG): why is it useful? (Answer: sets a signal delivered to a child when its parent dies—useful to avoid orphaned helper processes.)
How do you prevent file descriptor leaks across exec? (Answer: set FD_CLOEXEC or open with O_CLOEXEC.)
Where are PIDs reused and why can that lead to bugs? (Answer: Kernel reuses PIDs; long-lived data keyed by PID without validation can attach to wrong process.)

Actionable Takeaways

Practice signals and reaping locally: write small C programs to send, trap, and handle signals; observe process states with ps and /proc.
Automate crash collection: enable systemd-coredump or configure a secure core pipeline and ensure symbol files are archived for postmortems.
Test with chaos: add deliberate process kills to CI with chaos tools and assert system-level SLAs. Use process-killing experiments to validate health checks and graceful shutdown paths—Process Roulette-style drills are great to reveal gaps.
Use modern observability: deploy eBPF probes for low-overhead tracing of signals and syscalls when diagnosing intermittent crashes.
Design supervisors carefully: include exponential backoff with jitter and limits to avoid flapping and resource storms.

2026 Hiring Trends & Final Notes

Interviewers in 2026 expect a mix of systems intuition and practical experience: can you explain kernel behavior, design robust restart strategies, and use modern tooling like eBPF and chaos engineering to validate assumptions? They also value clear incident handling and postmortem thinking—how you detect, mitigate, and prevent recurrence.

Next Steps — practice plan (30 days)

Week 1: Build a signal-handling lab—SIGTERM/SIGKILL/SIGSEGV and SIGCHLD reaping experiments.
Week 2: Create a supervisor with backoff & jitter; integrate with systemd unit files and test restart limits.
Week 3: Add chaos tests (kill processes randomly) and observe system behavior; fix issues revealed by tests.
Week 4: Instrument with eBPF and produce a short postmortem of one induced crash; prepare answers for interviews using your lab notes.

Call to action

If you're preparing for systems programming or DevOps interviews, clone this checklist as a focused practice repo, run the hands-on labs above, and rehearse the scenario answers with a partner. Ready for a mock interview? Sign up for our next systems-focused mock session and get feedback on signal handling, crash triage, and supervisor design—build confidence and reduce pager anxiety.

Hook: Why mastering process management wins interviews (and prevents outages)

Why this matters in 2026

How to use this guide

Signals and Handlers

Q1: What's the difference between SIGTERM and SIGKILL? When should you prefer one over the other?

Q2: How do you reliably run cleanup code on fatal signals like SIGSEGV?

Process States and Lifecycle

Q3: What is a zombie process? How do you create and then fix one?

Q4: Explain the difference between a process being 'stopped' and 'sleeping'.

Crash Recovery & System Robustness

Q5: What is an OOM killer and how should services prepare for it?

Q6: How would you design a supervisor pattern that avoids thundering herds on restart?

Q7: What is a core dump, and how do you configure systems to collect and analyze them safely?

Containers, Namespaces, and Orchestration

Q8: How do PID namespaces change process management semantics?

Q9: What happens to signals when you send them to a process in a cgroup or to a container via docker/kubectl?

Advanced Kernel & Observability Topics

Q10: Explain how eBPF changed process-level observability and give an example use-case relevant to process crashes.

Q11: What role does Rust in the kernel and safer languages play in reducing process crashes?

Scenario & System Design Questions

Q12: You're on-call and an important service is flapping with frequent restarts after SIGSEGV. How do you debug and mitigate quickly?

Q13: For a distributed job that must not be duplicated, how do you handle process crashes and restarts?

Practice Questions (Quick Drills)

Actionable Takeaways

2026 Hiring Trends & Final Notes

Next Steps — practice plan (30 days)

Call to action

Related Reading

Related Topics

codeacademy

Up Next

JavaScript Interview Questions for Beginners and Junior Developers

Developer Resume Guide: What to Include for Internships and Entry-Level Roles

Best GitHub Projects for Beginners to Study and Contribute To

From Our Network

CORS Errors Explained: A Practical Debugging Guide for Frontend Developers

JSON Escaping Explained: Fix Broken Payloads, Strings, and Config Files

Postman Alternatives Compared for Lightweight API Testing

Code Review Checklist for Faster, More Useful Pull Requests

Building Better API Docs: A Checklist for Clarity, Examples, and Maintenance

How to Use AI Safely With Proprietary Code