system-designcircuit-breakerresiliencefault-tolerance

System Design: Implementing the Circuit Breaker Pattern

If you've ever had a single failing microservice take down your entire application, you already understand why the circuit breaker pattern exists. It's one of those patterns that feels obvious in…

May 2, 2026

System Design: Implementing the Circuit Breaker Pattern

If you've ever had a single failing microservice take down your entire application, you already understand why the circuit breaker pattern exists. It's one of those patterns that feels obvious in hindsight but can save you from catastrophic cascading failures in production.

System design interviews love this pattern too — it signals that you understand distributed systems don't just fail, they fail in *interesting* ways that require deliberate defensive design.

Why Cascading Failures Are Brutal

Imagine Service A calls Service B, which calls Service C. Service C starts timing out. Now Service B is holding open threads waiting for responses that never come. Those threads pile up. Service B runs out of resources and starts timing out too. Now Service A is in the same boat. Your entire request chain is down because one downstream dependency got slow.

This is a cascading failure, and it's one of the most common ways distributed systems die. The naive fix — just retry — often makes things worse. You're hammering an already-struggling service with even more traffic.

The circuit breaker pattern borrows from electrical engineering. When a circuit is overloaded, a physical breaker *opens* the circuit to prevent damage. Same idea here: when a downstream service is failing, stop sending it requests until it recovers.

How the Circuit Breaker Works

A circuit breaker wraps calls to an external service and tracks the outcome of those calls. It operates in three states:

Closed — Everything is working normally. Requests flow through. The breaker monitors the failure rate.

Open — Too many failures have occurred. The breaker immediately rejects requests without even attempting to call the downstream service. This gives the failing service breathing room to recover.

Half-Open — After a timeout period, the breaker allows a limited number of test requests through. If they succeed, it transitions back to Closed. If they fail, it goes back to Open.

Here's a simple state machine to visualize it:

[Closed] ---(failure threshold exceeded)---> [Open]
[Open]   ---(timeout expires)--------------> [Half-Open]
[Half-Open] ---(test request succeeds)-----> [Closed]
[Half-Open] ---(test request fails)--------> [Open]

Implementing a Basic Circuit Breaker in Python

Let's build a simple one from scratch so you understand what's happening under the hood:

import time
from enum import Enum
class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, success_threshold=2):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit is OPEN — request rejected")
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self._reset()
        else:
            self.failure_count = 0
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
    def _should_attempt_reset(self):
        return (time.time() - self.last_failure_time) >= self.recovery_timeout    def _reset(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0

And here's how you'd use it:

import requests
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10)
def fetch_user(user_id):
    response = requests.get(f"https://user-service/users/{user_id}", timeout=2)
    response.raise_for_status()
    return response.json()
Wrap the call with the circuit breaker
try:
    user = breaker.call(fetch_user, 42)
    print(user)
except Exception as e:
    print(f"Request failed: {e}")
    # Fall back to cache, default response, etc.

This is deliberately simple. Production implementations handle thread safety, sliding window failure rates, and metrics — but this captures the core logic.

Using Battle-Tested Libraries

In real projects, you don't want to roll your own circuit breaker. There are solid libraries for this:

Python: [pybreaker](https://github.com/danielfm/pybreaker)

import pybreaker
breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30)@breaker
def call_payment_service(order_id):
    # your HTTP call here
    pass

Java/Spring: Resilience4j is the go-to:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);Supplier<String> decoratedSupplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> paymentService.charge(order));

Node.js: [opossum](https://nodeshift.dev/opossum/) is popular and well-maintained.

Key Design Decisions You Need to Think Through

What counts as a failure? Timeouts, 5xx errors, and network exceptions should trip the breaker. But a 404 is probably not a service failure — it's a valid response. Be deliberate about what you're measuring.

Failure threshold: count vs. rate? A count-based threshold (e.g., 5 failures) is simple but fragile under variable traffic. A rate-based threshold (e.g., 50% failure rate over the last 10 requests) is more robust. Resilience4j uses a sliding window for exactly this reason.

What happens when the breaker is open? This is the part people forget to design. You need a fallback strategy:

Return a cached response

Return a sensible default

Queue the request for later processing

Show a degraded UI

Failing fast is only useful if you have *somewhere* to fail to.

Scope your breakers correctly. One circuit breaker per downstream service, not one global breaker. You don't want a failing payment service to trip the breaker for your user profile service.

Circuit Breakers in System Design Interviews

When this comes up in interviews, the interviewer wants to see that you understand *why* you'd use it, not just that you know the name.

Good things to mention:

It prevents cascading failures by failing fast instead of blocking threads

It gives struggling services time to recover without being overwhelmed

It forces you to think about fallback behavior, which improves overall resilience

It pairs well with retries (but retries should happen *inside* the breaker, not outside it)

A common follow-up question: *"How does a circuit breaker differ from a retry?"* Retries assume transient failures and keep trying. Circuit breakers assume systemic failures and *stop* trying. They're complementary — use retries for brief hiccups, circuit breakers for sustained outages.

Actionable Next Steps

Implement the basic version above in whatever language you're comfortable with. Building it yourself cements the state machine in your head.

Add it to a side project that makes external API calls. Even a simple integration with pybreaker or opossum will teach you a lot about the configuration trade-offs.

Explore Resilience4j's documentation if you're in the Java ecosystem — it's the most feature-complete implementation and the docs explain the design decisions clearly.

Practice explaining it out loud. Draw the state diagram, walk through a failure scenario, describe your fallback strategy. That's exactly what interviewers want to hear.

The circuit breaker pattern is one of those things that feels like overhead until the day it saves your production system at 2am. Build the habit of designing for failure from the start.