System Design: Implementing the Circuit Breaker Pattern
If you've ever had a single failing microservice take down your entire application, you already understand why the circuit breaker pattern exists. It's one of those patterns that feels obvious in…
System Design: Implementing the Circuit Breaker Pattern
If you've ever had a single failing microservice take down your entire application, you already understand why the circuit breaker pattern exists. It's one of those patterns that feels obvious in hindsight but can save you from catastrophic cascading failures in production.
System design interviews love this pattern too — it signals that you understand distributed systems don't just fail, they fail in *interesting* ways that require deliberate defensive design.
Why Cascading Failures Are Brutal
Imagine Service A calls Service B, which calls Service C. Service C starts timing out. Now Service B is holding open threads waiting for responses that never come. Those threads pile up. Service B runs out of resources and starts timing out too. Now Service A is in the same boat. Your entire request chain is down because one downstream dependency got slow.
This is a cascading failure, and it's one of the most common ways distributed systems die. The naive fix — just retry — often makes things worse. You're hammering an already-struggling service with even more traffic.
The circuit breaker pattern borrows from electrical engineering. When a circuit is overloaded, a physical breaker *opens* the circuit to prevent damage. Same idea here: when a downstream service is failing, stop sending it requests until it recovers.
How the Circuit Breaker Works
A circuit breaker wraps calls to an external service and tracks the outcome of those calls. It operates in three states:
Closed — Everything is working normally. Requests flow through. The breaker monitors the failure rate.
Open — Too many failures have occurred. The breaker immediately rejects requests without even attempting to call the downstream service. This gives the failing service breathing room to recover.
Half-Open — After a timeout period, the breaker allows a limited number of test requests through. If they succeed, it transitions back to Closed. If they fail, it goes back to Open.
Here's a simple state machine to visualize it:
[Closed] ---(failure threshold exceeded)---> [Open]
[Open] ---(timeout expires)--------------> [Half-Open]
[Half-Open] ---(test request succeeds)-----> [Closed]
[Half-Open] ---(test request fails)--------> [Open]Implementing a Basic Circuit Breaker in Python
Let's build a simple one from scratch so you understand what's happening under the hood:
import time
from enum import Enumclass CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, success_threshold=2):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit is OPEN — request rejected")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self._reset()
else:
self.failure_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def _should_attempt_reset(self):
return (time.time() - self.last_failure_time) >= self.recovery_timeout
def _reset(self):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
And here's how you'd use it:
import requestsbreaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10)
def fetch_user(user_id):
response = requests.get(f"https://user-service/users/{user_id}", timeout=2)
response.raise_for_status()
return response.json()
Wrap the call with the circuit breaker
try:
user = breaker.call(fetch_user, 42)
print(user)
except Exception as e:
print(f"Request failed: {e}")
# Fall back to cache, default response, etc.This is deliberately simple. Production implementations handle thread safety, sliding window failure rates, and metrics — but this captures the core logic.
Using Battle-Tested Libraries
In real projects, you don't want to roll your own circuit breaker. There are solid libraries for this:
Python: [pybreaker](https://github.com/danielfm/pybreaker)
import pybreakerbreaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30)
@breaker
def call_payment_service(order_id):
# your HTTP call here
pass
Java/Spring: Resilience4j is the go-to:
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.build();CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentService.charge(order));
Node.js: [opossum](https://nodeshift.dev/opossum/) is popular and well-maintained.
Key Design Decisions You Need to Think Through
What counts as a failure? Timeouts, 5xx errors, and network exceptions should trip the breaker. But a 404 is probably not a service failure — it's a valid response. Be deliberate about what you're measuring.
Failure threshold: count vs. rate? A count-based threshold (e.g., 5 failures) is simple but fragile under variable traffic. A rate-based threshold (e.g., 50% failure rate over the last 10 requests) is more robust. Resilience4j uses a sliding window for exactly this reason.
What happens when the breaker is open? This is the part people forget to design. You need a fallback strategy:
Failing fast is only useful if you have *somewhere* to fail to.
Scope your breakers correctly. One circuit breaker per downstream service, not one global breaker. You don't want a failing payment service to trip the breaker for your user profile service.
Circuit Breakers in System Design Interviews
When this comes up in interviews, the interviewer wants to see that you understand *why* you'd use it, not just that you know the name.
Good things to mention:
A common follow-up question: *"How does a circuit breaker differ from a retry?"* Retries assume transient failures and keep trying. Circuit breakers assume systemic failures and *stop* trying. They're complementary — use retries for brief hiccups, circuit breakers for sustained outages.
Actionable Next Steps
The circuit breaker pattern is one of those things that feels like overhead until the day it saves your production system at 2am. Build the habit of designing for failure from the start.