Resilience patterns: treat the network as hostile

April 16, 2023

Every distributed system eventually discovers the same lesson: the network is not a reliable function call. Packets drop. Connections stall. Downstream services slow down under load. Processes get killed mid-request. A client’s idea of whether an operation succeeded is not always the same as the server’s. A system built on the assumption that this rarely happens will be brittle in every specific way the assumption turns out to be wrong.

The resilience patterns are the vocabulary of responses — idempotency, retries, timeouts, circuit breakers, bulkheads, fallbacks, hedging, and a few others — that together let a system degrade instead of collapsing. None of them is interesting in isolation. They compose. A retry without a timeout is a hang. A timeout without idempotency is double-billing. A circuit breaker without a fallback is a faster 500. The design job is knowing which combinations fit which failure modes.

Michael Nygard’s Release It! is still the canonical reference, and most of what follows traces back to it. The framing, though, is worth stating up front: you are not trying to make failures stop happening. You are trying to make sure that when failures happen — and they will — the blast radius is bounded, the user experience degrades gracefully, and the system recovers without human intervention.

The fallacy: “if it fails, retry”

The instinct, in the face of a failed call, is to try again. This instinct is mostly right and occasionally catastrophic, and the difference between the two depends on whether the server actually failed to do the work or only failed to tell the client it succeeded.

Consider a payment service. The client sends charge $100. The server charges the card, and the response is lost — the connection drops on the way back. The client sees a timeout, retries, and the server charges the card a second time. The user is charged $200 for a $100 order. The network was healthy; the payment service was healthy; the retry was the bug.

This is the problem idempotency solves, and the reason it is the foundation under every other resilience pattern. Until a request is idempotent, retrying it is unsafe in the general case. Everything else in this post assumes the idempotency layer is in place.

Idempotency

An operation is idempotent if performing it twice has the same effect as performing it once. set x = 5 is idempotent. increment x is not. DELETE /users/42 is idempotent (deleting an already-deleted user is still “user 42 does not exist”). POST /charges is not, by default — each charge is a new one.

HTTP gives you idempotency for free on GET, PUT, and DELETE if you implement them correctly. POST is the one you have to design for, and the standard tool is the idempotency key. The client generates a unique key for each logical operation, sends it as a header (Idempotency-Key: a7f3...), and retries with the same key if it needs to retry. The server stores, for each key, the result of the first request. Subsequent requests with the same key return the stored result without redoing the work.

Two subtleties that matter:

The key is the client’s responsibility, not the server’s. The server must treat the key as an input, not generate it. A server that generates the key (e.g., from request contents via a hash) can produce false matches on semantically different requests that happen to hash the same.
The key has a lifetime. Most servers retain keys for 24 hours or so. This is enough for retries and not enough to treat the key as a permanent deduplication mechanism. If you need permanent dedup, you need a business-level natural key (an order id, a reservation id), not an idempotency key.

At the database level, the inbox pattern gives you idempotent event consumers by storing the event id in a dedupe table before processing. At the workflow level, orchestrators like Temporal enforce idempotency by remembering which steps have run. Whatever the mechanism, idempotency is the prerequisite for every retry. If your writes are not idempotent, you do not have a retry story; you have a double-write story.

Timeouts

Timeouts are the pattern engineers know about and underuse in practice. Every network call should have a timeout. Not an optional one. Not a “we’ll add it if we see problems” one. Every one.

The reason is mechanical. A call without a timeout is a call that can block forever. In a thread-per-request server, a blocked call holds a thread. A hundred blocked calls hold a hundred threads. A thread pool with no idle threads rejects new requests, and the service is effectively down — not because it is doing anything wrong, but because it is waiting for something that will never answer.

The cascading version is worse. Service A calls service B, which calls service C. C stalls. B’s threads fill up waiting on C. A’s threads fill up waiting on B. The front-end fills up waiting on A. A problem in one leaf service has taken down the entire path to it, and the blast radius is the full width of the dependency tree.

The rule: every network call has a timeout, and the timeout is shorter than the timeout of the call one layer up. If the front-end’s timeout to A is 2 seconds, A’s timeout to B is 1 second, B’s timeout to C is 500ms. When something stalls, it stalls at the leaf and every layer above has time to do something sensible with the failure instead of hanging alongside it.

What makes timeouts hard is picking them. Too short and you reject legitimate slow responses (especially during a load spike). Too long and you are back to the cascade. The working answer: pick a timeout that is meaningfully longer than the p99 latency under normal load and meaningfully shorter than “the user gives up.” Adjust when the latency distribution changes. And measure — a service with unknown latency percentiles does not have timeouts; it has guesses.

Retries with backoff and jitter

Once operations are idempotent and calls have timeouts, retries become safe. The question is how to retry.

Naive retry — “call failed, try again immediately” — has two failure modes. First, it hits the same overloaded backend with another request immediately, which is the opposite of what the backend needs. Second, it synchronizes: if a shared dependency stumbles, every client retries at the same time, producing a thundering herd of requests exactly when the dependency is most vulnerable. Clients that were spaced out moments earlier are now all hammering together.

Exponential backoff solves the first problem: wait 100ms, then 200ms, then 400ms, then 800ms. Jitter solves the second: randomize each wait so that two clients that failed at the same moment do not retry at the same moment. The AWS architecture blog’s canonical formulation is “full jitter” — wait a random time between 0 and the current backoff cap — which works well in practice and is simple to implement.

Budget the retries. Retry forever and you have reimplemented the hang. Retry three to five times, with exponential backoff and jitter, and then give up. “Give up” means return a failure to the caller, who has their own retry budget, not silently drop the request.

One more thing: do not retry everything. A 400-class HTTP error (bad request, unauthorized, not found) is not going to succeed on retry — retrying a 401 just burns the dependency’s CPU rejecting you four more times. Retry on transient failures: timeouts, connection errors, 503s, 429s (with Retry-After respected). Do not retry on semantic failures. The client library defaults usually get this right; custom retry code usually does not.

Circuit breakers

A circuit breaker is a state machine in front of a dependency that watches for repeated failures and, once a threshold is crossed, stops sending requests for a while. The metaphor is the electrical circuit breaker: a persistent fault trips the breaker, the circuit opens, and load is removed from the faulty component until conditions improve.

Three states:

Closed — traffic flows normally. Failures are counted.
Open — traffic is rejected immediately, without attempting the downstream call. Open for a configured duration (“open period”).
Half-open — after the open period, a small number of trial requests are allowed. If they succeed, the breaker closes. If they fail, the breaker opens again.

What the circuit breaker buys you is fail-fast. Without it, a service whose downstream is dead keeps trying — hitting the timeout on every request, holding threads, slowing its own responses. With it, once the pattern of failure is established, requests fail in microseconds with a clear “circuit open” signal, and the caller can act on that signal (return a fallback, degrade the feature, show a cached value, surface an error to the user).

Two common mistakes:

Breakers on the wrong granularity. A breaker per dependency is correct; a breaker per endpoint on that dependency is sometimes also correct; a breaker per specific request is always wrong. You want to detect “the downstream is in trouble,” not “this specific request failed.”
No fallback. A circuit breaker without a fallback is just a faster error. That is still useful — it stops the cascade — but it does not make the system more available; it makes it fail more cleanly. Pair breakers with fallbacks wherever the business logic can tolerate one.

The reference implementations (Hystrix, resilience4j, Polly) all handle the state machine for you. The work is deciding where to put breakers, what their thresholds should be, and what the fallback is when they open.

Bulkheads

A bulkhead is a resource isolation pattern: different classes of work get different thread pools, connection pools, or memory pools, so that a surge in one class cannot exhaust the resources the others need. The name comes from ships — watertight compartments that prevent a single hull breach from sinking the whole vessel.

The motivating failure is familiar. A service with one thread pool serves both a critical user-facing endpoint and a batch-job endpoint. The batch endpoint has a slow dependency. A surge of batch requests fills the thread pool waiting on the slow dependency. The user-facing endpoint, which had nothing to do with the batch dependency, starts returning 503 because there are no threads left to handle its fast requests.

Separate pools fix this. User-facing requests get their own threads; batch requests get their own threads. A saturation in one class leaves the other class unaffected. The cost is overhead — more pools, more tuning, more configuration to get wrong — and the benefit is a failure that is local instead of total.

Bulkheads apply at many levels: thread pools within a process, connection pools within a client library, separate service instances for different workloads, separate clusters for different tenants. The principle is the same: isolate resources so that contention in one class does not starve another.

Fallbacks and graceful degradation

A fallback is what the system does when a dependency is unavailable. Options, roughly in order of ambition:

Fail and surface the error. Sometimes there is no meaningful fallback — you cannot place an order if the payment service is down. The honest move is to tell the user.
Return stale data. A cached version of the last successful response. Appropriate when the data is read-heavy and some staleness is tolerable — product details, recommendations, configuration.
Return partial data. The dashboard loads the seven sections that succeeded and shows “temporarily unavailable” on the eighth. Appropriate when the UI can render meaningfully without every component.
Return a default. A reasonable placeholder: “your recommendations will be back shortly,” a generic banner instead of a personalized one. Appropriate when absence is worse than a fake.
Queue for later. Accept the user’s request and promise to process it when the dependency recovers. Appropriate for writes that can tolerate delay (notifications, analytics, some order flows).

The pattern to avoid is the silent fallback — swallowing an error and returning a default without telling anyone. The user sees “fine,” the monitoring sees “success,” and the bug is invisible until someone notices downstream that nothing has worked for a week. Every fallback should be instrumented: a metric, a log, a trace attribute that says “this response was served by fallback, the primary was unavailable.” Otherwise you are shipping a system whose failure modes are hidden from its own operators.

Hedged requests

A newer pattern, developed at Google and popularized in Jeff Dean’s “The Tail at Scale” paper. For read requests where p99 latency is much worse than p50 — i.e., most distributed systems — send a second request after a delay slightly longer than p95, and use whichever response returns first. The duplicate effort is small (by construction, you only hedge on the slow tail, which is rare) and the tail latency improves dramatically.

The caveats are the same as always. Hedged writes are only safe if they are idempotent; otherwise you double the work. And hedging amplifies load during incidents — if the system is slow because it is overloaded, hedging adds more requests, which makes it slower. Production hedging implementations usually include a circuit on the hedge itself: stop hedging when the slow-tail rate gets too high.

How they compose

Each pattern addresses a specific failure mode, and they only produce a resilient system when composed honestly. A reasonable default stack for a service call:

The operation is idempotent, or has an idempotency key.
The call has a timeout shorter than the caller’s timeout.
Failures are retried with exponential backoff and jitter, up to a bounded budget, only on transient errors.
Persistent failures trip a circuit breaker that fails fast with a clear signal.
The open breaker triggers a fallback — stale data, partial data, a default, or an honest error.
The call shares a bulkhead with other calls of the same class and is isolated from different classes.
Every one of these is instrumented — metrics on retries, breaker state transitions, fallback use — so that the operators can see what the system is doing.

That is seven patterns for one call. Most of it is handled by a library (resilience4j, Polly, a service mesh, gRPC’s interceptor chain) if you let it. Building each of these by hand is how you discover, at 3am, which one you forgot.

The principle under all of them

Every resilience pattern is a variation on the same principle: assume things will fail, and decide in advance what the system does when they do. A system with explicit answers to “what happens when the database is slow” and “what happens when the auth service is down” and “what happens when a single tenant’s batch job goes haywire” is a system that stays partially available through incidents. A system that hopes none of those things happen is a system whose failure modes are discovered, one at a time, in production.

You cannot prevent failure. You can bound its blast radius. Every pattern in this post is a specific shape of bound: a timeout bounds how long, a circuit breaker bounds how many, a bulkhead bounds how wide, a fallback bounds how visible. Design the bounds, instrument them, and exercise them. Treat the network as hostile. The network does not need your kindness; it will be hostile regardless.