Alok Ranjan Daftuar

Posted on Mar 18 • Originally published at aloknecessary.github.io

Idempotency in Distributed Systems: Design Patterns Beyond 'Retry Safely'

#architecture #backend #distributedsystems #systemdesign

Every engineer in distributed systems has heard: "make your operations idempotent." It gets repeated in design reviews and dropped into architecture docs as if the word itself is a solution.

The gap between understanding what idempotency means and actually building systems that enforce it under real-world failure conditions is enormous.

"Retry safely" is the starting point — not the destination.

Idempotency vs. Deduplication

Teams often conflate these. They are related but not the same:

Idempotency — the outcome is the same regardless of how many times the operation is applied
Deduplication — detecting and discarding duplicate requests so the operation executes exactly once

A PUT /users/{id} is naturally idempotent. A POST /orders is not — calling it ten times creates ten orders. To make it safe to retry, you need deduplication. And deduplication requires idempotency keys.

Idempotency Keys

A client-supplied identifier that scopes a request to a single logical operation. Key design decisions:

Client ownership — the client knows the intent; it generates the key
Time-bounded validity — keys should not live forever; align TTL with your retry window
Granularity — one key per logical operation, not per HTTP request
Request fingerprinting — hash the payload to detect same key with different payloads (client bugs)

The Two-Phase Reservation Pattern

The naive check-then-act pattern has a race condition. Two concurrent requests with the same key can both pass the existence check.

The correct approach:

Reservation — atomically insert the key as IN_PROGRESS (insert-if-not-exists)
Completion — after processing, update to COMPLETED with the response payload

This eliminates concurrent duplicate processing. Every production idempotency implementation should use this or an equivalent.

API Gateway vs. Application Layer

Gateway-level — short-circuit caching for the happy path; fast but doesn't protect your data layer
Application-level — owns the deduplication store, handles correctness, coordinates multi-step workflows

In production, they're complementary. The gateway is a performance optimization; the application layer is where correctness lives.

Failure Scenarios Most Teams Miss

Completed-but-undelivered response — operation succeeds, connection drops before response reaches client
Key reuse across different operations — client bug sends different payload with same key
Concurrent requests with the same key — race condition without atomic reservation
Partially-applied multi-step operations — saga crashes mid-workflow, retry re-runs completed steps
Message queue consumers without idempotency — API-layer protection doesn't help queue consumers
TTL expiry during active retry windows — key expires while client is still retrying
Schema evolution breaking retry flows — new schema + old key = fingerprint mismatch rejection

Key Takeaway

Idempotency done right is a system-wide property, not a feature you add to an endpoint. It requires deliberate decisions about key semantics, durable deduplication stores, clear responsibility allocation between gateway and application layers, and explicit handling of failure scenarios beyond "retry on 5xx."

"Retry safely" is where you start. Building a system that is actually safe to retry under real-world conditions is the harder, more interesting problem.

Read the Full Article

This is a summary of my comprehensive deep dive into idempotency patterns. The full article covers each pattern in detail with production-grade implementation strategies:

👉 Idempotency in Distributed Systems — Full Article

The full article includes:

Detailed deduplication store architecture with Redis, PostgreSQL, and DynamoDB trade-offs
Complete two-phase reservation pattern walkthrough
All 7 failure scenarios with mitigations
Cross-cutting concerns: observability, testing strategy, and downstream service idempotency

Top comments (4)

Mayckon Giovani • Mar 18

This is one of those posts that actually separates people who have read about distributed systems from people who have been burned by them in production.

The strongest part here is the reframing: idempotency as a system property, not an endpoint property. That sounds obvious, but in practice most systems still treat it as an HTTP concern, which is why everything falls apart the moment you introduce async boundaries, retries across queues, or partial failures inside a saga.

The distinction between idempotency and deduplication is also spot on and rarely articulated this cleanly. A lot of teams think “we added an idempotency key” means they solved correctness, when in reality they just added a best-effort cache. Without a durable, atomic reservation mechanism, you are still fundamentally in at-least-once execution territory with undefined side effects.

The Two-Phase Reservation pattern is the real backbone here. What’s interesting is that it’s effectively the same shape as a lightweight consensus primitive. You are establishing a single writer for a logical operation under concurrency. If that insert-if-not-exists is not strongly consistent, everything else in the system is built on sand. I’d push even harder on that point: the correctness of idempotency is only as strong as the consistency model of the store backing it.

One thing I’d expand is how this interacts with event-driven architectures. Idempotency at the API layer is meaningless if downstream consumers are not idempotent. In practice, every boundary that can replay, reorder, or duplicate messages needs its own idempotency domain, otherwise you get what I call idempotency fragmentation, where each layer assumes another one is handling it.

Another subtle but critical angle is semantic idempotency vs. structural idempotency. Hashing payloads works until business meaning diverges from structure. Two requests can be structurally identical and still be semantically different depending on timing, external state, or side-channel effects. That’s where systems start leaking invariants.

Also worth calling out explicitly: idempotency does not give you exactly-once semantics. It gives you deterministic convergence under retries. That’s a very different guarantee, and a lot of production incidents come from engineers assuming stronger properties than what was actually implemented.

The failure scenarios section is excellent, especially the completed-but-undelivered case. That single edge case alone is responsible for a huge class of financial inconsistencies in payment systems. If your idempotency layer does not persist the response artifact, not just the execution state, you’ve already lost.

If I were to push this even further into “production-grade thinking,” I’d bring in:

State machine formalization of idempotent operations (IN_PROGRESS → COMPLETED → FAILED with invariants)
Explicit discussion of consistency trade-offs (Redis vs. Postgres vs. DynamoDB under partition)
Interaction with sagas and compensating actions, especially when retries re-enter partially completed workflows
Observability at the idempotency layer itself (key lifecycle tracing, collision metrics, replay visibility)

Overall, this is the kind of write-up that should exist in every system design doc before the first line of code is written. Most teams only discover these patterns after their first serious incident.

Alok Ranjan Daftuar • Mar 19

Thank you — this is exactly the kind of comment that makes writing these posts worthwhile.

Your point on idempotency fragmentation is something I deliberately left implicit, and in hindsight it deserved its own section. The assumption that "someone else is handling it" at each boundary is one of the most persistent and expensive beliefs in distributed systems design. Every async boundary is a new idempotency domain, full stop.

The semantic vs. structural distinction is a sharp one. Payload hashing is a structural check, and you're right that it breaks down when two requests carry identical bytes but different intent — timing-dependent state, external context, caller assumptions that aren't encoded in the payload. I've seen this surface most painfully in pricing and inventory systems where the "same" request means something different depending on when it lands. It's an under-discussed failure class.

On the consistency model point — fully agree, and I probably understated it. The correctness guarantee of the two-phase reservation is strictly bounded by the consistency of the underlying store. A Redis cluster under a network partition, or a DynamoDB table relying on eventually consistent reads, gives you probabilistic deduplication, not deterministic correctness. That gap is where teams get surprised. A future post on distributed locking will go deeper there.

The exactly-once vs. deterministic convergence distinction is precise and worth repeating loudly. Idempotency gives you a stable outcome under retries, not a guarantee that the operation executed once. Engineers conflating the two end up with systems that appear correct in testing and fail in subtle ways under production load.

The four areas you'd push further — state machine formalization, consistency trade-offs per store, saga re-entry, and idempotency-layer observability — are all on the roadmap. The saga interaction piece is covered partially in the follow-up post on orchestration vs. choreography, but the re-entry angle specifically (retries landing mid-compensation) deserves dedicated treatment. Adding it to the list.

Appreciate the thorough engagement with this.

Andre Cytryn • Mar 18

the consistency model of the backing store is the part that trips teams up most in practice. two-phase reservation only works if insert-if-not-exists is genuinely atomic, and that rules out naive Redis approaches unless you use Lua scripts for the check-and-set atomically. Postgres INSERT ... ON CONFLICT DO NOTHING inside a serializable transaction gives you much cleaner correctness guarantees, which is why I've seen teams reach for it even when Redis is already in the stack. curious whether you've run into the TTL alignment problem specifically, where a key expires while the operation is still IN_PROGRESS during a slow downstream call.

Alok Ranjan Daftuar • Mar 19

The Lua point is worth hammering — most Redis implementations get this wrong because SET NX + GET is two round trips, not one atomic op. The atomicity only exists inside a single Lua execution context.

And yes, Postgres INSERT ... ON CONFLICT DO NOTHING is underrated here. Correctness proof is simpler, and co-locating the deduplication record with your business data in the same transaction is the real win, not just the uniqueness guarantee.

On TTL expiry during slow downstream calls — hit this in a payment context. The fix was decoupling IN_PROGRESS TTL (sized to p99 of your slowest call) from COMPLETED TTL (sized to your full retry window), with a compare-and-swap on completion so a late-arriving first response can't silently overwrite a retried completion. Only clean solution short of making TTLs unreasonably long.