Key takeaways

  • Every production webhook system is at-least-once. Exactly-once delivery is a distributed-systems fiction.
  • Idempotency starts with a stable event ID, gets enforced by an atomic dedupe store, and gets verified by side-effect design downstream.
  • A 24–72 hour TTL on your dedupe cache covers ~99% of real-world retry storms. Don’t store forever — storage grows unbounded and you lose the ability to replay.
  • Return the right HTTP codes. 2xx means “stop retrying.” 5xx means “try again.” Mixing them up is the most common bug in this whole space.
  • Test duplicate delivery in CI. Ship every integration test twice on every run. If the second pass diverges, you have a bug in production.

Webhook bugs don’t page you on Tuesday afternoon. They page you at 2am, when your consumer is looping on a retry, a partner’s system is melting under your accidental load, and your on-call is trying to figure out whether the duplicate charges in Stripe are your bug or their replay.

The list below is the one we run internally before any new webhook source or consumer goes live. It’s not novel. Most of it has been written before, in better blog posts. But nobody has the full checklist in one place, so here it is.

1. Assume at-least-once delivery

There is no webhook provider that guarantees exactly-once delivery, because in a distributed system it’s impossible. What they guarantee is at-least-once, which means you should expect the same event to arrive one, two, or ten times.

Every claim of “exactly-once” in this space is one of two things: at-least-once with idempotent handlers downstream (honest), or a marketing statement (dishonest). Build as if you’ll see every event multiple times, and the rest of the list falls out naturally.

2. Every event needs a stable ID

Idempotency starts at the event itself, not at your handler. The event must carry a unique, stable identifier that you can dedupe on. Not a hash of the payload (those collide). A dedicated ID field the producer generates once and re-sends verbatim on retry.

If you’re the producer, UUIDv4 or UUIDv7 is fine. UUIDv7 sorts chronologically, which is a nice property for dedupe-table indexes.

3. Use the provider’s ID when it exists

Most major providers already give you one. Stripe’s id field. GitHub’s X-GitHub-Delivery header. Linear’s delivery_id. Use those. Don’t hash the payload. Don’t generate your own on receive — you want the ID to be stable across the producer’s retries, and the producer is the only one that can guarantee that.

If the provider doesn’t send an ID, file an issue with them. Then fall back to hashing (timestamp, event_type, subject_id) to get something close to stable — but know that it’ll occasionally collide on legitimate near-simultaneous events, so treat your dedupe as best-effort.

4. Dedupe on receive, not on process

Before any business logic runs, check whether you’ve seen this event ID. The check happens in the HTTP request handler, not in the downstream worker. Reason: if dedupe happens in the worker, a retry that lands between “received” and “enqueued” produces two queued jobs, and you’ve already lost the battle.

# Python / Flask — dedupe at the handler boundary
@app.post("/webhooks/stripe")
def stripe_webhook():
    event_id = request.headers.get("Stripe-Signature-Id") or request.json["id"]
    if dedupe_store.seen_recently(event_id):
        return "", 200  # already processed, return success

    queue.enqueue("process_stripe_event", request.json)
    dedupe_store.mark_seen(event_id, ttl=86400 * 3)
    return "", 200

5. Store dedupe state atomically

The dedupe store write and the enqueue need to be atomic — or at least, they need to happen in the right order so a crash between them is safe.

In Redis: SET key 1 NX EX 259200 gives you a conditional write with TTL in one round-trip. In Postgres:

INSERT INTO webhook_dedupe (event_id, received_at)
VALUES ($1, NOW())
ON CONFLICT (event_id) DO NOTHING
RETURNING xmax;

The RETURNING xmax trick tells you whether the row was new (xmax = 0) or a duplicate (xmax != 0), in a single round-trip. If the row was new, continue to processing. If it was a duplicate, return 200 and do nothing.

6. Give your dedupe cache a TTL

Don’t store dedupe keys forever. Storage grows unbounded, and you lose the ability to deliberately replay an event for recovery.

For most providers, 24–72 hours covers the retry-storm window. Stripe gives up after 3 days. GitHub retries for up to 24 hours. Pick your TTL based on the longest window of any upstream you accept events from, and add a day of safety margin.

7. Make side-effects conditional

Even with perfect dedupe, retries can happen inside your own pipeline — a worker crashes mid-job and the queue redelivers. So every side-effect your handler performs should be conditional on prior state.

Bad:

UPDATE deals SET status = 'won' WHERE id = $1;

Good:

UPDATE deals
SET status = 'won', won_at = NOW()
WHERE id = $1 AND status = 'negotiation';

The good version is a compare-and-swap. If the deal already moved to won (because a retry already processed), the update affects zero rows and you know to skip the downstream notifications.

8. Propagate idempotency keys to downstream APIs

If your webhook handler calls out to another API — Stripe to issue a refund, Resend to send an email, Twilio to send an SMS — most modern APIs accept an Idempotency-Key header. Use the upstream event ID as the key, or derive it deterministically.

// Go — retry-safe charge creation
req, _ := http.NewRequest("POST", "https://api.stripe.com/v1/charges", body)
req.Header.Set("Idempotency-Key", eventID) // stable across our retries
req.Header.Set("Authorization", "Bearer "+apiKey)
resp, err := httpClient.Do(req)

A retry of the handler will send the same key. Stripe returns the original response — no duplicate charge.

9. Use the outbox pattern for multi-system writes

When your handler does more than one write — local DB + external API + message queue — you can’t make all of them atomic. The fix is the transactional outbox:

  1. In the same database transaction as your business-logic write, insert a row into an outbox table describing the side-effect (“send this email,” “call this API”).
  2. A separate worker polls the outbox and performs the side-effects, marking each row done.
  3. If a side-effect fails, the outbox row retries — naturally at-least-once.

Combined with downstream idempotency keys (item 8), the whole chain is safe under failure.

10. Return the right HTTP status codes

The most common bug here: returning 500 when you’ve actually processed the event but a downstream notification failed. The provider retries, your dedupe catches it, and your logs fill with spurious retries that look like real failures. Keep the retry signal honest.

11. Test duplicate delivery explicitly

Every integration test in your webhook suite should run the test event twice, back-to-back. If the result after two deliveries is different from the result after one, you have an idempotency bug.

// Node / Jest
test('deal-won webhook is idempotent', async () => {
  const event = buildDealWonEvent({ dealId: 'deal_123' });

  await postWebhook('/webhooks/deals', event);
  await postWebhook('/webhooks/deals', event);  // same event, again

  const deal = await db.deals.find('deal_123');
  const notifications = await db.notifications.where({ deal_id: 'deal_123' });

  expect(deal.status).toBe('won');
  expect(notifications).toHaveLength(1);  // not 2
});

Most idempotency bugs we’ve seen would have been caught by a one-line change to the test harness.

12. Monitor dedupe rate and handler percentiles

Two metrics tell you more about webhook health than anything else:

Alert on the 95th percentile of both, over 10-minute windows. That’s the fire alarm, not the dashboard.


Closing

Idempotency isn’t a library you install. It’s a discipline you enforce on every handler you write. The checklist above is roughly the order we run it in code review: event ID → dedupe store → conditional side-effects → downstream keys → tests. Skip any step and you ship a bug that only shows up under retry pressure, which is the worst time to find a bug.

If you’re designing a webhook system today and any item on this list is unchecked, it’s worth fifteen minutes to write down what your team would do if you caught a duplicate processed tomorrow at 2am.