Why webhook deliveries fail and how to replay them safely

A webhook that fires once is a feature. A webhook that fires once and is never seen by the consumer is a silent data loss. A webhook that fires twice and runs the side effect twice is a billing incident.

This is the working guide to the failure modes that actually take down webhook delivery in 2026, what to do when they do, and how to design a replay flow that does not make things worse.

The five failure modes that account for 95% of dropped webhooks

Out of every hundred webhook failures, almost all of them are one of these:

Endpoint timeout. The consumer took longer than the sender's timeout (usually 5–30 seconds) to respond. The sender retries, the consumer is still slow, the retry queue piles up.
5xx from the consumer. The consumer is alive but errored. Common causes: a deploy in progress, a database that just hit max connections, a downstream API that is throttling. The 5xx family is "server-side, retry later" by RFC 9110 §15.6 — a sender that treats it any other way is mis-implementing HTTP.
Signature mismatch. The HMAC signature did not validate on the consumer side. Almost always a configuration drift: the secret was rotated on one side and not the other.
TLS or DNS failure. The consumer's certificate expired, the domain stopped resolving, or the TLS handshake failed for a cipher reason. Rare but catastrophic when it happens, because every webhook fails until someone notices.
Network-level blackhole. The consumer's firewall, a WAF rule, or a CDN started 403'ing the sender's IP. Looks like a 4xx from the consumer's side, but the consumer's application never saw the request.

Each of these has a different fix. Treating them all as "the webhook failed, retry it" is how you end up with a queue that runs forever and a consumer that has been down for 48 hours without anyone noticing.

Retry strategy that actually works

Exponential backoff with jitter, capped at a sensible maximum. The shape:

First retry: 30 seconds after the first failure.
Each subsequent retry: doubles the wait, plus random jitter of ±25%.
Cap at 1 hour between retries.
Stop retrying after 24 hours total elapsed, or after 12 attempts, whichever comes first.

The jitter matters. Without it, when a consumer comes back up, every queued webhook hits it at the same instant — a thundering herd that takes the consumer back down. With jitter, the retries spread across a window and the consumer recovers gracefully. The math is in the AWS Architecture Blog's Exponential Backoff and Jitter post, which is still the canonical reference on why randomisation is non-optional.

Beyond 24 hours, retry is not the answer. At that point, the consumer needs to know there is a backlog and pull it themselves. Pushing forever just buries the signal.

Idempotency: the consumer's responsibility

The single most important contract between a webhook sender and a webhook consumer is the idempotency key. Every webhook delivery carries a unique ID. Every consumer must:

Record the ID on first successful processing.
Skip processing on every subsequent delivery of the same ID.

This is what makes replay safe. Without it, every replay is a chance to double-charge a customer, double-send an email, or double-create a record.

The consumer's idempotency window should be at least 48 hours. Anything shorter and a delayed replay can still cause a double-process. Anything longer is wasted storage.

A note on what the ID should look like: the sender's delivery ID, not the event's payload hash. Two deliveries of the same event are intentionally separate retries and should each be skipped on the consumer side — the payload hash would collapse them into one, which is exactly what idempotency is supposed to prevent.

When and how to replay

Replay is the manual override for the automatic retry. It exists for three cases:

The automatic retry window expired and the consumer is finally back up.
A bug in the consumer was fixed and a window of deliveries needs to be re-processed.
The consumer rotated their endpoint URL or signing secret and a window of deliveries needs to be re-sent to the new configuration.

For each case, the replay flow should:

Select the window of deliveries to replay. Almost never "all of them" — pick a date range or a specific delivery ID.
Confirm the consumer is ready. Hit the endpoint with a health check before flooding it with backlog. A replay into a still-down consumer is a wasted operation.
Replay with a delay between deliveries. Not the full retry backoff — usually 100ms is enough — but not zero. A burst replay can take down a consumer that is just barely back up.
Log every replay separately. A replayed delivery should be visible in the delivery log as a replay, not as a fresh delivery. This is essential for debugging "why did this event get processed twice".

Skip the replay if the consumer is not idempotent. In that case, the replay is more dangerous than the original drop.

The delivery log is the source of truth

Every webhook system needs a delivery log that records, per attempt:

Timestamp.
HTTP status returned by the consumer.
Response body (first 1KB, for debugging).
Request body and headers (for verifying signature mismatches).
Latency.
Attempt number.

Without this log, you cannot answer the two questions that come up every time something breaks: "did the webhook fire?" and "what did the consumer say?". With it, both are a two-second lookup.

Retention on the delivery log should be at least 30 days. Past that, archive the headers and status but drop the body — full payload retention has GDPR implications you don't need. The relevant text is GDPR Article 5(1)(c) — data minimisation: only keep what is necessary for the stated purpose.

Health monitoring that catches the silent failures

The dangerous failures are not the ones that fire alerts. They are the ones that don't.

Two health signals to monitor:

Delivery rate per endpoint. If a consumer that normally receives 100 deliveries per hour drops to 0, alert. This catches the cases where every delivery is failing (DNS, TLS, blackhole) and the consumer has no idea.
Success rate per endpoint, rolling 1-hour window. Below 95% for more than an hour, alert. This catches the slow drift cases where some deliveries are working and some are not.

Alert routing matters. The webhook sender's team is rarely the right responder — the consumer's team is. Alerts should go to whoever owns the consuming application, not the form vendor.

Related from this desk

Webhook idempotency: handling duplicate deliveries safely — the consumer-side contract that makes every replay in this post safe.
Webhook retry strategies: exponential backoff explained — the sender-side math, jitter variants, and the four retry bugs that get past code review.
Verify HMAC webhook signatures in Node, PHP, and Python — constant-time signature checks across three runtimes.
Form submission automations: routing, enrichment, follow-up — once delivery is reliable, the next decision is what to do with each submission.
Webhook configuration reference: /docs/webhooks/overview and the form backend overview.

The honest pitch

Webhook reliability is the boring infrastructure problem nobody thinks about until it breaks. The shape of getting it right is well understood: exponential backoff with jitter, consumer-side idempotency, a queryable delivery log, replay with confirmation, and two health alerts.

Most webhook outages are not failures of the platform. They are failures of the consumer to implement idempotency or to monitor their own endpoint. The platform's job is to make the failure visible and the replay safe. The consumer's job is to be ready when the replay arrives.

The Field Notes