Webhook retry strategies: exponential backoff explained

Florian Wartner2026-05-07 6 min read

A webhook receiver that's down for 30 seconds is normal — deploys, network blips, a Lambda cold start that timed out. A webhook sender that gives up after one attempt loses real data. The fix is retries, but naive retries (fixed-interval, no jitter, no idempotency key) cause their own problems: synchronized retry storms, double-processing of the same event, unbounded queue growth.

This post walks through how a well-behaved webhook sender retries — exponential backoff math, jitter, idempotency keys, and the four bugs that get past code review.

The baseline: exponential backoff

If the receiver returns a non-2xx status (or times out, or the connection is reset), retry the delivery. Each retry waits longer than the last:

Attempt 1: immediately
Attempt 2: 30 seconds later
Attempt 3: 90 seconds later  (3× the previous interval)
Attempt 4: 270 seconds later
Attempt 5: 810 seconds later  (~13.5 min)
Attempt 6: 2,430 seconds later  (~40 min)
Attempt 7: 7,290 seconds later  (~2 hours)
Attempt 8: 21,600 seconds later  (~6 hours)

Total span: ~24 hours over 8 attempts. That's the Stripe webhook retry schedule, give or take, and it's what Formspring uses internally.

The math: delay(n) = base × multiplier^(n-1) where base = 30s and multiplier = 3.

Why exponential, not linear?

A linear retry schedule (every 30 seconds, 8 times) means a receiver that's been down for 4 minutes gets hit 8 times in 4 minutes — useless retries that just generate load. Exponential backs off as the outage continues, giving the receiver room to recover.

The flip side: a receiver that's down for hours genuinely won't see your delivery for hours. That's intentional — at that point, the receiver probably needs human attention, not more retries.

Adding jitter

The bug nobody catches: if 1,000 senders all use the exact same backoff schedule, when the receiver comes back up at minute 5, all 1,000 retry simultaneously. Receiver gets a thundering herd, immediately falls over, the cycle repeats.

The fix: add jitter to each delay.

delay(n) = base × multiplier^(n-1) × (0.5 + random(0, 1))

So the actual delays are 50%-150% of the nominal values. With jitter, the 1,000 senders' retries spread over minutes 4-7 instead of all hitting minute 5. The receiver recovers smoothly.

Variants:

  • Full jitter: delay = random(0, base × multiplier^(n-1)). More randomness, sometimes too aggressive.
  • Equal jitter: delay = (base × multiplier^(n-1)) / 2 + random(0, base × multiplier^(n-1) / 2). Less aggressive.
  • Decorrelated jitter: each retry's delay is bounded by the previous delay × 3, with random in between. Good for very high-cardinality senders.

For a single-tenant form-backend like Formspring sending to one receiver per form, decorrelated jitter is overkill; equal jitter is fine.

Idempotency keys: the receiver's contract

Retries mean the same logical event might arrive twice — first attempt timed out at the network layer (so the sender retries), but the receiver actually processed it. Without an idempotency key, the receiver double-processes: two CRM rows, two Slack notifications, two welcome emails.

The contract: every webhook delivery includes a unique idempotency key. The receiver records it; subsequent deliveries with the same key return success without reprocessing.

POST /webhook
X-Formspring-Signature: t=1715090123,v1=8e2c…
X-Formspring-Delivery: dlv_01HRX8YE9F5G7J0K3M5N7P9Q1R
Content-Type: application/json

{"submission_id":"sub_abc","data":{…}}

Receiver pseudocode:

def handle_webhook(request):
    delivery_id = request.headers.get("X-Formspring-Delivery")

    # Idempotency check
    if Delivery.objects.filter(id=delivery_id).exists():
        return Response(status=200)  # Already processed

    # Process the event
    event = json.loads(request.body)
    process_submission(event)
    Delivery.objects.create(id=delivery_id, processed_at=now())

    return Response(status=200)

The store can be Redis (fast, with TTL), a database table, or even a file. Storage cost is one row per delivery; cheap.

The four retry-related bugs that get past code review

1. Retrying on success

Forgetting to check the response status code:

# WRONG
response = requests.post(webhook_url, json=payload)
# No status check — the next "if response.failed: retry" never fires

Always check response.status_code is 2xx before treating the delivery as successful. 4xx is "client error, don't retry" (the receiver said "this request is malformed, try again won't help"). 5xx is "server error, retry."

2. Retrying on 4xx

# WRONG
if response.status_code != 200:
    queue_retry(payload)

A 401 or 422 won't fix itself by retrying. The webhook secret is wrong, or the payload schema is wrong, or the receiver has rejected this specific request. Retrying these wastes resources and might trip rate limits.

Retry only on 5xx (server error) and on network-level failures (timeout, connection reset, DNS failure).

3. No retry budget

# WRONG
def deliver_with_retry(payload):
    while True:  # ← unbounded
        response = requests.post(url, json=payload, timeout=30)
        if response.ok:
            return
        time.sleep(backoff_delay())

If the receiver is down for a week, this thread never returns. Set a max attempt count (8 is reasonable) or a max total elapsed time (24 hours).

4. Synchronous retry blocking the request thread

Retrying inside the same request thread that received the original webhook trigger blocks everything else. If you're sending 1,000 webhooks per minute and one receiver is slow, you exhaust the thread pool waiting on it.

Always offload to a background queue (Sidekiq, RQ, Laravel queues, BullMQ, Cloudflare Queues, AWS SQS). The original request thread enqueues a job and returns immediately. The job worker handles delivery + retries.

Replay: the human-in-the-loop fallback

After the automated retry budget is exhausted, the delivery is "permanently failed." That's a misnomer — sometimes the receiver just had a longer outage than expected. A good webhook sender lets you manually replay failed deliveries from the dashboard.

Replay is just another delivery attempt with the same idempotency key. If the receiver has finally recovered, the message gets through; if it processes successfully, the sender stops retrying. Idempotency means replaying twice produces the same result as replaying once — no double-processing.

Putting it all together: Formspring's actual retry schedule

Attempt 1: t+0s
Attempt 2: t+30s ± 50% jitter
Attempt 3: t+90s ± 50% jitter
Attempt 4: t+270s ± 50% jitter (~4.5min)
Attempt 5: t+810s ± 50% jitter (~13.5min)
Attempt 6: t+2,430s ± 50% jitter (~40min)
Attempt 7: t+7,290s ± 50% jitter (~2hr)
Attempt 8: t+21,600s ± 50% jitter (~6hr)

Each delivery gets:

  • X-Formspring-Signature HMAC header (Stripe-pattern)
  • X-Formspring-Delivery unique ID for idempotency
  • 30-second connect timeout, 30-second read timeout

Only 5xx and network errors trigger retries. 4xx is a permanent failure (logged in dashboard, ready for manual replay if you fix the receiver). After 8 attempts, manual replay is the escape hatch.

What this means for your receiver

If you're building the receiver side of webhook delivery:

  1. Verify HMAC signatures — see How to verify HMAC webhook signatures in Node, PHP, and Python.
  2. Track idempotency keys — Redis with 7-day TTL is sufficient.
  3. Return 2xx fast — get the work off the request thread, then process asynchronously. The sender expects sub-30-second responses.
  4. Return 5xx if you genuinely can't process now so the sender retries; return 2xx if you've enqueued the work; return 4xx only if the request is malformed and retrying won't help.
  5. Idempotent processing — if the same event arrives twice, your handler should produce the same result.

Formspring's webhook delivery is fully retry-aware — try the free tier and watch your delivery log: timestamps, status codes, latencies, manual replay buttons. The retry math is invisible until something fails; then it's the difference between "we lost the lead from yesterday" and "the lead arrived after their server came back up."

Florian Wartner

Founder of Formspring and Pixel & Process. Senior Laravel and Vue engineer based in Lübeck, Germany. Building developer-first SaaS with EU data residency and honest pricing.

Ship your form in two minutes.

No credit card. 50 free submissions a month, every month.