Backend Engineering |

Webhook Design Patterns: How I Built a System That Never Loses an Event

Learn reliable webhook patterns including retry strategies, idempotency, signature verification, and dead letter queues with production code examples.

By SouvenirList

Last year, our payment provider silently dropped three webhook notifications over a weekend. Three customers paid for their subscriptions, but our system never activated their accounts. We did not find out until Monday morning when the support queue exploded with angry tickets. The total revenue at risk was small, but the trust damage was enormous.

That incident forced me to rethink everything about how we handle webhooks. The default approach — spin up an endpoint, parse the JSON, do some work — is dangerously fragile. Webhooks are fire-and-forget by design, which means if your receiver fails, the sender usually does not care. The responsibility for reliability falls entirely on you.

This guide covers the patterns I have built and battle-tested for making webhook systems that never lose an event, even when things go wrong.


TL;DR — Webhook Reliability Patterns

PatternWhat It SolvesComplexityPriority
Signature VerificationPrevents spoofed eventsLowCritical
Respond First, Process LaterAvoids sender timeoutsLowCritical
Idempotency KeysPrevents duplicate processingMediumHigh
Exponential Retry with BackoffHandles transient failuresMediumHigh
Dead Letter QueueCaptures permanently failed eventsMediumHigh
Event Log TableEnables audit and replayMediumMedium
Circuit BreakerPrevents cascade failuresHighMedium

If you implement nothing else: verify signatures, respond with 200 immediately, and store the raw payload before processing.


Why Webhooks Are Harder Than They Look

On the surface, a webhook receiver is the simplest backend code you will ever write. Accept a POST request, read the body, do something with it. Five lines of code in any framework.

The problem is that webhooks operate in a fundamentally adversarial environment. The sender controls the timing, the volume, and the retry behavior. Your receiver has to handle all of these scenarios:

  • Duplicate deliveries: Most webhook providers retry on timeout, and “timeout” can mean your server processed the event but responded too slowly. Now you have processed it twice.
  • Out-of-order delivery: Event A happens before Event B, but Event B’s webhook arrives first. If you process them in arrival order, your data is wrong.
  • Payload spoofing: Without signature verification, anyone who discovers your endpoint URL can send fake events.
  • Thundering herd: A provider recovers from an outage and replays hours of queued webhooks simultaneously. Your database connection pool evaporates.

I have encountered every single one of these in production. Let me walk you through the patterns that solve them.


Pattern 1 — Signature Verification

This is non-negotiable. Every webhook you accept without verifying the signature is a potential security hole. An attacker who discovers your endpoint can send fake payment confirmations, fake user deletions, or fake anything.

Most providers sign their payloads using HMAC-SHA256. Here is how to verify it properly:

import hmac
import hashlib
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()
WEBHOOK_SECRET = "whsec_your_secret_here"

def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)

@app.post("/webhooks/payment")
async def handle_payment_webhook(request: Request):
    payload = await request.body()
    signature = request.headers.get("X-Webhook-Signature", "")
    
    if not verify_signature(payload, signature, WEBHOOK_SECRET):
        raise HTTPException(status_code = 401, detail = "Invalid signature")
    
    # Process the verified event
    event = json.loads(payload)
    await process_event(event)
    return {"status": "ok"}

Critical detail: use hmac.compare_digest() instead of == for the comparison. A regular string comparison is vulnerable to timing attacks — an attacker can measure response times to guess the correct signature one character at a time. compare_digest runs in constant time regardless of where the strings differ.

Also, verify against the raw request body, not a re-serialized version of the parsed JSON. JSON serialization is not deterministic — key ordering and whitespace can differ between libraries, which will produce a different hash.


Pattern 2 — Respond First, Process Later

This is the single most important reliability pattern for webhooks. Most providers have a timeout of 5-30 seconds. If your processing takes longer than that, the provider marks the delivery as failed and retries — even though you already started processing it.

The fix is simple: acknowledge the webhook immediately, then process it asynchronously.

import json
from fastapi import FastAPI, Request, BackgroundTasks
from datetime import datetime

app = FastAPI()

@app.post("/webhooks/payment")
async def handle_payment_webhook(
    request: Request,
    background_tasks: BackgroundTasks
):
    payload = await request.body()
    
    # Step 1: Verify signature (fast)
    verify_or_reject(payload, request.headers)
    
    # Step 2: Store the raw event (fast)
    event_id = await store_raw_event(payload)
    
    # Step 3: Return 200 immediately
    background_tasks.add_task(process_webhook_event, event_id)
    return {"status": "received", "event_id": event_id}


async def store_raw_event(payload: bytes) -> str:
    event = json.loads(payload)
    event_id = event.get("id", generate_uuid())
    
    await db.execute("""
        INSERT INTO webhook_events (event_id, payload, status, received_at)
        VALUES ($1, $2, 'pending', $3)
        ON CONFLICT (event_id) DO NOTHING
    """, event_id, payload.decode(), datetime.utcnow())
    
    return event_id

The key insight: your webhook endpoint is not where business logic lives. It is a thin ingestion layer. Its only job is to accept the payload, verify it, store it, and return 200. All the real work happens in a background worker that reads from the event store.

This pattern also gives you free retry capability. If your background processor crashes, the event is still in the database with a pending status. A periodic sweep job can pick up any events that were received but never processed.


Pattern 3 — Idempotency Keys

Duplicate webhook deliveries are not a bug — they are a feature. Providers deliberately retry to ensure at-least-once delivery. Your system needs to handle the same event arriving two, three, or ten times without processing it multiple times.

The standard approach is to track which events you have already processed:

async def process_webhook_event(event_id: str):
    # Atomic check-and-claim using database locking
    result = await db.execute("""
        UPDATE webhook_events
        SET status = 'processing', started_at = NOW()
        WHERE event_id = $1 AND status = 'pending'
        RETURNING event_id
    """, event_id)
    
    if not result:
        # Already processing or processed — skip
        return
    
    try:
        event = await db.fetchone(
            "SELECT payload FROM webhook_events WHERE event_id = $1",
            event_id
        )
        payload = json.loads(event["payload"])
        
        # Your actual business logic
        await handle_payment_event(payload)
        
        await db.execute("""
            UPDATE webhook_events
            SET status = 'completed', completed_at = NOW()
            WHERE event_id = $1
        """, event_id)
        
    except Exception as e:
        await db.execute("""
            UPDATE webhook_events
            SET status = 'failed', error = $2, attempts = attempts + 1
            WHERE event_id = $1
        """, event_id, str(e))
        raise

Why the database lock matters: if you check status == 'pending' and then update in separate queries, two workers can both see pending and both start processing. The UPDATE ... WHERE status = 'pending' RETURNING pattern is atomic — only one worker wins the race.

If the webhook provider does not include an event ID (some do not), generate a deterministic one by hashing the payload contents. This way, identical payloads always map to the same ID:

import hashlib

def generate_idempotency_key(payload: bytes) -> str:
    return hashlib.sha256(payload).hexdigest()

Pattern 4 — Exponential Retry with Backoff

When your webhook processor fails, you need a retry strategy that balances speed with safety. Retrying immediately in a tight loop is a recipe for amplifying failures. If the database is overloaded, hammering it with retries makes things worse.

Exponential backoff with jitter is the standard approach:

import asyncio
import random

async def retry_with_backoff(
    func,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 300.0
):
    for attempt in range(max_retries):
        try:
            return await func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Final attempt, let it fail
            
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.1)
            
            print(f"Attempt {attempt + 1} failed: {e}. "
                  f"Retrying in {delay + jitter:.1f}s")
            await asyncio.sleep(delay + jitter)


# Usage in the webhook processor
async def process_with_retry(event_id: str):
    async def _process():
        event = await db.fetchone(
            "SELECT payload FROM webhook_events WHERE event_id = $1",
            event_id
        )
        await handle_payment_event(json.loads(event["payload"]))
    
    try:
        await retry_with_backoff(_process, max_retries=5)
        await mark_completed(event_id)
    except Exception:
        await mark_failed(event_id)

The retry schedule with these defaults: 1s, 2s, 4s, 8s, 16s (plus jitter). Total wait before giving up is about 31 seconds. For webhooks that interact with external services, I typically increase base_delay to 5 seconds and max_retries to 8, giving a total retry window of about 21 minutes.

Why jitter matters: without jitter, if 100 webhook events all fail at the same time (say, during a brief database hiccup), they all retry at the same time. The synchronized retry creates a thundering herd that can cause the same failure all over again. Adding random jitter spreads the retries across time.


Pattern 5 — Dead Letter Queue

Even with retries, some events will permanently fail. Maybe the payload references a user that was deleted, or the event type is one your code does not handle yet. These events need to go somewhere you can inspect and replay them later.

async def move_to_dead_letter(event_id: str, error: str):
    await db.execute("""
        INSERT INTO webhook_dead_letters (
            event_id, payload, error, original_received_at, moved_at
        )
        SELECT event_id, payload, $2, received_at, NOW()
        FROM webhook_events
        WHERE event_id = $1
    """, event_id, error)
    
    await db.execute("""
        UPDATE webhook_events SET status = 'dead_letter'
        WHERE event_id = $1
    """, event_id)

# Periodic sweep to catch stuck events
async def sweep_failed_events():
    """Move events that have exceeded max retry attempts."""
    stuck_events = await db.fetch("""
        SELECT event_id, error
        FROM webhook_events
        WHERE status = 'failed'
          AND attempts >= 5
          AND updated_at < NOW() - INTERVAL '1 hour'
    """)
    
    for event in stuck_events:
        await move_to_dead_letter(
            event["event_id"],
            f"Max retries exceeded: {event['error']}"
        )

Your dead letter queue should be easy to inspect and replay. I build a simple admin endpoint for this:

@app.post("/admin/webhooks/replay/{event_id}")
async def replay_dead_letter(event_id: str):
    event = await db.fetchone(
        "SELECT payload FROM webhook_dead_letters WHERE event_id = $1",
        event_id
    )
    if not event:
        raise HTTPException(404, "Event not found in dead letter queue")
    
    # Re-insert as pending for reprocessing
    await db.execute("""
        INSERT INTO webhook_events (event_id, payload, status, received_at)
        VALUES ($1, $2, 'pending', NOW())
        ON CONFLICT (event_id) DO UPDATE SET status = 'pending', attempts = 0
    """, event_id, event["payload"])
    
    return {"status": "replayed", "event_id": event_id}

I have used this replay capability dozens of times. Once, a third-party API changed their event schema without warning, and all our webhook processing broke for six hours. Because every raw payload was stored in the dead letter queue, we fixed the parser and replayed all 2,400 failed events in under a minute.


Pattern 6 — Event Ordering

Some webhook events have natural ordering. A subscription cannot be cancelled before it is created. An invoice cannot be paid before it is issued. If events arrive out of order, naive processing will produce incorrect state.

There are two approaches I have used in production:

Approach A — Timestamp-Based Resolution

If each event includes a timestamp, use it to detect and resolve ordering conflicts:

async def handle_subscription_event(payload: dict):
    sub_id = payload["subscription_id"]
    event_time = datetime.fromisoformat(payload["timestamp"])
    event_type = payload["type"]
    
    # Only process if this event is newer than what we have
    result = await db.execute("""
        UPDATE subscriptions
        SET status = $2, updated_at = $3
        WHERE id = $1 AND updated_at < $3
        RETURNING id
    """, sub_id, map_status(event_type), event_time)
    
    if not result:
        print(f"Skipping stale event {event_type} for {sub_id}")

Approach B — State Machine Validation

Define valid state transitions and reject events that do not follow them:

VALID_TRANSITIONS = {
    "created":   ["active", "cancelled"],
    "active":    ["paused", "cancelled", "past_due"],
    "paused":    ["active", "cancelled"],
    "past_due":  ["active", "cancelled"],
    "cancelled": []  # Terminal state
}

async def handle_with_state_machine(payload: dict):
    sub_id = payload["subscription_id"]
    new_status = map_status(payload["type"])
    
    current = await db.fetchone(
        "SELECT status FROM subscriptions WHERE id = $1", sub_id
    )
    
    if current and new_status not in VALID_TRANSITIONS.get(current["status"], []):
        # Queue for later reprocessing — the prerequisite event
        # might arrive soon
        await db.execute("""
            UPDATE webhook_events
            SET status = 'deferred', process_after = NOW() + INTERVAL '30 seconds'
            WHERE event_id = $1
        """, payload["event_id"])
        return
    
    await apply_subscription_update(sub_id, new_status)

I prefer Approach A for most cases because it is simpler and does not require you to enumerate every valid transition. But Approach B is better when you need strict consistency guarantees — particularly in financial systems where processing an event in the wrong order could mean charging a customer incorrectly.


The Complete Webhook Database Schema

Here is the schema I use as a starting point for every webhook integration:

CREATE TABLE webhook_events (
    id            BIGSERIAL PRIMARY KEY,
    event_id      VARCHAR(255) UNIQUE NOT NULL,
    source        VARCHAR(100) NOT NULL,     -- 'stripe', 'github', etc.
    event_type    VARCHAR(100),
    payload       JSONB NOT NULL,
    status        VARCHAR(20) DEFAULT 'pending',
    attempts      INT DEFAULT 0,
    error         TEXT,
    received_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    started_at    TIMESTAMPTZ,
    completed_at  TIMESTAMPTZ,
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_webhook_events_status ON webhook_events (status)
    WHERE status IN ('pending', 'processing', 'failed');

CREATE INDEX idx_webhook_events_source ON webhook_events (source, event_type);

CREATE TABLE webhook_dead_letters (
    id                  BIGSERIAL PRIMARY KEY,
    event_id            VARCHAR(255) NOT NULL,
    source              VARCHAR(100) NOT NULL,
    payload             JSONB NOT NULL,
    error               TEXT,
    original_received_at TIMESTAMPTZ,
    moved_at            TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

The partial index on status is intentional. Once an event is completed, you almost never query it by status. The partial index keeps the index small and fast for the queries that matter — finding pending and failed events.


Production Checklist

Before shipping a webhook integration to production, I run through this checklist:

Security

  • Signature verification is enabled and tested
  • Webhook secret is stored in environment variables, not code
  • Endpoint is HTTPS-only
  • Request body size is limited (prevent memory exhaustion attacks)

Reliability

  • Endpoint responds with 200 before processing starts
  • Raw payload is stored before any processing
  • Idempotency prevents duplicate processing
  • Failed events retry with exponential backoff
  • Permanently failed events go to dead letter queue

Observability

  • Log every received event (source, type, event ID)
  • Alert on dead letter queue growth
  • Monitor webhook processing latency
  • Track success/failure rates per webhook source

Operations

  • Dead letter events can be replayed via admin tool
  • Webhook endpoint has its own rate limiting
  • Health check endpoint is separate from webhook endpoint
  • Documentation exists for each webhook source and event type

FAQ

How do I test webhooks during local development?

Use a tunneling tool like ngrok or Cloudflare Tunnel to expose your local server to the internet. Then configure the webhook provider to send events to your tunnel URL. For automated testing, record real webhook payloads and replay them in your test suite. I keep a /tests/fixtures/webhooks/ directory with example payloads from every provider we integrate with.

What should I return when webhook processing fails?

Return 200 OK anyway. If you return a 4xx or 5xx status, the provider will retry, but you have already stored the raw event for internal retry. Returning an error status just creates duplicate deliveries that your idempotency layer has to filter out. The only exception is signature verification failure — return 401 Unauthorized for those because you want the provider to know the secret may be misconfigured.

How long should I keep webhook event logs?

I keep completed events for 90 days and dead letter events for 1 year. The 90-day window covers most dispute and chargeback timelines for payment webhooks. Dead letters stay longer because they often represent edge cases you want to reference when debugging similar issues months later. Set up a scheduled job to archive or delete old events to keep your table performant.

Should I use a message queue instead of a database for webhook processing?

For most teams, the database approach is simpler and sufficient. You already have a database, and the throughput requirements for webhooks are usually modest — even a busy SaaS application might process a few thousand webhook events per day. If you are handling tens of thousands of events per minute, then yes, a dedicated message queue like RabbitMQ or Kafka makes sense. But do not add infrastructure complexity until you actually need it.

How do I handle webhook providers that do not include event IDs?

Generate a deterministic ID by hashing the payload. Use SHA-256 of the raw request body. This guarantees that identical payloads produce the same ID, which is exactly what you need for idempotency. Be aware that some providers add timestamps or nonces to each delivery attempt — if the payload differs between retries, this approach will not deduplicate them. In that case, hash only the stable fields (like the object ID and event type).


Bottom Line

Webhooks are deceptively simple to implement and deceptively hard to make reliable. The patterns in this guide — verify, acknowledge, store, process, retry, dead-letter — form a pipeline that handles every failure mode I have encountered in production. Start with signature verification and async processing, then add the remaining patterns as your integration matures.


Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.

Tags: webhooks backend api system design event-driven reliability

Related Articles