Webhook Design Patterns: How I Built a System That Never Loses an Event
Learn reliable webhook patterns including retry strategies, idempotency, signature verification, and dead letter queues with production code examples.
Last year, our payment provider silently dropped three webhook notifications over a weekend. Three customers paid for their subscriptions, but our system never activated their accounts. We did not find out until Monday morning when the support queue exploded with angry tickets. The total revenue at risk was small, but the trust damage was enormous.
That incident forced me to rethink everything about how we handle webhooks. The default approach — spin up an endpoint, parse the JSON, do some work — is dangerously fragile. Webhooks are fire-and-forget by design, which means if your receiver fails, the sender usually does not care. The responsibility for reliability falls entirely on you.
This guide covers the patterns I have built and battle-tested for making webhook systems that never lose an event, even when things go wrong.
TL;DR — Webhook Reliability Patterns
| Pattern | What It Solves | Complexity | Priority |
|---|---|---|---|
| Signature Verification | Prevents spoofed events | Low | Critical |
| Respond First, Process Later | Avoids sender timeouts | Low | Critical |
| Idempotency Keys | Prevents duplicate processing | Medium | High |
| Exponential Retry with Backoff | Handles transient failures | Medium | High |
| Dead Letter Queue | Captures permanently failed events | Medium | High |
| Event Log Table | Enables audit and replay | Medium | Medium |
| Circuit Breaker | Prevents cascade failures | High | Medium |
If you implement nothing else: verify signatures, respond with 200 immediately, and store the raw payload before processing.
Why Webhooks Are Harder Than They Look
On the surface, a webhook receiver is the simplest backend code you will ever write. Accept a POST request, read the body, do something with it. Five lines of code in any framework.
The problem is that webhooks operate in a fundamentally adversarial environment. The sender controls the timing, the volume, and the retry behavior. Your receiver has to handle all of these scenarios:
- Duplicate deliveries: Most webhook providers retry on timeout, and “timeout” can mean your server processed the event but responded too slowly. Now you have processed it twice.
- Out-of-order delivery: Event A happens before Event B, but Event B’s webhook arrives first. If you process them in arrival order, your data is wrong.
- Payload spoofing: Without signature verification, anyone who discovers your endpoint URL can send fake events.
- Thundering herd: A provider recovers from an outage and replays hours of queued webhooks simultaneously. Your database connection pool evaporates.
I have encountered every single one of these in production. Let me walk you through the patterns that solve them.
Pattern 1 — Signature Verification
This is non-negotiable. Every webhook you accept without verifying the signature is a potential security hole. An attacker who discovers your endpoint can send fake payment confirmations, fake user deletions, or fake anything.
Most providers sign their payloads using HMAC-SHA256. Here is how to verify it properly:
import hmac
import hashlib
from fastapi import FastAPI, Request, HTTPException
app = FastAPI()
WEBHOOK_SECRET = "whsec_your_secret_here"
def verify_signature(payload: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(
secret.encode(),
payload,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
@app.post("/webhooks/payment")
async def handle_payment_webhook(request: Request):
payload = await request.body()
signature = request.headers.get("X-Webhook-Signature", "")
if not verify_signature(payload, signature, WEBHOOK_SECRET):
raise HTTPException(status_code = 401, detail = "Invalid signature")
# Process the verified event
event = json.loads(payload)
await process_event(event)
return {"status": "ok"}
Critical detail: use hmac.compare_digest() instead of == for the comparison. A regular string comparison is vulnerable to timing attacks — an attacker can measure response times to guess the correct signature one character at a time. compare_digest runs in constant time regardless of where the strings differ.
Also, verify against the raw request body, not a re-serialized version of the parsed JSON. JSON serialization is not deterministic — key ordering and whitespace can differ between libraries, which will produce a different hash.
Pattern 2 — Respond First, Process Later
This is the single most important reliability pattern for webhooks. Most providers have a timeout of 5-30 seconds. If your processing takes longer than that, the provider marks the delivery as failed and retries — even though you already started processing it.
The fix is simple: acknowledge the webhook immediately, then process it asynchronously.
import json
from fastapi import FastAPI, Request, BackgroundTasks
from datetime import datetime
app = FastAPI()
@app.post("/webhooks/payment")
async def handle_payment_webhook(
request: Request,
background_tasks: BackgroundTasks
):
payload = await request.body()
# Step 1: Verify signature (fast)
verify_or_reject(payload, request.headers)
# Step 2: Store the raw event (fast)
event_id = await store_raw_event(payload)
# Step 3: Return 200 immediately
background_tasks.add_task(process_webhook_event, event_id)
return {"status": "received", "event_id": event_id}
async def store_raw_event(payload: bytes) -> str:
event = json.loads(payload)
event_id = event.get("id", generate_uuid())
await db.execute("""
INSERT INTO webhook_events (event_id, payload, status, received_at)
VALUES ($1, $2, 'pending', $3)
ON CONFLICT (event_id) DO NOTHING
""", event_id, payload.decode(), datetime.utcnow())
return event_id
The key insight: your webhook endpoint is not where business logic lives. It is a thin ingestion layer. Its only job is to accept the payload, verify it, store it, and return 200. All the real work happens in a background worker that reads from the event store.
This pattern also gives you free retry capability. If your background processor crashes, the event is still in the database with a pending status. A periodic sweep job can pick up any events that were received but never processed.
Pattern 3 — Idempotency Keys
Duplicate webhook deliveries are not a bug — they are a feature. Providers deliberately retry to ensure at-least-once delivery. Your system needs to handle the same event arriving two, three, or ten times without processing it multiple times.
The standard approach is to track which events you have already processed:
async def process_webhook_event(event_id: str):
# Atomic check-and-claim using database locking
result = await db.execute("""
UPDATE webhook_events
SET status = 'processing', started_at = NOW()
WHERE event_id = $1 AND status = 'pending'
RETURNING event_id
""", event_id)
if not result:
# Already processing or processed — skip
return
try:
event = await db.fetchone(
"SELECT payload FROM webhook_events WHERE event_id = $1",
event_id
)
payload = json.loads(event["payload"])
# Your actual business logic
await handle_payment_event(payload)
await db.execute("""
UPDATE webhook_events
SET status = 'completed', completed_at = NOW()
WHERE event_id = $1
""", event_id)
except Exception as e:
await db.execute("""
UPDATE webhook_events
SET status = 'failed', error = $2, attempts = attempts + 1
WHERE event_id = $1
""", event_id, str(e))
raise
Why the database lock matters: if you check status == 'pending' and then update in separate queries, two workers can both see pending and both start processing. The UPDATE ... WHERE status = 'pending' RETURNING pattern is atomic — only one worker wins the race.
If the webhook provider does not include an event ID (some do not), generate a deterministic one by hashing the payload contents. This way, identical payloads always map to the same ID:
import hashlib
def generate_idempotency_key(payload: bytes) -> str:
return hashlib.sha256(payload).hexdigest()
Pattern 4 — Exponential Retry with Backoff
When your webhook processor fails, you need a retry strategy that balances speed with safety. Retrying immediately in a tight loop is a recipe for amplifying failures. If the database is overloaded, hammering it with retries makes things worse.
Exponential backoff with jitter is the standard approach:
import asyncio
import random
async def retry_with_backoff(
func,
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 300.0
):
for attempt in range(max_retries):
try:
return await func()
except Exception as e:
if attempt == max_retries - 1:
raise # Final attempt, let it fail
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
print(f"Attempt {attempt + 1} failed: {e}. "
f"Retrying in {delay + jitter:.1f}s")
await asyncio.sleep(delay + jitter)
# Usage in the webhook processor
async def process_with_retry(event_id: str):
async def _process():
event = await db.fetchone(
"SELECT payload FROM webhook_events WHERE event_id = $1",
event_id
)
await handle_payment_event(json.loads(event["payload"]))
try:
await retry_with_backoff(_process, max_retries=5)
await mark_completed(event_id)
except Exception:
await mark_failed(event_id)
The retry schedule with these defaults: 1s, 2s, 4s, 8s, 16s (plus jitter). Total wait before giving up is about 31 seconds. For webhooks that interact with external services, I typically increase base_delay to 5 seconds and max_retries to 8, giving a total retry window of about 21 minutes.
Why jitter matters: without jitter, if 100 webhook events all fail at the same time (say, during a brief database hiccup), they all retry at the same time. The synchronized retry creates a thundering herd that can cause the same failure all over again. Adding random jitter spreads the retries across time.
Pattern 5 — Dead Letter Queue
Even with retries, some events will permanently fail. Maybe the payload references a user that was deleted, or the event type is one your code does not handle yet. These events need to go somewhere you can inspect and replay them later.
async def move_to_dead_letter(event_id: str, error: str):
await db.execute("""
INSERT INTO webhook_dead_letters (
event_id, payload, error, original_received_at, moved_at
)
SELECT event_id, payload, $2, received_at, NOW()
FROM webhook_events
WHERE event_id = $1
""", event_id, error)
await db.execute("""
UPDATE webhook_events SET status = 'dead_letter'
WHERE event_id = $1
""", event_id)
# Periodic sweep to catch stuck events
async def sweep_failed_events():
"""Move events that have exceeded max retry attempts."""
stuck_events = await db.fetch("""
SELECT event_id, error
FROM webhook_events
WHERE status = 'failed'
AND attempts >= 5
AND updated_at < NOW() - INTERVAL '1 hour'
""")
for event in stuck_events:
await move_to_dead_letter(
event["event_id"],
f"Max retries exceeded: {event['error']}"
)
Your dead letter queue should be easy to inspect and replay. I build a simple admin endpoint for this:
@app.post("/admin/webhooks/replay/{event_id}")
async def replay_dead_letter(event_id: str):
event = await db.fetchone(
"SELECT payload FROM webhook_dead_letters WHERE event_id = $1",
event_id
)
if not event:
raise HTTPException(404, "Event not found in dead letter queue")
# Re-insert as pending for reprocessing
await db.execute("""
INSERT INTO webhook_events (event_id, payload, status, received_at)
VALUES ($1, $2, 'pending', NOW())
ON CONFLICT (event_id) DO UPDATE SET status = 'pending', attempts = 0
""", event_id, event["payload"])
return {"status": "replayed", "event_id": event_id}
I have used this replay capability dozens of times. Once, a third-party API changed their event schema without warning, and all our webhook processing broke for six hours. Because every raw payload was stored in the dead letter queue, we fixed the parser and replayed all 2,400 failed events in under a minute.
Pattern 6 — Event Ordering
Some webhook events have natural ordering. A subscription cannot be cancelled before it is created. An invoice cannot be paid before it is issued. If events arrive out of order, naive processing will produce incorrect state.
There are two approaches I have used in production:
Approach A — Timestamp-Based Resolution
If each event includes a timestamp, use it to detect and resolve ordering conflicts:
async def handle_subscription_event(payload: dict):
sub_id = payload["subscription_id"]
event_time = datetime.fromisoformat(payload["timestamp"])
event_type = payload["type"]
# Only process if this event is newer than what we have
result = await db.execute("""
UPDATE subscriptions
SET status = $2, updated_at = $3
WHERE id = $1 AND updated_at < $3
RETURNING id
""", sub_id, map_status(event_type), event_time)
if not result:
print(f"Skipping stale event {event_type} for {sub_id}")
Approach B — State Machine Validation
Define valid state transitions and reject events that do not follow them:
VALID_TRANSITIONS = {
"created": ["active", "cancelled"],
"active": ["paused", "cancelled", "past_due"],
"paused": ["active", "cancelled"],
"past_due": ["active", "cancelled"],
"cancelled": [] # Terminal state
}
async def handle_with_state_machine(payload: dict):
sub_id = payload["subscription_id"]
new_status = map_status(payload["type"])
current = await db.fetchone(
"SELECT status FROM subscriptions WHERE id = $1", sub_id
)
if current and new_status not in VALID_TRANSITIONS.get(current["status"], []):
# Queue for later reprocessing — the prerequisite event
# might arrive soon
await db.execute("""
UPDATE webhook_events
SET status = 'deferred', process_after = NOW() + INTERVAL '30 seconds'
WHERE event_id = $1
""", payload["event_id"])
return
await apply_subscription_update(sub_id, new_status)
I prefer Approach A for most cases because it is simpler and does not require you to enumerate every valid transition. But Approach B is better when you need strict consistency guarantees — particularly in financial systems where processing an event in the wrong order could mean charging a customer incorrectly.
The Complete Webhook Database Schema
Here is the schema I use as a starting point for every webhook integration:
CREATE TABLE webhook_events (
id BIGSERIAL PRIMARY KEY,
event_id VARCHAR(255) UNIQUE NOT NULL,
source VARCHAR(100) NOT NULL, -- 'stripe', 'github', etc.
event_type VARCHAR(100),
payload JSONB NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
attempts INT DEFAULT 0,
error TEXT,
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_webhook_events_status ON webhook_events (status)
WHERE status IN ('pending', 'processing', 'failed');
CREATE INDEX idx_webhook_events_source ON webhook_events (source, event_type);
CREATE TABLE webhook_dead_letters (
id BIGSERIAL PRIMARY KEY,
event_id VARCHAR(255) NOT NULL,
source VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
error TEXT,
original_received_at TIMESTAMPTZ,
moved_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
The partial index on status is intentional. Once an event is completed, you almost never query it by status. The partial index keeps the index small and fast for the queries that matter — finding pending and failed events.
Production Checklist
Before shipping a webhook integration to production, I run through this checklist:
Security
- Signature verification is enabled and tested
- Webhook secret is stored in environment variables, not code
- Endpoint is HTTPS-only
- Request body size is limited (prevent memory exhaustion attacks)
Reliability
- Endpoint responds with 200 before processing starts
- Raw payload is stored before any processing
- Idempotency prevents duplicate processing
- Failed events retry with exponential backoff
- Permanently failed events go to dead letter queue
Observability
- Log every received event (source, type, event ID)
- Alert on dead letter queue growth
- Monitor webhook processing latency
- Track success/failure rates per webhook source
Operations
- Dead letter events can be replayed via admin tool
- Webhook endpoint has its own rate limiting
- Health check endpoint is separate from webhook endpoint
- Documentation exists for each webhook source and event type
FAQ
How do I test webhooks during local development?
Use a tunneling tool like ngrok or Cloudflare Tunnel to expose your local server to the internet. Then configure the webhook provider to send events to your tunnel URL. For automated testing, record real webhook payloads and replay them in your test suite. I keep a /tests/fixtures/webhooks/ directory with example payloads from every provider we integrate with.
What should I return when webhook processing fails?
Return 200 OK anyway. If you return a 4xx or 5xx status, the provider will retry, but you have already stored the raw event for internal retry. Returning an error status just creates duplicate deliveries that your idempotency layer has to filter out. The only exception is signature verification failure — return 401 Unauthorized for those because you want the provider to know the secret may be misconfigured.
How long should I keep webhook event logs?
I keep completed events for 90 days and dead letter events for 1 year. The 90-day window covers most dispute and chargeback timelines for payment webhooks. Dead letters stay longer because they often represent edge cases you want to reference when debugging similar issues months later. Set up a scheduled job to archive or delete old events to keep your table performant.
Should I use a message queue instead of a database for webhook processing?
For most teams, the database approach is simpler and sufficient. You already have a database, and the throughput requirements for webhooks are usually modest — even a busy SaaS application might process a few thousand webhook events per day. If you are handling tens of thousands of events per minute, then yes, a dedicated message queue like RabbitMQ or Kafka makes sense. But do not add infrastructure complexity until you actually need it.
How do I handle webhook providers that do not include event IDs?
Generate a deterministic ID by hashing the payload. Use SHA-256 of the raw request body. This guarantees that identical payloads produce the same ID, which is exactly what you need for idempotency. Be aware that some providers add timestamps or nonces to each delivery attempt — if the payload differs between retries, this approach will not deduplicate them. In that case, hash only the stable fields (like the object ID and event type).
Bottom Line
Webhooks are deceptively simple to implement and deceptively hard to make reliable. The patterns in this guide — verify, acknowledge, store, process, retry, dead-letter — form a pipeline that handles every failure mode I have encountered in production. Start with signature verification and async processing, then add the remaining patterns as your integration matures.
Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.