4 Webhook Patterns for Production
Webhooks are real-time HTTP callbacks — your system reacts the instant something happens instead of polling every few minutes. Naive architectures work fine at low traffic, but as events scale to thousands per second, they fail in catastrophic and unpredictable ways.
Here are the four architecture patterns every production SaaS system needs.
Pattern 1 — Fan-out
Problem: One webhook event (payment.succeeded) needs to trigger email delivery, CRM update, analytics, and database write simultaneously. Sequential processing is slow and creates cascade failure risk.
Solution: Receiver returns HTTP 200 immediately → publishes event to message bus → workers process async, fully parallel and isolated.
Stripe Webhook → [Receiver: return 200] → [Message Bus]
↙ ↓ ↓ ↘
Email CRM DB Analytics
(async, fully isolated)
Golden rule: Return 200 within 2–3 seconds. Stripe, GitHub, and Shopify have short timeouts and will retry if no response arrives. Never process inline in the request handler.
Pattern 2 — Queue-based Buffering
Problem: Traffic spikes (flash sales, Product Hunt launches) generate thousands of simultaneous events. Downstream services get overwhelmed → timeouts → retries → exponentially worse congestion.
Solution: Redis Streams or RabbitMQ as a buffer between receiver and workers.
def receive_webhook(payload: dict):
r.xadd('webhook:events', {
'event_id': payload.get('id'),
'event_type': payload.get('type'),
'data': json.dumps(payload)
}, maxlen=10000)
return {'status': 'queued'}
Worker pool reads from the queue and scales horizontally on demand. The queue absorbs all spikes and guarantees at-least-once delivery even if the server crashes.
Pattern 3 — Circuit Breaker
Problem: When a downstream service (Slack, HubSpot) is slow or down, timeouts accumulate → thread pool exhaustion → cascading failure brings down the whole system.
Solution: Three-state machine:
- CLOSED: Normal operation
- OPEN: Fail-fast immediately, no service calls (after reaching failure threshold)
- HALF-OPEN: Send one probe → success returns to CLOSED, failure returns to OPEN
slack_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
crm_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=120)
def send_notification(message):
slack_breaker.call(slack_client.post_message, channel='#alerts', text=message)
Create one breaker per downstream service — a Slack outage must not trip your CRM breaker.
Pattern 4 — Idempotency
Problem: Providers retry when they don't receive HTTP 200 in time. Server processed successfully but response was dropped → double charge, duplicate email, duplicate record.
Solution: Event ID + Redis deduplication with 24-hour TTL.
def process_idempotent(event_id: str, payload: dict):
key = f"webhook:processed:{event_id}"
if r.get(key):
return {'status': 'already_processed'}
result = handle_event(payload)
r.setex(key, 86400, json.dumps(result))
return {'status': 'processed', 'result': result}
Event ID sources: Stripe uses event.id, GitHub uses X-GitHub-Delivery header, Shopify uses X-Shopify-Webhook-Id. Custom providers: add X-Event-ID with UUID v4.
Security: HMAC Verification
Never trust a webhook payload without verifying its signature. Skipping this means accepting requests from anyone on the internet.
Two most common mistakes:
- Using
== instead of compare_digest → vulnerable to timing attacks
- Parsing JSON before signature verification → body is modified, signature never matches
if not hmac.compare_digest(f"sha256={expected}", x_webhook_signature):
raise HTTPException(status_code=401)
Beyond signatures, validate the timestamp — reject requests older than 5 minutes to prevent replay attacks.
Dead Letter Queue (DLQ)
Failed events must not be lost. After 3 retries with exponential backoff (1s → 4s → 16s), move to DLQ and alert immediately.
[Main Queue] → [Worker: fail] → [Retry: 1s, 4s, 16s] → [DLQ + Slack Alert]
↓
[Manual Review / Replay]
The DLQ is your insurance policy. No DLQ means accepting data loss when things go wrong.
Webhooks + AI Agents: Real-World Use Cases
Combining webhooks with AI creates genuinely intelligent automation — not just moving data from A to B.
Email Triage Agent: Gmail/Outlook webhook → AI classifier (Claude/GPT-4o) → Leads into HubSpot, support tickets into Zendesk with priority score. Real result: 80% reduction in manual triage time.
Churn Prevention: Stripe webhook (subscription.cancelled) → AI analyzes full customer history → personalized intervention: discount email (price objection), check-in call (support issues), feature tutorial (feature gap). Real result: 23% higher recovery rate vs generic win-back emails.
AI Code Review: GitHub webhook (pull_request.opened) → fetch PR diff → Claude Sonnet reviews for bugs/security/performance → posts inline comments + updates Linear ticket automatically.
Integration rule: Always process AI responses asynchronously — never block the webhook handler. Use structured output schemas for reliable parsing. Always have a fallback for when the AI service is unavailable.
Production Checklist
Reliability: Return 200 within 3 seconds, idempotency with Redis (TTL 24h), queue-based when >100 events/min, circuit breaker per service, DLQ with alerting, backoff 1s→4s→16s.
Security: HMAC with timing-safe comparison, validate timestamp (reject if >5 min old), whitelist provider IP ranges, HTTPS-only, rotate secrets every 6 months.
Observability: Log all payloads (redact PII), track latency p50/p99, alert on new DLQ events, real-time health dashboard.
Master these four patterns and you have the foundation for genuinely production-grade automation infrastructure. Continue with n8n AI Workflows with OpenAI and Automate Customer Onboarding.