9 min readJohnny UnarJohnny Unar

Postgres Outbox Beats RabbitMQ for Webhook Ingestion Under 5k RPM

Before you bolt on SQS for webhook reliability, read this. A Postgres outbox table and a Go worker will serve most startups better.

the incident that starts every conversation

A Stripe webhook fires. Your server returns a 500 because your database was briefly overloaded. Stripe retries after 5 seconds, then 30 seconds, then 2 minutes. Meanwhile, your server recovered at the 4-second mark, processed the original event on the third retry, charged the customer, and then your handler ran again on the fourth retry and charged them twice. You now have an angry customer, a support ticket, and a post-mortem where someone writes "we need a proper queue" on a whiteboard.

This is the moment teams reach for RabbitMQ, SQS, or Kafka without thinking through whether the problem actually requires those tools. The operational weight of a dedicated message broker is real: you're now running another stateful service, writing dead-letter queue consumers, configuring visibility timeouts, dealing with at-least-once delivery semantics on top of your existing at-least-once webhook delivery semantics, and paying for the privilege. We've watched several clients go through exactly this cycle, spinning up RabbitMQ on EC2 for a workload that peaks at maybe 800 events per minute from a payment provider and a CRM, and then spending more engineering time babysitting the broker than building product.

The actual problems you need to solve are three: you need to deduplicate incoming events so retries are harmless, you need delivery guarantees so an event doesn't get dropped if your worker crashes mid-process, and you need visibility into failures without grepping through CloudWatch logs at 2am. None of those require a dedicated queue. Postgres can handle all three, and if you're running Postgres already, which you almost certainly are, you're adding zero new infrastructure.

idempotency keys done right

Every decent webhook provider sends an idempotency key. Stripe calls it the event ID (evt_1PxK...), GitHub calls it the X-GitHub-Delivery header, Shopify sends X-Shopify-Webhook-Id. The pattern is the same everywhere: a unique string that identifies this logical event, stable across retries.

Most teams store this in a column and do a SELECT before inserting. That's a race condition. Two concurrent requests carrying the same event ID can both pass the SELECT check before either INSERT commits, and you're back to processing the same event twice. The fix is a unique index and an INSERT ... ON CONFLICT DO NOTHING, atomically. In Postgres this looks like:

sql
1INSERT INTO webhook_events (event_id, source, payload, received_at, status)
2VALUES ($1, $2, $3, NOW(), 'pending')
3ON CONFLICT (event_id) DO NOTHING;

If that statement affects zero rows, you return 200 immediately without touching your business logic. The provider thinks delivery succeeded, the duplicate is silently discarded, and your customer doesn't get charged twice. That's the entire deduplication story. No distributed locks, no Redis SETNX, no saga choreography.

The unique index on event_id needs to actually enforce uniqueness at the database level, not just in application code. I've seen codebases with a comment saying "unique per source+event_id" and no actual constraint, which means the guarantee evaporates the moment someone adds a second worker process or deploys two pods simultaneously. Put the constraint in the schema, not in your head.

One nuance: some providers reuse event IDs across environments. Stripe doesn't, but some CRM webhooks do, or their staging environment bleeds event IDs into your production table if you're not careful about routing. Scope your event_id uniqueness to (source, event_id) rather than just event_id and you'll avoid that particular class of confusion.

the outbox table as your queue

The outbox pattern is nothing exotic. You have a table of pending work, a worker that polls it, and a mechanism to mark items done or failed. Postgres has had the primitives for this forever, and Postgres 17 specifically made some of this nicer with improvements to logical replication and vacuum behavior that reduce table bloat on high-churn workloads.

The schema I'd start with looks roughly like this:

sql
1CREATE TABLE webhook_events (
2 id BIGSERIAL PRIMARY KEY,
3 event_id TEXT NOT NULL,
4 source TEXT NOT NULL,
5 payload JSONB NOT NULL,
6 received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
7 status TEXT NOT NULL DEFAULT 'pending',
8 attempt_count INT NOT NULL DEFAULT 0,
9 last_attempted_at TIMESTAMPTZ,
10 error_detail TEXT,
11 UNIQUE (source, event_id)
12);
13
14CREATE INDEX ON webhook_events (status, received_at)
15 WHERE status IN ('pending', 'retrying');

The partial index on status matters a lot here. Once an event is 'completed' or 'dead', you never want it in your worker's scan, and without a partial index, your polling query will slow down as the table grows because it's scanning completed rows that should be invisible to the worker.

Your Go worker does a SELECT ... FOR UPDATE SKIP LOCKED to grab a batch of pending events, processes them, and updates the status. SKIP LOCKED is the important part: multiple worker goroutines can run concurrently without blocking each other or processing the same row twice. This is the Postgres feature that makes the polling pattern actually viable at scale. Without it, you'd get lock contention that would kill your throughput and produce confusing latency spikes.

At 5,000 events per minute, that's roughly 83 events per second. A single Go worker polling in batches of 50 with a 100ms tick can comfortably handle this on a db.t4g.medium. We've run similar setups at steezr for document processing pipelines that have spiky ingestion patterns, and the bottleneck is almost always the downstream business logic, not the Postgres polling overhead.

dead-letter handling without the drama

A dead-letter queue in the traditional broker sense is a separate queue that failed messages get routed to. In your Postgres outbox, it's just a status transition and optionally a separate side-table you can query without polluting your main events table.

The retry logic I'd recommend: three attempts with exponential backoff (5 seconds, 30 seconds, 5 minutes), and after the third failure, transition the event to 'dead' and write the last error to error_detail. Then have a separate dead_webhook_events table that you INSERT into when something dies, with a foreign key back to the original event.

sql
1CREATE TABLE dead_webhook_events (
2 id BIGSERIAL PRIMARY KEY,
3 webhook_event_id BIGINT REFERENCES webhook_events(id),
4 failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
5 final_error TEXT,
6 payload_snapshot JSONB
7);

The payload_snapshot column is worth the storage cost. If you're ever trying to debug why a Shopify order.fulfilled event died three months ago, you want to see exactly what payload your handler received, not reconstruct it from sparse logs.

What you get from this that you don't get from most queue setups: you can write a SQL query to find everything that died this week from Stripe. You can JOIN it against your orders table to see which customers were affected. You can write an admin endpoint that lets you replay a single dead event by ID, updating its status back to 'pending' and resetting attempt_count to 0. Try doing that with an SQS DLQ at 2am when you're half asleep and the Terraform state is drift and the IAM role has the wrong permissions. Postgres lets you just run an UPDATE and watch it work.

the Go worker that ties it together

The worker code is intentionally boring. A ticker fires every 100ms, grabs up to 50 pending events with SELECT ... FOR UPDATE SKIP LOCKED, dispatches each to a handler function, and commits or rolls back accordingly. The whole thing is maybe 150 lines of Go with pgx/v5 and no external dependencies besides Postgres.

One thing that trips people up: you want to process each event in its own transaction so that a failure on event N doesn't roll back events 1 through N-1. The batch SELECT should happen in one transaction to lock the rows, but each individual handler invocation should commit independently. A simple way to structure this is to select and lock a batch, then immediately commit that transaction (the lock releases when you explicitly UPDATE the status), and spawn goroutines or process sequentially with per-event transactions from that point.

Another thing: your HTTP handler that receives the webhook from the provider should only INSERT the raw event and return 200. No business logic, no calls to other services, just write to the outbox and respond. The worker handles everything else asynchronously. This decouples your webhook ingestion latency from your processing latency, which matters because if your payment processor's retry window is 5 seconds and your business logic occasionally takes 3 seconds, you're going to generate spurious retries that fill your event table with duplicates you then have to deduplicate. Fast ingestion, slow processing, strict separation.

Error handling in the worker should be loud. If a handler returns an error, log the full error with the event ID, source, and attempt count. If you're using something like Loki or just structured JSON logs to stdout, make sure event_id is a top-level field so you can grep for it. The observability story for a Postgres outbox is genuinely better than most queue systems because your queue IS your database, and your database has a query interface that your monitoring tools already know how to talk to. A simple query like SELECT status, COUNT(*) FROM webhook_events WHERE received_at > NOW() - INTERVAL '1 hour' GROUP BY status gives you a live health dashboard in about 4 seconds.

when you actually do need a real queue

None of this means RabbitMQ and SQS are bad. They're solving real problems at scale. If you're processing 50,000 events per minute from multiple high-volume sources and you need strict ordering guarantees per partition, fan-out to multiple independent consumers, or sub-second processing SLAs with complex routing, then yes, Kafka or a managed queue service is probably the right tool.

The threshold I've landed on from building a variety of systems at steezr is roughly 5,000 events per minute as the inflection point where the operational complexity of a dedicated broker starts to pay for itself. Below that, a Postgres outbox is faster to implement, easier to debug, and has zero additional infrastructure cost if you're already on Postgres, which you almost certainly are.

Also consider your team's operational maturity. If you don't have someone who's run RabbitMQ in production and knows what a queue length of 400k with zero consumers means at midnight, you're going to have a bad time. Postgres has table bloat and vacuum tuning, sure, but those are problems your team likely already has some muscle memory around. Introducing a new stateful service with its own failure modes, its own alerting requirements, and its own deployment pipeline is a real cost that doesn't show up on the architecture diagram.

Start with Postgres. Instrument it well. If you hit the limits, you'll know exactly what the limits are because you'll have data, not guesses, and migrating the outbox pattern to an actual queue is a reasonably clean lift because your interfaces are already defined. The inverse, discovering you didn't need the queue and now need to rip it out, is considerably messier.

Johnny Unar

Written by

Johnny Unar

Want to work with us?

Before you bolt on SQS for webhook reliability, read this. A Postgres outbox table and a Go worker will serve most startups better.