from Hacker News

You Can't Guarantee Webhook Ordering

by yurisagalov on 8/3/22, 4:17 PM with 23 comments

by Diggsey on 8/3/22, 5:13 PM
This is rubbish, we've run with guaranteed webhook ordering for years, so the idea that you can't do is laughable.
Timestamps don't solve the issue, and neither do "thin payloads" since the receiver has no idea how long to wait before assuming that the order is certain, and if you have a problem on the sender side it could cause logic errors for all of your clients.
Most of these problems are solved if the receiver doesn't process the webhook immediately, but instead queues it internally. You don't have issues with the queue being stalled due to one bad webhook, because there is no event-specific processing happening on the receiver (other than perhaps ignoring some events). The queue can still be stalled if there is a wider problem, but as soon as the problem is resolved, the system can catch up on those queued webhooks, and synchronization integrity is maintained.
Having said all that, if I were to design a new system I would go with a pull-based system instead. In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query. A "webhook" would contain an empty payload, and would simply indicate that the queue had become non-empty - this could be omitted entirely if realtime updates are not required, instead having the client poll.
The advantages of this approach are that it's easy for consumers to "replay" a set of events if they accidentally lose them, and it's also a lot more efficient, since many events can be sent per request (we gain some of this benefit at the moment by supporting "batch" webhooks containing multiple events, but it requires opt-in from the client.) Additionally, it allows webhooks to be versioned more easily, since you can have versioned endpoints for fetching events, and it also allows you to have an arbitrary number of consumers of the same set of events with no additional complexity.
by emadda on 8/3/22, 5:02 PM
You can also poll an /events endpoint to get a consistent ordering.
I used /events to apply writes from Stripe to a local database for this reason in the tdog CLI:
https://table.dog/blog/principles/events-are-better/
by spiffytech on 8/3/22, 7:59 PM
This is a big pain point with Stripe's webhooks, and I think there's ample room for improvement.
Senders could guarantee ordering by only sending webhook n+1 after the HTTP request for webhook n completes, rather than sending them concurrently or in arbitrary order. For efficiency, perhaps only guarantee ordering for hooks related to each resource rather than all of a customer's hooks.
Or, include a monotonic counter in the webhook so the recipient can tell when it would apply an old state on top of a new one.
What the recipient does when they receive the webhook is up to them (delays, parallelism, etc.), but at least they'd know the correct event order.
The author raises a good point about what to do in the face of errors, but I'd vastly prefer to handle special behavior upon recipient error (stall, dead letter queue) to the current Stripe reality of "things come in out of order, and we don't give you the info needed to reassemble the order on your end".
by spiffytech on 8/3/22, 8:13 PM
The solution to receiving webhooks in unknown order is to ignore the payload and refetch the resource. Yet naively implemented, this still leaves race conditions on the recipient end: if two webhooks can come in at once, you have to make sure you process them serially, since your refetch or database write could complete in arbitrary order.
That's non-trivial engineering to foist upon every recipient of your webhooks.
I like the idea of the /events pull-based endpoint, which keeps engineering much simpler on for recipient: https://blog.sequin.io/events-not-webhooks/
by rektide on 8/3/22, 9:27 PM
> At first glance it seems like a simple, and easy to implement idea — just send the webhooks in order.
Not webhook specific but a couple hours today figuring out that some our service calls to internal services look like they open & are sent & processing, but the target server doesnt even see the request for a full 8s sometimes. The call itself was not thrle problem, the service just hadnt started until a long time after data was all sent.
by topspin on 8/3/22, 6:32 PM
Would have thought this is self evident. Intermediates exist and they can do essentially anything. Without absolute control over every aspect of the systems involved you have no guarantees about ordering.
by davidgu on 8/3/22, 11:14 PM
Really enjoyed the post Tom. I really like the structure of introducing a potential solution then pointing out the problems with it.
P.S. Svix is great, super happy customer here :)
by Spivak on 8/3/22, 5:19 PM
Those who fear or do not know Kafka are doomed to work around its absence. Like I just cannot understand this mindset of "just educate people" when the system doesn't meet the requirements of its users. If your users want event ordering just give them event ordering.
You can even keep your existing webhook code by providing a synchronous bridge to Kafka so "just send them in order but wait for the 200 before sending the next one." Boom, now you are guaranteed the events are recorded and processed in order.