by yurisagalov on 8/3/22, 4:17 PM with 23 comments
by Diggsey on 8/3/22, 5:13 PM
Timestamps don't solve the issue, and neither do "thin payloads" since the receiver has no idea how long to wait before assuming that the order is certain, and if you have a problem on the sender side it could cause logic errors for all of your clients.
Most of these problems are solved if the receiver doesn't process the webhook immediately, but instead queues it internally. You don't have issues with the queue being stalled due to one bad webhook, because there is no event-specific processing happening on the receiver (other than perhaps ignoring some events). The queue can still be stalled if there is a wider problem, but as soon as the problem is resolved, the system can catch up on those queued webhooks, and synchronization integrity is maintained.
Having said all that, if I were to design a new system I would go with a pull-based system instead. In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query. A "webhook" would contain an empty payload, and would simply indicate that the queue had become non-empty - this could be omitted entirely if realtime updates are not required, instead having the client poll.
The advantages of this approach are that it's easy for consumers to "replay" a set of events if they accidentally lose them, and it's also a lot more efficient, since many events can be sent per request (we gain some of this benefit at the moment by supporting "batch" webhooks containing multiple events, but it requires opt-in from the client.) Additionally, it allows webhooks to be versioned more easily, since you can have versioned endpoints for fetching events, and it also allows you to have an arbitrary number of consumers of the same set of events with no additional complexity.
by emadda on 8/3/22, 5:02 PM
I used /events to apply writes from Stripe to a local database for this reason in the tdog CLI:
by spiffytech on 8/3/22, 7:59 PM
Senders could guarantee ordering by only sending webhook n+1 after the HTTP request for webhook n completes, rather than sending them concurrently or in arbitrary order. For efficiency, perhaps only guarantee ordering for hooks related to each resource rather than all of a customer's hooks.
Or, include a monotonic counter in the webhook so the recipient can tell when it would apply an old state on top of a new one.
What the recipient does when they receive the webhook is up to them (delays, parallelism, etc.), but at least they'd know the correct event order.
The author raises a good point about what to do in the face of errors, but I'd vastly prefer to handle special behavior upon recipient error (stall, dead letter queue) to the current Stripe reality of "things come in out of order, and we don't give you the info needed to reassemble the order on your end".
by spiffytech on 8/3/22, 8:13 PM
That's non-trivial engineering to foist upon every recipient of your webhooks.
I like the idea of the /events pull-based endpoint, which keeps engineering much simpler on for recipient: https://blog.sequin.io/events-not-webhooks/
by rektide on 8/3/22, 9:27 PM
Not webhook specific but a couple hours today figuring out that some our service calls to internal services look like they open & are sent & processing, but the target server doesnt even see the request for a full 8s sometimes. The call itself was not thrle problem, the service just hadnt started until a long time after data was all sent.
by topspin on 8/3/22, 6:32 PM
by davidgu on 8/3/22, 11:14 PM
P.S. Svix is great, super happy customer here :)
by Spivak on 8/3/22, 5:19 PM
You can even keep your existing webhook code by providing a synchronous bridge to Kafka so "just send them in order but wait for the 200 before sending the next one." Boom, now you are guaranteed the events are recorded and processed in order.