by enether on 5/15/25, 11:43 AM with 97 comments
Message queues are usually a core part of any distributed architecture, and the options are endless: Kafka, RabbitMQ, NATS, Redis Streams, SQS, ZeroMQ... and then there's the “just use Postgres” camp for simpler use cases.
I’m trying to make sense of the tradeoffs between:
- async fire-and-forget pub/sub vs. sync RPC-like point to point communication
- simple FIFO vs. priority queues and delay queues
- intelligent brokers (e.g. RabbitMQ, NATS with filters) vs. minimal brokers (e.g. Kafka’s client-driven model)
There's also a fair amount of ideology/emotional attachment - some folks root for underdogs written in their favorite programming language, others reflexively dismiss anything that's not "enterprise-grade". And of course, vendors are always in the mix trying to steer the conversation toward their own solution.
If you’ve built a production system in the last few years:
1. What queue did you choose?
2. What didn't work out?
3. Where did you regret adding complexity?
4. And if you stuck with a DB-based queue — did it scale?
I’d love to hear war stories, regrets, and opinions.
by speedgoose on 5/16/25, 8:35 AM
Mostly because it has been very reliable for years in production at a previous company, and doesn’t require babysitting. Its recent versions also has new features that make it is a descent alternative to Kafka if you don’t need to scale to the moon.
And the logo is a rabbit.
by KingOfCoders on 5/18/25, 6:02 AM
by adamcharnock on 5/18/25, 7:53 AM
In the case of a queue, you put an item in the queue, and then something removes it later. There is a single flow of items. They are put in. They are taken out.
In the case of a stream, you put an item in the queue, then it can be removed multiple times by any other process that cares to do so. This may be called 'fan out'.
This is an important distinction and really effects how one designs software that uses these systems. Queues work just fine for, say, background jobs. A user signs up, and you put a task in the 'send_registration_email' queue.[1]
However, what if some _other_ system then cares about user sign ups? Well, you have to add another queue, and the user sign-up code needs to be aware of it. For example, a 'add_user_to_crm' queue.
The result here is that choosing a queue early on leads to a tight-coupling of services down the road.
The alternative is to choose streams. In this case, instead of saying what _should_ happen, you say what _did_ happen (past tense). Here you replace 'send_registration_email' and 'add_user_to_crm' with a single stream called 'used_registered'. Each service that cares about this fact is then free to subscribe to that steam and get its own copy of the events (it does so via a 'consumer group', or something of a similar name).
This results in a more loosely coupled system, where you potentially also have access to an event history should you need it (if you configure your broker to keep the events around).
--
This is where Postgresql and SQS tend to fall down. I've yet to hear of an implementation of streams in Postgresql[2]. And SQS is inherently a queue.
I therefore normally reach for Redis Steams, but mostly because it is what I am familiar with.
Note: This line of thinking leads into Domain Driven Design, CQRS, and Event Sourcing. Each of which is interesting and certainly has useful things to offer, although I would advise against simply consuming any of them wholesale.
[1] Although this is my go-to example, I'm actually unconvinced that email sending should be done via a queue. Email is just a sequence of queues anyway.
[2] If you know of one please tell me!
by bilinguliar on 5/18/25, 6:11 AM
However, I have noticed that oftentimes devs are using queues where Workflow Engines would be a better fit.
If your message processing time is in tens of seconds – talk to your local Workflow Engine professional (:
by wordofx on 5/18/25, 6:14 AM
by lmm on 5/18/25, 6:43 AM
Kafka is a great tool with lots of very useful properties (not just queues, it can be your primary datastore), but it's not operationally simple. If you're going to use it you should fully commit to building your whole system on it and accept that you will need to invest in ops at least a little. It's not a good fit for a "side" feature on the edge of your system.
by mstaoru on 5/19/25, 2:36 PM
I didn't have anything but bad experiences with RabbitMQ, maybe I cannot "cook" it, but it would always go split-brain, or last issue I had, a part of clients connected to certain clustered nodes just stopped receiving messages. Cluster restart helped, but all logs and all metrics were green and clean. I try to avoid it if I can.
ZeroMQ is more like a building block for your applications. If you need something very special, it could be a good fit, but for a typical EDA-ish bus architecture Redis or Kafka/Redpanda are both very good.
by jolux on 5/18/25, 6:00 AM
We wanted replayability and multiple clients on the same topic, so we evaluated Kafka, but we determined it was too operationally complex for our needs. Persistence was also unnecessary as the data stream already had a separate archiving system and existing clients only needed about 24hr max of context. AWS Kinesis ended up being simpler for our needs and I have nothing but good things to say about it for the most part. Streaming client support in Elixir was not as good as Kafka but writing our own adapter wasn’t too hard.
by AznHisoka on 5/15/25, 6:59 PM
by vanbashan on 5/18/25, 6:41 AM
Performance is at least as good as Kafka.
For simpler workload, beanstalkd could be a good fit, either.
by crmd on 5/18/25, 6:26 AM
Architecture info: https://explore.fednow.org/resources/technical-overview-guid...
by j45 on 5/18/25, 1:25 PM
Database functions can remain independent of stack or programming changes.
Complexity comes on it's own, often little need to pile it in from the start to tie ones hands early for relatively simple solutions.
by matt_s on 5/18/25, 2:22 PM
Its not very complex and feels like we're running a lot of compute resources to just sync data between systems. Admittedly there isn't good separation of concerns so there is overlap that requires data syncs.
I've been looking at things like kafka, etc. thinking there might be some magic there that makes us use less compute or makes data syncs a little easier to deal with but wonder what scale of data throughput is a tipping point where a service like that is really needed. If it turns out its just a different service but same timeliness of data sync and similar compute resources I struggle with what benefits might be provided.
I'd love for almost like a levels.fyi style site where people could anonymously report things like this for the tech stacks being used, throughput of data, amount of compute in play, and ratings/comments on their overall solution ("would do again", "don't recommend", "overkill", "resume filler"). It feels much like other areas of technology where a use case comes out of a huge company and RDD (resume driven development) takes hold and now there are people out there doing the equivalent of souping up a 1997 honda accord like its a racecar but its only driving grandma to her appointments.
by lolc on 5/19/25, 11:48 PM
by MyOutfitIsVague on 5/18/25, 6:16 AM
by micvbang on 5/18/25, 9:42 AM
by Jemaclus on 5/16/25, 3:59 PM
For smaller projects of "job queues," I tend to use Amazon SQS or RabbitMQ.
But just for clarity, Kafka is not really a message queue -- it's a persistent structured log that can be used as a message queue. More specifically, you can replay messages by resetting the offset. In a queue, the idea is once you pop an item off the queue, it's no longer in the queue and therefore is gone once it's consumed, but with Kafka, you're leaving the message where it is and moving an offset instead. This means, for example, that you can have many many clients read from the same topic without issue.
SQS and other MQs don't have that persistence -- once you consume the message and ack, the message disappears and you can't "replay it" via the queue system. You have to re-submit the message to process it. This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.
There are pros and cons to either mechanism, and there's significant overlap in the usage of the two systems, but they are designed to serve different purposes.
The analogy I tend to use is that Kafka is like reading a book. You read a page, you turn the page. But if you get confused, you can flip back and reread a previous page. An MQ like RabbitMQ or Sidekiq is more like the line at the grocery store: once the customer pays, they walk out and they're gone. You can't go back and re-process their cart.
Again, pros and cons to both approaches.
"What didn't work out?" -- I've learned in my career that, in general, I really like replayability, so Kafka is typically my first choice, unless I know that re-creating the messages are trivial, in which case I am more inclined to lean toward RabbitMQ or SQS. I've been bitten several times by MQs where I can't easily recreate the queue, and I lose critical messages.
"Where did you regret adding complexity?" -- Again, smaller systems that are just "job queues" (versus service-to-service async communication) don't need a whole lot of complexity. So I've learned that if it's a small system, go with an MQ first (any of them are fine), and go with Kafka only if you start scaling beyond a single simple system.
"And if you stuck with a DB-based queue -- did it scale?" -- I've done this in the past. It scales until it doesn't. Given my experience with MQs and Kafka, I feel it's a trivial amount of work to set up an MQ/Kafka, and I don't get anything extra by using a DB-based queue. I personally would avoid these, unless you have a compelling reason to use it (eg, your DB isn't huge, and you can save money).
by austin-cheney on 5/16/25, 8:26 AM
by coolcase on 5/18/25, 9:39 PM
2. Nothing. It all worked out.
3. Nowhere. Generally used them for queue-y things.
4. Not done this. Even back in 2000s when queues weren't so well known they'd be a queue-like system. Polling FTP for example!
by dmazin on 5/18/25, 5:56 AM
by csomar on 5/18/25, 6:47 AM
by stephenr on 5/18/25, 6:40 AM
For those unfamiliar, it's a Lua library that gets executed in Redis using one of the various language bindings (which are essentially wrappers around calling the Lua methods).
With our multi-node redis setup it seems to be quite reliable.
by a_void_sky on 5/15/25, 11:51 AM
by clark-kent on 5/15/25, 7:44 PM
by z3ugma on 5/18/25, 5:43 AM
by yesnomaybe on 5/18/25, 12:39 PM
I found it hard to shift mentally from MSK and its even triggers back to regular consumer spun up in containers etc. but that also it rather MSK than Kafka.
I am currently swapping out the whole pub/sub layer to MongoDB change streams, which I have found to be working really well. For queuing it attempts to lock on read so I can scale consumers with retry / backoff etc. Broadcast is simple and without locking, auto delete in Mongo.
I will have to see how it really scales and I'm sure I'm trading one problem for another but, it will definitely help to remove a moving part. Overall, app is rather low volume with the occasional spike. I would have stayed with Kafka were there be let's say >100rpm on the core functions.
by tacostakohashi on 5/18/25, 2:24 AM
by MichaelMoser123 on 5/18/25, 6:19 AM
by kabes on 5/18/25, 6:40 AM
by nop_slide on 5/15/25, 4:30 PM
by ok1984 on 5/15/25, 7:47 PM
by DonHopkins on 5/18/25, 7:27 AM
by smittywerben on 5/18/25, 3:29 PM
RabbitMQ is neat out of the box. But I went with ZeroMQ at the time.
ZeroMQ is cool but during current year I'd only use it to learn from their excellent documentation. Coming from Python, it taught me about Berkeley sockets and the process of building cross-language messaging patterns. After a few projects, it's like realizing I didn't need ZeroMQ to begin with I could make my own! If ZeroMQ's Hintjens were still with us I'd still be using it.
It's like the documented incremental process of designing a messaging queue to fit your problem domain, plus a thin wrapper easing some of lower level socket nastiness. At least that's my experience using it over the years. Me talking about it won't do it enough justice.
NATS does the lower level socket wrapper part very nicely. It's a but more modern too. Golang's designed to be like a slightly nicer C syntax, so it would make sense that it's high performance and sturdy. So it's similar to ZeroMQ there.
I'm not sure if either persist to disk out of the box. So either of these are going to be simpler and faster than Kafka.
The DB people are probably trying too hard to cater to the queues. Ideally I'd have normalized the data and modeled the relations such transactions don't lock up the whole table. Then I started questioning why I needed a queue at all when databases (sans SQLite which is fast enough as is) are made for pooling access to a database.
Kafka supports pipelining to a relational database but this part is where you kind of have to be experienced to not footgun and I'm not at that level. I think using it as a queue in that you're short-circuiting it from the relational database pipeline is non-standard for Kafka. I suspect that's where a lot of the Kafka hate is from. I could understand if the distributed transactions part is hell but at that point it's like why'd you skip the database then? Trying to get that free lunch I assume.
I have an alternative. Try inserting everything into a SQLite file. Running into concurrency issues? Use a second SQLite file. Two computers? send it over the network. More issues? Since it's SQL just switch to a real database that will pool the clients. Or switch to five of them. SQL is sorta cool that way. I assume that would avoid the reimplementing half of the JVM to sync across computers where you get Oracle Java showing up to sell you their database halfway into making your galactic scale software or the whatever.
I must be stressed today. Thanks for asking.
by atombender on 5/18/25, 11:47 PM
For example: Message queues are good for work that must be done in strict order where you want to deal with one message at a time. They aren't such a great fit for large batch movement of data, like logs or high volume events, because having a per-message acknowledgement state requires a lot of round trips over the network that simply isn't needed; you want to treat the entire bulk of the flow to carve out big chunks of it, because CPUs wnd networks and disks are more efficient when doing the same operation over large amounts of data in one go.
If you are executing "tasks" (like image processing, ML inference, webhooks), ordering by insertion order might not be the right choice, either. Sometimes you want to coalesce (dedupe by key). Sometimes you want to ensure the processing for a key (e.g. a customer ID) is done in the same process and not randomly distributed over all your workers. Sometimes you want delivery to be strictly sequential, requiring an exclusive worker rather than massively parallel fan-out. And so on.
Where I work, we use a mix of things depending on the application. I am a big fan of NATS. It's not itself a message queue, but its primitives can be combined to handle all sorts of behaviors. Core NATS is more like ephemeral pub/sub, while Jetstream gives you durable, highly available Kafka-like streams.
I like combining queues with database state. Use the queue as an efficient way to order items (like jobs or events) for massively scalable distribution, and use the database to store the current state of things.
For example, imagine you're delivering webhook messages. We first store the message in the database with the state "pending", then write an event to the queue about it. The worker receives the event, double-checks its state is still "pending", then executes it. If delivered, mark as "done" and ack the message. Otherwise, mark as "failed" and create a new queue message to retry. This way, you have durable state in a solid database, and the queue is an efficient way to coordinate the workers. (There's a bit more work here to ensure consistency, but this is the gist of it.)
Core NATS is fantastic as a communication primitive between ephemeral processes. You can use it for RPC, for lightweight broadcasts (e.g. reload config everywhere), even for things like leases or caching or similar. Jetstream is like Kafka but more flexible; for example, each message has a wildcard subject that can be filtered on, so different consumers can very efficiently filter a big, commingled stream by interest. In Jetstream streams, messages have per-consumer ack/nack state in addition to a position, so you're not limited to Kafka's linear "position". Overall, a superb data model, and very easy to manage as infra.
One weak point with NATS is a maximum message size of 10MB. This means that you sometimes have to invent your own chunking if your application needs to send larger payloads. Doing this opens up some cans of worms, so I honestly wouldn't recommend it. For large batch stuff, Redpanda is a better option.
by mlhpdx on 5/18/25, 6:01 AM
by catkitcourt on 5/18/25, 5:51 AM
by varbhat on 5/15/25, 7:14 PM
by revskill on 5/16/25, 10:30 AM