by galeaspablo on 10/4/24, 7:49 PM with 10 comments
by deniscoady on 10/5/24, 2:09 AM
I've worked with Apache Kafka at massive (50+ Gbps) scales. It's a proper nightmare. When it breaks – it breaks fast and violently.
But the problem is that Apache Kafka (and more modern Kafka-compatible alternatives like Redpanda < obligatory mention) solve a need for a durable streaming log that other systems cannot offer. The access patterns, requirements, use cases, ecosystem, etc, are different from those of traditional databases and require a proper streaming solution.
Streaming from a traditional database is kinda a solved problem. Why not just use a managed Kafka provider with a change data capture (CDC) capability if you don't want to deal with Kafka yourself? At least then you get to use all of the tools in the vibrant Kafka ecosystem.
by Sphax on 10/5/24, 12:44 PM
Configuration complexity: there are a couple of things we had to tune over the years, mainly regarding the log cleaner once we started leveraging compacted topics, but other than that it's pretty much the default config. Is it the most optimal ? No but it's fast enough. Hardware choice in my opinion is not really an issue: we started on HDDs switching to SSDs later on, the cluster continued working just fine with the same configuration.
Scaling I'll grant can be a pain. We had to scale our clusters mainly for two reasons: 1) more services want to use Kafka therefore there are more topics and more data to serve. This is not that hard to scale: just add brokers to have more capacity. 2) is when you need more partitions for a topic; we had to do this a couple of times over the years and it's annoying because the default tooling to do data redistribution is bad. We ended up using a third party tool (today Cruise Control does this nicely).
Maintenance: yes, you need to monitor your stuff. Just like any other system you deploy on your own hardware. Thankfully monitoring Kafka is not _that_ hard, there are ready made solutions to export the JMX monitoring data. We use Prometheus (prometheus-jmx-exporter and node_exporter) almost since the beginning and it works fine. We're still using ZooKeeper but thankfully that's no longer necessary, I just have to say our zookeeper clusters have been rock solid over the years.
Development overheads: I really can't agree with that. Yes, the "main" ecosystem is Java based but it's not like librdkafka doesn't exist, and third party libraries are not all "sub par", that's just a mischaracterization. We use Go with sarama since 2014, recently switched to using franz-go: both work great. You do need to properly evaluate your options though (but that's part of your job). With that said, if I were to start from scratch I would absolutely suggest starting with Kafka Streams, even if your team doesn't hava java experience (I mean learning Java isn't that hard), just because it makes building a data pipeline super straightforward and handle a lot of the complexities mentioned.
by taylodl on 10/4/24, 9:26 PM