from Hacker News

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

by ambrood on 8/15/24, 5:16 PM with 31 comments

tl;dr we built an embeddable stream processing engine in Rust using apache DataFusion, check us out at https://github.com/probably-nothing-labs/denormalized

Hey HN,

We’d like to showcase a very early version of our embeddable stream processing engine called Denormalized. The rise of DuckDB has abundantly made it clear that even for many workloads of Terabyte scale, a single node system outshines the distributed query engines of previous generation such as Spark, Snowflake etc in terms of both performance and cost.

Now a lot of workloads DuckDB is used for were normally considered to be “big data” in the previous generation, but no more. In the context of streaming especially, this problem is more acute. A streaming system is designed to incrementally process large amounts of data over a period of time. Even on the upper end of scale, productionized use-cases of stream processing are rarely performing compute on more than tens of gigabytes of data at a given time.

Even so, the standard stream processing solutions such as Flink involve spinning up a distributed JVM cluster to even compute against the simplest of event streams. To that end, we’re building Denormalized designed to be embeddable in your applications and scale up to hundreds of thousands of events per second with a Flink-like dataflow API. While we currently only support Rust, we have plans for Python and Typescript bindings soon.

We’re built atop DataFusion and the Arrow ecosystems and currently support streaming joins as well as windowed aggregations on Kafka topics.

Please check out out repo at: https://github.com/probably-nothing-labs/denormalized

We’d love to hear your feedback.

  • by dman on 8/15/24, 7:12 PM

    This looks super interesting. I built https://github.com/finos/perspective in a past life but have been out of the streaming analytics game for some time. Nice to see single machine efficiency be a focus, will give this a try and post feedback on github.
  • by emgeee on 8/15/24, 7:34 PM

    Other founder here -- we've been working on this now for several months and have had a lot of fun building on top of arrow and datafusion
  • by theLiminator on 8/15/24, 11:05 PM

    Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.

    Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).

    It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

    It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

  • by j-pb on 8/16/24, 6:33 AM

    I'd be curious to know what your thoughts on differential/timely dataflow are. Superficially it seems that it might be possible to integrate the existing Rust infrastructure from those libraries with DataFusion and Arrow, which could give you quite a few operators for free, and provide your users with the very nice incremental query/streaming-as-view-maintenance model.
  • by ethegwo on 8/15/24, 5:54 PM

    Neat, founder of https://tonbo.io/ here, I am excited to see someone bring stream processing to datafusion, we are working on a arrow-native embedded db and plan to support datafusion in the next release, we’re interested in building the streaming feature on denormalized.
  • by shrisukhani on 8/15/24, 8:42 PM

    Interesting. What use cases are you guys targeting with this?
  • by stereosky on 8/16/24, 8:45 AM

    Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment
  • by eXpl0it3r on 8/16/24, 12:09 PM

    For someone not deep in the topic, what is a "Streaming Processing Engine"?

    All the description for Denormalized use the term, so if don't know it, it's kind of impossible to understand what Denormalized is / trying to solve.

  • by nonlogical on 8/16/24, 8:58 AM

    This looks totally awesome! Easy to setup, memory-efficient, streaming, real-time data aggregation, compilable to a single self contained binary, that is a dream come true.

    Bookmarked for future projects!

  • by ztratar on 8/15/24, 7:28 PM

    Will be excited to see the typescript bindings once out. We may be able to use this to handle some of our workloads at Embra.

    Will reach out! Congrats on the ship.

  • by drawnwren on 8/15/24, 8:36 PM

    What differentiates you from i.e. Arroyo and Fluvio?
  • by franciscojarceo on 8/15/24, 7:11 PM

    Can't wait for the Python SDK!
  • by lhnz on 8/15/24, 9:38 PM

    Do you have plans to make the data sources pluggable instead of being Kafka specific?
  • by akshay2881 on 8/15/24, 8:41 PM

    Nice! How feature complete is this with current industry standards like Flink?
  • by rNULLED on 8/16/24, 4:53 AM

    Looks cool! I’ll try it out for my ambitious project :)