from Hacker News

Building Databases over a Weekend

by ambrood on 11/21/24, 1:16 AM with 13 comments

by 01HNNWZ0MV43FF on 11/21/24, 3:31 AM
> In this post we take you on a walkthrough on how you can use DataFusion
Thought it was gonna be a "build your own SQLite" or something
by Gepsens on 11/21/24, 4:10 AM
I remember 2 years ago someone proposed adding stream processing in datafusion and PRs followed. But IMO stream processing is an entirely different beast, some people could use the sql engine of df for it though. There are rust projects like Arroyo
by maximus93 on 11/21/24, 11:09 AM
Great discussion here! At AI Squared, we have also been exploring the evolving landscape of stream processing and SQL engines. While batch engines like DataFusion excel at handling static data, we recognize the challenges around integrating streaming capabilities and infrastructure seamlessly.
Our focus has been on simplifying data activation pipelines with tools like Multiwoven, which aims to bridge the gap between static and dynamic data needs by supporting connectors for both traditional databases and real-time platforms like Kafka. However, the need for more embedded, developer-friendly streaming solutions is clear, and it’s exciting to see the progress in projects like Arroyo, Materialize, and ClickHouse.
For us, the balance lies in usability and flexibility—how can we empower teams to embed robust data capabilities (whether streaming or batch) into their workflows without overloading on infrastructure complexity? As this ecosystem evolves, we’re optimistic about collaborating and contributing to solutions that make streaming SQL as accessible as traditional SQL.
Looking forward to seeing how this space develops—and kudos to the teams pushing boundaries! https://github.com/Multiwoven/multiwoven/
by dangoodmanUT on 11/21/24, 4:07 AM
this post feels like it's skipping over a lot of code that could be included
by alamb on 11/21/24, 2:55 PM
BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):
https://github.com/apache/datafusion/issues/13448
Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.