from Hacker News

Feldera Incremental Compute Engine

by gzel on 9/29/24, 8:03 AM with 55 comments

  • by arn3n on 9/29/24, 2:48 PM

    If you don’t want to change your whole stack, ClickHouse’s Materialized Views do something extraordinarily similar, where computations are ran on inserts to the source table in an online/streaming manner. I’m curious how this solution compares in its set of features/gaurantees.
  • by rebanevapustus on 9/29/24, 5:45 PM

    Big fan of Feldera here.

    I would advise everybody to stay clear of anything that isn't Feldera or Materialize. Nobody aside from these guys have a IVM product that is grounded on proper theory.

    If you are interested in trying out the theory (DBSP) underneath Feldera, but in Python, then check this out: https://github.com/brurucy/pydbsp

    It works with pandas, polars...anything.

  • by YmiYugy on 10/2/24, 11:22 PM

    I tried the demo and it looks quite promising. Right now it seems to be focused on handling a few queries over high throughput streams. I wonder if it could also work for the following scenario. It behaves almost like a normal sql database, i.e. most data is cold on disk, queries are low latency, no need to predefined them, acid compliant, etc. Except you can subscribe to queries. That means the initial response needs to be as fast as in traditional databases and the database needs to be able to scale to large number concurrent subscriptions. If this where possible it could alleviate the common problem, that Web Apps that need to keep UI up to date, need to constantly poll the database, creating a lot of overhead and making anything but trivial queries a non starter.
  • by jitl on 9/29/24, 3:15 PM

    I’ve been following the Feldera/DBSP/Differential Datalog team for a while and am happy to see y’all stable-ish with your own venture and settling in a model more approachable than DDlog for most developers :)

    This seems much more adoptable to me in my org than DDlog was, even if I really liked DDlog much more than SQL :-(

  • by jacques_chester on 9/29/24, 4:42 PM

    I remember seeing a VMware-internal presentation on the DDlog work which led to Feldera and being absolutely blown away. They took a stream processing problem that had grown to an hours-deep backlog and reduced it to sub second processing times. Lalith & co are the real deal.
  • by qazxcvbnm on 9/29/24, 3:24 PM

    Incredible… I hadn’t even noticed, and people found the holy grail and open-sourced it!

    By the way, I was wondering about a related question. Do streaming engines typically store a copy of the data streamed to them? For instance, if I had a view to get the maximum value of a table, and the maximum value was removed, the streaming engine surely needs to get the next value from somewhere. It seems clear that the streaming engine needs at least its own snapshot of the data to have a consistent state of the computation, but duplicating the persisted data seems somewhat wasteful.

  • by cube2222 on 9/29/24, 10:02 AM

    This looks extremely cool. This is basically incremental view maintenance in databases, a problem that almost everybody (I think) has when using SQL databases and wanting to do some derived views for more performant access patterns. Importantly, they seem to support a wide breath of SQL operators, support spilling computation state to disk, and it's open-source! Interestingly, it compiles queries to Rust, so an approach similar to Redshift (which compiles queries to C++ programs).

    There's already a bunch of tools in this area:

    1. Materialize[0], which afaik is more big-data oriented, and doesn't pipe the results back to your database, instead storing results in S3 and serving them.

    2. Epsio[1], which I've never used, seems to be very similar to this product, but is closed-source only.

    3. When building OctoSQL[2], this capability was also important to me and it was designed from ground up to support it. Though in practice in a tool like OctoSQL it's pretty useless (was a fun problem to solve though).

    There's some things I'm curious about:

    - Does it handle queries that involve complex combinations of ordering with limits in subqueries? If due to a change in an underlying table a top-n row is added, resulting in moving other rows around (and removing the current n'th) will the subsequent query parts behave as though the order was maintained when computing it, or will it fall apart (imagine a select with limit from a select with bigger limit)?

    - Is it internally consistent[3]? They say it's "strongly consistent" and "It also guarantees that the state of the views always corresponds to what you'd get if you ran the queries in a batch system for the same input." so I think the answer is yes, but this one's really important.

    Either way, will have to play with this, and dig into the paper (the link in the repo doesn't work, here's an arXiv link[4]). Wishing the creators good luck, this looks great!

    [0]: https://materialize.com

    [1]: https://www.epsio.io

    [2]: https://github.com/cube2222/octosql

    [3]: https://www.scattered-thoughts.net/writing/internal-consiste...

    [4]: https://arxiv.org/pdf/2203.16684

  • by ZiliangXK on 9/29/24, 6:47 PM

    Timeplus proton OSS https://github.com/timeplus-io/proton does similar thing but with powerful historical query processing as well.
  • by jonstewart on 9/29/24, 1:42 PM

    How does it compare to Materialize/differential dataflow?
  • by seungwoolee518 on 9/29/24, 9:43 AM

    When I saw the title first, I've thought that "one of the os remove l" introduces a new incremental conpute engine?

    Anyway, it was very impressive.

  • by bbminner on 9/29/24, 3:16 PM

    I wonder what guarantees can be made wrt resource consumption. I suppose that'd reasonable to assume that in most (all?) cases an update is cheaper then recompute in terms of cpu cycles, but what about ram? Intuitively it seems like there must be cases that would force you to store unbounded amount of data indefinitely in ram.
  • by shuaiboi on 9/29/24, 6:51 PM

    would something like dbsp support spreadsheet style computations? Most of the financial world is stuck behind spreadsheets and the entire process of productioinizing spreadsheets is broken:

    * Engineers don't have time to understand the spreadsheet logic and translate everything into an incremental version for production.

    * Analysts don't understand the challenges with stream processing.

    * SQL is still too awkward of a language for finance.

    * Excel is a batch environment, which makes it hard to codify it as a streaming calculation.

    If I understand correctly, your paper implies as long as there is a way to describe spreadsheets as a Zset, some incremental version of the program can be derived? Spreadsheets are pretty close to a relational table, but it would be a ZSet algebra on cells, not rows, similar to functional reactive programming. So dbsp on cells would be incremental UDFs, not just UDAFs?

    thoughts??

  • by Nelkins on 9/30/24, 1:58 PM

    Does anybody have a good resource to learn about the differences between things like Feldera, Materialize, Adapton, and other developments in the incremental computation space? Where are the experts hanging out? What are they reading?
  • by Nelkins on 9/29/24, 12:58 PM

    I would love if something like this that exposed C bindings so that every language with an FFI could use the library. I’d love to be able to define pipelines and queries in .NET instead of having to use SQL.
  • by faangguyindia on 9/29/24, 5:09 PM

    We just use bigquery and call it a day.

    Bigquery had figured this out long ago and built it in top of Big table.