by gzel on 9/29/24, 8:03 AM with 55 comments
by arn3n on 9/29/24, 2:48 PM
by rebanevapustus on 9/29/24, 5:45 PM
I would advise everybody to stay clear of anything that isn't Feldera or Materialize. Nobody aside from these guys have a IVM product that is grounded on proper theory.
If you are interested in trying out the theory (DBSP) underneath Feldera, but in Python, then check this out: https://github.com/brurucy/pydbsp
It works with pandas, polars...anything.
by YmiYugy on 10/2/24, 11:22 PM
by jitl on 9/29/24, 3:15 PM
This seems much more adoptable to me in my org than DDlog was, even if I really liked DDlog much more than SQL :-(
by jacques_chester on 9/29/24, 4:42 PM
by qazxcvbnm on 9/29/24, 3:24 PM
By the way, I was wondering about a related question. Do streaming engines typically store a copy of the data streamed to them? For instance, if I had a view to get the maximum value of a table, and the maximum value was removed, the streaming engine surely needs to get the next value from somewhere. It seems clear that the streaming engine needs at least its own snapshot of the data to have a consistent state of the computation, but duplicating the persisted data seems somewhat wasteful.
by cube2222 on 9/29/24, 10:02 AM
There's already a bunch of tools in this area:
1. Materialize[0], which afaik is more big-data oriented, and doesn't pipe the results back to your database, instead storing results in S3 and serving them.
2. Epsio[1], which I've never used, seems to be very similar to this product, but is closed-source only.
3. When building OctoSQL[2], this capability was also important to me and it was designed from ground up to support it. Though in practice in a tool like OctoSQL it's pretty useless (was a fun problem to solve though).
There's some things I'm curious about:
- Does it handle queries that involve complex combinations of ordering with limits in subqueries? If due to a change in an underlying table a top-n row is added, resulting in moving other rows around (and removing the current n'th) will the subsequent query parts behave as though the order was maintained when computing it, or will it fall apart (imagine a select with limit from a select with bigger limit)?
- Is it internally consistent[3]? They say it's "strongly consistent" and "It also guarantees that the state of the views always corresponds to what you'd get if you ran the queries in a batch system for the same input." so I think the answer is yes, but this one's really important.
Either way, will have to play with this, and dig into the paper (the link in the repo doesn't work, here's an arXiv link[4]). Wishing the creators good luck, this looks great!
[1]: https://www.epsio.io
[2]: https://github.com/cube2222/octosql
[3]: https://www.scattered-thoughts.net/writing/internal-consiste...
by ZiliangXK on 9/29/24, 6:47 PM
by jonstewart on 9/29/24, 1:42 PM
by seungwoolee518 on 9/29/24, 9:43 AM
Anyway, it was very impressive.
by bbminner on 9/29/24, 3:16 PM
by shuaiboi on 9/29/24, 6:51 PM
* Engineers don't have time to understand the spreadsheet logic and translate everything into an incremental version for production.
* Analysts don't understand the challenges with stream processing.
* SQL is still too awkward of a language for finance.
* Excel is a batch environment, which makes it hard to codify it as a streaming calculation.
If I understand correctly, your paper implies as long as there is a way to describe spreadsheets as a Zset, some incremental version of the program can be derived? Spreadsheets are pretty close to a relational table, but it would be a ZSet algebra on cells, not rows, similar to functional reactive programming. So dbsp on cells would be incremental UDFs, not just UDAFs?
thoughts??
by Nelkins on 9/30/24, 1:58 PM
by Nelkins on 9/29/24, 12:58 PM
by faangguyindia on 9/29/24, 5:09 PM
Bigquery had figured this out long ago and built it in top of Big table.