from Hacker News

Apache DataFusion

by thebuilderjr on 1/12/25, 6:53 PM with 47 comments

  • by kristjansson on 1/16/25, 2:28 AM

    Of interest and relevance: This past semester, Andy Pavlo's DB seminar at CMU explored a number of projects under the heading 'Database Building Blocks', starting with DataFusion and several of its applications. Take a listen!

    https://www.youtube.com/playlist?list=PLSE8ODhjZXjZc2AdXq_Lc...

  • by jamesblonde on 1/16/25, 12:31 AM

    There is a cambrian explosion in data processing engines: DataFusion, Polars, DuckDB, Feldera, Pathway, and more than i can remember.

    It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.

    I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.

  • by krapht on 1/13/25, 12:53 AM

    I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.

    When I have small data that fits on my laptop, Pandas is good enough.

    Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...

  • by netcraft on 1/16/25, 1:55 AM

    Why would this be useful over of DuckDb? (earnest question)
  • by bionhoward on 1/13/25, 2:53 AM

    How does this compare/contrast to polars? Seems pretty similar, anybody tried both?
  • by pickinrust on 1/16/25, 11:54 AM

    highly recommend this video for a deeper dive. this is a actual example in practice: https://www.youtube.com/watch?v=VLAvZw0ZEwI&list=PLSE8ODhjZX... Enjoy!
  • by theLiminator on 1/16/25, 10:23 PM

    I've done some testing of polars, duckdb, and datafusion.

    Anecdotally, these are my experiences:

    DuckDB (last used maybe 7-8 months):

    - Very nice for very fast local queries (against parquet files, i ignored their homegrown file format)

    - Most pleasant cli

    - Seems to have the best out of core experience

    - As far as I can tell, seems to be closest to state of the art in terms of algorithms/overall design, though honestly everyone is within spitting distance of each other

    - Spark api seems exciting

    Datafusion (last used 1.5y ago):

    - Most pleasant to build/extend on top of (in rust)

    - Is to OLAP DBMS's what LLVM is to compilers (stole this quote off Andrew Lamb)

    - Could be wrong, but in terms of core engineering discipline they are the most rigorous/thoughtful (no shade thrown to the other libraries, which are all awesome libraries/tools too)

    - Seems to be the most foundational to many other tools (and is most ubiquitously embedded)

    - Their python dataframe centric workflow isn't as nice as polars (this is rapidly improving afaict)

    - Docs are lagging behind polars

    - Very exciting future (ray datafusion, improvements to python bindings, ballista, datafusion-comet)

    Polars (last used this week):

    - The most pleasant api by far for a programmatic user

    - Pretty good interop with python ecosystem

    - Rust crate is a second class citizen

    - Python is a first class citizen

    - Probably the best for advanced ETL use cases

    - Fastest library for querying hive partitioned parquet data in an object store

    - Wide end-user adoption (less so as a query engine)

    - Moves very fast (I do get more bugs/regressions in polars version to version, but on the flip side, they move fast to fix issues and release very often)

    - Exciting distributed cloud solution coming (is proprietary though)

    - New streaming engine based off morsel driven parallelism (same architectural as duckdb afaict?) should greatly improve polars OOC capabilities

    - Much nicer to test/compose/build re-usable queries/functions on top of then SQL based ETL tools - Error messages/debuggability/observability are still immature

    All three are awesome tools. The OLAP space is really heating up.

    Things I still see lacking in the OLAP end-user space are: - Unified batch/streaming dataframe centric workflows, nothing is truly high throughput/low latency/pleasant to use/mature/robust. I've only really seen arroyo and risingwave, neither seem too mature usable yet.

    - Nothing is quite at the robustness level of something like sqlite

    - Despite native query engines, datalake implementations are mostly lagging behind their java equivalents (iceberg/delta)

    Some questions for other users:

    - I'm curious if anyone uses Ibis in prod, I found that it wasn't very usable as an end user