by dmlorenzetti on 1/9/23, 3:36 PM with 47 comments
by RobinL on 1/10/23, 12:23 PM
In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! https://arrow.apache.org/powered_by/
Another interesting development since I wrote this is DuckDB.
DuckDB offers a compute engine with great performance against parquet files and other formats. Probably similar performance to Arrow. It's interesting they opted to write their own compute engine rather than use Arrow's - but I believe this is partly because Arrow was immature when they were starting out. I mention it because, as far as I know, there's not yet an easy SQL interface to Arrow from Python.
Nonetheless, DuckDB are still Arrow for some of its other features: https://duckdb.org/2021/12/03/duck-arrow.html
Arrow also has a SQL query engine: https://arrow.apache.org/blog/2019/02/04/datafusion-donation...
I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.
Going forward, I see parquet continuing on its path to becoming a de facto standard for storing and sharing bulk data. I'm particularly excited about new tools that allow you to process it in the browser. I've written more about this just yesterday: https://www.robinlinacre.com/parquet_api/, discussion: https://news.ycombinator.com/item?id=34310695.
by kajika91 on 1/10/23, 12:01 PM
I have used Arrow and even made my humble contribution to the Go binding but I don't like pretending it is so much better than other solutions. It is not a silver bullet and probably the best pro is the "non-copy" goal to convert data into different frameworks' object. Depending of the use for the data columnar layout can be better but not always.
by Lyngbakr on 1/10/23, 11:18 AM
by alamb on 1/10/23, 6:26 PM
by agumonkey on 1/10/23, 12:08 PM
ps: a tiny video to explain storage layout optimizations https://yewtu.be/watch?v=dPb2ZXnt2_U
by gizmodo59 on 1/10/23, 1:34 PM
by flakiness on 1/10/23, 12:33 PM
https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...
by hermitcrab on 1/10/23, 9:58 PM
by d_burfoot on 1/10/23, 1:04 PM
I have been burned so many times by amateur hour software engineering failures from the Apache world, that it’s very hard for me to ever willingly adopt anything from that brand again. Just put it in gripped JSon or TSV and hey, if there’s a performance penalty, it’s better to pay a bit more for cloud compute than hate your job because of some nonsense dependency issue caused by an org.Apache library failing to follow proper versioning guidelines.
by rr888 on 1/10/23, 12:58 PM
by kordlessagain on 1/10/23, 2:29 PM
by mjburgess on 1/10/23, 11:05 AM
If someone has a code example to this effect, I'd be greatful.
I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".
I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.
It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.
It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.
by amayui on 1/10/23, 9:16 PM