from Hacker News

Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet

by houqp on 9/24/22, 3:59 PM with 62 comments

  • by ebfe1 on 9/24/22, 9:41 PM

    This is cool...Totally reminded me about several tools pop up on HN every now and then in the past for similar task so i did a quick search:

    clickhouse-local - https://news.ycombinator.com/item?id=22457767

    q - https://news.ycombinator.com/item?id=27423276

    textql - https://news.ycombinator.com/item?id=16781294

    simpql- https://news.ycombinator.com/item?id=25791207

    We need a benchmark i think..;)

  • by mmastrac on 9/24/22, 5:16 PM

    The one thing everyone here is missing so far is that it's a Rust binary, distributed on PyPi. That's brilliant.
  • by gavinray on 9/24/22, 7:43 PM

    1) roapi is built with some wicked cool tech

    2) the author once answered some questions I posted on Datafusion, so they're cool in my book

    Here are my anecdotes.

  • by playingalong on 9/24/22, 8:18 PM

    Bye bye jq and your awful query syntax.
  • by henrydark on 9/24/22, 8:15 PM

    It is pretty cool. py-spy has also been doing this for a few years

    https://github.com/benfred/py-spy

  • by tootie on 9/24/22, 5:21 PM

    AWS Athena offers something similar. You can build tables off of structured text files (like log files) in S3 and run SQL queries.
  • by johnnunn on 9/25/22, 4:03 AM

    I have a use case, where my company's application logs will be shipped to S3 in a directory structure such as application/timestamp(one_hour)_logs.parquet. We want to build a simple developer focussed UI, where we can query for a given application for a time range and retrieve a bunch of s3 blobs in that time range and brute force search for the desired string. I see that roapi offers a REST interface for a fixed set of files but I would like to dynamically glob newer files. Are there are alternatives that can be used too ? Thanks
  • by cube2222 on 9/24/22, 4:39 PM

    This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast.

    If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.

    It's plenty fast and easier to add new data sources for as external plugins.

    It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.

    Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory.

    [0]: https://github.com/cube2222/octosql

  • by smugma on 9/25/22, 12:16 AM

    SQL on CSV (using preinstalled Mac tools) previously linked on HN: https://til.simonwillison.net/sqlite/one-line-csv-operations

    e.g.

    sqlite3 :memory: -cmd '.mode csv' -cmd '.import royalties.csv Royalty' -cmd '.mode column' \

        'SELECT SUM(Royalty),Currency FROM Royalty GROUP BY Currency'
  • by bachmeier on 9/24/22, 4:57 PM

    As I commented on a recent similar discussion, these tools can't be used for update or insert. As useful as querying might be, it's terribly misleading to claim to "run SQL" if you can't change the data, since that's such a critical part of an SQL database.
  • by skybrian on 9/24/22, 9:05 PM

    Looks like it also supports SQLite for input, but not for output. That might be a nice addition.
  • by the_optimist on 9/25/22, 1:04 AM

    What’s the memory handling behavior here? Are CSVs read on query or at startup? What about Arrow? If read on startup, is there compression applied?
  • by theGnuMe on 9/25/22, 4:26 AM

    This is really cool and redefines ETL pipelines.
  • by whimsicalism on 9/24/22, 9:07 PM

    Trino can do this as well.
  • by Kalanos on 9/25/22, 12:16 AM

    is there a pythonic api for scripting (not command line)? i was looking for a json query tool and couldn't find one.