from Hacker News

Ask HN: How to Deal with CSVs?

by nullbytesmatter on 4/18/22, 5:02 PM with 16 comments

I have a massive CSV (~1.6TB) and I need to merge rows that reference the same data. What are your goto tools for such tasks/data?

by mattewong on 4/18/22, 10:09 PM
Assuming you want to do this on a single machine with finite memory, a feasible approach would be:
1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory 2. Sort each shard. You could use something like `zsv` (sql sub-command) (https://github.com/liquidaty/zsv) or `xsv` (sort subcommand) (https://github.com/BurntSushi/xsv) for this 3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and/or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)
For #3, I don't know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.
by tored on 4/18/22, 10:04 PM
Put into a database, most databases can import CSV files directly.
From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.
https://www.mysqltutorial.org/import-csv-file-mysql-table/
https://www.mysqltutorial.org/mysql-export-table-to-csv/
https://www.sqlitetutorial.net/sqlite-import-csv/
https://www.sqlitetutorial.net/sqlite-export-csv/
by Minor49er on 4/18/22, 5:18 PM
I would write a script in PHP to iterate over the CSV (though most languages would work here). In the first pass, scan for the rows that should be merged. On the second pass, when a merged row is encountered, write it if it hasn't been written yet. If you do this, just be sure to make a demo or two first to ensure that the behavior is what you expect. Also make use of buffering since reading and writing a file line-by-line will probably be very slow compared to reading it in at blocks at a time
by modinfo on 4/18/22, 5:20 PM
I once created a simple script in NodeJS to merge two or more CSV files and remove duplicates. You might find it useful.
https://www.npmjs.com/package/merge-csv-keepass
by zaik on 4/18/22, 5:38 PM
Seems like a naive approach will quickly run out of memory trying to detect duplicates. When possible, I would try to put the data into your favorite database system. A lot of optimization and knowledge on how to handle such things is in those systems.
by imichael on 4/18/22, 5:57 PM
How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).
by mattewong on 4/18/22, 9:43 PM
Are you asking how to do this with a single machine, or how to do this assuming you have access to cloud resources?
by yuppie_scum on 4/18/22, 10:23 PM
Stream it thru Kafka, Redis or Elasticache or something pub-sub like that and dump the resulting set?
by wizwit999 on 4/18/22, 5:46 PM
Convert to parquet and use spark.
by IronWolve on 4/18/22, 6:33 PM
Split it into smaller files.