by nullbytesmatter on 4/18/22, 5:02 PM with 16 comments
by mattewong on 4/18/22, 10:09 PM
1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory 2. Sort each shard. You could use something like `zsv` (sql sub-command) (https://github.com/liquidaty/zsv) or `xsv` (sort subcommand) (https://github.com/BurntSushi/xsv) for this 3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and/or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)
For #3, I don't know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.
by tored on 4/18/22, 10:04 PM
From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.
https://www.mysqltutorial.org/import-csv-file-mysql-table/
https://www.mysqltutorial.org/mysql-export-table-to-csv/
by Minor49er on 4/18/22, 5:18 PM
by modinfo on 4/18/22, 5:20 PM
by zaik on 4/18/22, 5:38 PM
by imichael on 4/18/22, 5:57 PM
by mattewong on 4/18/22, 9:43 PM
by yuppie_scum on 4/18/22, 10:23 PM
by wizwit999 on 4/18/22, 5:46 PM
by IronWolve on 4/18/22, 6:33 PM