by laxentasken on 10/21/18, 2:21 PM with 4 comments
The data follows the relation model but just updating one field after dumping it into postgresql takes quite some time (doing a update on join) and I'm not sure this is the most effective tool/way for this kind of work. The only queries that will be run is to update or doing inserts/append new data to existing tables (eg older files).
Do you have any suggestions to look into for a workload like this?
by cypherdtraitor on 10/21/18, 2:34 PM
1. Rip all CSV data into SQLite or another tabular database
2. Do all data manipulations by shifting information between the database and memory. Ideally you pull entire columns at a time. 95% of your runtime is going to be spent pulling and pushing data, so minimize the number of calls however possible.
3. Export the database to CSV and ship it back to the customer.
If you use a particular language a lot, it is worth it to write a text scanner that uses low level APIs to read large CSV files quickly. I usually pipe a million characters at a time, submit most of them to the database, the duct tape the last few characters to the next million that I pull.
by jondegenhardt on 10/21/18, 8:24 PM
by geophile on 10/21/18, 3:07 PM