from Hacker News

Ask HN: How to efficiently import Excel/CSV files

by leahcim on 11/13/20, 11:34 PM with 11 comments

I have this recurring problem at my startup that I'd like to solve.

Our clients give us an Excel/file and I need to 1) match the columns 2) clean up the data 3) return errors (if any) 4) import each row into my database

It's very difficult given that so many things could go wrong. I have already invested a lot of time in building a file parser with column matcher but the results aren't there yet.

Do you know of any API / widget that would allow me to offer this on my website with minimal code?

  • by gus_massa on 11/14/20, 12:46 AM

    There is an old post by patio11 https://www.kalzumeus.com/2015/01/28/design-and-implementati... (HN https://news.ycombinator.com/item?id=8960280 | 107 points | Jan 28, 2015 | 35 comments) Perhaps it has something relevant.
  • by shyn3 on 11/14/20, 2:39 PM

    Check out PowerShell. In my experience, it handles CSV files the best. Also you can open it up using the Excel module, so you can do pretty much anything you need to the file.

    You will always get false positives, so what I do is dump the PS results into another database, then run a comparison, and then import. It slows you down a bit but I have noticed, every few months, I'll get 1 or 2 new errors due to data sanitization.

  • by fuzzfactor on 11/14/20, 6:10 PM

    I would want to make sure your working version of Excel & its settings open the file exactly as your clients' do.

    This could be high-touch but it could also build more client trust and pay off more than other non-automatable efforts.

    Then first actually open each file in Excel.

    I would imagine many clients would expect you to be more advanced at Excel than they are anyway, so do not fall behind.

    That way you and the client always start on the same page, and you have the full power of Excel at your disposal at least like they do.

    Matching the columns is no longer on your plate, fait accompli.

    And a bad XLSx can be caught by Excel without being subject to further incorrect processing.

    Maybe the data will be cleaner or more well-confirmed to begin with, and there might be fewer errors to return already.

    The spreadsheet itself is the database for that one file, in your client's familiar format, an opened XLSx.

    Rather than parsing, you could then consider reading the data directly from the cells using a straightforward macro which creates a binary or text file directly converting to your desired database format, or also an intermediate CSV if desired.

    Excel's built-in macro language should still be capable of accomplishing this, or I would expect you could still call any proper x86 code in other languages as an Excel macro or subroutine if desired too.

    It should then be possible for a powershell script to accomplish this structured file conversion with a single click from a menu, on any target XLSx or folder of XLSx's that conform to the clients' layout, which you have encoded into the custom macro that converts to your proprietary file structure.

    If this is considered efficient while accommodating the overhead of Windows and Excel together only you can decide.

    I used a common office PC for this process so I could encode modern data for use by antique scientific instruments, then sent the files to the equipment in their own obsolete format using the COM port.

    Only decoding the incoming proprietary data stream from the antique gear required a parser.

  • by sangli on 11/14/20, 4:54 AM

    I've used opencsv java library on many occasions. It's pretty neat.

    http://opencsv.sourceforge.net/

  • by st1x7 on 11/14/20, 10:42 AM

    Don't write the parser yourself. Whatever language you're using very likely already has one that's much better than what you can build and maintain by yourself.
  • by sloaken on 11/14/20, 10:04 AM

    I import and export to Excel all the time. I never do any fancy tricks, like exporting format or formulas. I use C# and there are libraries available to make life easy.
  • by poletopole on 11/14/20, 8:02 PM

    I’m not sure I understand the problem space here, but consider AWK. It really changed how I think about any problem involving tabular data parsing.
  • by YaBa on 11/14/20, 6:45 PM

    https://datafetcher.io/

    Python + Pandas would be my choice to develop something.

  • by thomasthomas on 11/17/20, 2:39 PM

    1. Come to an agreed upon format with clients 2. SSIS
  • by op03 on 11/14/20, 11:28 AM

    pandas read_csv
  • by su1000 on 11/14/20, 12:40 AM

    What language you are using?