from Hacker News

Leveraging SIMD: Splitting CSV Files at 3Gb/S

by __exit__ on 12/16/21, 10:09 AM with 41 comments

  • by zwegner on 12/16/21, 10:37 AM

    Pretty similar article from very recently: https://nullprogram.com/blog/2021/12/04/

    Discussion: https://news.ycombinator.com/item?id=29439403

    The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.

    But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)

  • by jagrsw on 12/16/21, 11:18 AM

    Not sure how the author of this entry on HN managed to change original title from

    gigabytes per second

    to

    gigabits per siemens

    :)

  • by mattewong on 12/18/21, 7:32 AM

    Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on
  • by liuliu on 12/16/21, 5:23 PM

    Splitting CSV file into chunks and process them independently won't necessarily be wrong (although there are implementations out there that I won't name would, because they do guess). The trick however requires to scan twice: https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...

    Nice article otherwise!

  • by michaelg7x on 12/16/21, 6:44 PM

    Presumably solving the same kind of delimiter-finding issues as Hyperscan? https://news.ycombinator.com/item?id=19270199
  • by Tuna-Fish on 12/16/21, 11:18 AM

    Why is the unit expression in topic messed up?
  • by rwmj on 12/16/21, 10:46 AM

    Nice, but I'm afraid real world CSVs are a lot more complicated than described so don't use this code in production.