from Hacker News

Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

by akarambir on 10/20/18, 11:28 AM with 51 comments

  • by the_clarence on 10/20/18, 4:26 PM

    I see a lot of applications trying to take advantage of SIMD, but what when you try to run them on systems that don't support these instructions? My guess is that you need to write multiple files taking advantage of different sets of instructions and then dynamically figure out which to use at runtime with cpuid, but isn't that cumbersome and a way to inflate a codebase dramatically?
  • by bradleyjg on 10/20/18, 1:55 PM

    Under the new string model in java > 8 a fairly frequent workflow is:

    1) get external string

    2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

    3) validate the byte stream

    4) figure out if the code points in the incoming string can be represented in Latin-1

    5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder

    I know some or all of these steps are done using hotspot intrinsics, and then the JIT/VM does inlining, folding and so on, but I wonder how fast a custom assembly function to do all these steps at once could be.

  • by jwilk on 10/20/18, 3:48 PM

  • by kissiel on 10/20/18, 1:29 PM

    I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.
  • by akarambir on 10/20/18, 11:52 AM

    What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.