from Hacker News

Decoding UTF8 with parallel extract

by g0xA52A2A on 5/5/24, 9:28 AM with 20 comments

by nwellnhof on 5/5/24, 5:13 PM
In my experiments, anything using lookup tables was slower than a naive, branching decoder on real-world data. Reading from a lookup table in L1 cache has ~4 cycles latency which is prohibitive for the simple case of mostly ASCII bytes. You can easily achieve more than 1.5 GB/s with a naive decoder while all the "smarter" approaches are capped to ~800 MB/s.
by pbsd on 5/5/24, 9:00 PM
The overlong lookup can also be written without a memory lookup as
```
    0x10000U >> ((0x1531U >> (i*5)) & 31);
```
On most current x86 chips this has a latency of 3 cycles -- LEA+SHR+SHR -- which is better than an L1 cache hit almost everywhere.
by clausecker on 5/6/24, 10:42 AM
This looks very similar to the approach we recently used to transcode UTF-8 into UTF-16 using AVX-512: https://arxiv.org/pdf/2212.05098
It's part of simdutf.
by masfuerte on 5/6/24, 3:59 PM
The code is careful not to read past the end of the buffer, but it doesn't explicitly check that there are enough bytes available for the current multibyte character. However, this "end of buffer in middle of character" error is caught later by the check for valid continuation bytes. I thought that was quite neat.