from Hacker News

Removing characters from strings faster with AVX-512

by mdb31 on 5/1/22, 1:12 PM with 85 comments

by Andoryuuta on 5/1/22, 3:09 PM
Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/
https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...
by gslin on 5/1/22, 2:51 PM
A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which usually cancels out the benefit in the Real World (tm).
by watmough on 5/1/22, 5:16 PM
This is really cool.
I just got through doing some work with vectorization.
On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.
With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.
I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.
by mdb31 on 5/1/22, 1:17 PM
Cool performance enhancement, with an accompanying implementation in a real-world library (https://github.com/lemire/despacer).
Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?

by brrrrrm on 5/1/22, 9:29 PM

What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least https://uops.info/table.html), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.

I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:

  __m512i spaces = _mm512_set1_epi8(' ');
  size_t i = 0;
  for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
    // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
    __m512i in0 = _mm512_loadu_si512(bytes + i);
    __m512i in1 = _mm512_loadu_si512(bytes + i + 64);
    __m512i in2 = _mm512_loadu_si512(bytes + i + 128);
    __m512i in3 = _mm512_loadu_si512(bytes + i + 192);

    __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
    __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
    __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
    __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);

    auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
    auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
    auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
    auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);

    _mm512_storeu_si512(bytes + pos, reg0);
    pos += _popcnt64(mask0);
    _mm512_storeu_si512(bytes + pos, reg1);
    pos += _popcnt64(mask1);
    _mm512_storeu_si512(bytes + pos, reg2);
    pos += _popcnt64(mask2);
    _mm512_storeu_si512(bytes + pos, reg3);
    pos += _popcnt64(mask3);
  }
  // old code can go here, since it handles a smaller size well

You can probably do better by chunking up the input and using temporary memory (coalesced at the end).

by bertr4nd on 5/1/22, 7:11 PM
I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.
I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.
by gfody on 5/1/22, 3:35 PM
there's more whitespace above 0x20 https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode
by GICodeWarrior on 5/2/22, 2:35 AM
Here's a list of processors supporting AVX-512:
https://ark.intel.com/content/www/us/en/ark/search/featurefi...
The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.
by tedunangst on 5/1/22, 8:57 PM
What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won't work because there can be spaces inside string values which must be preserved.
by jquery on 5/1/22, 3:05 PM
I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.
by protoman3000 on 5/1/22, 3:08 PM
Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?