by mdb31 on 5/1/22, 1:12 PM with 85 comments
by Andoryuuta on 5/1/22, 3:09 PM
https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...
by gslin on 5/1/22, 2:51 PM
by watmough on 5/1/22, 5:16 PM
I just got through doing some work with vectorization.
On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.
With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.
I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.
by mdb31 on 5/1/22, 1:17 PM
Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?
by brrrrrm on 5/1/22, 9:29 PM
I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:
__m512i spaces = _mm512_set1_epi8(' ');
size_t i = 0;
for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
// 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
__m512i in0 = _mm512_loadu_si512(bytes + i);
__m512i in1 = _mm512_loadu_si512(bytes + i + 64);
__m512i in2 = _mm512_loadu_si512(bytes + i + 128);
__m512i in3 = _mm512_loadu_si512(bytes + i + 192);
__mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
__mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
__mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
__mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);
auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);
_mm512_storeu_si512(bytes + pos, reg0);
pos += _popcnt64(mask0);
_mm512_storeu_si512(bytes + pos, reg1);
pos += _popcnt64(mask1);
_mm512_storeu_si512(bytes + pos, reg2);
pos += _popcnt64(mask2);
_mm512_storeu_si512(bytes + pos, reg3);
pos += _popcnt64(mask3);
}
// old code can go here, since it handles a smaller size well
You can probably do better by chunking up the input and using temporary memory (coalesced at the end).by bertr4nd on 5/1/22, 7:11 PM
I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.
by gfody on 5/1/22, 3:35 PM
by GICodeWarrior on 5/2/22, 2:35 AM
https://ark.intel.com/content/www/us/en/ark/search/featurefi...
The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.
by tedunangst on 5/1/22, 8:57 PM
by jquery on 5/1/22, 3:05 PM
by protoman3000 on 5/1/22, 3:08 PM