by thecompilr on 6/4/22, 9:16 PM
I'm the co-author of one of the papers referenced in the blogpost, (Fast Quicksort Implementation Using AVX Instructions), we did write the AVX512 code back in 2015, just had nowhere to run it, at least publicly. The paper also very explicitly says that the lookup tables can be instead replaced by the AVX512 compress instructions. The code for that paper is available in
https://github.com/vkrasnov/avx_qsortby jart on 6/4/22, 5:26 PM
by carbocation on 6/4/22, 9:52 PM
by gww on 6/4/22, 6:25 PM
I am always amazed at the algorithms re-implemented using SIMD. One of my favorites is the Striped Smith-Waterman approach used for sequence alignment. Does anyone have any good resources on learning to use SIMD? I've found it heard to make the "plunge".
by janwas on 6/4/22, 6:12 PM
Author here, happy to discuss.
by orlp on 6/4/22, 5:21 PM
SIMD based sorting isn't new, e.g. djbsort:
https://sorting.cr.yp.to/. I find it strange the paper doesn't cite it or compare to it at all.
EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, https://arxiv.org/abs/1704.08579. A comparison would still be nice though.
by VHRanger on 6/4/22, 5:09 PM
I love work like this and SIMD-JSON which vectorizes what "should be" sequential form problems to find massive performance gains
by DeathArrow on 6/4/22, 7:57 PM
by jerryjerryjerry on 6/4/22, 7:08 PM
This is great, and can definitely improve sort performance in database and big data projects. I can immediately imagine this is a perfect match to my project of columnar processing engine TiFlash (
https://github.com/pingcap/tiflash).
by cbetti on 6/4/22, 5:10 PM
Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?
In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?
by MrYellowP on 6/4/22, 8:07 PM
Man, I was thinking about this problem and came up with some weird solutions.
I've actually thought all this stuff has already been solved?
Is there a point in still trying to improve efficiency of sorting?
I can do that! I'd love such a challenge if there was a point to it!
by slackerIII on 6/4/22, 5:15 PM
What would it take for something like this to make it into Postgres?
by olliej on 6/4/22, 6:38 PM
Re-implementation of stuff with SIMD is always amazing to me. I have done stuff with SIMD before: 4 element float vector operations; basic arithmetic on arrays of floats.
Those are things that are super simple, once you get past the scary mystique of SIMD.
It’s stuff like this that should be getting all the witch burning :D
by skybrian on 6/4/22, 7:30 PM
Will browsers start using this code for typed array sorting? Maybe they already do?
by BeeOnRope on 6/4/22, 9:05 PM
Are there instructions for building & reproducing the performance results?
by wly_cdgr on 6/4/22, 7:02 PM
SpaceChem players have known this for years
by aaaaaaaaaaab on 6/4/22, 10:27 PM
This works only with integers, right? Then radix sort is gonna be still faster on sufficiently large inputs…
by sylware on 6/5/22, 11:29 AM
This code is worth a clean assembly port.
by diamondlovesyou on 6/5/22, 12:33 AM
If P=NP, it only going to be possible because of vectorizing/vectorization.
by alecco on 6/4/22, 5:09 PM
> By comparison, the standard library reaches 58/128/117 MB/s on the same CPU, so we have managed a 9-19x speedup depending on the type of numbers.
That's pretty disingenuous. The standard implementation is known to be terribly slow because of constraints. They should compare against current state of the art.
by unnouinceput on 6/4/22, 5:22 PM
Quote: "Today we're sharing open source code that can sort arrays of numbers about ten times as fast as the C++ std::sort...." and proceeds to explain about SIMD instructions.
Well, to me sounds that C++ std::sort is simply lacking an optimization which can very easy be implemented in next iteration of the compiler/linker. I had high hopes this would be a really breakthrough algorithm, not lack of a simple optimization using some instruction set that compiler/linker didn't implement. If they want, CPU vendors would make an even more awesome and faster instruction set, let's say SIMD2, which would leave this one in the dust.