from Hacker News

Using SIMD for Parallel Processing in Rust

by nbrempel on 7/1/24, 4:29 PM with 42 comments

  • by oconnor663 on 7/1/24, 6:25 PM

    There are a lot of factors that go into how fast a hash function is, but the case we're showing in the big red chart at https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven by SIMD. It's a huge deal.
  • by ww520 on 7/1/24, 9:37 PM

    Zig actually has a very nice abstraction for SIMD in the form of vector programming. The size of the vector is agnostic to the underlying cpu architecture. The compiler or LLVM will generate code for using SIMD128, 256, or 512 registers. And you are just programming straight vectors.
  • by thomashabets2 on 7/1/24, 6:14 PM

    The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

    https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.

  • by nbrempel on 7/1/24, 6:15 PM

    Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

    One of my goals of writing these articles is to learn so feedback is more than welcome!

  • by eachro on 7/1/24, 5:57 PM

    This is cool that simd primitives exist in the std lib of rust. I've wanted wanted to mess around a bit more with simd in python but I don't think that native support exists. Or your have to go down to C/C++ bindings to actually mess around with it (last I checked at least, please correct me if I'm wrong).
  • by anonymousDan on 7/1/24, 6:50 PM

    The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.
  • by IshKebab on 7/1/24, 8:56 PM

    Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's Scalable Vector Extension. Unlike traditional SIMD the code is agnostic to the register width and different hardware can run the same code with different widths.

    There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

    I am wondering how and if Rust will support these vector processing extensions.

  • by brundolf on 7/2/24, 5:33 AM

    std::simd is a delight. I'd never done SIMD before in any language, and it was very easy and natural (and safe!) to introduce to my code, and just automatically works cross-platform. Can't recommend it enough
  • by neonsunset on 7/1/24, 7:52 PM

    If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

    https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

    Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

    https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

    Other examples:

    CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

    Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

    Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

    There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

    Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.