from Hacker News

SIMD-friendly algorithms for substring searching (2016)

by Rendello on 6/14/25, 3:31 AM with 35 comments

by burntsushi on 6/14/25, 12:15 PM
The "AVX2 (generic)" approach is roughly what ripgrep uses (via Rust's `regex` crate) to accelerate most searches. Even something like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck `Sherlock` out of the regex and search that.
The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...
The main difference with the algorithm presented here is that instead of always using the first and last bytes of the needle, a heuristic is used to try to pick two bytes that occur less frequently according to a background frequency distribution.
It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:
```
    $ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
    Engine                       Version  Geometric mean of speed ratios  Benchmark count
    ------                       -------  ------------------------------  ---------------
    rust/memchr/memmem/prebuilt  2.7.1    1.07                            57
    rust/memchr/memmem/oneshot   2.7.1    1.39                            54
    libc/memmem/oneshot          unknown  3.15                            54
    rust/memchr/memmem/twoway    2.7.1    5.26                            54
```
The "prebuilt" benchmarks also demonstrate the API deficiencies of routines like `memmem`. Because of their API, they need to rebuild the state necessary to execute a search for the needle given. But it is often the case that the needle is invariant across repeated calls with different haystacks. So you end up doing that repeated work for no gain.
by ncruces on 6/14/25, 7:15 AM
I implemented these in my attempt to SIMD optimize Wasm/WASI libc.
https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc
For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.
https://igm.univ-mlv.fr/~lecroq/string/node19.html
by mwsherman on 6/14/25, 5:34 PM
C# has implemented some SIMD for IndexOf: https://github.com/dotnet/runtime/pull/63285
by clauderoux on 6/14/25, 7:46 AM
I implemented many different algorithms for searching and splitting strings using SMID methods as well: https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I used a different algorithm than the ones presented here.
by klaussilveira on 6/14/25, 4:21 PM
Reminds me of https://github.com/nikneym/hparse
Which uses SIMD algos for fast HTTP parsing.
by eska on 6/14/25, 11:22 AM
The swar algorithm has UB because it casts the 1 byte aligned data to 8 byte aligned. Perhaps it suffers some performance issues from unaligned loads?
by sylware on 6/14/25, 10:11 AM
What is the string sizes thresold for efficiency?
Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.
Yes, it depends heavily on the hardware architecture.
by rurban on 6/14/25, 9:25 AM
We know that the libc strstr performs horrible, but musl is fast and state of the art.
Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos
by mgaunard on 6/14/25, 9:50 AM
missing (2018) ?
by azhenley on 6/14/25, 1:23 PM
Now if I can just use SIMD directly from Python without calling out to another language.