from Hacker News

SymSpell: 1M times faster spelling correction

by mci on 3/6/22, 10:15 AM with 32 comments

by LordGrey on 3/6/22, 2:51 PM
What they are calling "Symmetric Delete" seems to be the same as an older concept called "deletion neighborhoods". It is a term coined in an academic paper written by Thomas Bocek, Ela Hunt, and Burkhard Stiller from the University of Zurich, titled "Fast Similarity Search in Large Dictionaries"[1]. The work described there was expanded in a paper written by Daniel Karch, Dennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology, titled "Improved Fast Similarity Search in Dictionaries"[2]. Both of these papers deal with efficient searching for similar string values, given a query string.
LookupCompound and WordSegmentation, algorithms built on Symmetric Delete/Deletion Neighborhoods, are pretty interesting.
[1]https://fastss.csg.uzh.ch/ifi-2007.02.pdf [2]https://arxiv.org/abs/1008.1191v2
by kevincox on 3/6/22, 1:33 PM
This seems very focused on spelling. What I find is key for a good spell correction system is how the words sound. For example I find that Firefox often can't find the word I meant for it's suggestion list but pasting the misspelt word into Google gets the right result 99% of the time as the one provided option.
I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent "normalization" function.
by cb321 on 3/6/22, 6:54 PM
As jamra correctly points out in a sibling comment, the entry point to this (which gets a lot of traction on HN) is indeed attacking a strawman tutorial-written-on-an-airplane-in-Python algorithm. So, the 1M speed-up is very over-hyped.
That said, the technique is not wholly without merit, but does carry certain "average-worst case" trade offs related to latency in the memory/storage system because of SymSpell's reliance upon large hash tables. For details see https://github.com/c-blake/suggest
EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the "mmap & go" Nim one. Still, what should be standard is very rare.
by jamra on 3/6/22, 3:42 PM
This seems very suspicious to me. They’re comparing performance to a tutorial blog post that is extremely inefficient.
How about comparing it to. Levenshtein automaton or another state of the art approach?
by injidup on 3/6/22, 9:28 PM
Slightly OT but trying to look up friends names using Android auto speech recognition whilst driving. I have two Austrian friend "Viktor" and "Patric". If I say. "Hey google, call Viktor" google says. "Sorry you have no Victor in your phone book. Same with Patric. "Sorry you have no Patrick" in your phone book. I'm surprised that there is not even basic scoring done when looking up names in the phone book with the most likely one offered.
*EDIT* I just found a solution to this problem. https://support.google.com/assistant/thread/559644?hl=en&msg... You can supply a phonetic name and this helps google match. This seems a bit low tech though.
by danielscrubs on 3/6/22, 12:48 PM
It was a bit fun comparing the Haskell version and the C# version:
https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/S...
https://github.com/cbeav/symspell/blob/master/src/SymSpell.h...
by nicoburns on 3/6/22, 4:01 PM
I'd love to see better open source spell checking. The state of the art spell checkers (in say, MS Word or Google Search) are excellent and more than good enough for my needs. And yet the spell checkers in browsers (e.g. Chrome and Firefox) and other apps tend to be terrible only catching very basic cases and often not being able to suggest the correct word.
Does anyone have any insight into what's holding this back?
by tgv on 3/6/22, 1:20 PM
So, this is time vs memory?
by wodenokoto on 3/6/22, 1:13 PM
It says it does fuzzy search as well. Wonder how it compares to fzf
by tootie on 3/6/22, 2:51 PM
Is this how modern spell checkers worked? I assumed they were more heuristic at this point. For example, Google's "did you mean" is based on mapping common misspellings to what people actually clicked on.
by amelius on 3/6/22, 6:36 PM
Can this be used to match DNA sequences?
Or sounds (like Shazam does)?