by g0xA52A2A on 4/14/24, 7:55 AM with 46 comments
by HarHarVeryFunny on 4/16/24, 12:48 AM
Consider the normal naive way to search for a string in a document. First you search for the next instance of the first letter of your search string, then try to match the entire string at that position. If the match fails then you advance your search position by one character and repeat (looking for next occurrence of first letter, etc).
The Knuth-Morris-Pratt algorithm improves upon this naive approach by usually advancing by more than one character after a failed match, thereby speeding up the search. It does this by taking advantage of its knowledge of the search string and the position at which the match failed.
To get the idea, consider searching for the string "explosion" .. first we find the next "e", then try matching the rest of the word. Say we match "explo" then fail (perhaps the document had "explode"), so now we want to start over and find the next "e" in the document... What KMP would do here is note that the document matched the first 5 letters "explo", none of which (other than 1st letter) are an "e", so it can advance by 5 characters, not just 1, to start looking for the next "e".
The amounts it can advance at each failed match position are pre-calculated to be efficient.
by vladimirralev on 4/16/24, 6:01 AM
And I can certainly implement the Z algorithm in a few mins while I struggle to implement the KMP off the top of my head.
Edit: perhaps a bit better article on this https://www.geeksforgeeks.org/z-algorithm-linear-time-patter...
by Terr_ on 4/16/24, 3:50 AM
Like KMP, it involves a preliminary step of analyzing the search term(s), and from them it builds a directed graph representing rules for what to do while consuming the stream of document-characters.
To find all occurrences, it's something like O(document_length + all_target_words_combined_length + number_of_hits) .
by mehulashah on 4/14/24, 1:10 PM
by hzay on 4/16/24, 6:35 AM
I abandoned this project without completing KMP. If anyone is interested in this sort of algorithm explanations, I'd love to collaborate / finish this project.
by andrewp123 on 4/16/24, 5:45 AM
You want to grow the window as big as possible to match the substring. The data structure you naturally come up with to do checks efficiently here is the one you use in KMP.
by tome on 4/16/24, 7:14 AM
https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...
And mehulashah's comment that the thread claims was posted 7 hours ago was also posted two days ago, according to Algolia
https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...
I'm aware of the second chance pool, but I don't think it can be that
by dreamcompiler on 4/16/24, 4:33 AM
I haven't implemented KMP but I might try after reading this.
by jkuria on 4/16/24, 5:26 AM
by mrbonner on 4/16/24, 2:08 PM
by amelius on 4/16/24, 1:30 PM
And for these related problems:
- fastest regexp search
- fastest search, matching minimum edit distance
by rugina on 4/16/24, 10:12 AM
by mehulashah on 4/16/24, 3:10 AM