from Hacker News

Knuth–Morris–Pratt illustrated

by g0xA52A2A on 4/14/24, 7:55 AM with 46 comments

  • by HarHarVeryFunny on 4/16/24, 12:48 AM

    The explanation is a bit lengthy, but the idea is simple.

    Consider the normal naive way to search for a string in a document. First you search for the next instance of the first letter of your search string, then try to match the entire string at that position. If the match fails then you advance your search position by one character and repeat (looking for next occurrence of first letter, etc).

    The Knuth-Morris-Pratt algorithm improves upon this naive approach by usually advancing by more than one character after a failed match, thereby speeding up the search. It does this by taking advantage of its knowledge of the search string and the position at which the match failed.

    To get the idea, consider searching for the string "explosion" .. first we find the next "e", then try matching the rest of the word. Say we match "explo" then fail (perhaps the document had "explode"), so now we want to start over and find the next "e" in the document... What KMP would do here is note that the document matched the first 5 letters "explo", none of which (other than 1st letter) are an "e", so it can advance by 5 characters, not just 1, to start looking for the next "e".

    The amounts it can advance at each failed match position are pre-calculated to be efficient.

  • by vladimirralev on 4/16/24, 6:01 AM

    Shockingly, there is a related algorithm that is widely considered much easier to understand and implement. https://cp-algorithms.com/string/z-function.html

    And I can certainly implement the Z algorithm in a few mins while I struggle to implement the KMP off the top of my head.

    Edit: perhaps a bit better article on this https://www.geeksforgeeks.org/z-algorithm-linear-time-patter...

  • by Terr_ on 4/16/24, 3:50 AM

    Another fun one is the Aho-Corasick algorithm [0] which is great (i.e. more efficient) when you have multiple different target strings you want to find and you don't want to do multiple scans of the document.

    Like KMP, it involves a preliminary step of analyzing the search term(s), and from them it builds a directed graph representing rules for what to do while consuming the stream of document-characters.

    To find all occurrences, it's something like O(document_length + all_target_words_combined_length + number_of_hits) .

    [0] https://en.wikipedia.org/wiki/Aho-Corasick_algorithm

  • by mehulashah on 4/14/24, 1:10 PM

    This starts off well, but the Haskell makes it hard for me to understand. Perhaps I should learn Haskell?
  • by hzay on 4/16/24, 6:35 AM

    Nice. I tried visualizing this algorithm here - https://visuallyexplain.pages.dev/kmp/algorithm-(wip) . The sidebar has a bunch of completed algos -- DFS, BFS, binary heap queries, dijkstra's, union find, topological sort, quicksort, etc.

    I abandoned this project without completing KMP. If anyone is interested in this sort of algorithm explanations, I'd love to collaborate / finish this project.

  • by andrewp123 on 4/16/24, 5:45 AM

    It’s really easy to come up with KMP. Just write an algorithm that searches for a substring p in string s. Make your algorithm efficient by sliding two pointers down s to find the biggest possible match, often called “sliding window”.

    You want to grow the window as big as possible to match the substring. The data structure you naturally come up with to do checks efficiently here is the one you use in KMP.

  • by tome on 4/16/24, 7:14 AM

    Is this post showing a bug in HN? It says it was posted 10 hours ago, but Algolia says it was posted 2 days ago

    https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...

    And mehulashah's comment that the thread claims was posted 7 hours ago was also posted two days ago, according to Algolia

    https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...

    I'm aware of the second chance pool, but I don't think it can be that

    https://news.ycombinator.com/item?id=26998308

  • by dreamcompiler on 4/16/24, 4:33 AM

    Good article. I've implemented Boyer-Moore several times and it's ridiculously fast on typical English text, especially with long search strings.

    I haven't implemented KMP but I might try after reading this.

  • by jkuria on 4/16/24, 5:26 AM

    Rabin-Karp algorithm is also a fun one.

    https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

  • by mrbonner on 4/16/24, 2:08 PM

    And some interviewers ask us to come up with it in a 45-minute coding session!
  • by amelius on 4/16/24, 1:30 PM

    What's the fastest known algorithm?

    And for these related problems:

    - fastest regexp search

    - fastest search, matching minimum edit distance

  • by rugina on 4/16/24, 10:12 AM

    We, humankind managed to get a good optimisation for this problem by using spaces between words. When trying these algorithms for searching a word in a string of text, I was surprised how little they could improve vs just skipping to the next word.
  • by mehulashah on 4/16/24, 3:10 AM

    Sounds like I’m learning Haskell soon.