Monday, 28 June 2010
Aho-Corasick (multiple pattern match) in PHP
My initial solution was to split the text into an array of substrings and use array_diff to remove those in my blacklist array. I don't know the asymptotic cost of array_diff algorythm, but using it with plenty of items (featuring long titles, summaries and descriptions) soon appeared a bit too expensive.
Recently I studied a lot of pattern matching algorythms, including the well known "Aho-Corasick" which checks occourrences of multiple patterns "at once" inside a text of length n in O(M+n), being M the total number of characters composing the patterns. Useful for virus signature detection and bioinformatics, I did not find its implementation anywhere, so I did my own!
Download the code
As you could see inside the code, this algorythm is cool because one can precalculate delta/phi transition tables and cache them, brutally speeding up computations on many different texts made of many terms.
If you find this code useful, or if you have any kind of suggestion to improve it, please let me know. :)