When I started programming Mergifier, my "RSS feeds dispenser", I knew I was going to need some kind of terms filtering. It's useless to take into account the most common words, for they don't bring any information.
My initial solution was to split the text into an array of substrings and use array_diff to remove those in my blacklist array. I don't know the asymptotic cost of array_diff algorythm, but using it with plenty of items (featuring long titles, summaries and descriptions) soon appeared a bit too expensive.
Recently I studied a lot of pattern matching algorythms, including the well known "Aho-Corasick" which checks occourrences of multiple patterns "at once" inside a text of length n in O(M+n), being M the total number of characters composing the patterns. Useful for virus signature detection and bioinformatics, I did not find its implementation anywhere, so I did my own!
Download the code
As you could see inside the code, this algorythm is cool because one can precalculate delta/phi transition tables and cache them, brutally speeding up computations on many different texts made of many terms.
If you find this code useful, or if you have any kind of suggestion to improve it, please let me know. :)
3 comments:
The download link doesnt work..
Hi!
I have implemented AhoCorasick PHP Extension.
Checkout here:
https://github.com/ph4r05/php_aho_corasick
It is interface to Aho-Corasick C library in MultiFast project ( http://sourceforge.net/projects/multifast/?source=dlp )
Since it is implemented as PHP extension in C it is quite fast.
Hello,
I have implemented Aho-Corasick search algorithm as PHP Extension, written in C.
Checkout here:
https://github.com/ph4r05/php_aho_corasick
It is interface to Aho-Corasick C library from MultiFast project ( http://sourceforge.net/projects/multifast/?source=dlp ).
Since it is written in C as extension it is pretty fast. Maybe it will be useful for somebody.
Post a Comment