Monday 7 June 2010

Mergifier: a bayesian RSS filter

The RSS feed is a popular format used to publish information. XML based, therefore machine-readable, it was thought as a "container" of news from frequently updated web pages. Personally, I love using iGoogle as aggregator to keep, in a single screen, lot of feeds that keep me constantly updated about intersting facts happening around the world.

We introduced early this technology on Bicocca's website, to spread interesting facts in an aggregate way; we placed RSS feeds in pages featuring news, events and so forth.

We recently talked about this system and we did not find it optimal. We think that it could be more effective to route all this informations into single feeds intended to particular audience, instead of fragmenting everything into many rarely-changed small feeds. Stefano has partially mitigated the problem by creating a feed's map, but we agreed that is sub-optimal: we still prefer something able to gather everything in a single channel and "skim" the news according to the reader's interest.

A similar project is Yahoo! Pipes, that I discovered during the course of knowledge representation. YP allows, among other, to "sum up" contribution of different feeds into a single one but cannot classify the outgoing items (or, at least, I can't find anything like that :) ). Reading some post from the project's board, it seems I'm not the only one lamenting this lack. Not a problem, because in the meanwhile I rediscovered the wheel.

"Mergifier", that's the name I gave to the software, draws from an arbitrary number of RSS feeds (and ATOM  as well) and integrates them in a big cauldron. The administrator - that is us, at the moment - is allowed to create as many categories as he wants and every item can be classified in one or more of those. Naturally, I'm not talking about a trivial table, because we don't want to classify every entry by hand. That's why I used a bayesian filter.

More ham for everyone

The idea comes from the bayesian spam filtering used in software like Mozilla Thunderbird: whenever you receive an unwanted letter, you classify it as "junk". The client stores words' frequencies and use those informations to determine the probability of future e-mail's to be spam.

The probability estimation is quite simple: 


where p1...Pn  is the individual probability of a word to be present in a "junk e-mail". How can we determine that? By using the following formula:

where
  • Pr(S|W) is the probability of e-mail to be spam
  • Pr(W|S) is the chance of finding the word in a junk e-mail
  • Pr(S) is the "a priori" probability of a mail to be spam
  • Pr(W|H) is the chance of finding the word in a good email
  • Pr(H) is the "a priori" probability of a mail to be ham (that is the way "not spam" e-mail are called!)
The idea was to replace the concept of spam with our own categories, and compute those probabilities "online" by fetching from the feeds and letting us classify them. Naturally, the most common words are blacklisted in order to lower the work of database.

This method works stunningly fine; we're testing the results in order to release an alpha during the next weeks :) .

No comments: