We introduced early this technology on Bicocca's website, to spread interesting facts in an aggregate way; we placed RSS feeds in pages featuring news, events and so forth.
We recently talked about this system and we did not find it optimal. We think that it could be more effective to route all this informations into single feeds intended to particular audience, instead of fragmenting everything into many rarely-changed small feeds. Stefano has partially mitigated the problem by creating a feed's map, but we agreed that is sub-optimal: we still prefer something able to gather everything in a single channel and "skim" the news according to the reader's interest.
A similar project is Yahoo! Pipes, that I discovered during the course of knowledge representation. YP allows, among other, to "sum up" contribution of different feeds into a single one but cannot classify the outgoing items (or, at least, I can't find anything like that :) ). Reading some post from the project's board, it seems I'm not the only one lamenting this lack. Not a problem, because in the meanwhile I rediscovered the wheel.
"Mergifier", that's the name I gave to the software, draws from an arbitrary number of RSS feeds (and ATOM as well) and integrates them in a big cauldron. The administrator - that is us, at the moment - is allowed to create as many categories as he wants and every item can be classified in one or more of those. Naturally, I'm not talking about a trivial table, because we don't want to classify every entry by hand. That's why I used a bayesian filter.
More ham for everyone
The idea comes from the bayesian spam filtering used in software like Mozilla Thunderbird: whenever you receive an unwanted letter, you classify it as "junk". The client stores words' frequencies and use those informations to determine the probability of future e-mail's to be spam.
The probability estimation is quite simple:
where p1...Pn is the individual probability of a word to be present in a "junk e-mail". How can we determine that? By using the following formula:
where
- Pr(S|W) is the probability of e-mail to be spam
- Pr(W|S) is the chance of finding the word in a junk e-mail
- Pr(S) is the "a priori" probability of a mail to be spam
- Pr(W|H) is the chance of finding the word in a good email
- Pr(H) is the "a priori" probability of a mail to be ham (that is the way "not spam" e-mail are called!)
This method works stunningly fine; we're testing the results in order to release an alpha during the next weeks :) .
No comments:
Post a Comment