Emergent: June 2010

Monday, 28 June 2010

L'usabilità e la scelta del corso di studi

Come sempre in questo periodo, stiamo lavorando alla pubblicazione dell'offerta didattica del prossimo anno accademico. Poichè il nostro ateneo ha una certa grandezza, l'offerta complessiva è decisamente vasta, organizzata in otto facoltà per un totale di circa 70 corsi di laurea differenti, tra triennali, magistrali, UNA specialistica residua (scienze infermieristiche ed ostetriche), "a ciclo unico" e l'ultima vecchio ordinamento (formazione primaria). Non è semplice, per chi visita il sito, orientarsi in questa selva di corsi, e ogni anno si da fondo alla creatività cercando l'organizzazione migliore.

Inoltre, per ragioni storiche, gli studenti tendono a confondere il concetto di "corsi di laurea" e "facoltà". Dai test di usabilità condotti da Stefano lo scorso inverno è emerso che tantissimi visitatori si perdono seguendo i link alle facoltà, credendo di arrivare alle informazioni di un corso di laurea e finendo invece a navigare sui siti satellite. Quest'anno abbiamo riorganizzato completamente la navigazione e tenteremo di frenare questa emorrargia, anche con un motore di ricerca dei corsi di laurea un po' smart.

Conosci il tuo nemico

Partendo dal presupposto che gli studenti non sono tenuti a conoscere l'organizzazione interna dell'Ateneo, abbiamo cercato di rendergli la vita semplice proponendo un motore di ricerca basato sull'elaborazione del testo inserito. Ho cercato di andare oltre il triviale pattern match, perché come detto auspichiamo una certa "tolleranza semantica".

Dunque, uno studente troverà il CdL in "informatica" sia cercando la parola stessa che inserendo "laurea in informatica" o "facoltà di informatica", ma anche cercando altre parole correlate. In sostanza, ho associato ad ogni CdL un insieme di termini che potrebbero essere inseriti dall'utente alla ricerca di quello specifico corso. Naturalmente, le parole di poca rilevanza vengono scartate automaticamente. La ricerca avviene tramite javascript asincrono e quindi in tempo reale senza caricamento.

Una parte interessante è che la complessità insita nell'identificazione del CdL - come sapere a che facoltà afferisce e come si chiama esattamente - viene del tutto mascherata: inserendo brandelli di parole viene generata una lista che potrebbe addirittura suggerire nuovi percorsi agli immatricolandi.

Aho-Corasick (multiple pattern match) in PHP

When I started programming Mergifier, my "RSS feeds dispenser", I knew I was going to need some kind of terms filtering. It's useless to take into account the most common words, for they don't bring any information.

My initial solution was to split the text into an array of substrings and use array_diff to remove those in my blacklist array. I don't know the asymptotic cost of array_diff algorythm, but using it with plenty of items (featuring long titles, summaries and descriptions) soon appeared a bit too expensive.

Recently I studied a lot of pattern matching algorythms, including the well known "Aho-Corasick" which checks occourrences of multiple patterns "at once" inside a text of length n in O(M+n), being M the total number of characters composing the patterns. Useful for virus signature detection and bioinformatics, I did not find its implementation anywhere, so I did my own!

Download the code

As you could see inside the code, this algorythm is cool because one can precalculate delta/phi transition tables and cache them, brutally speeding up computations on many different texts made of many terms.

If you find this code useful, or if you have any kind of suggestion to improve it, please let me know. :)

Friday, 11 June 2010

Mergifier: prima alfa

Mergifier è entrato in funzione! Attualmente si occupa di classificare le notizie di interesse per studenti, docenti e dipendenti. Questi sono i link per sottoscrivere i feed ATOM:

Il progetto è in una fase embrionale e potrebbe manifestare qualche malfunzionamento; aiutaci segnalando bug o eventuali suggerimenti via email o sul canale facebook! :)

Monday, 7 June 2010

Mergifier: a bayesian RSS filter

The RSS feed is a popular format used to publish information. XML based, therefore machine-readable, it was thought as a "container" of news from frequently updated web pages. Personally, I love using iGoogle as aggregator to keep, in a single screen, lot of feeds that keep me constantly updated about intersting facts happening around the world.

We introduced early this technology on Bicocca's website, to spread interesting facts in an aggregate way; we placed RSS feeds in pages featuring news, events and so forth.

We recently talked about this system and we did not find it optimal. We think that it could be more effective to route all this informations into single feeds intended to particular audience, instead of fragmenting everything into many rarely-changed small feeds. Stefano has partially mitigated the problem by creating a feed's map, but we agreed that is sub-optimal: we still prefer something able to gather everything in a single channel and "skim" the news according to the reader's interest.

A similar project is Yahoo! Pipes, that I discovered during the course of knowledge representation. YP allows, among other, to "sum up" contribution of different feeds into a single one but cannot classify the outgoing items (or, at least, I can't find anything like that :) ). Reading some post from the project's board, it seems I'm not the only one lamenting this lack. Not a problem, because in the meanwhile I rediscovered the wheel.

"Mergifier", that's the name I gave to the software, draws from an arbitrary number of RSS feeds (and ATOM as well) and integrates them in a big cauldron. The administrator - that is us, at the moment - is allowed to create as many categories as he wants and every item can be classified in one or more of those. Naturally, I'm not talking about a trivial table, because we don't want to classify every entry by hand. That's why I used a bayesian filter.

More ham for everyone

The idea comes from the bayesian spam filtering used in software like Mozilla Thunderbird: whenever you receive an unwanted letter, you classify it as "junk". The client stores words' frequencies and use those informations to determine the probability of future e-mail's to be spam.

The probability estimation is quite simple:

where p1...Pn is the individual probability of a word to be present in a "junk e-mail". How can we determine that? By using the following formula:

where

Pr(S|W) is the probability of e-mail to be spam
Pr(W|S) is the chance of finding the word in a junk e-mail
Pr(S) is the "a priori" probability of a mail to be spam
Pr(W|H) is the chance of finding the word in a good email
Pr(H) is the "a priori" probability of a mail to be ham (that is the way "not spam" e-mail are called!)

The idea was to replace the concept of spam with our own categories, and compute those probabilities "online" by fetching from the feeds and letting us classify them. Naturally, the most common words are blacklisted in order to lower the work of database.

This method works stunningly fine; we're testing the results in order to release an alpha during the next weeks :) .