Friday, 6 May 2011

On the bioinformatics or: computer science's pillars


One of the most interesting facets of bioinformatics is that pushes computer science towards its limit. When you deal with huge data sets, a poorly implemented algorithm or the choice of the wrong data structure leads to inevitable failure.

Screenshot above shows some information about the costs of a piece of software I'm developing for prof. Besozzi, that executes a slight variation of Aho-Corasick on saccharomyces cerevisiae's genome, looking for a bunch of relevant nucleotides sequences. The data-set isn't impressive, just 36MB, and loading it into main memory takes about 10MB. The remaining 128 megabytes are due to the tables produced by algorithm and the huge vectors that store the results (>7 million of matches).

This being the ratio, one can easily foresee that bigger data sets (eg.: the human genome) could easily get out of hand, requiring more resources than one can afford! I'm more than satisfied in terms of time complexity (it's a linear algorithm, therefore optimal), but some further optimization could (has) to be done on the spatial side.

No comments: