Asa Ben-Hur and Douglas Brutlag
Protein function prediction, i.e. classification of protein sequences according to their biological function is an important task in bioinformatics. In this paper we illustrate that the presence of sequence motifs -- elements that are conserved across different proteins -- are highly discriminative features for protein function prediction. This is in agreement with the biological thinking that considers motifs as the building blocks of protein sequences. The approach is demonstrated on the problem of classifying enzymes. Since enzymes contain thousands of motifs, we perform feature selection on this data and find that most enzyme classes can be classified using a handful of motifs, yielding accurate and highly interpretable classifiers. We compare our method to a method based on BLAST, which is the accepted method for measuring sequence similarity between proteins.