[ workshop page | workshop FAQ | challenge page | challenge FAQ | learning object FAQ | results ]
When everything else fails,Agnostic Learning vs. Prior Knowledge Challenge, Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin Cawley, In proceedings IJCNN 2007, Orlando, Florida, August 2007.
ask for additional domain knowledge…
Analysis of the IJCNN 2007 Agnostic Learning vs. Prior Knowledge Challenge, Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin Cawley, Neural Network special anniversary issue, in press. [Earlier draft]
Hand on Pattern Recognition, challenges in data representation, model selection, and performance prediction. Book in preparation. Isabelle Guyon, Gavin Cawley, Gideon Dror, and Amir Saffari Editors.
We summarize the results of the AL vs. PK challenge, whose goal was to assess the added value of Prior/domain Knowledge (PK) in machine learning, compared to using as a standard learning machine (a "black box") trained on data pre-formatted with simple-minded features (called Agnostic Learning or AL). The challenge, which durated 10 months (from Oct. 1st 2006 to Aug. 1st, 2007), is now over. About 50 participants made 1070 entries and 13 competed in each track for the final ranking (90 entries in the PK track and 167 in the AL track). The data are available from the workshop page and the website of the challenge and , which is still open for post-challenge submission.
The challenge had two tracks, using the same five datasets from various domains:
- Agnostic learning: Preprocessed datasets in a nice “feature-based” representation, but no knowledge about the identity of the features.
- Prior knowledge: Raw data, sometimes not in a feature-based representation. Information given about the nature and structure of the data.
The final ranking was based on the balanced error rate (BER) on the test set. The BER is the average of the error rate of the positive class and the error rate of the negative class. The Area Under the ROC Curve (AUC) was also computed, but not used for scoring. To obtain the overall ranking we averaged the ranks of participants in each track after normalizing them by the number of entries. The number of submissions was unlimited, but only the five last “complete” submissions for each entrant in either track were included in the final ranking. For details, see:
Table 1: ALvsPK challenge datasets.
The datasets of this challenge were used in our previous challenge on "performance prediction", we refer the reader ot its analysis for details. In this previous challenge, the datasets were formatted similarly as in the "agnostic track" and the training/validation/test sets were in the same proportions, but the data split was different.
Figure 1: Learning curves. We show the evolution of the best test BER as a function of time for the 2006 and the 2007 challenges, using the same datasets. (a) Performance prediction challenge. Solid lines = test set BER. Dashed lines = validation set BER (note: validation set labels were released one month before the challenge ended). (b) AL vs. PK challenge. Solid lines = AL track. Dashed lines = PK track.
For the first few weeks of the challenge, the agnostic track (AL) results kept leading (see Figure 1). The learning curves all crossed at a certain point, except for ADA. After approximately 150 days the PK performance asymptote was reached. The asymptotic performances are reported at the top of Table 2. In contrast, last year, using the same data as the AL track, the competitors attained almost their best performances in about 60 days and kept slightly improving afterwards.
In Figure 1, we show the test BER distribution for all entries. There were about 60% more submissions in the AL track than in the PK track. This indicates that the “prior knowledge” track was harder to enter. However, the participants who did enter the PK track performed significantly better on average than those who entered the AL track, on all datasets except for HIVA. To quantify this observation we ran a Wilcoxon rank sum test on the difference between the median values of the two tracks (Table 2). We also performed paired comparisons for entrants who entered both tracks, using their last 5 submissions. In Table 2, a “+” indicates that the entrant performed best in the
PK track and a “-” indicates the opposite. We see that the entrants who entered both tracks did not always succeed in getting better results in the PK track. The pvalues of the sign test does not reveal any significant dominance of PK over AL or vice versa in that respect (all are between 0.25 and 0.5). However, for HIVA and NOVA the participants who entered both tracks failed to get better results in the PK track. We conclude that, while on average PK seems to win over AL, success is uneven and depends both on the domain and on the individuals’ expertise.
Table 2: Result statistics. The first 4 lines are computed from the test set BER over all entries, including reference entries made by the organizers. The indicate that PK wins over AL in most cases. The fifth line indicates that the median values of the 2 previous lines are significantly different (better for PK), except for HIVA. In the remainder of the table, a + indicates that in the 5 last entries of each participant having entered both tracks the best prior knowledge entry gets better results than the best agnostic entry. PK is better than AL for most entrants having entered both tracks for ADA, GINA and SYLVA, but the opposite is true for HIVA and NOVA (requiring more domain knwoledge). The last lign tests the significance of the fraction of times PK is better than AL or vice versa (no significance found because too few participants entered both tracks).
In Figure 2, we show the test BER distribution for all entries. There were about 60% more submissions in the AL track than in the PK track. This indicates that the “prior knowledge” track was harder to enter. However, the participants who did enter the PK track performed significantly better on average than those who entered the AL track, on all datasets except for HIVA.
Figure 2: Histograms of BER performances. The thin vertical black line indicates the best entry counting toward the final ranking (among the five last of every entrant). (a) Agnostic learning track. (b) Prior knowledge track. Note that for NOVA, the best entries are not among the last ones submitted!
To quantify this observation we ran a Wilcoxon rank sum test on the difference between the median values of the two tracks (Table 2). We also performed paired comparisons for entrants who entered both tracks, using their last 5 submissions. In Table 2, a “+” indicates that the entrant performed best in the PK track and a “-” indicates the opposite. We see that the entrants who entered both tracks did not always succeed in getting better results in the PK track. The pvalues of the sign test does not reveal any significant dominance of PK over AL or vice versa in that respect (all are between 0.25 and 0.5). However, for HIVA and NOVA the participants who entered both tracks failed to get better results in the PK track. We conclude that, while on average PK seems to win over AL, success is uneven and depends both on the domain and on the individuals’ expertise.
Here are the final result tables as of August 1st, 2007, which determined the prizes.
Table 3: Agnostic learning track
Table 4: Prior knowledge track
The prize winners are indicated in yellow. The best average BER is held by Reference (Gavin Cawley): try to outperform him by making post-challenge entries! Louis Duclos-Gosselin is second on ADA with Neural Network13, and S. Joshua Swamidass (Baldi Lab, UCI) second on HIVA, but they are not entered in the table because they did not submit complete entries. The overall entry ranking is performed with the overall score (average rank over all datasets, using the test BER for ranking). The best performing complete entry may not contain all the best performing entries on the individual datasets.
From the results, we noticed that it is possible to do very well without prior knowledge and using prior knowledge skillfully is not that obvious:
We performed intermediate rankings using the test set (but not revealing the test set performances), see Table 5:
- December 1st, 2006 (for NIPS 06): Results of the model selection game. [Slides].
- March 1st , 2007: Competition deadline ranking. [Slides]
The March 1st results prompted us to extend the competition deadline because the participants in the "prior knowledge track" are still making progress, as indicated by the learning curves. The best results did not significantly improve, but some entrants made significant improvements in their methods.
Table 5: Milestone results on the test set (only revealed at the end of the challenge). The letter "A" indicates the AL track and the letter "P" the PK track.
Winning agnostic learning methods
The winner of the “agnostic learning” track is Roman Lutz, who also won the Performance Prediction Challenge (IJCNN06), using boosting techniques. Gavin Cawley, who joined the organization team and was co-winner of the previous challenge, made a reference entry using a kernel method called LSSVMs, which slightly outperforms that of Lutz. The improvements he made can partly be attributed to the introduction of an ARD kernel, which automatically downweighs the least relevant features and to a Bayesian regularization at the second level of inference. The second best entrant is the Intel group, also using boosting methods. The next best ranking entrants include Juha Reunanen and Hugo Jair Escalante, who have both been using CLOP models provided by the organizers and have proposed innovative search strategies for model selection: Escalante is using a biologically inspired particle swarm technique and Reunanen a cross-indexing method to make cross-validation more computationally efficient. Other top ranking participants in the AL track include Vladimir Nikulin and Joerg Wichard who both experimented with several ensemble methods, Erinija Pranckeviciene who performed a study of linear programming SVM methods (another kernel method), and Marc Boullé who introduced a new data grid method.
How to use prior knowledge?
We summarize the strategies employed by the participants to exploit prior knowledge on the various datasets.
ADA: the marketing application
The task of ADA is to discover high revenue people from census data, presented in the form of a two-class classification problem. The raw data from the census bureau is known as the Adult database in the UCI machine-learning repository. The 14 original attributes (features) represent age, workclass, education, education, marital status, occupation, native country, etc. and include continuous, binary and categorical features. The PK track had access to the original features and their descriptions. The AL track had access to a preprocessed numeric representation of the features, with a simple disjunctive coding of categorical variables, but the identity of the features was not revealed. We expected that the participants of the AL vs. PK challenge could gain in performance by optimizing the coding of the input features. Strategies adopted by the participants included using a thermometer code for ordinal variables (Gavin Cawley) and optimally grouping values for categorical variables (Marc Boullé). Boullé also optimally discretized continuous variables, which make them suitable for a naïve Bayes classifier. However, the advantage of using prior knowledge for ADA was marginal. The overall winner on ADA is in the agnostic track (Roman Lutz), and the entrants who entered both tracks and performed better using prior knowledge do not have results statistically significantly better. We conclude that optimally coding the variables may be not so crucial and that good performance can be obtained with a simple coding and a state-of-the-art classifier.
GINA: the handwriting recognition application
The task of GINA is handwritten digit recognition, the raw data is known as the MNIST dataset. For the “agnostic learning” track we chose the problem of separating two-digit odd numbers from two-digit even numbers. Only the unit digit is informative for this task, therefore at least 1/2 of the features are distractors. Additionally, the pixels that are almost always blank were removed and the pixel order was randomized to hide the meaning of the features. For the “prior knowledge” track, only the informative digit was provided in the original pixel map representation. In the PK track the identities of the digits (0 to 9) were provided for training, in addition to the binary target values (odd vs. even number). Since the prior knowledge track data consists of pixel maps, we expected the participants to perform image pre-processing steps such as noise filtering, smoothing, de-skewing, and feature extraction (points, loops, corners) and/or use kernels or architectures exploiting geometric invariance by small translation, rotation, and other affine transformations, which have proved to work well on this dataset. Yet, the participants to the PK track adopted very simple strategies, not involving a lot of domain knowledge. Some just relied on the performance boost obtained by the removal of the distractor features (Vladimir Nikulin, Marc Boullé, Juha Reunanen). Others exploited the knowledge of the individual class labels and created multi-class of hierarchical classifiers (Vojtech Franc, Gavin Cawley). Only the reference entries of Gavin Cawley (which obtained the best BER of 0.0192) included domain knowledge by using an RBF kernels with tunable receptive fields to smooth the pixel maps. In the future, it would be interesting to assess the methods of Simard et al on this data to see whether further improvements are obtained by exploiting geometrical invariances. The agnostic track data was significantly harder to analyze because of the hidden class heterogeneity and the presence of feature distractors. The best GINA final entry was therefore on the PK track and all four ranked entrants who entered both tracks obtained better results in the PK track. Further, the differences in performance are all statistically significant.
HIVA: the drug discovery application
The task of HIVA is to predict which compounds are active against the AIDS HIV infection. The original data from the NCI has 3 classes (active, moderately active, and inactive). We brought it back to a two-class classification problem (active & moderately active vs. inactive), but we provided the original labels for the “prior knowledge” track. The compounds are represented by their 3d molecular structure for the “prior knowledge” track (in SD format). For the “agnostic track” we represented the input data as vector of 2000 sparse binary variables. The variables represent properties of the molecule inferred from its structure by the ChemTK software package (version 4.1.1, Sage Informatics LLC). The problem is therefore to relate structure to activity (a QSAR - quantitative structure-activity relationship problem) to screen new compounds before actually testing them (a HTS - high-throughput screening problem). Note that in such applications the BER is not the best metric to assess performances since the real goal is to identify correctly the compounds most likely to be effective (belonging to the positive class). We resorted to using the BER to make comparisons easier across datasets. The raw data was not supplied in a convenient feature representation, which made it impossible to enter the PK track using agnostic learning methods, using off-the-shelf machine learning packages. The winner in HIVA (Chloé-Agathe Azencott of the Pierre Baldi Laboratory at UCI) is a specialist in this kind of dataset, on which she is working towards her PhD. She devised her own set of low level features, yielding a “molecular fingerprint” representation, which outperformed the ChemTK features used on the agnostic track. Her winning entry has a test BER of 0.2693, which is significantly better than the test BER of the best ranked AL entry of 0.2827 (error bar 0.0068). The results on HIVA are quite interesting because most agnostic learning entrants did not even attempt to enter the prior knowledge track and the entrants that did submit models for both tracks failed to obtain better results in the PK track. One of them working in an institute of pharmacology reported that too much domain knowledge is sometimes detrimental; experts in his institute advised against using molecular fingerprints, which ended up as the winning technique.
NOVA: the text classification application
The data of NOVA come from the 20-Newsgroup dataset. Each text to classify represents a message that was posted to one or several USENET newsgroups. The raw data is provided in the form of text files for the “prior knowledge” track. The preprocessed data for the “agnostic learning” track is a sparse binary representation using a bag-of-words with a vocabulary of approximately 17000 words (the features are simply frequencies of words in text). The original task is a 20-class classification problem but we grouped the classes into two categories (politics and religion vs. others) to make it a two-class problem. The original class labels were available for training in the PK track but not in the AL track. As the raw data consist of texts of variable length it was not possible to enter the PK track for NOVA without performing a significant pre-processing. All PK entrants in the NOVA track used a bag-ofwords representation, similar to the one provided in the agnostic track. Standard tricks were used, including stemming. Gavin Cawley used the additional idea of correcting the emails with an automated spell checker and Jorge Sueiras used a Thesaurus. No entrant who entered both tracks outperformed their AL entry with their PK entry in their last ranked entries, including the winner! This is interesting because the best PK entries made throughout the challenge significantly outperform the best AL entries (BER difference of 0.0089 for an error bar of 0.0018), see also Figure 1. Hence in this case, the PK entrants overfitted and were unable to select among their PK entries those, which would perform best on test data. This is not so surprising because the validation set on NOVA is quite small (175 examples). Even though the bag-of-words representation is known to be state-of-the-art for this kind of applications, it would be interesting to compare it with more sophisticated representations. To our knowledge, the best results on the 20 Newsgroup data were obtained by the method of distributional clustering by Ron Bekkerman.
SYLVA: the ecology application
The task of SYLVA is to classify forest cover types. The forest cover type for 30 x 30 meter cells was obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. We converted this into a two-class classification problem (classifying Ponderosa pine vs. everything else). The input vector for the “agnostic learning“ track consists of 216 input variables. Each pattern is composed of 4 records: 2 true records matching the target and 2 records picked at random. Thus 1/2 of the features are distractors. The “prior knowledge” track data is identical to the “agnostic learning” track data, except that the distractors are removed and the meaning of the features is revealed. For that track, the identifiers in the original forest cover dataset are revealed for the training set. As the raw data was already in a feature vector representation, this task was essentially testing the ability of the participants in the AL track to perform well in the presence of distractor features. The PK track winner (Roman Lutz) in his Doubleboost algorithm exploited the fact that each pattern was made of two records of the same pattern to train a classifier with twice as many training examples. Specifically, a new dataset was constructed by putting the second half of the data (variables 55 to 108) below the first half (variables 1 to 54). The new dataset is of dimension 2n times 54 (instead of n times 108). This new dataset is used for fitting the base learner (tree) of his boosting algorithm. The output of the base learner is averaged over the two records belonging to the same pattern. This strategy can be related to the neural network architectures using “shared weights”, whereby at training time, the weights trained on parts of the pattern having similar properties are constrained to be identical. This reduced the number of free parameters of the classifier.
For the first few month of the challenge, AL lead over PK, showing that the development of good AL classifiers is considerably faster. As of March 1st 2007, PK was leading over AL on four out of five datasets. We extended the challenge five more month, but the best performances did not significant improve during that time period. On datasets not requiring real expert domain knowledge (ADA, GINA, SYLVA), the participants entering both track obtained better results in the PK track, using a special-purpose coding of the inputs and/or the outputs, exploiting the knowledge of which features were uninformative, and using “shared weights” for redundantfeatures. For two datasets (HIVA and NOVA) the raw data was not in a feature representation and required some domain knowledge to preprocess data. The winning data representations consist in low level features (“molecular fingerprints” and “bag of words”). From the analysis of this challenge, we conclude that agnostic learning methods are very powerful. They quickly yield (in 40 to 60 days) to performances, which are near the best achievable performances. General-purpose techniques for exploiting prior knowledge in the encoding of inputs or outputs or the design of the learning machine architecture (e.g. via shared weights) may provide an additional performance boost, but exploiting real domain knowledge is both hard and time consuming. The net result of using domain knowledge rather using than low level features and relying on agnostic learning may actually be to worsen results, as experienced by some entrants. This fact seems to be a recurrent theme in machine learning publications and the results of our challenge confirm it. Future work includes incorporating the best identified methods in our challenge toolkit CLOP. The challenge web site remains open for post-challenge submissions.
We are very grateful our sponsors and the all our collaborators:
This project is supported by the National Science Fundation under Grant N0. ECCS-0736687. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.