Description of the datasets:
We published a technical
memorandum describing the datasets
, how we created them, and some baseline
are available from the website of the challenge
, where each dataset
is described. Briefly:
- REGED is a dataset generated
by a simulator of gene expression data, which was trained on
real DNA microarray data. The target variable is "lung cancer". Hence,
the task is to discover genes, which trigger disease or are a consequence
of disease. The manipulations simulate the effect of agents such as drugs.
For REGED1, the list of manipulated variable is provided, but not for
REGED2. REGED has 999 features, of which the Markov blanket contains 2
direct causes, 13 direct effects and 6 spouses in REGED0, but only 2 direct
causes, 6 direct effects and 4 spouses in REGED1, and 2 direct
causes in REGED2.
- SIDO consists
of real data,
from a drug discovery problem. The variables represent molecular
descriptors of pharmaceutical compounds, whose activity on the HIV virus
must be determined (the target variable). Knowing which molecule feature
is a cause of activity would be of great help to chemical engineers
to design new compounds. To test the efficacy of causal discovery algorithm,
artificial "distractor" variables (called "probes") were added, which
are "non-causes" of the target. All the probes are manipulated in the
test sets SIDO1 and SIDO2. The probes must be filtered out to get a good
causal discovery score and good prediction performance on test data.
SIDO has 4932 features, of which 1644 real features and 3288 probes.
- CINA is also
a real dataset. The problem is to predict the revenue
level of people from census data (marital status, years of study,
gender, etc.). As a causal discovery problem, the task is to find causes,
which might influence revenue. Similarly as for SIDO, artificial variables
(probes) were added. CINA has 132 features, of which 44 real
features and 88 probes; all probes are manipulated in CINA1 and 2.
- MARTI is a noisy version of REGED. Correlated noise
was added to simulate measurement artifacts and introduce spurious
relationships between variables. This dataset illustrates that without
proper calibration/normalization of data, causal discovery algorithms
may yield wrong causal structures. MARTI has 1024 features and the same
causal graph as REGED. However, 25 calibrant variables were added to help
taking out the noise.
We searched for the entries, which obtained best Tscore
among all the Reference entries, which were made with the knowledge
of the true Markov blanket (Fscore=1). With a few exceptions, the
Tscore reached is not statistically significantly different from the
best among all entries, regardless of Fscore (see this table
). The interesting
exceptions are for SIDO0 and CINA1 where the best Tscore is reached by
entries having Fscore~0.5, showing the robustness of these classifiers
to a large number of distractors. The SIDO0 entry uses linear ridge
regression and the CINA1 entry uses naive Bayes.
In Figure 1, we show the value of the best test set prediction
score (Tscore) for predicting the target variable obtained by participants,
as a function of number of days in the challenge. The Tscore
used here is the area under the ROC curve
. The datasets REGED
and SIDO were introduced from the beginning of the challenge while
CINA was introduced a month later and MARTI mid-way. We show as a thin
line the performance level of the best Reference entry made by the organizers
(using knowledge of causal relationships not available to the participants).
We see that this value is closely reached by the participants by the end
of the challenge for all 3 variant of REGED and MARTI whereas some improvement
could still ne gained on SIDO1 and 2 and CINA1 and 2. Progresses are
made by steps.
Figure 1: Leaning curves. The thin horizontal lines indicate
the best achievable performance, knowing the causal relationships (solid
line for set 0, dashed line for set 1, and dotted line for set 2.
In Figure 2, the distribution of Tscores is represented
by histograms of all the complete entries made throughout the challenge.
The dashed vertical line represents the best Reference entry made by
the organizers (using knowledge of causal relationships not available
to the participants). The solid vertical line represents the best final
entry made by the participants (only one final entry was allowed per
participant). We note that the particpants did not always select their
best entries as final submission, which indicates that the feed-back
on performance they were getting on-line was not sufficient to perform
model selection. Still, it may have biased the results, particularly
on version 0 of the datasets, for which there is not a wide spread in results
of the top ranking submissions. Hence, falling out of the top 25% was
a strong indication of performance loss. We see that for the version 0
of the datasets, the Reference performances are reached. Hence, the on-line
feed-back may have had the effect of pumping performance up. However,
version 0 is the only one in which training data and test data have the
same distribution. Hence cross-validation gives a very strong feed-back.
Additionally, many more featurs are predictive in version 0 since, due
to manipulations, a lot of features are made irrelevant in versions 1 and
2. We see no indication that "pumping" ocurred for versions 1 and 2. To
further detect any potential "pumping" effect, we analyzed the progress
of individual participants (see individual results
We notice that for version 0, the top 2 quatiles are very close to one
another. Towards the end of the challenge, the entries oscillate over and
under the top quartile line, giving very strong feed-back to the participants.
Also, as the number of entries grow, the top quartile values are drifting
up, so participants having entered late in the game did not succeed in catching
up and benefit from this feed-back.
Figure 2: Histograms of Tscores.
Analysis of results by dataset:
In Figures 3, 4, 5, 6, we show the individual participant
results for their last complete entry. The entrants who qualified
for the final ranking (i.e., having a complete entry, disclosing
their name, providing their feature set, providing a fact sheet,
and cooperating with verification tests) have their rank indicated
in parenthesis. We also show on the graph the last complete entry of
participants, who did not comply with all the requirements or did not compete
towards the prizes for other reasons (there are referred to by they psudonym
of initials). According
to the instructions
, the participants could provide sorted or
unsorted lists of features. Participants having supplied a sorted
list of features have one star after their name. They also had the
option of varying the number of features used, following nested subsets
of features derived from their ranking. The participants who supplied
a table of results corresponding to using nested subsets of features
have 2 stars after their name.
We show 3 graphs:
- Tscore: The test set AUC for the prediction
of the target variable.
- Fscore: The AUC for the prediction of which
features belong to the Markov blanket.
- Fnum: The number of features used.
These scores are described in more details on the Evaluation
. If result tables are provided, the graph shows the best
Tscore and the corresponding Fnum. We indicate error bars for Tscore
and Fscore. However, for Tscore, they are smaller than the symbols,
so they cannot be seen.
We observe some correlation between Fscore and Tscore,
but some notable exceptions:
- Some participants provided an unsorted list of all
features or no list at all. They get an Fscore of 0.5. This does not
necessarily indicate that they performed no feature selection: Some
ensemble methods use multiple feature subsets and we did not have
provisions in our format for reporting such cases.
- Some participants performed good causal discovery
(high Fscore) but did not get good prediction performance (low Tscore).
This could be attributed to a bad classification algorithm and needs
to be further investigated case by case. For REGED2, for instance, many
people discovered the causes of the target exactly. Yet, they do not
necessarily predict well the target
- The Fscore gives equal importance to good features
rightly selected and bad features rightly discarded. Hence for REGED2
where there are very few good features, most people who got them all
get a very high Fscore. The proportion of good features in the selected
feature set (so-called "precision") correlates a little better with Tscore,
particularly when the good feature set is taken as all relatives instead
of just Markov blanket members.
- Interestingly, some participants obtained good Tscore
for a very bad Fscore. For the test sets labeled 0, this is not
surprising since using all the features often yields better performance
than using features selection: generally most good regularized classifiers
are robust against the presence of irrelevant features. For the test
sets labeled 1 and 2, some manipulated variables should introduce
a significant disturbance, which should give an advantage to people
who got rid of them.
Figure 3: Scores of participants for REGED. Circle=REGED0,
Full circle=REGED1, Star=REGED2.
Figure 4: Scores of participants for SIDO. Circle=SIDO0,
Full circle=SIDO1, Star=SIDO2.
Figure 5: Scores of participants for CINA. Circle=CINA0,
Full circle=CINA1, Star=CINA2.
Figure 6: Scores of participants for MARTI. Circle=MARTI0,
Full circle=MARTI1, Star=MARTI2.
Further analysis of correlation between causation and prediction
These results prompted us to investigate the correlation between
Tscore and various measures evaluating the feature set to see whether
a better correlation between causation and prediction could be obtained.
We provide a detailed analysis and many more graphs in Appendix A
We ended up creating a new Fscore, which correlates better with
the Tscore, based on the information retrieval notions of precision and
recall. We define a precision, which assesses the fraction of causally
relevant features in the feature set selected (precision=num_good_selected/num_selected
and a recall, which measures the fraction of all causally relevant features
). For REGED and MARTI,
the new Fscore is the Fmeasure=2*precision*recall/(precision+recall)
while for the datasets SIDO and CINA we simply use the precision (because
recall is not a meaning ful quantity for real data with probes since
we do not know which variables are relevant). This new Fscore correlates
well with Tscore, as shown in Figure 7
In addition, in Appendix B
we performed another analysis, which aims at correcting the bias introduced
by picking up the best performance on the test set for people who returned
a series of predictions on nested subsets of features. We performed paired
comparisons for identical feature numbers. In Figure
, we show how this affect performances.
Finally, we are presently conducting complementary experiments to
factor out the influence of the classifier, by training the same classifiers
on the feature sets of the participants.
Individual participant results:
(including graphs and fact sheets)
Chen Chu An
L.E.B & Y.T.
H. Jair Escalante
E. Mwebaze & J.
J. Yin & Z. Geng
Full Result Tables:
Tables of results are available in HTML format
and text format
. In addition we provide tables of ranked results
Verifications: Letter sent to top entrants
As one of the top ranking participants, you will
have to cooperate with the organizers to reproduce your results
in order to qualify for the prizes. The outcome of the tests (or
the absence of tests) will be published. To that end, we expect
1) Elaborate on your method in the fact sheet
- What specific feature selection methods did you try?
- What else did you try besides the method you submitted
last? What do you think was a critical element of success compared
to other things you tried?
- In what do the models for the versions 0, 1, and
2 of the various tasks differ?
- Did you rely on the quartile information available
on the web site for model selection or did you use another scheme?
- In the result table you submitted, did you use
nested subsets of features from the slist you submitted?
- Did you use any knowledge derived from the test
set to make your submissions, including simple statistics and visual
examination of the data?
2) Upload by May 15 your code to
password: yinwen4wcci (change your password)
Go to "All files" an click "upload file"
** IMPORTANT: the code should be strictly standalone
respect the following guidelines:
- Provide both executables (Windows and/or Linux)
and source code for TWO SEPARATE training and test modules, called
yourname_train and yourname_test.
- The two modules should be stricly standalone and
include all necessary libraries compiled in and have no command
- The modules should regularly output to STDOUT a
status of progress.
- The modules should regularly save partial results
so if they need to be interrupted, they can be restarted from intermediate
- The two modules will read and write to disk files
* a configuration file "config.txt"
including the path to directories DATA_DIR where data are, MODEL_DIR
where the models are, and OUTPUT_DIR where outputs should be written
* the data in the challenge standard
formats to be read from DATA_DIR
* files "dataname[num]_manip.txt"
specifying which features are manipulated in the test data to be
read from DATA_DIR
* a file "marti0_calib.txt", special
for MARTI, containing the calibrant numbers
* the models in a format you can
freely choose to be written to MODEL_DIR
* the prediction results (including
feature lists and output predictions) in the same format as challenge
submissions to be written to OUTPUT_DIR
A) The training program will:
- Read from the current directory the configuration
file "config.txt", indicating DATA_DIR and MODEL_DIR (keyword
followed by path name, separated by newline)
- Read from DATA_DIR the training data (in standard
format; since all 3 training sets are the same for a given task,
we will use only version 0; only data from one task will be present
in DATA_DIR) and the files "dataname[num]_manip.txt" indicating
the list of manipulated variables, see self explanatory format
- Produce models for test sets 0, 1, and 2 saved
To save time, since much of the training may be similar
for the versions 0, 1, and 2, the training module should process
everything at once and output models for all 3 versions. The training
module should save feature sets or feature rankings in MODEL_DIR.
If nested subsets of features are used, the training module should
produce models for all subsets. Therefore, it is important that the
module can be restarted if it gets interrupted and all intermediate results
B) The test program will:
- Read from the current directory the configuration
file "config.txt", indicating DATA_DIR, MODEL_DIR, and OUTPUT_DIR
(keyword followed by path name, separated by newline)
- Read the test data from DATA_DIR (in standard
format). Test sets for all 3 versions of a given task will be found
in that directory and should all be processed.
- Load the corresponding model(s) from MODEL_DIR
- Output the submission as it was made to the website
Special Matlab instructions:
- Provide 2 Matlab functions
A) [yourname]_train(data_name, train_data_dir, model_dir,
This function should read from train_data_dir
and should output models to model_dir. Optionally, log_dir
can be used to log messages about the status/completion of learning.
B) [yourname]_test(data_name, test_data_dir,
This function should read from test_data_dir
and reload models from model_dir. Then it should produce
predictions to output_dir.
- We will run your code of the original datasets
- Do not pass the full dataset archive as an argument
to the program, pass a directory name.
train_data_dir and test_data_dir will be 2 separate
directories to make sure we do not load accidentally test data during