Causation and Prediction Challenge Analysis



The causation and prediction challenge started on December 15, 2007 and ended on April 30, 2008. Four tasks were proposed from various application domains. Each dataset included a training set drawn from a "natural" distribution and 3 test sets: one from the same distribution as the training set (labeled 0) and 2 corresponding to data drawn when an external agent is manipulating certain variables. The goal of the challenge was to predict as well as possible a binary target variable, whose value on test data was withheld. In addition, the participants were asked to provide the list of variables (features) they used to make predictions. The tasks were designed such that the knowledge of causal relationships between the predictive variables and the target should be helpful to making predictions. Causation and prediction are tied because manipulated variables, which are not direct causes of the target, may be more harmful than useful to making predictions. In the extreme case when all variables are manipulated, only the direct causes are predictive of the target. The effectiveness of causal discovery is assessed with a score, which measures how well the features selected coincide with the Markov blanket of the target for test set distribution. Part of our analysis is to observe under which condition this score is correlated with prediction accuracy.


Design and Analysis of the Causation and Prediction Challenge
Isabelle Guyon, Constantin Aliferis, Greg Cooper, André Elisseeff, Jean-Philippe Pellet, Peter Spirtes, and Alexander Statnikov; JMLR W&CP 3:1-33, 2008.


Left: Gavin Cawley and Yin-Wen Chang receiving their award from Isabelle Guyon. Right: Jianxin Yin receiving his award from Isabelle Guyon.

Description of the datasets:

We published a technical memorandum describing the datasets, how we created them, and some baseline results.
The data are available from the website of the challenge, where each dataset is described. Briefly:

Reference entries:

We searched for the entries, which obtained best Tscore among all the Reference entries, which were made with the knowledge of the true Markov blanket (Fscore=1). With a few exceptions, the Tscore reached is not statistically significantly different from the best among all entries, regardless of Fscore (see this table). The interesting exceptions are for SIDO0 and CINA1 where the best Tscore is reached by entries having Fscore~0.5, showing the robustness of these classifiers to a large number of distractors. The SIDO0 entry uses linear ridge regression and the CINA1 entry uses naive Bayes.

Learning Curves:

In Figure 1, we show the value of the best test set prediction score (Tscore) for predicting the target variable obtained by participants, as a function of number of days in the challenge. The Tscore used here is the area under the ROC curve. The datasets REGED and SIDO were introduced from the beginning of the challenge while CINA was introduced a month later and MARTI mid-way. We show as a thin line the performance level of the best Reference entry made by the organizers (using knowledge of causal relationships not available to the participants). We see that this value is closely reached by the participants by the end of the challenge for all 3 variant of REGED and MARTI whereas some improvement could still ne gained on SIDO1 and 2 and CINA1 and 2. Progresses are made by steps.

Learning curves

Figure 1: Leaning curves. The thin horizontal lines indicate the best achievable performance, knowing the causal relationships (solid line for set 0, dashed line for set 1, and dotted line for set 2.

Performance Histograms:

In Figure 2, the distribution of Tscores is represented by histograms of all the complete entries made throughout the challenge. The dashed vertical line represents the best Reference entry made by the organizers (using knowledge of causal relationships not available to the participants). The solid vertical line represents the best final entry made by the participants (only one final entry was allowed per participant). We note that the particpants did not always select their best entries as final submission, which indicates that the feed-back on performance they were getting on-line was not sufficient to perform model selection. Still, it may have biased the results, particularly on version 0 of the datasets, for which there is not a wide spread in results of the top ranking submissions. Hence, falling out of the top 25% was a strong indication of performance loss. We see that for the version 0 of the datasets, the Reference performances are reached. Hence, the on-line feed-back may have had the effect of pumping performance up. However, version 0 is the only one in which training data and test data have the same distribution. Hence cross-validation gives a very strong feed-back. Additionally, many more featurs are predictive in version 0 since, due to manipulations, a lot of features are made irrelevant in versions 1 and 2. We see no indication that "pumping" ocurred for versions 1 and 2. To further detect any potential "pumping" effect, we analyzed the progress of individual participants (see individual results). We notice that for version 0, the top 2 quatiles are very close to one another. Towards the end of the challenge, the entries oscillate over and under the top quartile line, giving very strong feed-back to the participants. Also, as the number of entries grow, the top quartile values are drifting up, so participants having entered late in the game did not succeed in catching up and benefit from this feed-back.


Figure 2: Histograms of Tscores. 

Analysis of results by dataset:

In Figures 3, 4, 5, 6, we show the individual participant results for their last complete entry. The entrants who qualified for the final ranking (i.e., having a complete entry, disclosing their name, providing their feature set, providing a fact sheet, and cooperating with verification tests) have their rank indicated in parenthesis. We also show on the graph the last complete entry of participants, who did not comply with all the requirements or did not compete towards the prizes for other reasons (there are referred to by they psudonym of initials). According to the instructions, the participants could provide sorted or unsorted lists of features. Participants having supplied a sorted list of features have one star after their name. They also had the option of varying the number of features used, following nested subsets of features derived from their ranking. The participants who supplied a table of results corresponding to using nested subsets of features have 2 stars after their name.
We show 3 graphs:
These scores are described in more details on the Evaluation page. If result tables are provided, the graph shows the best Tscore and the corresponding Fnum. We indicate error bars for Tscore and Fscore. However, for Tscore, they are smaller than the symbols, so they cannot be seen.

We observe some correlation between Fscore and Tscore, but some notable exceptions:

REGED results
Figure 3: Scores of participants for REGED. Circle=REGED0, Full circle=REGED1, Star=REGED2.

SIDO Results
Figure 4: Scores of participants for SIDO. Circle=SIDO0, Full circle=SIDO1, Star=SIDO2.

CINA Results
Figure 5: Scores of participants for CINA. Circle=CINA0, Full circle=CINA1, Star=CINA2.

MARTI Results
Figure 6: Scores of participants for MARTI. Circle=MARTI0, Full circle=MARTI1, Star=MARTI2.

Further analysis of correlation between causation and prediction

These results prompted us to investigate the correlation between Tscore and various measures evaluating the feature set to see whether a better correlation between causation and prediction could be obtained.
We provide a detailed analysis and many more graphs in Appendix A:
We ended up creating a new Fscore, which correlates better with the Tscore, based on the information retrieval notions of precision and recall. We define a precision, which assesses the fraction of causally relevant features in the feature set selected (precision=num_good_selected/num_selected) and a recall, which measures the fraction of all causally relevant features retrived (recall=num_good_selected/num_good). For REGED and MARTI, the new Fscore is the Fmeasure=2*precision*recall/(precision+recall), while for the datasets SIDO and CINA we simply use the precision (because recall is not a meaning ful quantity for real data with probes since we do not know which variables are relevant). This new Fscore correlates well with Tscore, as shown in Figure 7.

In addition, in Appendix B, we performed another analysis, which aims at correcting the bias introduced by picking up the best performance on the test set for people who returned a series of predictions on nested subsets of features. We performed paired comparisons for identical feature numbers. In Figure 8, we show how this affect performances.

Finally, we are presently conducting complementary experiments to factor out the influence of the classifier, by training the same classifiers on the feature sets of the participants.

Mean relatives     legend

Figure 7: Performance of participants averaged over all datasets (all 4 tasks and all test set versions).
The new Fscore is plotted as a function of the relative Tscore: (Tscore-0.5)/(MaxTscore-0.5).

Mean relatives rank     legend
Figure 8: Performance of participants averaged over all datasets (all 4 tasks and all test set versions).
Here we show the fraction of times one participant is better than another one in pairwise comparisons for identical numbers of features.

Individual participant results: 

(including graphs and fact sheets)

Chen Chu An
Alexander Borisov
L.E.B & Y.T.
CaMML Team
J.G. Castellano
Gavin Cawley
Yin-Wen Chang
Louis Duclos-Gosselin
H. Jair Escalante
Cristian Grozea
Nistor Grozavu
Jinzhu Jia
Jianming Jin
H.A. Jen
E. Mwebaze & J. Quinn
Vladimir Nikulin
Alexey Polovinkin
Florin Popescu
Marius Popescu
Mehreen Saeed
J. Yin & Z. Geng Gr.
Ching-Wei Wang
Wu Zhili

Full Result Tables:

Tables of results are available in HTML format and text format. In addition we provide tables of ranked results.

Verifications: Letter sent to top entrants

As one of the top ranking participants, you will have to cooperate with the organizers to reproduce your results in order to qualify for the prizes. The outcome of the tests (or the absence of tests) will be published. To that end, we expect you to:
1) Elaborate on your method in the fact sheet:
- What specific feature selection methods did you try?
- What else did you try besides the method you submitted last? What do you think was a critical element of success compared to other things you tried?
- In what do the models for the versions 0, 1, and 2 of the various tasks differ?
- Did you rely on the quartile information available on the web site for model selection or did you use another scheme?
- In the result table you submitted, did you use nested subsets of features from the slist you submitted?
- Did you use any knowledge derived from the test set to make your submissions, including simple statistics and visual examination of the data?
2) Upload by May 15 your code to:
login: ywc
password: yinwen4wcci (change your password)
Go to "All files" an click "upload file"
** IMPORTANT: the code should be strictly standalone respect the following guidelines:
- Provide both executables (Windows and/or Linux) and source code for TWO SEPARATE training and test modules, called yourname_train and yourname_test.
- The two modules should be stricly standalone and include all necessary libraries compiled in and have no command line arguments.
- The modules should regularly output to STDOUT a status of progress.
- The modules should regularly save partial results so if they need to be interrupted, they can be restarted from intermediate results.
- The two modules will read and write to disk files including:
    * a configuration file "config.txt" including the path to directories DATA_DIR where data are, MODEL_DIR where the models are, and OUTPUT_DIR where outputs should be written
    * the data in the challenge standard formats to be read from DATA_DIR
    * files "dataname[num]_manip.txt" specifying which features are manipulated in the test data to be read from DATA_DIR
    * a file "marti0_calib.txt", special for MARTI, containing the calibrant numbers
    * the models in a format you can freely choose to be written to MODEL_DIR
    * the prediction results (including feature lists and output predictions) in the same format as challenge submissions to be written to OUTPUT_DIR
A) The  training program will:
- Read from the current directory the configuration file "config.txt", indicating DATA_DIR and MODEL_DIR (keyword followed by path name, separated by newline)
- Read from DATA_DIR the training data (in standard format; since all 3 training sets are the same for a given task, we will use only version 0; only data from one task will be present in DATA_DIR) and the files "dataname[num]_manip.txt" indicating the list of manipulated variables, see self explanatory format attached.
- Produce models for test sets 0, 1, and 2 saved in OUTPUT_DIR
To save time, since much of the training may be similar for the versions 0, 1, and 2, the training module should process everything at once and output models for all 3 versions. The training module should save feature sets or feature rankings in MODEL_DIR.  If nested subsets of features are used, the training module should produce models for all subsets. Therefore, it is important that the module can be restarted if it gets interrupted and all intermediate results be stored.
B) The test program will:
- Read from the current directory the configuration file "config.txt", indicating DATA_DIR, MODEL_DIR, and OUTPUT_DIR (keyword followed by path name, separated by newline)
- Read the test data from DATA_DIR (in standard format). Test sets for all 3 versions of a given task will be found in that directory and should all be processed.
- Load the corresponding model(s) from MODEL_DIR
- Output the submission as it was made to the website in OUTPUT_DIR

Special Matlab instructions:

- Provide 2 Matlab functions
A) [yourname]_train(data_name, train_data_dir, model_dir, log_dir)
This function should read from train_data_dir
 * [data_name]
 * [data_name]_train.targets
 * [data_name]_manip.txt
and should output models to model_dir. Optionally, log_dir can be used to log messages about the status/completion of learning.
B) [yourname]_test(data_name, test_data_dir, model_dir, output_dir)
This function should read from test_data_dir
 * [data_name]
and reload models from model_dir. Then it should produce predictions to output_dir.

- We will run your code of the original  datasets and variants.
- Do not pass the full dataset archive as an argument to the program, pass a directory name.
- train_data_dir and test_data_dir will be 2 separate directories to make sure we do not load accidentally test data during training.

Results of verifications

The following competitors were asked to provide their code and the verifications were performed to satisfaction: Yin-Wen Chang, Gavin Cawley, Jianxin Yin. These entrants were awarded prizes for their achievements. Other entrants supplied their code and we performed partial verifications, but full verifications were not performed since they did not qualify for prizes.
The verrification results were published in:
Design and Analysis of the Causation and Prediction Challenge
Isabelle Guyon, Constantin Aliferis, Greg Cooper, André Elisseeff, Jean-Philippe Pellet, Peter Spirtes, and Alexander Statnikov; JMLR W&CP 3:1-33, 2008.

Post-challenge tests

We are presently conducting several post-challenge tests to draw stronger conclusions on the problems of the competition. The draft of the WCCI challenge analysis paper is available fro review to the participants only.

T1: Influence of the classifier [done]:
The following competitiors and organizers were asked to provide classification results using their ow classification method for the feature sets of all ranked entrants:
- Marc Boullé (Naive Bayes)
- Alexander Borisov (ensembles of tree classifiers)
- Gavin Cawley (Linear ridge regression)
- Yin Wen Chang (SVM)
The result are shown in Appendix A (bottom). The correlation between causation ad prediction does not significantly change on average over all datasets and tasks, although there are some diferences for individual entrants and individual datasets and tasks. The best correlation between prediction and causation  was obtained by Alexander Borisov (0.88).

T2: Preprocessing for the MARTI dataset [done]:
In order to compare methods more easily, the following competitiors and organizers have been asked to provide preprocessed data for the MARTI data sets:
- Laura Brown (The data is in the Matlab format that was used for the original datasets.  RaR files are posted for the MARTI0, 1, and 2 sets (the files are split to be less than 25MB in order to upload to the system).  The new datasets can be loaded in the same manner as the original data in Matlab (please note, I did not adjust the check sums for the files and this will throw an error if using the sample code function “create_data_struct”)
- Gavin Cawley
- Yin-Wen Chang (EACH FEATURE of the  testing data should be scaled to the range of [-1,1] to be used  together with this preprocessing)
- Isabelle Guyon
- Jianxin Yin
other competitors are welcome to also supply their preprocessed data.

T3: Causal discovery [partially done]:
These tests aim at understanding how well the causal discovery algorithms figure out the local causal neighborhood of the target variables and all causes and compare the predictive power of feature sets based on the local causal neighborhood with those that are more encompassing.
The following competitors and organizers were asked to participate:
- Laura Brown
- Gavin Cawley
- Ernest Mwebaze
- Alexander Statnikov
- Jianxin Yin

and produce for each dataset (but only for version 0):
T3A: a depth 3 oriented graph structure,
T3B: a graph of all causes (direct or indirect):

Depth 1 relatives: parents and children.
Depth 2 relatives: grand-parents, grand-children, spouses, siblings.
Depth 3 relatives: great-grand-parents, great-grand-children, uncles/aunts, nices/nephews, children and parents in law.

Note: task T3A  was completes as part of the pot-luck challenge and is referred to as the LOCANET task.

The format is via a text file containing the list of parents of the features of interest. Two files per datasets should be produced: <dataname>_feat.localgraph (for the depth 3 neighborhood) and <dataname>_feat.causalgraph (for the graph of all ancestors of the target). The target is numbered 0. All other features are numbered with their column number in the data tables.

Example lucas0_feat.localgraph

0: 1 5
1: 3 4
2: 1  
6: 5
8: 6 9
9: 0 11
11: 0 10

Results will be provided only for version 0 of the datasets.

T4: Feature selection [on hold]:
For comparison, the following competitors and organizers were asked to turn in a single optimal feature subset (rather than a feature ranking) for each task, using their feature selection method:
- Marc Boullé
- Alexander Borisov
- Gavin Cawley
- Yin-Wen Chang
- Vladimir Nikulin

The format is via the format [dataname]_feat.slist or [dataname]_feat.ulist of the challenge.

- Are the feature sets stable under data resampling? Provide also, if possible, 10 versions of the feature sets using 90% of the training data.
- Do feature selection algorithms select features from the local neighborhood? Are there features farther away, which are more predictive? For SIDO and CINA, probes are non-causes and are not manipulated. All causes are alway predicitive. In the case of CINA see whether feature coding has a significant influence. Compare maybe with raw data.

T5: Assessment of the probe method [current]:
We produced new versions of REGED and MARTI (REDEGP and MARTIP), which include probes. These datasets aim at studying the probe method. The artificial datasets REGED and MARTI represent in this case real data for which the data generative process is unknown. We added artificial variables (probes), which are all non-causes. We the same people as in the previous section to participate in tests on those new datsets:
- T5A: Provide the local structure of the target (<dataname>_feat.localgraph), for version 0 of REGEDP and MARTIP: causal discovery people only.
- T5B: Provide all causes of the target (<dataname>_feat.causalgraph), for version 0 of REGEDP and MARTIP: causal discovery people only.
- T5C: Provide predictive feature subsets (<dataname>_feat.slist or <dataname>_feat.ulist), for version 0 of REGEDP and MARTIP: feature selection people only.

- Are statistics computed with the probe method good estimators of the true statistics (e.g. precision, MBAUC)?
- A possible strategy in manipulated data (all probes manipulated) is to keep only causes of the target. Do some methods also retain some consequenses among the true variables. How does this give them an advantage? (possibly they also include probes, but this may not penalize them as much as they profit from the addition of true variables).
- Do the performances on manipulated data with probes assess how well people performed causal discovery?

T6: Optimality of the MB [waiting for results of T3. T4, and T5]:
These tests should be performed on all datasets and all test set versions. They aim at testing the optimality of the Markov blanket (and the "manipulated MB") estimated by causal discovery methods as feature set, in cases where the estimation is not perfect and compare to strategies of including more features.

The participants in these tests may include;
- Gavin Cawley
- Vladimir Nikulin
- Yin-Wen Chang
- Isabelle Guyon
- Alexander Statnikov

- T6A: Baseline. Select random subsets of features of varying sizes (1, 2, 4... 2^n) and compute prediction performances using a reference classifier for all datasets and all test set versions. Perform 100 trials in each case. This test aims at providing baline performances for random subsets of given sizes.
- T6B: Using a reference classifier and the results of causal structure discovery (T3), compute predictions for nested feature subsets built as follow, for all datasets and all test set versions:
       PC, MB, depth 2, depth 3, depth 3+ all causes
Compare with "optimal stategies" in manipulated test sets:
- In manipulated data for which the manipulated variables are known (REGED1 and MARTI1), exclude the manipulated chidren and the manipulated spouses when all their children are manipulated.
- In REGED2 and MARTI2, exclude all non-direct causes.
- In SIDO and CINA 1 and 2, exclude all non-causes.
- T6C: Same as T6B for REGED and MARTI using the true structures.
- T6D: Same as T6C but add at random true good relatives to the true MB to see whether increasing the space dimensionality with true good features might be detrimental.
- T6E: Same as T6D but add at random some proportion of true good and true bad features to the true MB.
- T6F: Test reandom subsets made of certain fractions of good and bad features.
- T6F: Redo all the previous tests by varying the number for training examples.

- In the reference submissions do we already have answers to some of our questions: does the optimum number of features correspond to the MB?
- Can we derive formulas to estimate the effect of selecting a certain number of "bad" variables and how this is compensated by adding variables, which are not in the causal neighborhood.

T7: Coding problem:
In the dataset CINA proposed in the challenges, the real variables included categorical variables, which were coded as binary variable. We now provide the raw data CINAR for analysis to see whether this coding has adverse effects.

Summary table of planned experiments:

Marc Boullé


Alexander Borisov


Laura Brown



Gavin Cawley

Yin Wen Chang


Isabelle Guyon



Ernest Mwebaze



Vladimir Nikulin


Alexander Statnikov



Jianxin Yin, Chanzang Wang



Robert Tillman