Causation and Prediction Challenge

Appendix B: Pairwise Comparisons


Further analysis of participant performance: pairwise comparisons

The results of the causation and prediction challenge were biased in favor of people who returned tables of results. This section corrects for this bias by making pariwise comparisons of participant results at equal number of features.

According to the rules of the challenge, people could return prediction results for nested subsets of features of size 1, 2, 4, 8, ... N. The best Tscore (test set AUC) of all these predictions was selected to rank their entries. We encouraged in this way people to perform feature ranking and return result tables, from which we could make graphs of performance as a function of the number of features.
By analyzing these graphs, we made the following observations:
The last two observations made us worry that the results did not give enough credit to methods having performed actual causal discovery, but no feature ranking, and too much credit to feature ranking methods not performing causal discovery. To correct for this bias, we made pairwise comparisons between participants. We consider 3 cases:
We then used as new scores the fraction of time a given person had better results than another, with respect to Tscore and our new Fscore (see Figures).


The correlations between Fscore and Tscore are enhanced considerably with this pairwise comparison (See table B1). The analysis of the graphs indicate that the team of Jianxin Yin and Prof. Geng did best both with respect to causal disovery and target prediction performance. However, Vladimir Nikulin and Marc Boullé closely match the prediction performances while using plain feature selection. By examining the entries made, it appears that Vladimir Nikulin may have been significantly influenced by the feed-back from the quartiles provided during the challenge. However, Marc Boullé made only one submission and uses methods making independence assumptions between features. Nonetheless, on REGED, the team of Laura Brown and Ioannins Tsamardinos (LEB&YT) comes ahead, and on SIDO and MARTI the team of Jianxin Yin and Prof. Geng comes ahead both with respect to Tscore and Fscore. It is only on CINA that the feature selection people get great prediction performance with no causal discovery.

Table B1: Correlation between Fscore and Tscore. We show the Pearson correlation coefficient and pvalue in parenthesis. All correlations outlined are significant with 95% confidence.

Set 0 Set 1 Set 2 Mean
REGED 0.55 (0.007) 0.55 (0.006) 0.63 (0.001) 0.62 (0.001)
SIDO 0.36 ( 0.09) 0.64 (0.001) 0.60 (0.002) 0.70 (0.0002)
CINA 0.36 ( 0.09) 0.69(0.0003) 0.65 (0.0007) 0.72 (0.0001)
MARTI 0.94 (1e-11) 0.96(1e-13) 0.90 (4e-9) 0.95 (2e-12)
Mean 0.64 (0.001) 0.75(3e-5) 0.73 (7e-5) 0.70 (3e-11)

Warning: Our mehod of scoring introduced an undesirable bias. We recommend not to use the best of several performances using test data in a challenge in which test data is not drawn from te same distribution as the training data because this amounts to injecting important information on the test distribution in the results. This is very important in comparisons of causal discovery methods, if many methods are tried. If many methods are tried, because of variance in the results, some methods not performing causal dicovery may perform surprisingly well.

Relatives 0 rank Mean relatives rank 0 legend
Figure B1: Performance of participants for test sets 0. Left: for each dataset. Right: averaged over datasets.

Relatives 1 rank mean relatives 1 rank legend
Figure B2: Performance of participants for test sets 1. Left: for each dataset. Right: averaged over datasets.

relatives 2 rank mean relatives 2 rank legend
Figure B3: Performance of participants for test sets 2. Left: for each dataset. Right: averaged over datasets.

Mean2 relaitivs 1 mean2 relatives 2 rank
Mean2 relatives3 rank mean2 relatives 4 rank legend
Figure B4: Performance of participants averaged over sets 0,1, and 2 for the 4 tasks
(red=REGED, green=SIDO, blue=CINA, black=MARTI).

Mean relatives rank
Figure B5: Performance of participants averaged over all sets 0,1, and 2 and all 4 tasks.