causality

Causation and Prediction Challenge

Appendix A: Correlation analysis


causality


Further analysis of correlation between causation and prediction

The results of the causation and prediction challenge showed a poor correlation between the Fscore we had defined to assess causal feature relevance and the Tscore measuring prediction accuracy.

We ended up creating a new Fscore, which correlates better with the Tscore. Consider the quatities called in information retrieval "precision" and "recall". The recall is the "sensitivity" (fraction of success of the positive class, i.e. the fraction of relevant features retrieved relatively to to the total number of relevant features). The precision is the fraction of relevant features in the features selected (i.e. 1-FDR, where FD is the "false discovery rate"). We found that both precision and recall correlate with Tscore, but precision correlates more. Recall is not a good metric for SIDO and CINA (real data + probes) because we must approximate the set of "relevant" features by the set of "true" features, among which some might be irrelevant. Conversely the fraction of true features in the set of features selected is a good proxi for the real precision, which cannot be computed directly. We also tried the Fmeasure=2*precision*recall/(precision+recall). Computing these scores required defining a set of good (relevant) features and a set of bad (irrelevant features). This is a difficult exercise sine all features have "some" degree of relevancy. To partially remedy that problem, we computed 3 versions of precision and recall, using as set of relevant features, set1="Markov blanket" or set2=MB+causes&effects or set3=all_relatives. The resulting scores is a weighted average using weights 3,2,1, to emphasize more the features, which should be more relevant. Based on such average precision and recall, we define the new Fscore as the precision for SIDO and CINA and the Fmeasure for REGED and MARTI.

In the following plots, we show the new Fscore, as a function of Tscore. The horizontal line indicates the fraction of  relevant features in the entire feature set. Hence, it can be seen that many entrants do better then selecting features at random since their feature set is significantly enriched in relevant features. The Tscore is normalized as follows: max(0, (Tscore-0.5)/(MaxTscore-0.5)) to indicate performance relative to the best achievable performance. Our observations include:

Set 0

Set 1

 Set 2

Mean

REGED

 0.34 (  0.2)

 0.10 (  0.7)

 0.65 (0.006)

 0.58 ( 0.02)

SIDO

 0.31 (  0.2)

 0.38 (  0.2)

 0.44 (  0.1)

 0.56 ( 0.01)

CINA

-0.30 (  0.3)

-0.00 (    1)

 0.52 ( 0.05)

 0.47 ( 0.05)

MARTI

 0.66 ( 0.02)

 0.94 (7e-006)

 0.67 ( 0.02)

 0.87 (0.0002)

Mean

 0.90 (2e-008)

 0.83 (2e-006)

 0.76 (4e-005)

 0.84 (2e-018)

Table A1 :  Correlation between New Fscore and Tscore. The mean corresponds the the correlation of scores averaged over several datasets.


set0 mean relatives 0 legend
Figure A1: Test sets 0 -- New Fscore a function of relative Tscore. Right: results for all 4 tasks. Left results averaged over all 4 tasks.

Set 1 Mean relatives 1 legend
Figure A2: Test sets 1 -- New Fscore a function of relative Tscore. Right: results for all 4 tasks. Left results averaged over all 4 tasks.

Set 1 mean relatives 2 legend
Figure A3: Test sets 2 -- New Fscore as a function of relative Tscore. Right: results for all 4 tasks. Left results averaged over all 4 tasks.

mean 2 relaives 1 mean 2 relatives 2
mean 2 relatives 3 mean 2 relatives 4 legend
Figure A4: Results averaged over sets 0, 1, and 2 for the 4 tasks (red=REGED, green=SIDO, blue=CINA, black=MARTI).

Mean relatives     legend

Figure A5: Average scores -- Performance of participants averaged over all datasets.

We performed complementary tests by asking several challenge participants to run all the feature sets selected by ranked participants with the same classifiers (Figure A6). We expected the correlation between Fscore and Tscore to become more singificant by reducing the classifier variability. However, we did not observe on average a gain in correlation (correlations were better for REGED and worse for SIDO). We show below the averaged graphs (over all datasets and test set variants), indicating that some performance variation depending on the classifier. For instance, Alexei Polovinkin and the CAMML team benefitted from a boost in Tscore, but Gavin Cawley and Vladimir Nikulin did best using their own original classifiers. Note that Gavin Cawley in this test used a single classifier but in his final entry used a large ensemble of classifiers). In contrast, the features of Yin-Wen Chang do well with all classifiers multivariate classifiers, and signiicantly better than others with linear ridge regression and SVM.

Borisov Cawley
Chang Boulle


Figure A5: Average scores -- Performance of participants averaged over all datasets when the entrants' feature sets are tested with the same classifier. From left to right and top to bottom: Ensemble of tree classifiers (courtesy of A, Borisov, Intel), linear ridge regression (courtesy of Gavin Cawley, University of East Anglia, UK), linear SVM (Yin-Wen Chang, National Taiwan University), naive Bayes classifier (courtesy of Marc Boullé, France Telecom).