Further analysis of correlation between causation and prediction
The
results
of the causation and prediction challenge showed a poor correlation between
the Fscore we had defined to assess causal feature relevance and the Tscore
measuring prediction accuracy.
We ended up creating a new Fscore, which correlates better with the Tscore.
Consider the quatities called in information retrieval "precision" and "recall".
The recall is the "sensitivity" (fraction of success of the positive class,
i.e. the fraction of relevant features retrieved relatively to to the total
number of relevant features). The precision is the fraction of relevant
features in the features selected (i.e. 1-FDR, where FD is the "false discovery
rate"). We found that both precision and recall correlate with Tscore, but
precision correlates more. Recall is not a good metric for SIDO and CINA
(real data + probes) because we must approximate the set of "relevant" features
by the set of "true" features, among which some might be irrelevant. Conversely
the fraction of true features in the set of features selected is a good proxi
for the real precision, which cannot be computed directly. We also tried
the Fmeasure=2*precision*recall/(precision+recall). Computing these scores
required defining a set of good (relevant) features and a set of bad (irrelevant
features). This is a difficult exercise sine all features have "some" degree
of relevancy. To partially remedy that problem, we computed 3 versions of
precision and recall, using as set of relevant features, set1="Markov blanket"
or set2=MB+causes&effects or set3=all_relatives. The resulting scores
is a weighted average using weights 3,2,1, to emphasize more the features,
which should be more relevant. Based on such average precision and recall,
we define the new Fscore as the precision for SIDO and CINA and the Fmeasure
for REGED and MARTI.
In the following plots, we show the new Fscore, as a function of
Tscore. The horizontal line indicates the fraction of relevant
features in the entire feature set. Hence, it can be seen that many entrants
do better then selecting features at random since their feature set is
significantly enriched in relevant features. The Tscore is normalized
as follows: max(0, (Tscore-0.5)/(MaxTscore-0.5)) to indicate performance
relative to the best achievable performance. Our observations include:
- We see that for test sets numbered 0, some of the top ranking
entries have feature sets with precision no better than the entire feature
set (some actually used the entire feature set). See for instance the
entries of Gavin Cawley.
- Perhaps more surprisingly, some entries in for test sets
1 and 2 have poor Fscore and yet a good Tscore. This is particularly
interesting because, in the case of test sets 0, the same irrelevant
features are both in training and test data, so the classifiers can train
to ignore them. For test sets 1 and 2, there are more irrelevant features
in test data than in training data, due to the manipulations.
- The correlations for test sets 2 are more important than
for test sets 1 because of their design. For REGED and MARTI, test set
1 has fewer members of the original Markov blanket that are manipulated
then test set 2. For SIDO and CINA, all probes are manipulated, but the
choice of values in test set 1 corresponds to a simle randomization, while
in test set 2 the manipulations are made in an adversarial way to penalize
more heavily people who select wrong features.
- Going back to the table
of reference entries, we see that, if the Markov blanket (or the
"manipulated" system) is known, using it gives an advantage. However finding
it from training data is hard and training data must also be used to train
the predictive model. As a result, the best performing entries often use
many more features than the size of the Markov blanket, including a large
number of relatives, partially compensating from erroneously selected irrelevant
features.
- To get rid of some of the variance, we averaged over several
test sets the new Fscore and the Tscore to obtain the last plot, which
demonstrates that there is overall good correlation between causation
and prediction. We show below the whole table of corelation values between
new Fscore and Tscore:
|
Set 0
|
Set 1
|
Set
2
|
Mean
|
REGED
|
0.34
( 0.2)
|
0.10
( 0.7)
|
0.65
(0.006)
|
0.58
( 0.02)
|
SIDO
|
0.31
( 0.2)
|
0.38
( 0.2)
|
0.44
( 0.1)
|
0.56
( 0.01)
|
CINA
|
-0.30 (
0.3)
|
-0.00 (
1)
|
0.52
( 0.05)
|
0.47
( 0.05)
|
MARTI
|
0.66
( 0.02)
|
0.94
(7e-006)
|
0.67
( 0.02)
|
0.87
(0.0002)
|
Mean
|
0.90
(2e-008)
|
0.83
(2e-006)
|
0.76
(4e-005)
|
0.84
(2e-018)
|
Table A1 : Correlation between New Fscore
and Tscore. The mean corresponds the the correlation of scores averaged
over several datasets.