Causality Challenge
Estimate causal direction in noisy background




Simulated data


Motivation: Noninvasive electrophysiological measurements like EEG/MEG measure to large extent unknown superpositions of very many sources. Any relation observed between channels is dominated by meaningless mixtures of mainly independent sources. The question is how to observe and properly interpret true interactions in the presence of such strong confounders.

[Download] data here.

To read the data into MATLAB, type
fid=fopen('simuldata.bin');
data=reshape(fread(fid,'float'),6000,2,1000);

The data consists of 1000 examples of bivariate data for 6000 time points. Each example is a superposition of a signal (of interest) and noise. The signal is constructed from a unidirectional bivariate AR-model of order 10 with (otherwise) random AR-parameters and uniformly distributed input. The noise is constructed of three independent sources, generated with 3 univariate AR-models with random parameters and uniformly distributed input, which were instantaneously mixed into the two sensors with a random mixing matrix. The relative strength of noise and signal was set randomly. The data were generated with this [Matlab code]. (Of course, the seeds for the random number generators chosen for the challenge data are confidential.)

The task is to estimate the direction of the interaction of the signal. A submitted result is a vector with 1000 numbers having the values 1, -1, or 0. Here, 1 means direction is from first to second sensor, -1 means direction is from second to first sensor, and 0 means "I don't know".

For all examples either 1 or -1 is correct. The most important point here is the way it is counted: you get +1 point for each correct answer; you get -10 points for each wrong answer; and you get 0 points for each 0 in the result vector. With this counting confidence about the result is added into the evaluation. It is strongly recommended that for each example the evidence for a specific finding is assessed.


Real EEG data for 10 subjects


Download the data here [subject1] [subject2] [subject3] [subject4] [subject5] [subject6] [subject7] [subject8] [subject9] [subject10]

To read the data e.g. of the first subject into Matlab type:
fid=fopen('sub1.bin','r');
data=reshape(fread(fid,'float'),[],19);

Each data set is an EEG measurement of a subject with eyes closed using 19 channels according to the standard 10-20 system. The sampling rate is 256Hz. If you divide a data set into blocks of 4 seconds (i.e. 1024 data points) then each block is a continuous measurement which is cleaned of apparent artefacts.

The data all have a strong signal at around 10Hz called alpha rhythm predominantly in occipital (i.e. back part of the brain) regions. The 10 subjects were selected from a total of 88 subjects according to an estimated signal to noise ratio. The data were provided by Tom Brismar from the Karolinska Institute in Stockholm. Any reference to subject name or id was taken out.

The challenge is to estimate the causal direction of the alpha rhythm for these data sets as an average across all 10 subjects. The result must be a single 19X19-matrix, say C. The matrix element C_ij must reflect the 'strength' of causal drive of channel i to channel j. Please, do not set non-significant results to zero or reduce the result to binary numbers. The respective figures are eventually difficult to interpret. The precise meaning of the term 'strength' varies across methods. Furthermore, different methods have different meaning with respect to the question whether the causal drive is direct or indirect. We leave these things to the participant who should give a short explanation of what the result means. Since the ground truth is not known, we only collect all results and send back a visualisation of the result. With the permission of the authors we put the respective figure plus a comment of the authors on the net. The purpose is to compare different methods for the same data and discuss the results. Both the amount of data and the quality is very high, and hence we can expect reasonable estimates from many different methods. Here's a warning: to our experience there is a large variability across subjects. Therefore, one cannot expect to have consistent results across all subjects. Also, EEG data are typically very noisy at very low frequencies (below 1 Hz). Make sure to avoid artefacts of slow drifts.

For illustration we show our own result for these data sets using the [Phase Slope Index]:

Imaginary Coherence, 10 Hz, Rest
( Download the software to create such figures here. ) Here, each small circle shows the flow of each channel to all other channels. Positive values (red) mean sending and negative values (blue) mean receiving information. The values denote relative temporal delay in pseudo-z-score sense: Absolute values larger than 2 are significant on a single subject level without correction for multiple comparison. The method (in this form) does not distinguish between direct and indirect interaction. The interpretation would be that frontal channels (top panels in the figure) send information to channels in the back.


For questions:

Dr. Guido Nolte
Fraunhofer FIRST
Kekuléstr. 7
12489 Berlin, Germany

email: guido.nolte"at"first.fraunhofer.de
Tel: +49 (30) 6392-1861
Fax: +49 (30) 6392-1879