2.1.1. dodiscover.ci.CMITest#

class dodiscover.ci.CMITest(k=0.2, transform='rank', n_jobs=-1, n_shuffle_nbrs=5, n_shuffle=100, random_seed=None)[source]#

Conditional mutual information independence test.

Implements the conditional independence test using conditional mutual information proposed in [1].

Parameters:
kfloat, optional

Number of nearest-neighbors for each sample point. If the number is smaller than 1, it is computed as a fraction of the number of samples, by default 0.2.

transformstr, optional

Transform the data by standardizing the data, by default ‘rank’, which converts data to ranks. Can be ‘rank’, ‘uniform’, ‘standardize’.

n_jobsint, optional

The number of CPUs to use, by default -1, which corresponds to using all CPUs available.

n_shuffle_nbrsint, optional

Number of nearest-neighbors within the Z covariates for shuffling, by default 5.

n_shuffleint

The number of times to shuffle the dataset to generate the null distribution. By default, 1000.

random_seedint, optional

The random seed that is used to seed via np.random.defaultrng.

Notes

Conditional mutual information (CMI) is defined as:

\[I(X;Y|Z) = \iiint p(z) p(x,y|z) \log \frac{ p(x,y|z)}{p(x|z)\cdot p(y |z)} \,dx dy dz\]

It can be seen that when \(X \perp Y | Z\), then CMI is equal to 0. Hence, CMI is a general measure for conditional dependence. The estimator for CMI proposed in [1] is a k-nearest-neighbor based estimator:

\[\widehat{I}(X;Y|Z) = \psi (k) + \frac{1}{T} \sum_{t=1}^T (\psi(k_{Z,t}) - \psi(k_{XZ,t}) - \psi(k_{YZ,t}))\]

where \(\psi\) is the Digamma (i.e. see scipy.special.digamma) function. \(k\) determines the size of hyper-cubes around each (high-dimensional) sample point. Then \(k_{Z,},k_{XZ},k_{YZ}\) are the numbers of neighbors in the respective subspaces. \(k\) can be viewed as a density smoothing parameter (although it is data-adaptive unlike fixed-bandwidth estimators). For large \(k\), the underlying dependencies are more smoothed and CMI has a larger bias, but lower variance, which is more important for significance testing. Note that the estimated CMI values can be slightly negative while CMI is a non- negative quantity.

The estimator implemented here assumes the data is continuous.

References

Methods

test(df, x_vars, y_vars[, z_covariates])

Abstract method for all conditional independence tests.

test(df, x_vars, y_vars, z_covariates=None)[source]#

Abstract method for all conditional independence tests.

Parameters:
dfpd.DataFrame

The dataframe containing the dataset.

x_varsSet of column

A column in df.

y_varsSet of column

A column in df.

z_covariatesSet, optional

A set of columns in df, by default None. If None, then the test should run a standard independence test.

Returns:
Tuple[float, float]

Test statistic and pvalue.