2.1.1. dodiscover.ci.CMITest#
- class dodiscover.ci.CMITest(k=0.2, transform='rank', n_jobs=-1, n_shuffle_nbrs=5, n_shuffle=100, random_seed=None)[source]#
Conditional mutual information independence test.
Implements the conditional independence test using conditional mutual information proposed in [1].
- Parameters:
- k
float
, optional Number of nearest-neighbors for each sample point. If the number is smaller than 1, it is computed as a fraction of the number of samples, by default 0.2.
- transform
str
, optional Transform the data by standardizing the data, by default ‘rank’, which converts data to ranks. Can be ‘rank’, ‘uniform’, ‘standardize’.
- n_jobs
int
, optional The number of CPUs to use, by default -1, which corresponds to using all CPUs available.
- n_shuffle_nbrs
int
, optional Number of nearest-neighbors within the Z covariates for shuffling, by default 5.
- n_shuffle
int
The number of times to shuffle the dataset to generate the null distribution. By default, 1000.
- random_seed
int
, optional The random seed that is used to seed via
np.random.defaultrng
.
- k
Notes
Conditional mutual information (CMI) is defined as:
\[I(X;Y|Z) = \iiint p(z) p(x,y|z) \log \frac{ p(x,y|z)}{p(x|z)\cdot p(y |z)} \,dx dy dz\]It can be seen that when \(X \perp Y | Z\), then CMI is equal to 0. Hence, CMI is a general measure for conditional dependence. The estimator for CMI proposed in [1] is a k-nearest-neighbor based estimator:
\[\widehat{I}(X;Y|Z) = \psi (k) + \frac{1}{T} \sum_{t=1}^T (\psi(k_{Z,t}) - \psi(k_{XZ,t}) - \psi(k_{YZ,t}))\]where \(\psi\) is the Digamma (i.e. see
scipy.special.digamma
) function. \(k\) determines the size of hyper-cubes around each (high-dimensional) sample point. Then \(k_{Z,},k_{XZ},k_{YZ}\) are the numbers of neighbors in the respective subspaces. \(k\) can be viewed as a density smoothing parameter (although it is data-adaptive unlike fixed-bandwidth estimators). For large \(k\), the underlying dependencies are more smoothed and CMI has a larger bias, but lower variance, which is more important for significance testing. Note that the estimated CMI values can be slightly negative while CMI is a non- negative quantity.The estimator implemented here assumes the data is continuous.
References
Methods
test
(df, x_vars, y_vars[, z_covariates])Abstract method for all conditional independence tests.
- test(df, x_vars, y_vars, z_covariates=None)[source]#
Abstract method for all conditional independence tests.
- Parameters:
- df
pd.DataFrame
The dataframe containing the dataset.
- x_vars
Set
ofcolumn
A column in
df
.- y_vars
Set
ofcolumn
A column in
df
.- z_covariates
Set
, optional A set of columns in
df
, by default None. If None, then the test should run a standard independence test.
- df
- Returns: