dowhy.gcm.independence_test package#

Submodules#

dowhy.gcm.independence_test.generalised_cov_measure module#

dowhy.gcm.independence_test.generalised_cov_measure.generalised_cov_based(X: ndarray, Y: ndarray, Z: ndarray | None = None, prediction_model_X: AssignmentQuality | Callable[[], PredictionModel] = AssignmentQuality.BETTER, prediction_model_Y: AssignmentQuality | Callable[[], PredictionModel] = AssignmentQuality.BETTER)[source]#

(Conditional) independence test based on the Generalised Covariance Measure.

Note: - Currently, only univariate and continuous X and Y are supported. - Residuals are based on the training data. - The relationships need to be non-deterministic, i.e., the residuals cannot be constant!

See - R. D. Shah and J Peters. The hardness of conditional independence testing and the generalised covariance measure, The Annals of Statistics 48(3), 2018 for more details.

Parameters:
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • prediction_model_X – Either a model class that will be used as prediction model for regressing X on Z (e.g., a linear regressor) or an AssignmentQuality for automatically selecting a model.

  • prediction_model_Y – Either a model class that will be used as prediction model for regressing X on Z (e.g., a linear regressor) or an AssignmentQuality for automatically selecting a model.

:return The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel module#

dowhy.gcm.independence_test.kernel.approx_kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~numpy.ndarray | None = None, num_random_features_X: int = 50, num_random_features_Y: int = 50, num_random_features_Z: int = 50, num_permutations: int = 100, approx_kernel: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <function approximate_rbf_kernel_features>, scale_data: bool = False, use_bootstrap: bool = True, bootstrap_num_runs: int = 10, bootstrap_num_samples: int = 1000, bootstrap_n_jobs: int | None = None, p_value_adjust_func: ~typing.Callable[[~numpy.ndarray | ~typing.List[float]], float] = <function merge_p_values_average>) float[source]#

Implementation of the Randomized Conditional Independence Test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.

If Z is given: Using RCIT as conditional independence test. If Z is not given: Using RIT as pairwise independence test.

Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings. - It is possible to apply a different kernel to each column in the matrices. For instance, a RBF kernel for the

first dimension in X and a delta kernel for the second.

Based on the work:

Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference 7.1 (2019).

Parameters:
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • num_random_features_X – Number of features sampled from the approximated kernel map for X.

  • num_random_features_Y – Number of features sampled from the approximated kernel map for Y.

  • num_random_features_Z – Number of features sampled from the approximated kernel map for Z.

  • num_permutations – Number of permutations for estimating the test test statistic.

  • approx_kernel – The approximated kernel map. The expected input is a n x d numpy array and the output is expected to be a n x k numpy array with k << d. By default, the Nystroem method with a RBF kernel is used.

  • scale_data – If set to True, the data will be standardized. If set to False, the data is taken as it is. Standardizing the data helps in identifying weak dependencies. If one is only interested in stronger ones, consider setting this to False.

  • use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.

  • bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).

  • bootstrap_num_samples – Maximum number of used samples per bootstrap run.

  • bootstrap_n_jobs – Number of parallel jobs for the bootstrap runs.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

Returns:

The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel.kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~numpy.ndarray | None = None, use_bootstrap: bool = False, bootstrap_num_runs: int = 10, max_num_samples_run: int = 2000, bootstrap_n_jobs: int | None = None, p_value_adjust_func: ~typing.Callable[[~numpy.ndarray | ~typing.List[float]], float] = <function merge_p_values_average>, **kwargs) float[source]#

Prepares the data and uses kernel (conditional) independence test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.

Here, we utilize the implementations of the cmu-phil/causal-learn package.

If Z is given: Using KCI as conditional independence test, i.e. we use cmu-phil/causal-learn. If Z is not given: Using KCI as pairwise independence test, i.e. we use cmu-phil/causal-learn.

Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings.

Based on the work: - K. Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011. - A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.

For more information about configuring the kernel independence test, see: - cmu-phil/causal-learn (if Z is not given) - cmu-phil/causal-learn (if Z is given)

Parameters:
  • X – Data matrix for observations from X.

  • Y – Data matrix for observations from Y.

  • Z – Optional data matrix for observations from Z. This is the conditional variable.

  • use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.

  • bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).

  • max_num_samples_run – Maximum number of samples used in an evaluation. If use_bootstrap is True, then different samples but at most max_num_samples_run are used.

  • bootstrap_n_jobs – Number of parallel jobs for the bootstrap runs.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

Returns:

The p-value for the null hypothesis that X and Y are independent (given Z).

dowhy.gcm.independence_test.kernel_operation module#

dowhy.gcm.independence_test.kernel_operation.apply_delta_kernel(X: ndarray) ndarray[source]#

Applies the delta kernel, i.e. the distance is 1 if two entries are equal and 0 otherwise.

Parameters:

X – Input data.

Returns:

The outcome of the delta-kernel, a binary distance matrix.

dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel(X: ndarray, precision: float | None = None) ndarray[source]#

Estimates the RBF (Gaussian) kernel for the given input data.

Parameters:
  • X – Input data.

  • precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.

Returns:

The outcome of applying a RBF (Gaussian) kernel on the data.

dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel_with_adaptive_precision(X: ndarray) ndarray[source]#

Estimates the RBF (Gaussian) kernel for the given input data. Here, each column is scaled by an individual precision parameter which is automatically inferred from the data.

Parameters:

X – Input data.

Returns:

The outcome of applying a RBF (Gaussian) kernel on the data.

dowhy.gcm.independence_test.kernel_operation.approximate_delta_kernel_features(X: ndarray, num_random_components: int) ndarray[source]#

Applies the Nystroem method to create a NxD (D << N) approximated delta kernel map using a subset of the data, where N is the number of samples in X and D the number of components. The delta kernel gives 1 if two entries are equal and 0 otherwise.

Parameters:
  • X – Input data.

  • num_random_components – Number of components D for the approximated kernel map.

Returns:

A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.

dowhy.gcm.independence_test.kernel_operation.approximate_rbf_kernel_features(X: ndarray, num_random_components: int, precision: float | None = None) ndarray[source]#

Applies the Nystroem method to create a NxD (D << N) approximated RBF kernel map using a subset of the data, where N is the number of samples in X and D the number of components.

Parameters:
  • X – Input data.

  • num_random_components – Number of components D for the approximated kernel map.

  • precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.

Returns:

A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.

dowhy.gcm.independence_test.regression module#

Regression based (conditional) independence test. Testing independence via regression, i.e. if a variable has information about another variable, then they are dependent.

dowhy.gcm.independence_test.regression.regression_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~numpy.ndarray | None = None, max_num_components_all_inputs: int = 40, k_folds: int = 3, p_value_adjust_func: ~typing.Callable[[~numpy.ndarray | ~typing.List[float]], float] = <function merge_p_values_average>, max_samples_per_fold: int = -1, n_jobs: int | None = None) float[source]#

The main idea is that if X and Y are dependent, then X should help in predicting Y. If there is no dependency, then X should not help. When Z is given, the idea remains the same, but here X and Y are conditionally independent given Z if X does not help in predicting Y when knowing Z. This is, X has not additional information about Y given Z. In the pairwise case (Z is not given), the performances (in terms of squared error) between predicting Y based on X and predicting Y by returning its mean (the best estimator without any inputs) are compared. Note that categorical inputs are transformed via encoders.

Here, we use the sklearn.kernel_approximation.Nystroem approach to approximate a kernel map of the inputs that serves as new input features. These new features allow to model complex non-linear relationships. In case of categorical data, we first apply an encoding and then map it into the kernel feature space. Afterwards, we use linear regression as a prediction model based on the non-linear input features. The idea is then to apply a f-test to see if the additional input features significantly help in predicting the target or not.

This test is motivated by Granger causality, the approx_kernel_based test and the following paper:

K Chalupka, P Perona, F. Eberhardt. Fast Conditional Independence Test for Vector Variables with Large Sample Sizes. arXiv:1804.02747, 2018.

Parameters:
  • X – Input data for X.

  • Y – Input data for Y.

  • Z – Input data for Z. The set of variables to (optionally) condition on.

  • max_num_components_all_inputs – Maximum number of kernel features when combining X and Z. If Z is not given, it will be replaced with an empty array. If Z is given, half of the number is used to generate features for Z. Note that the actual number of components is 1/10 of the number of samples, but at most max_num_components_all_inputs.

  • num_target_components_factor – The factor indicates how many components are used for the target variable. This is, num_target_components_factor * dimension of the target many components.

  • k_folds – Number of folds for training and test set. This equals the number of estimated p-values, which get adjusted by the p_value_adjust_func.

  • p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.

  • max_samples_per_fold – Maximum number of samples used per fold for training and testing. If -1, it uses all data.

  • n_jobs – Number of parallel jobs for the evaluation of the folds.

Returns:

The p-value for the null hypothesis that X and Y are independent given Z. If Z is not given, then for the hypothesis that X and Y are independent.

Module contents#

dowhy.gcm.independence_test.independence_test(X, Y, conditioned_on=None, method='kernel', **kwargs)[source]#

Performs a (conditional) independence test. Three methods for (conditional) independence test are supported at the moment:

  • kernel: Kernel-based (conditional) independence test.

      1. Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011.

      1. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.

    Here, we utilize the implementations of the cmu-phil/causal-learn package.

  • approx_kernel: Approximate kernel-based (conditional) independence test.

      1. Strobl, K. Zhang, S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 2019.

  • regression: Regression based (conditional) independence test using a f-test. See regression_based() for more details.

  • gcm: (Conditional) independence test based on the Generalised Covariance Measure. See generalised_cov_based() for more details.

        1. Shah and J Peters. The hardness of conditional independence testing and the generalised covariance measure, The Annals of Statistics 48(3), 2018

Parameters:
  • X – Observations of X.

  • Y – Observations of Y.

  • conditioned_on – Observations of conditioning variable if we want to perform a conditional independence test. By default, independence test is carried out.

  • method – Method for conditional independence test. The choices are: kernel (default): kernel_based() (conditional) independence test. approx_kernel: approx_kernel_based() (conditional) independence test. regression: regression_based() (conditional) independence test. gcm: generalised_cov_based() (conditional) independence test. For more information about these methods, see above.

Returns:

p-value of the (conditional) independence test. (Conditional) Independence is the null hypothesis.