3.2.1. dodiscover.cd.KernelCDTest#

class dodiscover.cd.KernelCDTest(distance_metric='euclidean', metric='rbf', l2=None, kwidth_x=None, kwidth_y=None, null_reps=1000, n_jobs=None, propensity_model=None, propensity_est=None, random_state=None)[source]#

Kernel conditional discrepancy test among conditional distributions.

Tests the equality of conditional distributions using a kernel approach outlined in [1].

Parameters:

distance_metricstr, optional: Distance metric to use, by default “euclidean”. For others, see DistanceMetric for supported list of metrics.
metricstr, optional: Kernel metric, by default “rbf”. For others, see pairwise for supported kernel metrics.
l2float | tuple of float, optional: The l2 regularization to apply for inverting the kernel matrices of ‘x’ and ‘y’ respectively, by default None. If a single number, then the same l2 regularization will be applied to inverting both matrices. If None, then a default regularization will be computed that chooses the value that minimizes the upper bound on the mean squared prediction error.
kwidth_xfloat, optional: Kernel width among X variables, by default None, which we will then estimate using the median l2 distance between the X variables.
kwidth_yfloat, optional: Kernel width among Y variables, by default None, which we will then estimate using the median l2 distance between the Y variables.
null_repsint, optional: Number of times to sample the null distribution, by default 1000.
n_jobsint, optional: Number of jobs to run computations in parallel, by default None.
propensity_modelcallable(), optional: The propensity model to estimate propensity scores among the groups. If None (default) will use sklearn.linear_model.LogisticRegression. The propensity_model passed in must implement a predict_proba method in order to be used. See https://scikit-learn.org/stable/glossary.html#term-predict_proba for more information.
propensity_estarray_like of shape (n_samples, n_groups,), optional: The propensity estimates for each group. Must match the cardinality of the group_col in the data passed to test function. If None (default), will build a propensity model using the argument in propensity_model.
random_stateint, optional: Random seed, by default None.

Notes

Currently only testing among two groups are supported. Therefore df[group_col] must only contain binary indicators and propensity_est must contain only two columns.

References

Methods

`compute_null`(e_hat, X, Y[, null_reps, ...])	Estimate null distribution using propensity weights.
`test`(df, group_col, y_vars, x_vars)	Compute conditional discrepancy test.

compute_null(e_hat, X, Y, null_reps=1000, random_state=None)#

Estimate null distribution using propensity weights.

Parameters:

e_hatArray-like of shape (n_samples,): The predicted propensity score for group_ind == 1.
XArray-Like of shape (n_samples, n_features_x): The X (covariates) array.
YArray-Like of shape (n_samples, n_features_y): The Y (outcomes) array.
null_repsint, optional: Number of times to sample null, by default 1000.
random_stateint, optional: Random generator, or random seed, by default None.

Returns:

null_distArray-like of shape (n_samples,): The null distribution of test statistics.

test(df, group_col, y_vars, x_vars)[source]#

Compute conditional discrepancy test.

Tests the null hypothesis: \(P(Y | X, group) = P(Y | X)\), where we are trying to determine if Y is (conditionally) independent from the group denoting the distribution, given X.

Another way of viewing this test is testing whether or not \(P_i(Y|X) = P_j(Y|X)\), where \(P_i(.)\) and \(P_j(.)\) denote distributions from different groups or environments denoted by the group_col.

Parameters:

dfpd.DataFrame: The dataframe containing the dataset.
y_varsSet of column: A column in df.
group_colcolumn: A column in df that indicates which group of distribution each sample belongs to with a ‘0’, or ‘1’.
x_varsSet of column, optional: A column in df.

Returns:

Tuple[float, float]: Test statistic and pvalue.