3.3.1. dodiscover.cd.BregmanCDTest#

class dodiscover.cd.BregmanCDTest(metric='rbf', distance_metric='euclidean', kwidth=None, null_reps=1000, n_jobs=None, propensity_model=None, propensity_est=None, random_state=None)[source]#

Bregman divergence conditional discrepancy test.

Tests the equality of conditional distributions using a kernel approach to estimate Bregman divergences outlined in [1].

Parameters:

metricstr, optional: The kernel metric, by default ‘rbf’.
distance_metricstr, optional: The distance metric, by default ‘euclidean’.
kwidthfloat, optional: The width of the kernel, by default None, which we will then estimate using the default procedure in dodiscover.ci.kernel_utils.compute_kernel().
null_repsint, optional: Number of times to sample null distribution, by default 1000.
n_jobsint, optional: Number of CPUs to use, by default None.
propensity_modelcallable(), optional: The propensity model to estimate propensity scores among the groups. If None (default) will use sklearn.linear_model.LogisticRegression. The propensity_model passed in must implement a predict_proba method in order to be used. See https://scikit-learn.org/stable/glossary.html#term-predict_proba for more information.
propensity_estarray_like of shape (n_samples, n_groups,), optional: The propensity estimates for each group. Must match the cardinality of the group_col in the data passed to test function. If None (default), will build a propensity model using the argument in propensity_model.
random_stateint, optional: Random seed, by default None.

Notes

Currently only testing among two groups are supported. Therefore df[group_col] must only contain binary indicators and propensity_est must contain only two columns. References ———- .. footbibliography:

.. rubric:: Methods

`compute_null`(e_hat, X, Y[, null_reps, ...])	Estimate null distribution using propensity weights.
`test`(df, group_col, y_vars, x_vars)	Compute conditional discrepancy test.

compute_null(e_hat, X, Y, null_reps=1000, random_state=None)#

Estimate null distribution using propensity weights.

Parameters:

e_hatArray-like of shape (n_samples,): The predicted propensity score for group_ind == 1.
XArray-Like of shape (n_samples, n_features_x): The X (covariates) array.
YArray-Like of shape (n_samples, n_features_y): The Y (outcomes) array.
null_repsint, optional: Number of times to sample null, by default 1000.
random_stateint, optional: Random generator, or random seed, by default None.

Returns:

null_distArray-like of shape (n_samples,): The null distribution of test statistics.

test(df, group_col, y_vars, x_vars)[source]#

Compute conditional discrepancy test.

Tests the null hypothesis: \(P(Y | X, group) = P(Y | X)\), where we are trying to determine if Y is (conditionally) independent from the group denoting the distribution, given X.

Another way of viewing this test is testing whether or not \(P_i(Y|X) = P_j(Y|X)\), where \(P_i(.)\) and \(P_j(.)\) denote distributions from different groups or environments denoted by the group_col.

Parameters:

dfpd.DataFrame: The dataframe containing the dataset.
y_varsSet of column: A column in df.
group_colcolumn: A column in df that indicates which group of distribution each sample belongs to with a ‘0’, or ‘1’.
x_varsSet of column, optional: A column in df.

Returns:

Tuple[float, float]: Test statistic and pvalue.