3.2.1. dodiscover.cd.KernelCDTest#

class dodiscover.cd.KernelCDTest(distance_metric='euclidean', metric='rbf', l2=None, kwidth_x=None, kwidth_y=None, null_reps=1000, n_jobs=None, propensity_model=None, propensity_est=None, random_state=None)[source]#

Kernel conditional discrepancy test among conditional distributions.

Tests the equality of conditional distributions using a kernel approach outlined in [1].

Parameters:
distance_metricstr, optional

Distance metric to use, by default “euclidean”. For others, see DistanceMetric for supported list of metrics.

metricstr, optional

Kernel metric, by default “rbf”. For others, see pairwise for supported kernel metrics.

l2float | tuple of float, optional

The l2 regularization to apply for inverting the kernel matrices of ‘x’ and ‘y’ respectively, by default None. If a single number, then the same l2 regularization will be applied to inverting both matrices. If None, then a default regularization will be computed that chooses the value that minimizes the upper bound on the mean squared prediction error.

kwidth_xfloat, optional

Kernel width among X variables, by default None, which we will then estimate using the median l2 distance between the X variables.

kwidth_yfloat, optional

Kernel width among Y variables, by default None, which we will then estimate using the median l2 distance between the Y variables.

null_repsint, optional

Number of times to sample the null distribution, by default 1000.

n_jobsint, optional

Number of jobs to run computations in parallel, by default None.

propensity_modelcallable(), optional

The propensity model to estimate propensity scores among the groups. If None (default) will use sklearn.linear_model.LogisticRegression. The propensity_model passed in must implement a predict_proba method in order to be used. See https://scikit-learn.org/stable/glossary.html#term-predict_proba for more information.

propensity_estarray_like of shape (n_samples, n_groups,), optional

The propensity estimates for each group. Must match the cardinality of the group_col in the data passed to test function. If None (default), will build a propensity model using the argument in propensity_model.

random_stateint, optional

Random seed, by default None.

Notes

Currently only testing among two groups are supported. Therefore df[group_col] must only contain binary indicators and propensity_est must contain only two columns.

References

Methods

compute_null(e_hat, X, Y[, null_reps, ...])

Estimate null distribution using propensity weights.

test(df, group_col, y_vars, x_vars)

Compute conditional discrepancy test.

compute_null(e_hat, X, Y, null_reps=1000, random_state=None)#

Estimate null distribution using propensity weights.

Parameters:
e_hatArray-like of shape (n_samples,)

The predicted propensity score for group_ind == 1.

XArray-Like of shape (n_samples, n_features_x)

The X (covariates) array.

YArray-Like of shape (n_samples, n_features_y)

The Y (outcomes) array.

null_repsint, optional

Number of times to sample null, by default 1000.

random_stateint, optional

Random generator, or random seed, by default None.

Returns:
null_distArray-like of shape (n_samples,)

The null distribution of test statistics.

test(df, group_col, y_vars, x_vars)[source]#

Compute conditional discrepancy test.

Tests the null hypothesis: \(P(Y | X, group) = P(Y | X)\), where we are trying to determine if Y is (conditionally) independent from the group denoting the distribution, given X.

Another way of viewing this test is testing whether or not \(P_i(Y|X) = P_j(Y|X)\), where \(P_i(.)\) and \(P_j(.)\) denote distributions from different groups or environments denoted by the group_col.

Parameters:
dfpd.DataFrame

The dataframe containing the dataset.

y_varsSet of column

A column in df.

group_colcolumn

A column in df that indicates which group of distribution each sample belongs to with a ‘0’, or ‘1’.

x_varsSet of column, optional

A column in df.

Returns:
Tuple[float, float]

Test statistic and pvalue.