3.2.1. dodiscover.cd.KernelCDTest#
- class dodiscover.cd.KernelCDTest(distance_metric='euclidean', metric='rbf', l2=None, kwidth_x=None, kwidth_y=None, null_reps=1000, n_jobs=None, propensity_model=None, propensity_est=None, random_state=None)[source]#
Kernel conditional discrepancy test among conditional distributions.
Tests the equality of conditional distributions using a kernel approach outlined in [1].
- Parameters:
- distance_metric
str
, optional Distance metric to use, by default “euclidean”. For others, see
DistanceMetric
for supported list of metrics.- metric
str
, optional Kernel metric, by default “rbf”. For others, see
pairwise
for supported kernel metrics.- l2
float
|tuple
offloat
, optional The l2 regularization to apply for inverting the kernel matrices of ‘x’ and ‘y’ respectively, by default None. If a single number, then the same l2 regularization will be applied to inverting both matrices. If
None
, then a default regularization will be computed that chooses the value that minimizes the upper bound on the mean squared prediction error.- kwidth_x
float
, optional Kernel width among X variables, by default None, which we will then estimate using the median l2 distance between the X variables.
- kwidth_y
float
, optional Kernel width among Y variables, by default None, which we will then estimate using the median l2 distance between the Y variables.
- null_reps
int
, optional Number of times to sample the null distribution, by default 1000.
- n_jobs
int
, optional Number of jobs to run computations in parallel, by default None.
- propensity_model
callable()
, optional The propensity model to estimate propensity scores among the groups. If
None
(default) will usesklearn.linear_model.LogisticRegression
. Thepropensity_model
passed in must implement apredict_proba
method in order to be used. See https://scikit-learn.org/stable/glossary.html#term-predict_proba for more information.- propensity_estarray_like of shape (n_samples, n_groups,), optional
The propensity estimates for each group. Must match the cardinality of the
group_col
in the data passed totest
function. IfNone
(default), will build a propensity model using the argument inpropensity_model
.- random_state
int
, optional Random seed, by default None.
- distance_metric
Notes
Currently only testing among two groups are supported. Therefore
df[group_col]
must only contain binary indicators andpropensity_est
must contain only two columns.References
Methods
compute_null
(e_hat, X, Y[, null_reps, ...])Estimate null distribution using propensity weights.
test
(df, group_col, y_vars, x_vars)Compute conditional discrepancy test.
- compute_null(e_hat, X, Y, null_reps=1000, random_state=None)#
Estimate null distribution using propensity weights.
- Parameters:
- e_hatArray-like of shape (n_samples,)
The predicted propensity score for
group_ind == 1
.- XArray-Like of shape (n_samples, n_features_x)
The X (covariates) array.
- YArray-Like of shape (n_samples, n_features_y)
The Y (outcomes) array.
- null_reps
int
, optional Number of times to sample null, by default 1000.
- random_state
int
, optional Random generator, or random seed, by default None.
- Returns:
- null_distArray-like of shape (n_samples,)
The null distribution of test statistics.
- test(df, group_col, y_vars, x_vars)[source]#
Compute conditional discrepancy test.
Tests the null hypothesis: \(P(Y | X, group) = P(Y | X)\), where we are trying to determine if Y is (conditionally) independent from the group denoting the distribution, given X.
Another way of viewing this test is testing whether or not \(P_i(Y|X) = P_j(Y|X)\), where \(P_i(.)\) and \(P_j(.)\) denote distributions from different groups or environments denoted by the group_col.
- Parameters:
- df
pd.DataFrame
The dataframe containing the dataset.
- y_vars
Set
ofcolumn
A column in
df
.- group_col
column
A column in
df
that indicates which group of distribution each sample belongs to with a ‘0’, or ‘1’.- x_vars
Set
ofcolumn
, optional A column in
df
.
- df
- Returns: