1.3.1. pywhy_stats.independence.power_divergence#

Independence test among categorical variables using power-divergence tests.

Works on categorical random variables. Based on the method parameter, one can compute a wide variety of different categorical hypothesis tests.

Categorical data is a type of data that can be divided into discrete groups. Compared to continuous data, there is no agreed upon way to represent categorical data numerically. For example, we can represent the color of an object as “red”, “blue”, “green” etc. but we can also represent it as 1, 2, 3, which maps to those colors, or even [1.2, 2.2, 3.2] which maps to those colors.

If str type data is passed in, then it is converted to int type using sklearn.preprocessing.LabelEncoder. All columns of data passed in must be of the same type, otherwise it is impossible to infer what you want to do.

Encoding categorical data numerically is a common practice in machine learning and statistics. There are many strategies, and we do not implement most of them. For categorical encoding strategies, see https://github.com/scikit-learn-contrib/category_encoders.

1.3.1.1. Examples#

>>> import pywhy_stats as ps
>>> res = ps.categorical.ind([1, 2, 3], [4, 5, 6])
>>> print(res.pvalue)
>>> 1.0

Functions

condind(X, Y, condition_on[, method, ...])

Perform an independence test using power divergence test.

ind(X, Y[, method, num_categories_allowed, ...])

Perform an independence test using power divergence test.

condind(X, Y, condition_on, method='cressie-read', num_categories_allowed=10, on_error='raise')[source]#

Perform an independence test using power divergence test.

The null hypothesis for the test is X is independent of Y given condition_on. A lot of the frequency comparison based statistics (eg. chi-square, G-test etc) belong to power divergence family, and are special cases of this test.

Parameters:
Xarray_like of shape (n_samples,)

The first node variable.

Yarray_like of shape (n_samples,)

The second node variable.

condition_onarray_like of shape (n_samples, n_variables)

The conditioning set.

methodfloat or str

The lambda parameter for the power_divergence statistic. Some values of method results in other well known tests:

“pearson” 1 “Chi-squared test” “log-likelihood” 0 “G-test or log-likelihood” “freeman-tukey” -1/2 “freeman-tukey Statistic” “mod-log-likelihood” -1 “Modified Log-likelihood” “neyman” -2 “Neyman’s statistic” “cressie-read” 2/3 “The value recommended in the paper [1]

num_categories_allowedint

The maximum number of categories allowed in the input variables. Default of 10 is chosen to error out on large number of categories.

on_errorstr

What to do when there are not enough samples in the data, where there are 0 samples in a cell of the contingency table. If ‘raise’, then raise an error. If ‘warn’, then log a warning and skip the test. If ‘ignore’, then ignore the warning and skip the test.

Returns:
statisticfloat

The test statistic.

pvaluefloat

The p-value of the test.

ind(X, Y, method='cressie-read', num_categories_allowed=10, on_error='raise')[source]#

Perform an independence test using power divergence test.

The null hypothesis for the test is X is independent of Y. A lot of the frequency comparison based statistics (eg. chi-square, G-test etc) belong to power divergence family, and are special cases of this test.

Parameters:
Xarray_like of shape (n_samples,)

The first node variable.

Yarray_like of shape (n_samples,)

The second node variable.

methodfloat or str

The lambda parameter for the power_divergence statistic. Some values of method results in other well known tests:

“pearson” 1 “Chi-squared test” “log-likelihood” 0 “G-test or log-likelihood” “freeman-tukey” -1/2 “freeman-tukey Statistic” “mod-log-likelihood” -1 “Modified Log-likelihood” “neyman” -2 “Neyman’s statistic” “cressie-read” 2/3 “The value recommended in the paper [1]

num_categories_allowedint

The maximum number of categories allowed in the input variables. Default of 10 is chosen to error out on large number of categories.

on_errorstr

What to do when there are not enough samples in the data, where there are 0 samples in a cell of the contingency table. If ‘raise’, then raise an error. If ‘warn’, then log a warning and skip the test. If ‘ignore’, then ignore the warning and skip the test.

Returns:
statisticfloat

The test statistic.

pvaluefloat

The p-value of the test.

References