1.3.1. pywhy_stats.independence.power_divergence#
Independence test among categorical variables using power-divergence tests.
Works on categorical random variables. Based on the method
parameter, one
can compute a wide variety of different categorical hypothesis tests.
Categorical data is a type of data that can be divided into discrete groups. Compared to continuous data, there is no agreed upon way to represent categorical data numerically. For example, we can represent the color of an object as “red”, “blue”, “green” etc. but we can also represent it as 1, 2, 3, which maps to those colors, or even [1.2, 2.2, 3.2] which maps to those colors.
If str
type data is passed in, then it is converted to int
type using
sklearn.preprocessing.LabelEncoder
. All columns of data passed in must be of
the same type, otherwise it is impossible to infer what you want to do.
Encoding categorical data numerically is a common practice in machine learning and statistics. There are many strategies, and we do not implement most of them. For categorical encoding strategies, see https://github.com/scikit-learn-contrib/category_encoders.
1.3.1.1. Examples#
>>> import pywhy_stats as ps
>>> res = ps.categorical.ind([1, 2, 3], [4, 5, 6])
>>> print(res.pvalue)
>>> 1.0
Functions
|
Perform an independence test using power divergence test. |
|
Perform an independence test using power divergence test. |
- condind(X, Y, condition_on, method='cressie-read', num_categories_allowed=10, on_error='raise')[source]#
Perform an independence test using power divergence test.
The null hypothesis for the test is X is independent of Y given condition_on. A lot of the frequency comparison based statistics (eg. chi-square, G-test etc) belong to power divergence family, and are special cases of this test.
- Parameters:
- Xarray_like of shape (n_samples,)
The first node variable.
- Yarray_like of shape (n_samples,)
The second node variable.
- condition_onarray_like of shape (n_samples, n_variables)
The conditioning set.
- method
float
orstr
The lambda parameter for the power_divergence statistic. Some values of method results in other well known tests:
“pearson” 1 “Chi-squared test” “log-likelihood” 0 “G-test or log-likelihood” “freeman-tukey” -1/2 “freeman-tukey Statistic” “mod-log-likelihood” -1 “Modified Log-likelihood” “neyman” -2 “Neyman’s statistic” “cressie-read” 2/3 “The value recommended in the paper [1]”
- num_categories_allowed
int
The maximum number of categories allowed in the input variables. Default of 10 is chosen to error out on large number of categories.
- on_error
str
What to do when there are not enough samples in the data, where there are 0 samples in a cell of the contingency table. If ‘raise’, then raise an error. If ‘warn’, then log a warning and skip the test. If ‘ignore’, then ignore the warning and skip the test.
- Returns:
- ind(X, Y, method='cressie-read', num_categories_allowed=10, on_error='raise')[source]#
Perform an independence test using power divergence test.
The null hypothesis for the test is X is independent of Y. A lot of the frequency comparison based statistics (eg. chi-square, G-test etc) belong to power divergence family, and are special cases of this test.
- Parameters:
- Xarray_like of shape (n_samples,)
The first node variable.
- Yarray_like of shape (n_samples,)
The second node variable.
- method
float
orstr
The lambda parameter for the power_divergence statistic. Some values of
method
results in other well known tests:“pearson” 1 “Chi-squared test” “log-likelihood” 0 “G-test or log-likelihood” “freeman-tukey” -1/2 “freeman-tukey Statistic” “mod-log-likelihood” -1 “Modified Log-likelihood” “neyman” -2 “Neyman’s statistic” “cressie-read” 2/3 “The value recommended in the paper [1]”
- num_categories_allowed
int
The maximum number of categories allowed in the input variables. Default of 10 is chosen to error out on large number of categories.
- on_error
str
What to do when there are not enough samples in the data, where there are 0 samples in a cell of the contingency table. If ‘raise’, then raise an error. If ‘warn’, then log a warning and skip the test. If ‘ignore’, then ignore the warning and skip the test.
- Returns:
References