dowhy.gcm.util package#
Submodules#
dowhy.gcm.util.catboost_encoder module#
- class dowhy.gcm.util.catboost_encoder.CatBoostEncoder(p: float = 1, alpha: float | None = None)[source]#
Bases:
object
Implements the proposed method from
“CatBoost: gradient boosting with categorical features support”, Dorogush et al. (2018).
The Catboost encoder is a target encoder for categorical features. In this implementation we follow Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf.
See Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf
- Parameters:
p – The p parameter in the equation. This weights the impact of the given alpha.
alpha – Alpha parameter in the equation. If None is given, the global mean is used as suggested in “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems”, Micci-Barreca (2001)
- fit(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) None [source]#
Fits the Catboost encoder following https://arxiv.org/pdf/1810.11363.pdf Eq. (1).
- Parameters:
X – Input categorical data.
Y – Target data (continuous or categorical)
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.
- fit_transform(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) ndarray [source]#
- Parameters:
X – Input categorical data.
Y – Target data (continuous or categorical).
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.
- Returns:
Catboost encoded inputs based on the given Y.
- transform(X: ndarray, Y: ndarray | None = None, use_alpha_when_unique: bool = True) ndarray [source]#
Applies the Catboost encoder to the data.
- Parameters:
X – Input categorical data.
Y – If target data is given, this data is used instead of the fitted data.
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.
- Returns:
Catboost encoded inputs. If Y is given, it uses the idea if giving each row a time index and only use the previously observed data to estimate the encoding. If Y is not given, the previously fitted average for each category is used. This can be seen as using the whole training data set as past observations.
dowhy.gcm.util.general module#
- dowhy.gcm.util.general.apply_catboost_encoding(X: ndarray, catboost_encoder_map: Dict[int, CatBoostEncoder], Y: ndarray | None = None) ndarray [source]#
- dowhy.gcm.util.general.apply_one_hot_encoding(X: ndarray, one_hot_encoder_map: Dict[int, OneHotEncoder]) ndarray [source]#
- dowhy.gcm.util.general.auto_apply_encoders(X: ndarray, encoder_map: Dict[int, OneHotEncoder | CatBoostEncoder], Y: ndarray | None = None) ndarray [source]#
- dowhy.gcm.util.general.auto_fit_encoders(X: ndarray, Y: ndarray | None = None, catboost_threshold: int = 7) Dict[int, OneHotEncoder | CatBoostEncoder] [source]#
- dowhy.gcm.util.general.fit_catboost_encoders(X: ndarray, Y: ndarray) Dict[int, CatBoostEncoder] [source]#
- dowhy.gcm.util.general.fit_one_hot_encoders(X: ndarray) Dict[int, OneHotEncoder] [source]#
Fits one-hot encoders to each categorical column in X. A categorical input needs to be a string, i.e. a categorical column consists only of strings.
- Parameters:
X – Input data matrix.
- Returns:
Dictionary that maps a column index to a scikit OneHotEncoder.
- dowhy.gcm.util.general.has_categorical(X: ndarray) bool [source]#
Checks if any of the given columns are categorical, i.e. either a string or a boolean. If any of the columns is categorical, this method will return True. Alternatively, consider is_categorical for checking if all columns are categorical.
Note: A np matrix with mixed data types might internally convert numeric columns to strings and vice versa. To ensure that the given given data keeps the original data type, consider converting/initializing it with the dtype ‘object’. For instance: np.array([[1, ‘True’, ‘0’, 0.2], [3, ‘False’, ‘1’, 2.3]], dtype=object)
- Parameters:
X – Input array to check if all columns are categorical.
- Returns:
True if all columns of the input are categorical, False otherwise.
- dowhy.gcm.util.general.is_categorical(X: ndarray) bool [source]#
Checks if all of the given columns are categorical, i.e. either a string or a boolean. Only if all of the columns are categorical, this method will return True. Alternatively, consider has_categorical for checking if any of the columns is categorical.
Note: A np matrix with mixed data types might internally convert numeric columns to strings and vice versa. To ensure that the given given data keeps the original data type, consider converting/initializing it with the dtype ‘object’. For instance: np.array([[1, ‘True’, ‘0’, 0.2], [3, ‘False’, ‘1’, 2.3]], dtype=object)
- Parameters:
X – Input array to check if all columns are categorical.
- Returns:
True if all columns of the input are categorical, False otherwise.
- dowhy.gcm.util.general.is_discrete(X: ndarray) bool [source]#
Checks if all values in the given array are discrete.
- Parameters:
X – Input array to check.
- Returns:
True if all values in the input are discrete, False otherwise.
- dowhy.gcm.util.general.means_difference(randomized_predictions: ndarray, baseline_values: ndarray) ndarray [source]#
- dowhy.gcm.util.general.set_random_seed(random_seed: int) None [source]#
Sets random seed in numpy and the random module.
- Parameters:
random_seed – Random see for the numpy and random module.
- Returns:
None
- dowhy.gcm.util.general.setdiff2d(ar1: ndarray, ar2: ndarray, assume_unique: bool = False) ndarray [source]#
This method generalizes numpy’s setdiff1d to 2d, i.e., it compares vectors for arbitrary length. See https://numpy.org/doc/stable/reference/generated/numpy.setdiff1d.html for more details.
- dowhy.gcm.util.general.shape_into_2d(*args)[source]#
If necessary, shapes the numpy inputs into 2D matrices.
- Example:
array([1, 2, 3]) -> array([[1], [2], [3]]) 2 -> array([[2]])
- Parameters:
args – The function expects numpy arrays as inputs and returns a reshaped (2D) version of them (if necessary).
- Returns:
Reshaped versions of the input numpy arrays. For instance, given 1D inputs X, Y and Z, then shape_into_2d(X, Y, Z) reshapes them into 2D and returns them. If an input is already 2D, it will not be modified and returned as it is.
dowhy.gcm.util.plotting module#
- dowhy.gcm.util.plotting.bar_plot(values: Dict[str, float], uncertainties: Dict[str, Tuple[float, float]] | None = None, ylabel: str = '', filename: str | None = None, display_plot: bool = True, figure_size: List[int] | None = None, bar_width: float = 0.8, xticks: List[str] | None = None, xticks_rotation: int = 90, sort_names: bool = True) None [source]#
Deprecated, please use dowhy.utils.plotting.bar_plot() instead.
- dowhy.gcm.util.plotting.plot(causal_graph: Graph, causal_strengths: Dict[Tuple[Any, Any], float] | None = None, colors: Dict[Any | Tuple[Any, Any], str] | None = None, filename: str | None = None, display_plot: bool = True, figure_size: List[int] | None = None, **kwargs) None [source]#
Deprecated, please use dowhy.utils.plotting.plot() instead.