dowhy.gcm.util package

Submodules

dowhy.gcm.util.catboost_encoder module

class dowhy.gcm.util.catboost_encoder.CatBoostEncoder(p: float = 1, alpha: Optional[float] = None)[source]

Bases: object

Implements the proposed method from

“CatBoost: gradient boosting with categorical features support”, Dorogush et al. (2018).

The Catboost encoder is a target encoder for categorical features. In this implementation we follow Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf.

See Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf

Parameters:

p – The p parameter in the equation. This weights the impact of the given alpha.
alpha – Alpha parameter in the equation. If None is given, the global mean is used as suggested in “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems”, Micci-Barreca (2001)

fit(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) → None[source]

Fits the Catboost encoder following https://arxiv.org/pdf/1810.11363.pdf Eq. (1).

Parameters:

X – Input categorical data.
Y – Target data (continuous or categorical)
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

fit_transform(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) → ndarray[source]

Parameters:

X – Input categorical data.
Y – Target data (continuous or categorical).
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

Returns:

Catboost encoded inputs based on the given Y.

transform(X: ndarray, Y: Optional[ndarray] = None, use_alpha_when_unique: bool = True) → ndarray[source]

Applies the Catboost encoder to the data.

Parameters:

X – Input categorical data.
Y – If target data is given, this data is used instead of the fitted data.
use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

Returns:

Catboost encoded inputs. If Y is given, it uses the idea if giving each row a time index and only use the previously observed data to estimate the encoding. If Y is not given, the previously fitted average for each category is used. This can be seen as using the whole training data set as past observations.

dowhy.gcm.util.general module

Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.

dowhy.gcm.util.general.apply_catboost_encoding(X: ndarray, catboost_encoder_map: Dict[int, CatBoostEncoder], Y: Optional[ndarray] = None) → ndarray[source]

dowhy.gcm.util.general.apply_one_hot_encoding(X: ndarray, one_hot_encoder_map: Dict[int, OneHotEncoder]) → ndarray[source]

dowhy.gcm.util.general.auto_apply_encoders(X: ndarray, encoder_map: Dict[int, Union[OneHotEncoder, CatBoostEncoder]], Y: Optional[ndarray] = None) → ndarray[source]

dowhy.gcm.util.general.auto_fit_encoders(X: ndarray, Y: Optional[ndarray] = None, catboost_threshold: int = 7) → Dict[int, Union[OneHotEncoder, CatBoostEncoder]][source]

dowhy.gcm.util.general.fit_catboost_encoders(X: ndarray, Y: ndarray) → Dict[int, CatBoostEncoder][source]

dowhy.gcm.util.general.fit_one_hot_encoders(X: ndarray) → Dict[int, OneHotEncoder][source]

Fits one-hot encoders to each categorical column in X. A categorical input needs to be a string, i.e. a categorical column consists only of strings.

Parameters:: X – Input data matrix.
Returns:: Dictionary that maps a column index to a scikit OneHotEncoder.

dowhy.gcm.util.general.geometric_median(x: ndarray) → ndarray[source]

dowhy.gcm.util.general.has_categorical(X: ndarray) → bool[source]

Checks if any of the given columns are categorical, i.e. either a string or a boolean. If any of the columns is categorical, this method will return True. Alternatively, consider is_categorical for checking if all columns are categorical.

Note: A np matrix with mixed data types might internally convert numeric columns to strings and vice versa. To ensure that the given given data keeps the original data type, consider converting/initializing it with the dtype ‘object’. For instance: np.array([[1, ‘True’, ‘0’, 0.2], [3, ‘False’, ‘1’, 2.3]], dtype=object)

Parameters:: X – Input array to check if all columns are categorical.
Returns:: True if all columns of the input are categorical, False otherwise.

dowhy.gcm.util.general.is_categorical(X: ndarray) → bool[source]

Checks if all of the given columns are categorical, i.e. either a string or a boolean. Only if all of the columns are categorical, this method will return True. Alternatively, consider has_categorical for checking if any of the columns is categorical.

Parameters:: X – Input array to check if all columns are categorical.
Returns:: True if all columns of the input are categorical, False otherwise.

dowhy.gcm.util.general.means_difference(randomized_predictions: ndarray, baseline_values: ndarray) → ndarray[source]

dowhy.gcm.util.general.set_random_seed(random_seed: int) → None[source]

Sets random seed in numpy and the random module.

Parameters:: random_seed – Random see for the numpy and random module.
Returns:: None

dowhy.gcm.util.general.setdiff2d(ar1: ndarray, ar2: ndarray, assume_unique: bool = False) → ndarray[source]: This method generalizes numpy’s setdiff1d to 2d, i.e., it compares vectors for arbitrary length. See https://numpy.org/doc/stable/reference/generated/numpy.setdiff1d.html for more details.

dowhy.gcm.util.general.shape_into_2d(*args)[source]

If necessary, shapes the numpy inputs into 2D matrices.

Example:: array([1, 2, 3]) -> array([[1], [2], [3]]) 2 -> array([[2]])

Parameters:: args – The function expects numpy arrays as inputs and returns a reshaped (2D) version of them (if necessary).
Returns:: Reshaped versions of the input numpy arrays. For instance, given 1D inputs X, Y and Z, then shape_into_2d(X, Y, Z) reshapes them into 2D and returns them. If an input is already 2D, it will not be modified and returned as it is.

dowhy.gcm.util.general.variance_of_deviations(randomized_predictions: ndarray, baseline_values: ndarray) → ndarray[source]

dowhy.gcm.util.general.variance_of_matching_values(randomized_predictions: ndarray, baseline_values: ndarray) → ndarray[source]

dowhy.gcm.util.plotting module

dowhy.gcm.util.plotting.bar_plot(values: Dict[str, float], uncertainties: Optional[Dict[str, Tuple[float, float]]] = None, ylabel: str = '', filename: Optional[str] = None, display_plot: bool = True, figure_size: Optional[List[int]] = None, bar_width: float = 0.8, xticks: Optional[List[str]] = None, xticks_rotation: int = 90, sort_names: bool = True) → None[source]: Deprecated, please use dowhy.utils.plotting.bar_plot() instead.

dowhy.gcm.util.plotting.plot(causal_graph: Graph, causal_strengths: Optional[Dict[Tuple[Any, Any], float]] = None, colors: Optional[Dict[Union[Any, Tuple[Any, Any]], str]] = None, filename: Optional[str] = None, display_plot: bool = True, figure_size: Optional[List[int]] = None, **kwargs) → None[source]: Deprecated, please use dowhy.utils.plotting.plot() instead.

dowhy.gcm.util.plotting.plot_adjacency_matrix(adjacency_matrix: DataFrame, is_directed: bool, filename: Optional[str] = None, display_plot: bool = True) → None[source]: Deprecated, please use dowhy.utils.plotting.plot_adjacency_matrix() instead.

dowhy.gcm.util package

Submodules

dowhy.gcm.util.catboost_encoder module

dowhy.gcm.util.general module

dowhy.gcm.util.plotting module

Module contents