dowhy.gcm.util package

Submodules

dowhy.gcm.util.catboost_encoder module

class dowhy.gcm.util.catboost_encoder.CatBoostEncoder(p: float = 1, alpha: Optional[float] = None)[source]

Bases: object

Implements the proposed method from

“CatBoost: gradient boosting with categorical features support”, Dorogush et al. (2018).

The Catboost encoder is a target encoder for categorical features. In this implementation we follow Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf.

See Eq. (1) in https://arxiv.org/pdf/1810.11363.pdf

Parameters:
  • p – The p parameter in the equation. This weights the impact of the given alpha.

  • alpha – Alpha parameter in the equation. If None is given, the global mean is used as suggested in “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems”, Micci-Barreca (2001)

fit(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) None[source]

Fits the Catboost encoder following https://arxiv.org/pdf/1810.11363.pdf Eq. (1).

Parameters:
  • X – Input categorical data.

  • Y – Target data (continuous or categorical)

  • use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

fit_transform(X: ndarray, Y: ndarray, use_alpha_when_unique: bool = True) ndarray[source]
Parameters:
  • X – Input categorical data.

  • Y – Target data (continuous or categorical).

  • use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

Returns:

Catboost encoded inputs based on the given Y.

transform(X: ndarray, Y: Optional[ndarray] = None, use_alpha_when_unique: bool = True) ndarray[source]

Applies the Catboost encoder to the data.

Parameters:
  • X – Input categorical data.

  • Y – If target data is given, this data is used instead of the fitted data.

  • use_alpha_when_unique – If True, uses the alpha value when a category only appears exactly once.

Returns:

Catboost encoded inputs. If Y is given, it uses the idea if giving each row a time index and only use the previously observed data to estimate the encoding. If Y is not given, the previously fitted average for each category is used. This can be seen as using the whole training data set as past observations.

dowhy.gcm.util.general module

Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.

dowhy.gcm.util.general.apply_catboost_encoding(X: ndarray, catboost_encoder_map: Dict[int, CatBoostEncoder], Y: Optional[ndarray] = None) ndarray[source]
dowhy.gcm.util.general.apply_one_hot_encoding(X: ndarray, one_hot_encoder_map: Dict[int, OneHotEncoder]) ndarray[source]
dowhy.gcm.util.general.auto_apply_encoders(X: ndarray, encoder_map: Dict[int, Union[OneHotEncoder, CatBoostEncoder]], Y: Optional[ndarray] = None) ndarray[source]
dowhy.gcm.util.general.auto_fit_encoders(X: ndarray, Y: Optional[ndarray] = None, catboost_threshold: int = 7) Dict[int, Union[OneHotEncoder, CatBoostEncoder]][source]
dowhy.gcm.util.general.fit_catboost_encoders(X: ndarray, Y: ndarray) Dict[int, CatBoostEncoder][source]
dowhy.gcm.util.general.fit_one_hot_encoders(X: ndarray) Dict[int, OneHotEncoder][source]

Fits one-hot encoders to each categorical column in X. A categorical input needs to be a string, i.e. a categorical column consists only of strings.

Parameters:

X – Input data matrix.

Returns:

Dictionary that maps a column index to a scikit OneHotEncoder.

dowhy.gcm.util.general.geometric_median(x: ndarray) ndarray[source]
dowhy.gcm.util.general.has_categorical(X: ndarray) bool[source]

Checks if any of the given columns are categorical, i.e. either a string or a boolean. If any of the columns is categorical, this method will return True. Alternatively, consider is_categorical for checking if all columns are categorical.

Note: A np matrix with mixed data types might internally convert numeric columns to strings and vice versa. To ensure that the given given data keeps the original data type, consider converting/initializing it with the dtype ‘object’. For instance: np.array([[1, ‘True’, ‘0’, 0.2], [3, ‘False’, ‘1’, 2.3]], dtype=object)

Parameters:

X – Input array to check if all columns are categorical.

Returns:

True if all columns of the input are categorical, False otherwise.

dowhy.gcm.util.general.is_categorical(X: ndarray) bool[source]

Checks if all of the given columns are categorical, i.e. either a string or a boolean. Only if all of the columns are categorical, this method will return True. Alternatively, consider has_categorical for checking if any of the columns is categorical.

Note: A np matrix with mixed data types might internally convert numeric columns to strings and vice versa. To ensure that the given given data keeps the original data type, consider converting/initializing it with the dtype ‘object’. For instance: np.array([[1, ‘True’, ‘0’, 0.2], [3, ‘False’, ‘1’, 2.3]], dtype=object)

Parameters:

X – Input array to check if all columns are categorical.

Returns:

True if all columns of the input are categorical, False otherwise.

dowhy.gcm.util.general.means_difference(randomized_predictions: ndarray, baseline_values: ndarray) ndarray[source]
dowhy.gcm.util.general.set_random_seed(random_seed: int) None[source]

Sets random seed in numpy and the random module.

Parameters:

random_seed – Random see for the numpy and random module.

Returns:

None

dowhy.gcm.util.general.setdiff2d(ar1: ndarray, ar2: ndarray, assume_unique: bool = False) ndarray[source]

This method generalizes numpy’s setdiff1d to 2d, i.e., it compares vectors for arbitrary length. See https://numpy.org/doc/stable/reference/generated/numpy.setdiff1d.html for more details.

dowhy.gcm.util.general.shape_into_2d(*args)[source]

If necessary, shapes the numpy inputs into 2D matrices.

Example:

array([1, 2, 3]) -> array([[1], [2], [3]]) 2 -> array([[2]])

Parameters:

args – The function expects numpy arrays as inputs and returns a reshaped (2D) version of them (if necessary).

Returns:

Reshaped versions of the input numpy arrays. For instance, given 1D inputs X, Y and Z, then shape_into_2d(X, Y, Z) reshapes them into 2D and returns them. If an input is already 2D, it will not be modified and returned as it is.

dowhy.gcm.util.general.variance_of_deviations(randomized_predictions: ndarray, baseline_values: ndarray) ndarray[source]
dowhy.gcm.util.general.variance_of_matching_values(randomized_predictions: ndarray, baseline_values: ndarray) ndarray[source]

dowhy.gcm.util.plotting module

dowhy.gcm.util.plotting.bar_plot(values: Dict[str, float], uncertainties: Optional[Dict[str, Tuple[float, float]]] = None, ylabel: str = '', filename: Optional[str] = None, display_plot: bool = True, figure_size: Optional[List[int]] = None, bar_width: float = 0.8, xticks: Optional[List[str]] = None, xticks_rotation: int = 90, sort_names: bool = True) None[source]

Deprecated, please use dowhy.utils.plotting.bar_plot() instead.

dowhy.gcm.util.plotting.plot(causal_graph: Graph, causal_strengths: Optional[Dict[Tuple[Any, Any], float]] = None, colors: Optional[Dict[Union[Any, Tuple[Any, Any]], str]] = None, filename: Optional[str] = None, display_plot: bool = True, figure_size: Optional[List[int]] = None, **kwargs) None[source]

Deprecated, please use dowhy.utils.plotting.plot() instead.

dowhy.gcm.util.plotting.plot_adjacency_matrix(adjacency_matrix: DataFrame, is_directed: bool, filename: Optional[str] = None, display_plot: bool = True) None[source]

Deprecated, please use dowhy.utils.plotting.plot_adjacency_matrix() instead.

Module contents