{ "cells": [ { "cell_type": "markdown", "id": "b075c6f9-2474-449f-8191-a45d77b6d3f8", "metadata": {}, "source": [ "# Decomposing the Gender Wage Gap\n", "\n", "In this notebook, we investigate how much of the gender wage gap in real census data can be attributed to differences in education or occupation between women and men. We use the multiply-robust causal change attribution method from the following paper:\n", ">Quintas-Martínez, V., Bahadori, M. T., Santiago, E., Mu, J. and Heckerman, D. \"Multiply-Robust Causal Change Attribution,\" *Proceedings of the 41 st International Conference on Machine Learning*, Vienna, Austria. PMLR 235, 2024." ] }, { "cell_type": "markdown", "id": "ad74db4b-2c96-4d97-86e6-ef0a5dc6ff69", "metadata": {}, "source": [ "### Read and prepare data\n", "\n", "We will be working with data from the Current Population Survey (CPS) 2015. After applying the same sample restrictions as [Chernozhukov et al. (2018)](https://arxiv.org/abs/1512.05635), the resulting sample contains 18,137 male and 14,382 female employed individuals." ] }, { "cell_type": "code", "execution_count": null, "id": "6e926b16-37e2-4921-9afb-1c99db9c9feb", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Read and prepare data:\n", "df = pd.read_csv('./cps2015.csv')\n", "\n", "# LightGBM works best with integer encoding for categorical variables:\n", "educ_int = {'lhs' : 0, 'hsg' : 1, 'sc' : 2, 'cg' : 3, 'ad' : 4}\n", "df['education'] = pd.from_dummies(df[['lhs', 'hsg', 'sc', 'cg', 'ad']])\n", "df['education'] = df['education'].replace(educ_int)\n", "\n", "df = df[['female', 'education', 'occ2', 'wage', 'weight']]\n", "df.columns = ['female', 'education', 'occupation', 'wage', 'weight']\n", "df[['education', 'occupation']] = df[['education', 'occupation']].astype('category')\n", "\n", "# We want to explain the change Male -> Female:\n", "data_old, data_new = df.loc[df.female == 0], df.loc[df.female == 1]" ] }, { "cell_type": "markdown", "id": "8c4f1dd7-0f59-4a78-ab30-6e8ab80d11f7", "metadata": {}, "source": [ "### Causal Model\n", "\n", "To use our causal change attribution method, we need a causal model linking the outcome (wage) and explanatory variables. (In practice, the method only requires knowing a *causal ordering*.) We will work with the following DAG:\n", "\n", "\n", "\n", "The DAG above implies the following causal Markov factorization of the distribution of the data:\n", "$$P(\\mathtt{wage} \\mid \\mathtt{occup}, \\mathtt{educ}) \\times P(\\mathtt{occup} \\mid \\mathtt{educ}) \\times P(\\mathtt{educ})$$\n", "Each of the components of this factorization (the distribution of each node in the DAG given its direct causes) is called a *causal mechanism*. Differences in the marginal distribution of the wage between men and women could be due to differences in each causal mechanism. Our goal is going to be to disentangle the contribution of each one to the total change.\n", "\n", "In the code below, we define a ```dowhy.gcm``` causal model:" ] }, { "cell_type": "code", "execution_count": null, "id": "2ca5daf9-be47-4c12-bd02-bb5b01014e0c", "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "import dowhy.gcm as gcm\n", "\n", "dag = nx.DiGraph([('education', 'occupation'), ('occupation', 'wage'), ('education', 'wage')])\n", "causal_model = gcm.ProbabilisticCausalModel(dag)" ] }, { "cell_type": "markdown", "id": "10562621-f29c-463d-97b3-8f073a53be86", "metadata": {}, "source": [ "### Implementation with ```dowhy.gcm.distribution_change_robust```\n", "\n", "First, we show how to compute a causal change attribution measure (Shapley values) using the ```dowhy.gcm.distribution_change_robust``` function.\n", "\n", "The multiply-robust causal change attribution method is based on a combination of *regression* and *re-weighting* approaches. In the regression approach, we learn the dependence between a node and its parents in one sample, and then use the data from the other sample to shift the distribution of that node. In the re-weighting approach, we average the data giving more weight to those observations that closely resemble the target distribution. \n", "\n", "By default, ```dowhy.gcm.distribution_change_robust``` uses linear and logistic regression to learn the regression function and the weights. Here, since our dataset is quite large, we will use the more flexible algorithms ```HistGradientBoostingRegressor``` and ```HistGradientBoostingClassifier``` instead.\n", "\n", "We also use ```IsotonicRegression``` to calibrate the probabilities that make up the weights for the re-weighting approach on a leave-out calibration sample. This is optional, but it has been shown to improve the performance of the method in simulations.\n", "\n", "Finally, since our dataset is large, we will use sample splitting (rather than cross-fitting). That is, we will randomly split our data in a training set, where we learn the regression and weights, and an evaluation set, where we use the regression and weights to compute the final attribution measures." ] }, { "cell_type": "code", "execution_count": null, "id": "39dcb294-25df-4358-81cb-3ee5a41e0419", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier\n", "from sklearn.isotonic import IsotonicRegression\n", "from dowhy.gcm.ml.classification import SklearnClassificationModelWeighted\n", "from dowhy.gcm.ml.regression import SklearnRegressionModelWeighted\n", "from dowhy.gcm.util.general import auto_apply_encoders, auto_fit_encoders, shape_into_2d\n", "\n", "def make_custom_regressor():\n", " return SklearnRegressionModelWeighted(HistGradientBoostingRegressor(random_state = 0))\n", "\n", "def make_custom_classifier():\n", " return SklearnClassificationModelWeighted(HistGradientBoostingClassifier(random_state = 0))\n", "\n", "def make_custom_calibrator():\n", " return SklearnRegressionModelWeighted(IsotonicRegression(out_of_bounds = 'clip'))\n", "\n", "gcm.distribution_change_robust(causal_model, data_old, data_new, 'wage', sample_weight = 'weight',\n", " xfit = False, calib_size = 0.2,\n", " regressor = make_custom_regressor,\n", " classifier = make_custom_classifier,\n", " calibrator = make_custom_calibrator)" ] }, { "cell_type": "markdown", "id": "34e901d2-34aa-4cf0-a426-5e41eebf9d40", "metadata": {}, "source": [ "The function returns a dictionary, where each value is the Shapley Value causal attribution measure for the causal mechanism corresponding to the key. (See the research paper for a formal definition of Shapley Values.) We give a more in depth interpretation of the results in the final section of this notebook." ] }, { "cell_type": "markdown", "id": "47a109db-74a7-4826-9a80-1afcafb073ee", "metadata": {}, "source": [ "### Manual Implementation (Advanced)\n", "\n", "Second, we show how to implement the method more directly with the class ```dowhy.gcm.distribution_change_robust.ThetaC```, which allows for more advanced capabilities, including computing standard errors via the Gaussian multiplier bootstrap (see below)." ] }, { "cell_type": "code", "execution_count": null, "id": "f7febdc3-8c1e-46cf-816b-4a62251d641b", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold, train_test_split\n", "\n", "# Split data into train and test set:\n", "X = df[['education', 'occupation']].values\n", "y = df['wage'].values\n", "T = df['female'].values\n", "w = df['weight'].values\n", "\n", "# To get the same train-test split:\n", "kf = StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)\n", "train_index, test_index = next(kf.split(X, T))\n", "\n", "X_train, X_eval, y_train, y_eval, T_train, T_eval = X[train_index], X[test_index], y[train_index], y[test_index], T[train_index], T[test_index]\n", "w_train, w_eval = w[train_index], w[test_index]\n", "\n", "X_calib, X_train, _, y_train, T_calib, T_train, w_calib, w_train = train_test_split(X_train, y_train, T_train, w_train, \n", " train_size = 0.2, stratify = T_train, random_state = 0)" ] }, { "cell_type": "code", "execution_count": null, "id": "5f9e1bad-ae6e-432d-bbb8-1f65ab7bbab6", "metadata": { "tags": [] }, "outputs": [], "source": [ "import itertools\n", "from dowhy.gcm.distribution_change_robust import ThetaC\n", "import numpy as np\n", "from math import comb\n", "\n", "# All combinations of 0s and 1s, needed for Shapley Values:\n", "all_combos = [list(i) for i in itertools.product([0, 1], repeat=3)]\n", "all_combos_minus1 = [list(i) for i in itertools.product([0, 1], repeat=2)]\n", "\n", "# Dictionary to store the multiply-robust scores, will be used later for bootstrap:\n", "scores = {}\n", "\n", "# Here we compute the theta^c parameters that make up the Shapley Values (see paper):\n", "for C in all_combos:\n", " scores[''.join(str(x) for x in C)] = ThetaC(C).est_scores(\n", " X_eval,\n", " y_eval,\n", " T_eval,\n", " X_train,\n", " y_train,\n", " T_train,\n", " w_eval=w_eval,\n", " w_train=w_train,\n", " X_calib=X_calib,\n", " T_calib=T_calib,\n", " w_calib=w_calib,\n", " regressor = make_custom_regressor,\n", " classifier = make_custom_classifier,\n", " calibrator = make_custom_calibrator)\n", "\n", "# This function combines the theta^c parameters to obtain Shapley values:\n", "w_sort = np.concatenate((w_eval[T_eval==0], w_eval[T_eval==1])) # Order weights in same way as scores\n", "\n", "def compute_attr_measure(res_dict, path=False):\n", " # Alternative to Shapley Value: along-a-causal-path (see paper)\n", " if path: \n", " path = np.zeros(3)\n", " path[0] = np.average(res_dict['100'], weights=w_sort) - np.average(res_dict['000'], weights=w_sort)\n", " path[1] = np.average(res_dict['110'], weights=w_sort) - np.average(res_dict['100'], weights=w_sort)\n", " path[2] = np.average(res_dict['111'], weights=w_sort) - np.average(res_dict['110'], weights=w_sort)\n", " return path\n", " \n", " # Shapley values:\n", " else: \n", " shap = np.zeros(3)\n", " for k in range(3):\n", " sv = 0.0\n", " for C in all_combos_minus1:\n", " C1 = np.insert(C, k, True)\n", " C0 = np.insert(C, k, False)\n", " chg = (np.average(res_dict[''.join(map(lambda x : str(int(x)), C1))], weights=w_sort) - \n", " np.average(res_dict[''.join(map(lambda x : str(int(x)), C0))], weights=w_sort))\n", " sv += chg/(3*comb(2, np.sum(C)))\n", " shap[k] = sv\n", " return shap\n", "\n", "shap = compute_attr_measure(scores, path=False)\n", "shap # Should coincide with the above" ] }, { "cell_type": "markdown", "id": "feb423b1-52ef-4e5b-a3be-589cbe70bbc8", "metadata": {}, "source": [ "Thanks to the multiple robustness property, we can compute standard errors using a quick form of the bootstrap, which does not require re-estimating the regression and weights at each bootstrap iteration. Instead, we just re-weight the data using i.i.d. Normal(0, 1) draws." ] }, { "cell_type": "code", "execution_count": null, "id": "e6d0b582-20c6-476b-a07c-9d336ccda3fa", "metadata": { "tags": [] }, "outputs": [], "source": [ "w_sort = np.concatenate((w_eval[T_eval==0], w_eval[T_eval==1])) # Order weights in same way as scores\n", "\n", "def mult_boot(res_dict, Nsim=1000, path=False):\n", " thetas = np.zeros((8, Nsim))\n", " attr = np.zeros((3, Nsim))\n", " for s in range(Nsim):\n", " np.random.seed(s)\n", " new_scores = {}\n", " for k, x in res_dict.items():\n", " new_scores[k] = x + np.random.normal(0,1, X_eval.shape[0])*(x - np.average(x, weights=w_sort))\n", " thetas[:, s] = np.average(np.array([x for k, x in new_scores.items()]), axis=1, weights=w_sort)\n", " attr[:, s] = compute_attr_measure(new_scores, path)\n", " return np.std(attr, axis=1)\n", " \n", "shap_se = mult_boot(scores, path=False)\n", "shap_se" ] }, { "cell_type": "markdown", "id": "2cb64bab-fcf3-4a32-ad8b-16da0b26536c", "metadata": {}, "source": [ "We can present the results visually in a graph, as in the paper:" ] }, { "cell_type": "code", "execution_count": null, "id": "95ec0312-84c2-4405-a823-740c98482090", "metadata": { "tags": [] }, "outputs": [], "source": [ "from scipy.stats import norm\n", "from statsmodels.stats.weightstats import DescrStatsW\n", "from matplotlib.patches import Rectangle\n", "import matplotlib.pyplot as plt\n", "\n", "# Significance stars:\n", "def star(est, se):\n", " if np.abs(est/se)