Impact of 401(k) eligibility on net financial assets#

In this case study, we will use real-world data from 401(k) analysis to explain how graphical causal models can be used to estimate average treatment effect (ATE) and conditional ATE (CATE).

Background#

In the early 1980s, the United States government introduced several tax deferred savings options for employees in an effort to increase individual saving for retirement. One popular option is the 401(k) plan, which allows employees to contribute a portion of their wages to individual accounts. The goal here is to understand the effect of 401(k) eligibility on net financial assets (which is a sum of 401(k) balances and non-401(k) assets) considering heterogeneity due to individual’s characteristics (income in particular).

Since 401(k) plans are provided by employers, only employees of companies that offer those plans are eligible for participation. As such, we are dealing with a non-randomized study. Several factors (e.g. education, preference for saving) affect 401(k) eligibility as well as net financial assets.

Data#

We consider a sample from the Survey of Income and Program Participation in 1991. The sample consists of households where the reference individual is 25-64 years old, and at least one individual is employed but no one is self-employed. There are records of 9915 households in the sample. For each household, 44 variables are recorded that include the eligibility of the household reference person for the 401(k) plan (the treatment), net financial assets (the outcome), and other covariates, such as age, income, family size, education, marital status, etc. We consider 16 covariates in particular.

We summarise the variables used for this case study in the table below.

Variable Name

Type

Details

e401

Treatment

eligibility for the 401(k) plan

net_tfa

Outcome

net financial assets (in USD)

age

Covariate

Age

inc

Covariate

income (in USD)

fsize

Covariate

family size

educ

Covariate

education (in years)

male

Covariate

is a male?

db

Covariate

defined benefit pension

marr

Covariate

married?

twoearn

Covariate

two earners

pira

Covariate

participation in IRA

hown

Covariate

home owner?

hval

Covariate

home value (in USD)

hequity

Covariate

home equity (in USD)

hmort

Covariate

home mortgage (in USD)

nohs

Covariate

no high-school? (one-hot encoded)

hs

Covariate

high-school? (one-hot encoded)

smcol

Covariate

some-college? (one-hot encoded)

The dataset is publicly available online from the `hdm <https://rdrr.io/cran/hdm/man/pension.html>`__ R package. For more details about the data set, we refer the interested reader to the following paper:

  1. Chernohukov, C. Hansen (2004). The impact of 401(k) participation on the wealth distribution: An instrumental quantile regression analysis. The Review of Economic and Statistics 86 (3), 735–751.

Let’s load and analyse the data first.

[1]:
import pandas as pd
df = pd.read_csv("pension.csv")
[2]:
df.head()
[2]:
ira a401 hval hmort hequity nifa net_nifa tfa net_tfa tfa_he ... i3 i4 i5 i6 i7 a1 a2 a3 a4 a5
0 0 0 69000 60150 8850 100 -3300 100 -3300 5550 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1 0 0 78000 20000 58000 61010 61010 61010 61010 119010 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1800 0 200000 15900 184100 7549 7049 9349 8849 192949 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0 0 0 0 0 2487 -6013 2487 -6013 -6013 ... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 0 0 300000 90000 210000 10625 -2375 10625 -2375 207625 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 44 columns

Effect of 401(k) Eligibility on Net Financial Assets, Conditioned on Income#

First we construct a causal graph of 401(k) plan eligibility (the treatment \(T\)), net financial assets (the outcome \(Y\)), control variables \(W\) we adjust for assuming that they block all back-door paths between \(Y\) and \(T\), and income \(X\) (the covariate of interest based on which we want to study the heterogeneity of treatment effect).

[3]:
import networkx as nx
import dowhy.gcm as gcm

treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
              "marr","twoearn","pira","hown","hval",
              "hequity","hmort","nohs","hs","smcol"]

edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])

# To ensure that the treatment is considered as a categorical variable, we convert the values explicitly to strings.
df = df.astype({treatment_var: str})

causal_graph = nx.DiGraph(edges)
[4]:
from dowhy.utils import plot

plot(causal_graph, figure_size=[20, 20])
../_images/example_notebooks_gcm_401k_analysis_7_0.png

Here we created a simplified graph where there are no interactions between covariates (i.e. nodes in \(X \cup W\)). Most likely, that is not the case in practice. However, as we take joint samples of the covariates—directly from the observed data—later to estimate CATEs, we can ignore their interactions.

Before we assign causal models to variables, let’s plot their histograms to get an idea on the distribution of variables.

[5]:
import matplotlib.pyplot as plt

cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
    plt.subplot(3,6,i+1)
    plt.grid(False)
    plt.hist(df[col])
    plt.xlabel(col)
plt.tight_layout()
plt.show()
../_images/example_notebooks_gcm_401k_analysis_9_0.png

We observe that real-valued variables do not follow well-known parametric distributions like Gaussian. Therefore, we fit empirical distributions whenever those variables do not have parents, which is also suitable for categorical variables.

Let’s assign the causal mechanisms to each node explicitly. For the treatment variable, we assign a classifier functional causal model (FCM) with a random forest classifier. For the outcome variable, we assign an additive noise model with random forest regression as a function and an empirical distribution for the noise. We assign empirical distributions to other variables as they do not have parents in the causal graph.

[6]:
causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
    causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())

If we don’t have prior knowledge or are not familiar with the statistical implications, we can also assign causal mechanisms automatically using a heuristic:

gcm.auto.assign_causal_mechanisms(causal_model, df)

With that, we can now fit the learn the causal models from data.

[7]:
gcm.fit(causal_model, df)
Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00,  2.64it/s]

Before computing CATE, we first divide households into equi-width bins of income percentiles. This allows us to study the impact on various income groups.

[8]:
import numpy as np

percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize

groups = [f'{percentages[i]*100:.0f}%-{percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))

Now we can compute CATE. To this end, we perform a randomised intervention on the treatment variable in the fitted causal graph, draw samples from the interventional distribution, group observations by the income group, and then compute the treatment effect in each group.

[9]:
np.random.seed(47)

def estimate_cate():
    samples = gcm.interventional_samples(causal_model,
                                         {treatment_var: lambda x: np.random.choice(['0', '1'])},
                                         observed_data=df)
    eligible = samples[treatment_var] == '1'
    ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
    result = dict(ate = ate)

    group_indices = np.digitize(samples['inc'], bin_edges)
    samples['group_index'] = group_indices

    for group_index in group_index_to_group_label:
        group_samples = samples[samples['group_index'] == group_index]
        eligible_in_group = group_samples[treatment_var] == '1'
        cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
        result[group_index_to_group_label[group_index]] = cate

    return result

group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)
Estimating bootstrap interval...: 100%|██████████| 100/100 [00:44<00:00,  2.27it/s]
{'ate': 6441.310755116158, '0%-20%': 3857.2640553348688, '20%-40%': 3156.748142545304, '40%-60%': 5588.30517050229, '60%-80%': 7915.379276020617, '80%-100%': 11517.07598704482}
{'ate': array([4906.32695767, 8138.9820634 ]), '0%-20%': array([2517.58236656, 5297.25607696]), '20%-40%': array([1176.91431346, 5369.69403251]), '40%-60%': array([3322.18732087, 8022.22142553]), '60%-80%': array([ 5110.81424832, 10840.05426858]), '80%-100%': array([ 5446.70651369, 18548.6666534 ])}

The average treatment effect of 401(k) eligibility on net financial assets is positive as indicated by the corresponding confidence interval. Now, let’s plot CATEs of various income groups to get a clear picture.

[10]:
fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
    ci = group_to_ci[group]
    plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()
/tmp/ipykernel_4331/1344202636.py:4: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro-" (-> color='r'). The keyword argument will take precedence.
  plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
../_images/example_notebooks_gcm_401k_analysis_20_1.png

The impact increases as one moves from lower to higher income groups. This result seems to be consistent with the resource constraints of the different income groups.