Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:

import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS

[2]:

data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df

[2]:

	W0	v0	y
0	-1.085634	False	-4.084557
1	-1.418167	False	-4.159834
2	-1.060908	False	-1.876372
3	-0.040109	True	4.739495
4	-1.135199	False	-2.126015
...	...	...	...
995	0.167694	True	6.002419
996	-1.342202	False	-5.061231
997	-1.115815	False	-4.549248
998	0.224101	False	2.319660
999	-0.559218	False	-1.882604

1000 rows × 3 columns

[3]:

# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[3]:

<Axes: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_3_1.png

[4]:

df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[4]:

<Axes: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_4_1.png

[5]:

cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:

cdf_0

[6]:

	W0	v0	y	propensity_score	weight
0	-0.550093	False	-1.532310	0.756892	1.321192
1	-0.722684	False	-1.846021	0.810431	1.233912
2	-1.996264	False	-6.116133	0.977962	1.022534
3	-0.525880	False	-0.314164	0.748614	1.335802
4	-0.668037	False	-3.078179	0.794523	1.258616
...	...	...	...	...	...
995	-2.188685	False	-5.719837	0.984423	1.015824
996	0.607096	False	0.420697	0.270844	3.692155
997	-0.310138	False	-0.762520	0.667047	1.499146
998	-1.141617	False	-4.197842	0.902252	1.108338
999	0.757894	False	1.761406	0.219703	4.551601

1000 rows × 5 columns

[7]:

cdf_1

[7]:

	W0	v0	y	propensity_score	weight
0	-0.568209	True	3.579572	0.237036	4.218773
1	-0.506941	True	3.209480	0.257991	3.876099
2	0.701984	True	7.612798	0.762181	1.312024
3	-0.486111	True	4.661744	0.265385	3.768110
4	0.228094	True	5.046634	0.572979	1.745264
...	...	...	...	...	...
995	-0.472540	True	2.449277	0.270274	3.699946
996	-1.012103	True	1.651103	0.120835	8.275773
997	-1.472749	True	0.877186	0.055678	17.960356
998	-0.103875	True	3.710211	0.421676	2.371490
999	-0.375534	True	5.820682	0.306825	3.259191

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:

(cdf_1['y'] - cdf_0['y']).mean()

[8]:

$\displaystyle 5.45773159896272$

[9]:

1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))

[9]:

$\displaystyle 0.247766786137291$

Comparing to the estimate from OLS.

[10]:

model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()

[10]:

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.955
Model:	OLS	Adj. R-squared (uncentered):	0.955
Method:	Least Squares	F-statistic:	1.062e+04
Date:	Thu, 03 Aug 2023	Prob (F-statistic):	0.00
Time:	18:29:21	Log-Likelihood:	-1426.4
No. Observations:	1000	AIC:	2857.
Df Residuals:	998	BIC:	2867.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	2.7773	0.030	93.411	0.000	2.719	2.836
x2	5.0002	0.054	91.951	0.000	4.894	5.107

Omnibus:	2.397	Durbin-Watson:	1.975
Prob(Omnibus):	0.302	Jarque-Bera (JB):	2.464
Skew:	-0.105	Prob(JB):	0.292
Kurtosis:	2.878	Cond. No.	1.89

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.