Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:

import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS

[2]:

data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df

[2]:

	W0	v0	y
0	-1.012777	False	-3.042986
1	0.289858	False	1.112130
2	1.347723	True	8.063621
3	-0.464752	False	-0.189419
4	0.121820	False	-0.439537
...	...	...	...
995	-1.797895	False	-4.532083
996	-1.601010	False	-4.809766
997	-0.639639	False	-1.936961
998	-2.256174	False	-5.319564
999	-0.270505	False	-0.054673

1000 rows × 3 columns

[3]:

# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[3]:

<Axes: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_3_1.png

[4]:

df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[4]:

<Axes: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_4_1.png

[5]:

cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:

cdf_0

[6]:

	W0	v0	y	propensity_score	weight
0	-0.304858	False	-1.448515	0.532390	1.878323
1	-0.625992	False	-2.489382	0.594945	1.680829
2	0.918621	False	2.341392	0.301398	3.317870
3	-0.650677	False	-1.249433	0.599654	1.667629
4	0.232127	False	0.969079	0.426496	2.344690
...	...	...	...	...	...
995	-0.486277	False	-1.746970	0.567984	1.760612
996	-0.994063	False	-0.910734	0.662929	1.508456
997	-0.600021	False	-0.240069	0.589971	1.694998
998	-1.467687	False	-3.689877	0.741164	1.349230
999	0.139491	False	-0.503729	0.444556	2.249433

1000 rows × 5 columns

[7]:

cdf_1

[7]:

	W0	v0	y	propensity_score	weight
0	-0.540636	True	3.159776	0.421469	2.372652
1	-0.517413	True	4.373290	0.425967	2.347601
2	-1.011418	True	3.547779	0.334002	2.993996
3	-1.705678	True	-0.387705	0.224299	4.458329
4	-1.809141	True	-0.779102	0.210345	4.754091
...	...	...	...	...	...
995	-0.207755	True	5.120488	0.486822	2.054139
996	-0.597498	True	3.878264	0.410513	2.435976
997	0.806819	True	7.124728	0.679609	1.471435
998	-1.339661	True	3.927840	0.278788	3.586958
999	-2.158812	True	-0.987639	0.167956	5.953941

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:

(cdf_1['y'] - cdf_0['y']).mean()

[8]:

$\displaystyle 5.00995911537052$

[9]:

1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))

[9]:

$\displaystyle 0.214614822056212$

Comparing to the estimate from OLS.

[10]:

model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()

[10]:

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.938
Model:	OLS	Adj. R-squared (uncentered):	0.938
Method:	Least Squares	F-statistic:	7593.
Date:	Tue, 05 Sep 2023	Prob (F-statistic):	0.00
Time:	05:11:13	Log-Likelihood:	-1418.2
No. Observations:	1000	AIC:	2840.
Df Residuals:	998	BIC:	2850.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	2.3582	0.028	85.636	0.000	2.304	2.412
x2	5.0888	0.050	101.283	0.000	4.990	5.187

Omnibus:	5.587	Durbin-Watson:	2.044
Prob(Omnibus):	0.061	Jarque-Bera (JB):	5.476
Skew:	-0.177	Prob(JB):	0.0647
Kurtosis:	3.080	Cond. No.	1.87

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.