Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:

import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS

[2]:

data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df

[2]:

	W0	v0	y
0	-0.436905	False	1.746834
1	0.276758	True	6.017127
2	-0.035913	False	0.990615
3	0.109910	False	1.095498
4	-1.136348	False	-0.984085
...	...	...	...
995	-0.623168	False	-1.233430
996	0.748328	True	5.313026
997	-0.194007	True	4.264915
998	-0.555842	False	1.552783
999	0.702504	False	1.165711

1000 rows × 3 columns

[3]:

# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[3]:

<AxesSubplot: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_3_1.png

[4]:

df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[4]:

<AxesSubplot: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_4_1.png

[5]:

cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:

cdf_0

[6]:

	W0	v0	y	propensity_score	weight
0	-2.451867	False	-5.228110	0.972026	1.028779
1	0.367597	False	0.576006	0.373936	2.674253
2	0.491363	False	2.175727	0.333202	3.001187
3	0.711783	False	0.566149	0.266703	3.749491
4	-1.658345	False	-3.368804	0.917168	1.090313
...	...	...	...	...	...
995	0.991758	False	2.705305	0.195459	5.116168
996	-1.007565	False	-0.140347	0.812533	1.230720
997	0.107398	False	0.149837	0.464967	2.150689
998	0.221108	False	-0.007530	0.424521	2.355598
999	1.491955	False	0.366618	0.105664	9.463981

1000 rows × 5 columns

[7]:

cdf_1

[7]:

	W0	v0	y	propensity_score	weight
0	-0.006548	True	6.710813	0.494035	2.024147
1	-0.741454	True	3.937414	0.252934	3.953604
2	-0.498823	True	4.745848	0.324462	3.082022
3	0.218096	True	3.273484	0.574418	1.740891
4	0.651958	True	6.479414	0.716099	1.396454
...	...	...	...	...	...
995	-0.498823	True	4.745848	0.324462	3.082022
996	0.628752	True	6.779816	0.709251	1.409938
997	-0.394105	True	4.843123	0.358377	2.790360
998	0.628752	True	6.779816	0.709251	1.409938
999	0.112639	True	5.710984	0.536911	1.862505

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:

(cdf_1['y'] - cdf_0['y']).mean()

[8]:

$\displaystyle 4.97688077947066$

[9]:

1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))

[9]:

$\displaystyle 0.135312498086271$

Comparing to the estimate from OLS.

[10]:

model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()

[10]:

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.960
Model:	OLS	Adj. R-squared (uncentered):	0.960
Method:	Least Squares	F-statistic:	1.201e+04
Date:	Sat, 17 Dec 2022	Prob (F-statistic):	0.00
Time:	06:38:19	Log-Likelihood:	-1360.8
No. Observations:	1000	AIC:	2726.
Df Residuals:	998	BIC:	2735.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	1.1106	0.036	31.135	0.000	1.041	1.181
x2	5.1407	0.047	109.376	0.000	5.048	5.233

Omnibus:	0.779	Durbin-Watson:	1.922
Prob(Omnibus):	0.678	Jarque-Bera (JB):	0.700
Skew:	-0.062	Prob(JB):	0.705
Kurtosis:	3.040	Cond. No.	1.99

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.