Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 -1.012777 False -3.042986
1 0.289858 False 1.112130
2 1.347723 True 8.063621
3 -0.464752 False -0.189419
4 0.121820 False -0.439537
... ... ... ...
995 -1.797895 False -4.532083
996 -1.601010 False -4.809766
997 -0.639639 False -1.936961
998 -2.256174 False -5.319564
999 -0.270505 False -0.054673

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.304858 False -1.448515 0.532390 1.878323
1 -0.625992 False -2.489382 0.594945 1.680829
2 0.918621 False 2.341392 0.301398 3.317870
3 -0.650677 False -1.249433 0.599654 1.667629
4 0.232127 False 0.969079 0.426496 2.344690
... ... ... ... ... ...
995 -0.486277 False -1.746970 0.567984 1.760612
996 -0.994063 False -0.910734 0.662929 1.508456
997 -0.600021 False -0.240069 0.589971 1.694998
998 -1.467687 False -3.689877 0.741164 1.349230
999 0.139491 False -0.503729 0.444556 2.249433

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 -0.540636 True 3.159776 0.421469 2.372652
1 -0.517413 True 4.373290 0.425967 2.347601
2 -1.011418 True 3.547779 0.334002 2.993996
3 -1.705678 True -0.387705 0.224299 4.458329
4 -1.809141 True -0.779102 0.210345 4.754091
... ... ... ... ... ...
995 -0.207755 True 5.120488 0.486822 2.054139
996 -0.597498 True 3.878264 0.410513 2.435976
997 0.806819 True 7.124728 0.679609 1.471435
998 -1.339661 True 3.927840 0.278788 3.586958
999 -2.158812 True -0.987639 0.167956 5.953941

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.00995911537052$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.214614822056212$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.938
Model: OLS Adj. R-squared (uncentered): 0.938
Method: Least Squares F-statistic: 7593.
Date: Tue, 05 Sep 2023 Prob (F-statistic): 0.00
Time: 05:11:13 Log-Likelihood: -1418.2
No. Observations: 1000 AIC: 2840.
Df Residuals: 998 BIC: 2850.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 2.3582 0.028 85.636 0.000 2.304 2.412
x2 5.0888 0.050 101.283 0.000 4.990 5.187
Omnibus: 5.587 Durbin-Watson: 2.044
Prob(Omnibus): 0.061 Jarque-Bera (JB): 5.476
Skew: -0.177 Prob(JB): 0.0647
Kurtosis: 3.080 Cond. No. 1.87


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.