Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api
from dowhy.graph import build_graph_from_str

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
nx_graph = build_graph_from_str(data["dot_graph"])

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 -0.330622 False 0.887729
1 -0.691849 False -1.118670
2 0.193915 False -0.966624
3 -0.535931 True 2.642023
4 2.334936 True 7.962068
... ... ... ...
995 0.339981 False -0.948675
996 -0.919134 False 0.069790
997 -1.140944 False -0.716580
998 0.352853 True 4.586724
999 -0.983083 True 6.705248

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
             variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
             outcome=outcome,
             common_causes=[common_cause],
            ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause]
              ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.416171 False -1.103170 0.587029 1.703494
1 1.020597 False 1.610446 0.256853 3.893284
2 0.289512 False 1.261112 0.415118 2.408955
3 -0.807504 False -1.508775 0.676307 1.478619
4 -0.127578 False 0.523608 0.516908 1.934579
... ... ... ... ... ...
995 0.193915 False -0.966624 0.438129 2.282434
996 -0.663361 False -1.209901 0.644508 1.551571
997 0.507649 False -0.926918 0.364117 2.746373
998 1.261393 False 0.781840 0.214268 4.667049
999 -1.187408 False -1.447086 0.752270 1.329309

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 -1.096630 True 1.579037 0.264752 3.777119
1 0.883869 True 6.774463 0.716633 1.395414
2 -1.335124 True 2.590650 0.221639 4.511850
3 -1.064736 True 4.180466 0.270908 3.691296
4 -0.712749 True 2.574438 0.344435 2.903306
... ... ... ... ... ...
995 -0.394069 True 4.852584 0.418255 2.390889
996 -2.109866 True -0.132723 0.117258 8.528192
997 1.342959 True 7.022089 0.798937 1.251663
998 0.263952 True 3.750980 0.578761 1.727828
999 -1.284152 True 2.070227 0.230414 4.340014

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 4.86323421622536$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.141690908498881$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.921
Model: OLS Adj. R-squared (uncentered): 0.921
Method: Least Squares F-statistic: 5840.
Date: Wed, 06 Dec 2023 Prob (F-statistic): 0.00
Time: 08:43:32 Log-Likelihood: -1436.2
No. Observations: 1000 AIC: 2876.
Df Residuals: 998 BIC: 2886.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.1002 0.028 39.374 0.000 1.045 1.155
x2 4.9739 0.050 99.704 0.000 4.876 5.072
Omnibus: 0.374 Durbin-Watson: 1.968
Prob(Omnibus): 0.829 Jarque-Bera (JB): 0.380
Skew: 0.047 Prob(JB): 0.827
Kurtosis: 2.984 Cond. No. 1.79


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.