Demo for the DoWhy causal API#

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api
from dowhy.graph import build_graph_from_str

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
nx_graph = build_graph_from_str(data["dot_graph"])

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 -1.092232 False -1.467122
1 -0.553488 True 5.092333
2 -2.721292 False 0.557821
3 -0.860433 False -0.864048
4 -1.078330 True 3.316149
... ... ... ...
995 -1.934879 False -0.464891
996 -0.240524 False -0.196877
997 -1.365115 False 1.737968
998 -1.312072 False -0.211613
999 -2.285569 False -2.284761

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
             variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
             outcome=outcome,
             common_causes=[common_cause],
            ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause]
              ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 -0.762795 False -1.106836 0.741823 1.348030
1 -0.639424 False -1.569639 0.713176 1.402178
2 0.270371 False 0.446674 0.461197 2.168270
3 -1.889006 False 0.264200 0.914941 1.092967
4 -0.732039 False 0.578589 0.734859 1.360806
... ... ... ... ... ...
995 -2.200991 False 1.705738 0.939414 1.064493
996 0.338915 False -0.224639 0.441305 2.266005
997 -0.556969 False -0.198539 0.693007 1.442988
998 -1.732231 False 0.170568 0.899507 1.111720
999 -0.504828 False -0.317645 0.679853 1.470905

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 -0.687302 True 3.268349 0.275483 3.629985
1 -1.452541 True 5.436494 0.134246 7.449006
2 -0.978979 True 4.749862 0.212678 4.701941
3 -0.397370 True 5.544810 0.348158 2.872257
4 0.406560 True 5.071772 0.578143 1.729677
... ... ... ... ... ...
995 -0.943522 True 5.675177 0.219720 4.551243
996 -1.128276 True 6.212309 0.184846 5.409895
997 0.130132 True 6.075339 0.497787 2.008891
998 0.092802 True 4.321617 0.486851 2.054015
999 -0.357319 True 4.093820 0.358886 2.786397

1000 rows × 5 columns

Comparing the estimate to Linear Regression#

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.00588480884916$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.0885944212046903$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.906
Model: OLS Adj. R-squared (uncentered): 0.906
Method: Least Squares F-statistic: 4820.
Date: Mon, 21 Apr 2025 Prob (F-statistic): 0.00
Time: 07:38:51 Log-Likelihood: -1406.3
No. Observations: 1000 AIC: 2817.
Df Residuals: 998 BIC: 2826.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.0705 0.029 2.444 0.015 0.014 0.127
x2 4.9838 0.051 97.384 0.000 4.883 5.084
Omnibus: 2.579 Durbin-Watson: 2.017
Prob(Omnibus): 0.275 Jarque-Bera (JB): 2.490
Skew: 0.075 Prob(JB): 0.288
Kurtosis: 2.808 Cond. No. 1.79


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.