Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api
from dowhy.graph import build_graph_from_str

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
nx_graph = build_graph_from_str(data["dot_graph"])

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 -1.352371 False -2.800845
1 1.840101 True 5.312718
2 1.462809 True 5.135988
3 -1.268049 False -0.090203
4 -1.062236 False 1.062694
... ... ... ...
995 0.278762 False 0.058603
996 1.429788 True 9.476343
997 -2.136987 False -2.413670
998 -2.070323 False -1.066857
999 -0.415496 True 3.013789

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
             variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
             outcome=outcome,
             common_causes=[common_cause],
            ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause]
              ).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<Axes: xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              graph=nx_graph
              )

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 2.404146 False 3.487173 0.221540 4.513856
1 -0.119822 False 0.453772 0.542849 1.842132
2 -0.614196 False 0.697349 0.611026 1.636591
3 -0.064320 False 1.124111 0.535044 1.869006
4 -0.137764 False -1.384972 0.545368 1.833623
... ... ... ... ... ...
995 -0.776681 False -1.094488 0.632648 1.580659
996 0.397365 False 0.620665 0.469811 2.128515
997 0.713727 False 2.663563 0.425568 2.349803
998 1.595413 False 2.558598 0.310243 3.223276
999 1.712484 False 2.700458 0.296245 3.375582

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 -0.607244 True 3.065915 0.389909 2.564699
1 -0.777414 True 4.223191 0.367256 2.722898
2 -2.025207 True 2.215251 0.222658 4.491188
3 -1.701738 True 1.773618 0.255943 3.907120
4 -0.578541 True 5.247765 0.393781 2.539485
... ... ... ... ... ...
995 0.044282 True 5.576950 0.480276 2.082137
996 -2.149224 True 1.854231 0.210746 4.745046
997 -0.777414 True 4.223191 0.367256 2.722898
998 1.066502 True 5.893722 0.622371 1.606758
999 -0.543172 True 3.854451 0.398569 2.508973

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.03222846738008$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.125764811665961$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.929
Model: OLS Adj. R-squared (uncentered): 0.928
Method: Least Squares F-statistic: 6487.
Date: Mon, 25 Dec 2023 Prob (F-statistic): 0.00
Time: 07:24:16 Log-Likelihood: -1372.1
No. Observations: 1000 AIC: 2748.
Df Residuals: 998 BIC: 2758.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.0182 0.028 36.171 0.000 0.963 1.073
x2 5.0284 0.046 108.991 0.000 4.938 5.119
Omnibus: 2.186 Durbin-Watson: 1.932
Prob(Omnibus): 0.335 Jarque-Bera (JB): 2.056
Skew: 0.045 Prob(JB): 0.358
Kurtosis: 2.797 Cond. No. 1.64


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.