Demo for the DoWhy causal API
We show a simple example of adding a causal extension to any dataframe.
[1]:
import dowhy.datasets
import dowhy.api
import numpy as np
import pandas as pd
from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
num_common_causes=1,
num_instruments = 0,
num_samples=1000,
treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'
treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 | v0 | y | |
---|---|---|---|
0 | 0.615341 | False | -0.216582 |
1 | 1.547448 | True | 3.791882 |
2 | -1.220144 | True | 4.319986 |
3 | -0.167506 | True | 6.049082 |
4 | 2.950866 | True | 4.455041 |
... | ... | ... | ... |
995 | 0.978907 | True | 6.964400 |
996 | 2.768105 | True | 4.851054 |
997 | 0.657292 | True | 7.053682 |
998 | 0.462598 | False | 0.075282 |
999 | 0.544589 | True | 3.206106 |
1000 rows × 3 columns
[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
common_causes=[common_cause],
proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot: xlabel='v0'>
[4]:
df.causal.do(x={treatment: 1},
variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
method='weighting',
common_causes=[common_cause],
proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot: xlabel='v0'>
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
dot_graph=data['dot_graph'],
proceed_when_unidentifiable=True)
cdf_0 = df.causal.do(x={treatment: 0},
variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
outcome=outcome,
dot_graph=data['dot_graph'],
proceed_when_unidentifiable=True)
[6]:
cdf_0
[6]:
W0 | v0 | y | propensity_score | weight | |
---|---|---|---|---|---|
0 | 0.067560 | False | 1.911379 | 0.466482 | 2.143707 |
1 | 0.823216 | False | -1.139482 | 0.298868 | 3.345962 |
2 | 1.662468 | False | 0.487500 | 0.161029 | 6.210075 |
3 | 1.219407 | False | -0.734138 | 0.226293 | 4.419058 |
4 | 3.170608 | False | 0.352740 | 0.043754 | 22.854844 |
... | ... | ... | ... | ... | ... |
995 | 2.290390 | False | 0.137876 | 0.095559 | 10.464770 |
996 | 1.037646 | False | 0.709049 | 0.257967 | 3.876460 |
997 | -0.134015 | False | 1.177118 | 0.514338 | 1.944246 |
998 | 0.438583 | False | -2.028024 | 0.380597 | 2.627451 |
999 | -0.018765 | False | -1.847223 | 0.486952 | 2.053589 |
1000 rows × 5 columns
[7]:
cdf_1
[7]:
W0 | v0 | y | propensity_score | weight | |
---|---|---|---|---|---|
0 | 0.151752 | True | 5.144520 | 0.553375 | 1.807092 |
1 | -0.361275 | True | 5.457729 | 0.432065 | 2.314466 |
2 | -0.516995 | True | 4.685548 | 0.396163 | 2.524211 |
3 | -0.151791 | True | 7.376667 | 0.481441 | 2.077096 |
4 | 0.533555 | True | 4.762663 | 0.640448 | 1.561408 |
... | ... | ... | ... | ... | ... |
995 | 2.328699 | True | 5.015506 | 0.907543 | 1.101876 |
996 | 0.922553 | True | 6.789926 | 0.720539 | 1.387849 |
997 | 0.890339 | True | 5.813846 | 0.714331 | 1.399912 |
998 | -0.380071 | True | 4.574668 | 0.427686 | 2.338166 |
999 | 1.219794 | True | 3.804645 | 0.773772 | 1.292371 |
1000 rows × 5 columns
Comparing the estimate to Linear Regression
First, estimating the effect using the causal data frame, and the 95% confidence interval.
[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.18094679145088$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.0932881847772126$
Comparing to the estimate from OLS.
[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
Dep. Variable: | y | R-squared (uncentered): | 0.945 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.945 |
Method: | Least Squares | F-statistic: | 8643. |
Date: | Tue, 06 Dec 2022 | Prob (F-statistic): | 0.00 |
Time: | 09:38:55 | Log-Likelihood: | -1450.1 |
No. Observations: | 1000 | AIC: | 2904. |
Df Residuals: | 998 | BIC: | 2914. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 0.2364 | 0.035 | 6.773 | 0.000 | 0.168 | 0.305 |
x2 | 5.0849 | 0.053 | 95.597 | 0.000 | 4.981 | 5.189 |
Omnibus: | 0.226 | Durbin-Watson: | 1.922 |
---|---|---|---|
Prob(Omnibus): | 0.893 | Jarque-Bera (JB): | 0.264 |
Skew: | 0.035 | Prob(JB): | 0.876 |
Kurtosis: | 2.962 | Cond. No. | 2.46 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.