Basic Example for Graphical Causal Model-Based Intervention
Step 1: Modeling cause-effect relationships as a structural causal model (SCM)
The first step is to model the cause-effect relationships between variables relevant to our use case. We do that in form of a causal graph. A causal graph is a directed acyclic graph (DAG) where an edge X→Y implies that X causes Y. Statistically, a causal graph encodes the conditional independence relations between variables. Using the NetworkX library, we can create causal graphs. In the snippet below, we create a chain X→Y→Z:
[1]:
import networkx as nx
causal_graph = nx.DiGraph([('X', 'Y'), ('Y', 'Z')])
To answer causal questions using causal graphs, we also have to know the nature of underlying data-generating process of variables. A causal graph by itself, being a diagram, does not have any information about the data-generating process. To introduce this data-generating process, we use an SCM that’s built on top of our causal graph:
[2]:
from dowhy import gcm
causal_model = gcm.StructuralCausalModel(causal_graph)
At this point we would normally load our dataset. For this introduction, we generate some synthetic data instead. The API takes data in form of Pandas DataFrames:
[3]:
import numpy as np, pandas as pd
X = np.random.normal(loc=0, scale=1, size=1000)
Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
data.head()
[3]:
X | Y | Z | |
---|---|---|---|
0 | -0.966829 | -1.081750 | -3.308715 |
1 | -1.531191 | -3.809681 | -13.216438 |
2 | 1.526233 | 2.653096 | 8.126527 |
3 | 1.182747 | 1.714520 | 4.881003 |
4 | 0.827234 | 2.410748 | 8.201742 |
Note how the columns X, Y, Z correspond to our nodes X, Y, Z in the graph constructed above. We can also see how the values of X influence the values of Y and how the values of Y influence the values of Z in that data set.
The causal model created above allows us now to assign causal mechanisms to each node in the form of functional causal models. Here, these mechanism can either be assigned manually if, for instance, prior knowledge about certain causal relationships are known or they can be assigned automatically using the auto
module. For the latter, we simply call:
[4]:
gcm.auto.assign_causal_mechanisms(causal_model, data)
In case we want to have more control over the assigned mechanisms, we can do this manually as well. For instance, we can can assign an empirical distribution to the root node X and linear additive noise models to nodes Y and Z:
[5]:
causal_model.set_causal_mechanism('X', gcm.EmpiricalDistribution())
causal_model.set_causal_mechanism('Y', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
causal_model.set_causal_mechanism('Z', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))
In the real world, the data comes as an opaque stream of values, where we typically don’t know how one variable influences another. The graphical causal models can help us to deconstruct these causal relationships again, even though we didn’t know them before.
Step 2: Fitting the SCM to the data
With the data at hand and the graph constructed earlier, we can now train the SCM using fit
:
[6]:
gcm.fit(causal_model, data)
Fitting causal mechanism of node Z: 100%|██████████| 3/3 [00:00<00:00, 565.24it/s]
Fitting means, we learn the generative models of the variables in the SCM according to the data.
Step 3: Answering a causal query based on the SCM
The last step, answering a causal question, is our actual goal. E.g. we could ask the question:
What will happen to the variable Z if I intervene on Y?
This can be done via the interventional_samples
function. Here’s how:
[7]:
samples = gcm.interventional_samples(causal_model,
{'Y': lambda y: 2.34 },
num_samples_to_draw=1000)
samples.head()
[7]:
X | Y | Z | |
---|---|---|---|
0 | 0.286891 | 2.34 | 7.766979 |
1 | -0.171419 | 2.34 | 6.334866 |
2 | 0.894578 | 2.34 | 7.268432 |
3 | 1.755129 | 2.34 | 8.133178 |
4 | -0.695842 | 2.34 | 6.380478 |
This intervention says: “I’ll ignore any causal effects of X on Y, and set every value of Y to 2.34.” So the distribution of X will remain unchanged, whereas values of Y will be at a fixed value and Z will respond according to its causal model.