Basic Example for generating samples from a GCM
A graphical causal model (GCM) describes the data generation process of the modeled variables. Therefore, after we fit a GCM, we can also generate completely new samples from it and, thus, can see it as data generator for synthetic data based on the underlying models. Generating new samples can generally be done by sorting the nodes in topological order, randomly sample from root-nodes and then propagate the data through the graph by evaluating the downstream causal mechanisms with randomly
sampled noise. The dowhy.gcm
package provides a simple helper function that does this automatically and, by this, offers a simple API to draw samples from a GCM.
Lets take a look at the following example:
[1]:
import numpy as np, pandas as pd
X = np.random.normal(loc=0, scale=1, size=1000)
Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
data.head()
[1]:
X | Y | Z | |
---|---|---|---|
0 | 0.744900 | 1.775248 | 6.764601 |
1 | 0.224922 | 1.275160 | 3.555186 |
2 | 1.957029 | 4.506293 | 13.400790 |
3 | -0.723889 | -1.468503 | -2.538700 |
4 | 0.559098 | 1.021694 | 3.500748 |
Similar as in the introduction, we generate data for the simple linear DAG X→Y→Z. Lets define the GCM and fit it to the data:
[2]:
import networkx as nx
import dowhy.gcm as gcm
causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')]))
gcm.auto.assign_causal_mechanisms(causal_model, data) # Automatically assigns additive noise models to non-root nodes
gcm.fit(causal_model, data)
Fitting causal mechanism of node Z: 100%|██████████| 3/3 [00:00<00:00, 424.02it/s]
We now learned the generative models of the variables, based on the defined causal graph and the additive noise model assumption. To generate new samples from this model, we can now simply call:
[3]:
generated_data = gcm.draw_samples(causal_model, num_samples=1000)
generated_data.head()
[3]:
X | Y | Z | |
---|---|---|---|
0 | -0.707750 | -2.036199 | -6.808362 |
1 | -0.616463 | -1.495279 | -4.858298 |
2 | -1.334419 | -2.755878 | -6.347658 |
3 | 0.384883 | 1.536390 | 5.148656 |
4 | -0.423618 | -2.546439 | -6.899461 |
If our modeling assumptions are correct, the generated data should now resemble the observed data distribution, i.e. the generated samples correspond to the joint distribution we defined for our example data at the beginning. One way to make sure of this is to estimate the KL-divergence between observed and generated distribution. For this, we can make use of the evaluation module:
[4]:
print(gcm.evaluate_causal_model(causal_model, data, evaluate_causal_mechanisms=False, evaluate_invertibility_assumptions=False))
Test permutations of given graph: 100%|██████████| 6/6 [00:00<00:00, 18.14it/s]
Evaluated and the overall average KL divergence between generated and observed distribution and the graph structure. The results are as follows:
==== Evaluation of Generated Distribution ====
The overall average KL divergence between the generated and observed distribution is 0.0
The estimated KL divergence indicates an overall very good representation of the data distribution.
==== Evaluation of the Causal Graph Structure ====
+-------------------------------------------------------------------------------------------------------+
| Falsificaton Summary |
+-------------------------------------------------------------------------------------------------------+
| The given DAG is not informative because 2 / 6 of the permutations lie in the Markov |
| equivalence class of the given DAG (p-value: 0.33). |
| The given DAG violates 0/1 LMCs and is better than 66.7% of the permuted DAGs (p-value: 0.33). |
| Based on the provided significance level (0.2) and because the DAG is not informative, |
| we do not reject the DAG. |
+-------------------------------------------------------------------------------------------------------+
==== NOTE ====
Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms.
This confirms that the generated distribution is close to the observed one.
While the evaluation provides us insights toward the causal graph structure as well, we cannot confirm the graph structure, only reject it if we find inconsistencies between the dependencies of the observed structure and what the graph represents. In our case, we do not reject the DAG, but there are other equivalent DAGs that would not be rejected as well. To see this, consider the example above - X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the correct interventional distribution (e.g., when intervening on Y).