Demystifying Causality: An Introduction in Causal Inference and Applications. Part 4.

IvanGor
9 min readJul 2, 2023

--

In previous posts, we discussed identifying causal models and calculating ATEs using causal graphs. However, what if we don’t have access to a causal graph? In such situations, causal discovery methods can be incredibly helpful. These techniques offer an exciting way to gain insights into data structures and causal relationships directly from the data itself.

We previously mentioned the Markov assumption. To refresh your memory, here’s the Global Markov Assumption:

Given that P is Markov with respect to G (satisfies the local Markov assumption), if X and Y are d-separated in G conditioned on Z, then X and Y are independent in P conditioned on Z. This can be written as:

When working with an unknown causal graph, we don’t have the first component of the theorem, but we can calculate the second one. Therefore, we need to make the Faithfulness assumption:

This assumption is quite strong and may not hold true in some structural models. Let’s consider an example from Brady Neal:

In this case, we can find parameters alpha, beta, gamma, and delta such that A is independent of D. However, as you can see in the graph, this is not the case.

In addition to the previously mentioned assumptions, we also need to consider Causal Sufficiency:

There are no unobserved confounders for any of the variables in the graph.

Moreover, we assume that the graph is Acyclic.

One challenge in causal discovery is the inability to distinguish certain graphs based solely on the data. Such graphs are called Markov equivalent. For instance, you may recall structures like chains and forks from our previous posts:

These three graphs are Markov equivalent. When examining the data, we find that X1 and X3 are independent if conditioned on X2, while all of them are pairwise dependent without conditioning. In situations like this, the only information we can extract is a so-called skeleton:

Thus, our only chance to distinguish graphs is when their Markov equivalency class contains only one graph, meaning they are not Markov equivalent.

We can often benefit from immorality in causal discovery. As you may recall, immorality occurs when we have two parents and a collider. Furthermore, the two parents are independent, and controlling for the collider opens a backdoor path, making them conditionally dependent. This leads to situations where we can confidently determine the structure of the graph from the data, as long as we keep the three assumptions in mind.

Immoralities and skeletons serve as our primary tools for causal discovery in the following algorithm, known as the PC algorithm:

PC Algorithm:

  1. Build the complete graph
  2. Identify the skeleton
  3. Identify and orient the immoralities
  4. Orient qualifying edges that are incident on colliders

Consider the following graph:

We only know the values of all five variables. Let’s first build the complete graph:

The next step is to identify the skeleton. To do this, we need to eliminate obsolete edges from the graph. Examining pairwise dependencies with an empty conditioning set, we find that A and B are independent. This allows us to eliminate the A-B link. After increasing the conditioning set to include one variable, we again look at conditional dependencies between different pairs of variables. We discover that conditional on C, A and B become dependent. This indicates that A-C-B forms an immorality with collider C. Furthermore, all other pairs (BE, BD, AD, AE, DE) become independent when conditioned on C, enabling us to eliminate five more links. We obtain the following graph:

The final step is to orient the edges D-C and E-C. In this case, it’s quite straightforward. Since we determined that C is a collider only in the immorality A-C-B, there are no more immoralities, and we can definitively say that edges D-C and E-C are not directed towards the C node. This allows us to infer the causal graph.

Let’s consider a practical example and try to build a causal graph from raw data. We will generate data from the graph we saw in the previous post:

The data generation process is as follows:

np.random.seed(11)

alpha = 0.5
beta = -5
delta = 0.4
gamma = -0.7

Z = np.random.normal(loc=1, size=500)
T = alpha * Z + np.random.normal(loc=2, size=500)
M = beta * T + np.random.normal(size=500)
Y = delta * Z + gamma* M + np.random.normal(size=500)
C = T + Y + np.random.normal(size=500)

df = pd.DataFrame(np.stack((Z, T, M, Y, C), axis=1),
columns=['Z', 'T', 'M', 'Y', 'C'])

To perform causal discovery, we will use the CausalNex library and apply the PC algorithm we discussed earlier. Let’s import the necessary libraries and perform the analysis.

import networkx as nx
import matplotlib.pyplot as plt
from causalnex.structure.notears import from_pandas_lasso

# Learning the causal graph using the PC algorithm
sm = from_pandas_lasso(df, beta=0, w_threshold=0.3)

The from_pandas_lasso method uses a modification of the PC algorithm. By setting beta equal to zero, it becomes the standard PC algorithm with a threshold of 0.3 on a link. The higher the threshold, the more links will be pruned. Let’s visualize the final graph:

# Converting the StructureModel to a networkx graph
G = nx.DiGraph(sm.edges)

# Visualizing the graph
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray',
node_size=3000, font_size=18, font_weight='bold')
plt.show()

As we can see, that’s exactly our graph! Now, we can estimate the causal effect of T on Y using the doWhy library (which employs methods familiar to those who have read the previous posts).

from dowhy import CausalModel

# Convert networkX representation to GLM - compatible with Causal Model
nx.write_gml(G, 'test.gml')

# Create the dowhy Causal Model
model = CausalModel(
data=df,
treatment='T',
outcome='Y',
graph='test.gml'
)

Now, doWhy allows us to explicitly show the assumptions and methods and then estimate the effect:

# Identifying the causal effect
identified_estimand = model.identify_effect()
print(identified_estimand)

This outputs:

Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────(Expectation(Y|Z))
d[T]
Estimand assumption 1, Unconfoundedness: If U→{T} and U→Y then P(Y|T,Z,U) = P(Y|T,Z)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
Estimand expression:
Expectation(Derivative(Y, [M])*Derivative([M], [T]))
Estimand assumption 1, Full-mediation: M intercepts (blocks) all directed paths from T to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{T} and U→{M} then P(M|T,U) = P(M|T)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{M} and U→Y then P(Y|M, T, U) = P(Y|M, T)

The output shows that both backdoor and frontdoor approaches can be applied to this graph. Let’s try:

# Estimating the causal effect
estimate = model.estimate_effect(identified_estimand,
method_name='backdoor.linear_regression',
confidence_intervals=True)

print("Causal Estimate:", estimate.value)
print("Confidence Intervals: ", estimate.get_confidence_intervals())

This provides:

Causal Estimate: 3.5563634704585154
Confidence Intervals: [[3.45063871 3.66208824]]

The result is very close to what we obtained for that graph in our previous post. The frontdoor method yields the same outcome.

DoWhy also offers another useful tool — the ability to test results for robustness (referred to as the refutation step). We will use several refutation methods:

  1. Random Common Cause: This method adds a random confounder to the dataset and re-estimates the causal effect. If the new estimate is similar to the original one, it indicates that the original estimate is robust to the presence of an unobserved confounder.
  2. Placebo Treatment Refuter: This method replaces the original treatment variable with a randomly generated placebo variable (unrelated to the outcome) and re-estimates the causal effect. If the new estimate is close to zero, it suggests that the original estimate is not merely a result of spurious correlations.
  3. Data Subset Refuter: This method re-estimates the causal effect on random subsets of the data. If the new estimates are similar to the original one, it indicates that the original estimate is robust to variations in the data.

Let’s examine the results:

refutation = model.refute_estimate(identified_estimand, estimate, method_name='random_common_cause')
print(refutation)
Refute: Add a random common cause
Estimated effect:3.5563634704585154
New effect:3.5564561606343172
p value:0.5

The hypothesis tested here is that the new effect after refutation significantly differs from the previous estimated effect. The null hypothesis is that they are not different.

We obtain similar results with the other refuters. This gives us confidence that our results are robust.

Now let’s delve into some theoretical bonus:

The Theorem of Markov Completeness states:

If we have multinomial distributions or linear Gaussian structural equations, we can only identify a graph up to its Markov equivalence class.

This implies that there is no method that can help us identify structures such as forks, chains, or distinguish X → Y from Y → X. So, without any assumptions, the results we use should satisfy the theorem. However, if we assume non-linear Gaussian structural equations and non-multinomial distributions, what would happen?

Consider a linear non-Gaussian structural equation:

where f is a linear function, and U is a non-Gaussian random variable. In this case, we have a theorem which states:

In the linear non-Gaussian setting, if the true Structural Causal Model (SCM) is:

then, there does not exist an SCM in the reverse direction:

that can generate data consistent with P(x, y).

The proof is quite straightforward. I encourage you to explore it in Brady Neal’s book.

Example of non-gaussian

Let’s examine a simple example involving two connected nodes A and B (A→B). From the theory, we know that we can only determine whether they are connected or not — in other words, we can only find the skeleton and not the direction of the edges.

np.random.seed(11)

A = 3 + np.random.uniform(-1,1, size=500)
B = A + np.random.standard_t(5, size=500)
df = pd.DataFrame(np.stack((A,B), axis=1),
columns=['A', 'B'])

Let’s visualize the dependency:

Transposed:

From here, it is difficult to determine which variable depends on the other, as both regressions have significant coefficients which is expected to be so of course:

However, if we examine the distribution of residuals from these regressions by covariate, we observe the following:

We can see that the left one (which shows the residuals from the causal model) has residuals that are independent of the covariate A. In contrast, the right one exhibits severe heteroscedasticity, and despite E(resid|B) = 0, we cannot consider the residual and B as independent.

In conclusion, causal discovery is a powerful tool in the field of causal inference, enabling us to uncover causal relationships directly from the data when we don’t have access to a causal graph. By leveraging assumptions such as Faithfulness, Causal Sufficiency, and Acyclicity, we can apply algorithms like the PC algorithm to identify skeletons and immoralities, which in turn help us to infer causal structures.

In this post, we explored a practical example using synthetic data and showed how causal discovery can lead to accurate causal graphs. We also discussed the limitations of the methods in the context of linear Gaussian and multinomial distributions, as well as the possibility of identifying causal directions when working with linear non-Gaussian structural equations.

While causal discovery techniques can provide valuable insights, it’s essential to remember that the accuracy of these methods relies on the validity of the assumptions made. Furthermore, additional domain knowledge or expert input may be necessary to fully interpret and understand the results. Overall, causal discovery is an exciting and promising area of research that continues to advance our understanding of causal relationships and improve our ability to make data-driven decisions.

Stay with me on this journey to the world of causal inference and we will look into instrumental variables, CATEs and more in the future posts!

--

--

IvanGor
IvanGor

Written by IvanGor

Senior Data Scientist @Careem

No responses yet