Demystifying Causality: An Introduction to Causal Inference and Applications. Part 3.

IvanGor
10 min readMay 16, 2023

--

Thank you for joining me on this journey through causal inference. Today, we’ll explore some formal operators, such as the do operator, and discuss which variables you should condition on in your model to determine the causal effect.

Let’s begin with the do operator, an intervention operator represented as:

This expression denotes the probability distribution of our target Y when intervention T=t occurs. We refer to this distribution as the interventional distribution, in contrast to the observational distribution. The key difference between merely conditioning on t and using the intervention operator is that with the do operator, we assume we have taken the entire sample and set T=t for each element. In contrast, when conditioning on t, we only consider a portion of the population where T=t.

So, using the do-operator, we can express the Average Treatment Effect (ATE) as:

An important question to consider is how the interventional operator influences the causal graphs we examined in our previous post. It turns out that during an intervention, we cut all the links leading to the node where the intervention takes place:

Left picture — before the intervention. Right picture — after the intervention T=t.

This concept is formally known as the Modularity Assumption, which states that if we intervene on a set of nodes S, all nodes not in that set remain unchanged, and all nodes within S (e.g., T is in S in the example above) will have P(T=t|pa_t) = 1. In other words, the probability of T equaling t in the case of intervention t is one.

This assumption enables us to factorize the distribution:

Here, x represents all variables in the set, excluding S. If x is consistent with the intervention, we obtain the factorized distribution shown above; otherwise, we get P=0.

The reason we need all of this is to estimate the Average Treatment Effect (ATE). Let’s look at it once again:

We still need to transition from interventional values to statistical ones. Can we simply go from P(y|do(T=1)) to P(y|t)? As it turns out, we can. This is known as the Backdoor Adjustment, which states that:

Given the modularity assumption holds and a set of nodes W satisfies the backdoor criterion and positivity (link), we can identify the causal effect of T on Y:

The backdoor criterion is defined as follows:

A set of variables W satisfies the backdoor criterion relative to T and Y if:

  1. W blocks all backdoor paths from T to Y
  2. W does not contain any descendants of T.

Backdoor paths are the paths through which confounding associations flow (see previous post). These paths are blocked if they contain all non-colliders conditioned on and all colliders not conditioned on.

For example let’s continue with our graph before intervention:

In this case, it would be sufficient to block the path T-Z_3-Y and not block T-C-Y. Therefore, we should control for Z_3 and avoid controlling the collider C; otherwise, the backdoor paths will still be open. One might argue that Z_3 is also a collider, which is true, but it is a collider on a different path (Z_1 — Z_3 — Z_2) where no association flows from T to Y.

So, if we have a set of nodes satisfying the backdoor criterion, positivity, and the modularity assumption, we can transform the ATE:

Hence, it is possible to estimate the Average Treatment Effect (ATE)! Simply control for W, which satisfies the backdoor criteria, and keep the other assumptions in mind, and everything should work out well.

Another approach to represent a causal graph is through a set of structural equations. For instance, we can express the graph above as follows:

Occasionally, we can attach a noise variable U to each node. In the case of an intervention (say T=t), we simply change the equation for T to T:=t. This notation will help us in the future.

Example

Now lets consider an example slightly modifying the previous graph:

Let’s generate synthetic data for that graph:

np.random.seed(11)

alpha = 0.5
beta = -5
delta = 0.4
gamma = -0.7

Z = np.random.normal(loc=1, size=500)
T = alpha * Z + np.random.normal(loc=2, size=500)
M = beta * T + np.random.normal(size=500)
Y = delta * Z + gamma* M + np.random.normal(size=500)
C = T + Y + np.random.normal(size=500)

df = pd.DataFrame(np.stack((Z, T, M, Y, C), axis=1),
columns=['Z', 'T', 'M', 'Y', 'C'])

Let’s find the effect of T on Y. We know that the true causal effect is equal to β*γ = 3.5. We can try to use regression as we did before. Let’s apply a quite common approach in Machine Learning — just add all the features into the model:

print(smf.ols('Y~T+Z+M+C', data = df).fit().summary())

As you can see, we have nothing close to 3.5. Moreover, we have a significant coefficient for T equal to -0.3977, which indicates a huge bias. But now we know what caused that bias! We controlled for the collider and the mediator. Let’s exclude all the variables except T and try once more:

Now, we neglected to block the backdoor path through Z. However, the result is much better (although it is still biased, and in case of a different generative process, the bias might be higher).

Finally, let us block this backdoor path according to the theory:

print(smf.ols('Y~T+Z', data = df).fit().summary())

At last! We have an unbiased estimate of the Average Treatment Effect (ATE) of T on Y!

The process works in the following way: in our model, we assume that there are no interactions between T and Z, and they enter the model independently. However, if we wish to introduce interactions between T and Z or incorporate some nonlinear relationships between them, we must keep our ATE formula for adjustment in mind. This will allow us to find the Average Treatment Effect (ATE) as an expectation. In the context of our example, when we introduce this adjustment, we get:

Then we may cut Z by percentiles so that we could easily take expectations and calculate ATE as:

where Z_{Z%} is mean value of Z in percentile. Using this approach, we arrive at our desired ATE. By employing bootstrapping methods, we can generate the following distribution:

Let’s slightly modify our process now. Suppose we know that the variable Z exists, but it is unobservable, and as a result, we cannot control for it. What should be our strategy in this situation? Here, we can take advantage of the frontdoor criterion to guide our approach.

A set of variables M satisfies the frontdoor criterion relative to T and Y if the following is true:

  1. M completely mediates the effect of T on Y (all the causal paths from T go to Y through M)
  2. There is no unblocked backdoor path from T to M
  3. All backdoor paths from M to Y are blocked by T.

If we have a set M which satisfies this criterion then the frontdoor adjustment can be done:

If (T, M, Y) satisfy frontdoor criterion and we have positivity then:

Thus, we can identify the causal effect of t on y as represented in the formula.

The intuition behind the formula can be broken down into three steps:

  1. We identify the causal effect of ‘t’ on ‘m’, leveraging the backdoor adjustment. In the case of a T-M dependency, all backdoor paths are blocked because Y and C are colliders on potential paths:

2. We identify the effect of M on Y. Given that T can block the backdoor path from M to Y (via the M-T-Z-Y route), and C already blocks it as it is a collider, we need to adjust only for T.

3. Finally, we merge the insights from the first two steps:

Now, let’s examine if the frontdoor adjustment will enable us to achieve the same results of estimated causal effect of T on Y as we did in the previous section.

Firstly, we could transition to probabilities and compute:

However, this calculation isn’t straightforward when dealing with a continuous variable. An alternative approach would be to calculate the ATE using frontdoor criteria through regressions. We’ll call it beta, representing the effect of T on Y. We’ll presume our variable dependencies are linear:

As we’re hunting for the ATE, we’re interested in a certain coefficient, beta hat, which is:

We recognize this as an Ordinary Least Squares (OLS) estimator. But in this situation, we understand that the error term and treatment aren’t independent. Let’s substitute our linear equations into the estimator to see what we end up with:

By substituting M and grouping covariances, we get:

Since there are no unblocked backdoor paths from T to M, we can set covariances to 0. Phi is equal to zero because there is no direct effect of T on Y, given the condition that M completely mediates the effect of T on Y. From this, we can infer that:

his offers us a straightforward methodology. We need to estimate two regressions (for M and Y, as in the previous model) and multiply the necessary coefficients.

Let’s see if this approach works as expected:

Firstly, let’s consider the results of the regressions, and then move on to the basic approach which involves categorizing our variables.

The multiplication of -5.03 and -0.7028 yields 3.535. We’ve achieved our goal! Confidence intervals can be created using the bootstrap method.

Now let’s consider an approach involving binary variables with the same graph:

np.random.seed(0)

# Number of observations
n = 1000

# Define parameters for the linear functions
a_coef = 0.2
b_coef = 0.6
d_coef = 0.3
e_coef = 0.4

# Generate C from Bernoulli distribution
C = np.random.binomial(1, 0.5, n)

# Generate T and M as Bernoulli variables with probabilities influenced by C and T
p_T = 0.25 + 0.5 * C
T = np.random.binomial(1, p_T)

p_M = 0.25 + 0.5 * T
M = np.random.binomial(1, p_M)

# Calculate probabilities for Y and Z
p_Y = a_coef * C + b_coef * M
Y = np.random.binomial(1, p_Y)

p_Z = d_coef * T + e_coef * Y
Z = np.random.binomial(1, p_Z)

This script will generate a set of Bernoulli variables. Here, the causal effect of T on Y is equal to 0.3 (which is 0.6*0.5).

Let’s try the direct approach first, using conditional probabilities from the frontdoor adjustment:

P_M_given_T = dataset.groupby('T')['M'].mean()

# Estimate P(Y | M, T') using groupby and mean
P_Y_given_M_Tprime = dataset.groupby(['M', 'T'])['Y'].mean()

# Estimate P(T')
P_Tprime = dataset['T'].mean()

# Calculate P(Y | do(T)) using front-door criterion
P_Y_do_T = np.zeros(2)
for t_val in [0, 1]:
for m_val in [0, 1]:
for t_prime_val in [0, 1]:
P_Y_do_T[t_val] += P_Y_given_M_Tprime.loc[m_val, t_prime_val] * \
(m_val*P_M_given_T[t_val]+(1-m_val)*(1-P_M_given_T[t_val])) * P_Tprime

causal_effect_T_on_Y_frontdoor = P_Y_do_T[1] - P_Y_do_T[0]

The result is 0.297. As usual, we can construct confidence intervals using bootstrap.

The regression approach, just like in the previous example, also provides us with an estimate of 0.3.

So, what can we conclude after reading this post? The theory is far from useless. Backdoor adjustment enables us to identify the effect of treatment T on the target Y, which paves the way for estimating the ATE using straightforward statistical techniques, such as regression. The frontdoor adjustment allows us even more — estimate ATE in case of omitted variable.

In simpler terms, by understanding the causal relationships between variables and properly accounting for them, we can make more accurate predictions and inferences about the effects of different treatments. This knowledge helps us make better decisions and avoid potential pitfalls in our analyses. So, always remember to consider the underlying causal structure when working with data!

Stay tuned! In our upcoming posts, we’ll delve into topics such as causal discovery and instrumental variables.

Additional links: https://www.canr.msu.edu/afre/events/Bellemare and Bloem (2020).pdf

--

--

IvanGor
IvanGor

Written by IvanGor

Senior Data Scientist @Careem

Responses (1)