Demystifying Causality: An Introduction to Causal Inference and Applications. Part 1.
This post is the first one in series of blog posts observing methods of causal inference starting with the basics and going to SOTA methods and approaches to causal inference. My posts will include practical illustrations of the concepts and methods in Python. The first post is based on the book of Brady Neal. Let’s start with some basics.
What is it about causal inference?
You may have heard the phrase “correlation doesn’t mean causation”. That is true. Though it would be better to say “association doesn’t mean causation” in order to avoid being stuck with linearity assumption which is quite often implicitly meant in case of correlation. You may find a large number of examples in the Internet. Here are couple of them (including one from ICI).
The first example is the problem of the researcher who wants to find the connection between two events: sleeping in shoes and a headache. The data shows us that those who slept in shoes quite often have a headache in the morning. We may find this correlation in data and it would be quite obvious for us. What we may not state is that there is a causal relationship.
On the picture above one may see that there is a third factor — a confounder which affects on both — sleeping in shoes and headache. The fact that the person we consider took alcohol last night. After the party the person comes home and gets to sleep in their shoes. Now let’s generate some data. We will use the following data generation process(DGP):
N_OBSERVANTS = 1000
np.random.seed(42)
random_1 = np.random.uniform(size=N_OBSERVANTS)
random_2 = np.random.uniform(size=N_OBSERVANTS)
random_3 = np.random.uniform(size=N_OBSERVANTS)
alcohol_taken = (random_1 > 0.75) * 1
shoes_on = ((random_2 > 0.95) * 1 + alcohol_taken*(random_2 > 0.6) >= 1) * 1
headache = ((random_3 > 0.9) * 1 + alcohol_taken*(random_3 > 0.2) >= 1) * 1
df = pd.DataFrame(np.stack([alcohol_taken, shoes_on, headache]).T,
columns=['alcohol', 'shoes', 'headache'])
We see that in data generation we created independently columns shoes and headache. But both of them have association with the alcohol consumption.
Analyzing this data and ignoring alcohol consumption we get the following results:
Well the result looks significant — we see that in these two groups the difference in share of people with headache is about 0.4. Correlation between these two groups is 0.34. If we check regression, it will also show us positive statistically significant results:
At the same time in the code we see that these two variables generated independently. What we faced is an omitted variable bias — one of the types of endogeneity. We see that on causal graph — kind of it is depicted in the figure 1 — we have a “backdoor path” from shoes to headache through the alcohol. On the terminology of causal DAGs I will write in one of the next posts. In order to eliminate this bias we should control for the variable which is omitted in our first regression.
After controlling on this variable (or closing the backdoor path) we receive unbiased estimate. We see that headache is no more dependent on the shoes — it is dependent on alcohol consumption.
Many of us know that a number of problems in statistical reasoning and machine learning can be solved by adding more data. That is not true for causal problems. No matter how much data you will add. If you haven’t specified your causal model correctly you will get biased answer. But what could be wrong? In our example — maybe nothing. But you may find a lot of examples in which incorrect analysis of causal relationships may lead to wrong decisions. For example:
Wrong analysis of this data may lead to intentional decrease of US crude oil imports in order to decrease deaths in collision with railway trains. Of course you may say that this events are obviously independent, but it is not always easy to see. You may face some problem where you should create some prediction model but build the one which won’t take into consideration causality which may lead to the wrong predictions.
So the observational data quite frequently contain confounders and very rarely treatment is randomized in the data. Let’s introduce some formal definitions we are going to use further.
Let’s stick with our example and consider shoes variable as treatment (T). T=1 if a person sleeps in her shoes, T=0 if not. The fact of headache we will call output and denote as Y. Y=1 if a person has headache in the morning, Y=0 otherwise. X — out covariate (alcohol intake) and it is equal to 1 if a person took alcohol, 0 if not.
Now let’s make a step towards causality. We call Y(1) and Y(0) — potential outcomes. For a specific person we may write Y_i(1) and Y_i(0). Then individual treatment effect (ITE) is denoted as:
Here we get a fundamental problem of causal inference. We can never observe both outcomes at the same time. Person will be either treated or untreated. That means that we never can observe causal effect tau. When we observe the realization of potential outcome we call that factual result. The unobserved variant is called counterfactual. For example if we have seen Y_i(1) then it is factual outcome, and Y_i(0) is a counterfactual.
One might say that it is still possible to get at least average treatment effect(ATE). By simply averaging results in treatment and control (T=0) groups. We saw previously on our example which results it may give us. But why? Let’s consider formal notations. What we did get in Figure 2 is:
However, the ATE is defined as:
Now let’s consider in which case these quantities are equal.
The assumptions which let us to convert causal difference tau to the associational difference above are the following:
- Ignorability/Exchangeability
- Identifiability The causal quantity (tau) is identifiable if we can compute it from statistical quantity.
Ignorability/exchangeability means that the choice of treatment is independent from the result. A good example of this is randomized controlled trial or simply ab-test. If we assign treatment randomly in independent similar groups we can say that ignorability holds. In other words we might just change treatment assignment to the opposite — the results of such experiment would be the same (in terms of ATE).
Identifiability means that we may identify causal effect — reduce causal expression to the statistical expression. Every calculation of causal expression we may consider as consisting of the two steps — identification and then estimation. For example in our example with tau and (2) we may write:
Where step 1 is made due to exchangeability and step 2 — identifiability.
In our example involving shoes and headaches, we can see that the ignorability assumption has been violated. The treatment assignment is not independent as we know that alcohol is a factor that influences both shoes and headaches.
So we know one more factor — alcohol. And now we want to include it into our calculations. What guarantees us that conditional on alcohol consumption we will be able to estimate the effect of shoes on a headache? Actually a few more assumptions:
- Conditional exchangeability / Unconfoudedness
- Positivity / Overlap / Common support
- No interference
- Consistency
A few comments:
Unconfoundedness means that treatment assignment is independent of experiment results conditional on X. Thus, the two groups may have different probabilities of assignment to treatment due to some covariate X, but after controlling for this, we obtain unbiased results, as we saw in regressions above.
Positivity means that different values of T are available for each value of X. Thus, for each subgroup, we have some treatments and some controls.
No interference means that no interference occurs between different observations. Thus, there are no network effects, and all the outputs are independent of each other.
Consistency means that potential outcome under treatment value t is equal to the observed outcome if the treatment is really t.
After checking all of these assumptions, we can proceed to calculate our ATE.
(1) — linearity of E → (2) — law of iterated expectations → (3) Unconfoudedness, positivity → (4) Consistency. All these assumptions together lead to identifiability of ATE.
Let’s estimate the ATE for our synthetic example simply using this formula and check of the result will be comparable to our regression example. The point estimate here is:
e_x1 = df.loc[(df['alcohol'] == 1)&(df['shoes'] == 1),'headache'].mean()\
- df.loc[(df['alcohol'] == 1)&(df['shoes'] == 0), 'headache'].mean()
e_x0 = df.loc[(df['alcohol'] == 0)&(df['shoes'] == 1),'headache'].mean()\
- df.loc[(df['alcohol'] == 0)&(df['shoes'] == 0), 'headache'].mean()
alco = df['alcohol'].mean()
print(e_x1*alco + e_x0*(1-alco))
Which is equal to 0.0439. What about statistical significance of result? Let’s use bootstrap. It gives us the following result:
The proportion to the left of 0 is 0.164, which means we cannot reject the null hypothesis that the ATE is equal to 0 at a significance level of say 0.05. We obtained the same qualitative result from our regression analysis. In a previous analysis, we fitted a logistic regression model that displayed an insignificant coefficient on shoes. We may also explicitly check the results.
model = smf.logit('headache~alcohol+shoes', data=df).fit()
dat_shoes = df.copy()
dat_shoes['shoes'] = 1
dat_noshoes = df.copy()
dat_noshoes['shoes'] = 0
print((model.predict(dat_shoes) - model.predict(dat_noshoes)).mean())
It gives us the estimate of ATE equal to 0.018. Bootstrap gives us the following distribution of results:
Where approximately 0.29 of observations lie to the left from the zero point so the result is the same — ATE cannot be distinguished from zero here.
In conclusion, this text introduces the basics of causal inference, emphasizing the importance of understanding and accounting for confounders in analyzing relationships between variables. The post demonstrates through a synthetic example that correlation does not imply causation, and showcases how confounders can lead to biased results if not properly controlled for. It also introduces essential causal inference assumptions, such as unconfoundedness, positivity, no interference, and consistency, which must be satisfied to estimate the Average Treatment Effect (ATE) accurately. By controlling for confounders and adhering to these assumptions, researchers can avoid incorrect conclusions and improve the validity of their findings.
In the next post, we will become acquainted with graphical models and explore their potential applications in causal inference. Stay tuned!