May 10

Why estimating causal effects isn’t easy, even if we’d have infinite, perfect data

Jamilla Cooiman, Founder Causal Academy

Many people working in the data industry believe that data holds all the answers, and that these answers are hidden in the form of complex patterns within the data. This belief has shaped how most data scientists approach their work today. The focus is often on building increasingly complex machine learning models to detect those patterns in hopes that the answers we look for are revealed in the most accurate way possible.

When predictive accuracy falls short, a large part of the problem is usually blamed on missing variables, low data quality, or limited observations. The common belief is that collecting more and better data will naturally lead to better results. So, if we had access to infinite, perfectly measured data, many would expect that finding the right answers would becomes relatively straightforward.

And to some extent, that belief is correct, as long as the goal is prediction.

If we are simply trying to estimate an outcome variable Y as accurately as possible, then more (and higher-quality) data generally makes that easier. Machine learning models rely on patterns in the data, and with more variables and observations, these patterns become more extensive and clearer and therefore easier to model. In that setting, having infinite, perfect data would allow us to build models that get extremely close to the true conditional expectation function that maps inputs to the outcome.

But not all questions we aim to answer with Data Science revolve around predictions. In fact, most of our important business questions revolve around Causal Inference. We don’t just want to predict an outcome, we want to know how we can change an outcome. We want to estimate the causal effect of changing one variable, say T, on another variable, say Y. These are the kinds of questions that help businesses make better decisions. How would changing airline ticket price affect demand? If a hotel started allowing pets, how would that affect the number of bookings? What’s the impact of switching from traditional ads to influencer marketing on a brand’s monthly revenue?

But here’s the crux: even with infinite, perfect (observational) data, estimating causal effects is still not easy. This may sound counterintuitive. If we can’t uncover causality in a perfect-data world, then when can we?

In this post, I’ll explain why causal effect estimation remains fundamentally difficult, even under the best possible data conditions.

How Causal Thinking Often Goes in Practice

Most people understand that the ‘patterns’ in observational data are associations, and association is not causation. They are aware that these associations can arise due to many reasons, one of the most common being the presence of a common cause of the variables of interest.

A classic example is that older individuals tend to take more medication and also tend to experience more health problems. This creates a strong association between medication use and poor health outcomes. But this does not necessarily mean that medication causes poor health. In this case, age is a common cause that drives both variables, creating a non-causal and therefore biased association between them.

It is also widely understood that if we control for such common causes (hold them fixed), we can remove the bias they introduce. In the example above, if we compare medication use and health outcomes among people of the same age, we eliminate the influence of age. The idea that adjusting for the right variables can remove bias forms the foundation of causal inference using observational data.

From this fundamental idea, a common line of reasoning follows. If conditioning on relevant variables removes bias, then adding as many variables as possible to a model should remove as much bias as possible. The logic continues: if we had access to infinite data on every variable in the world, and we included all of them in our model, we could fully eliminate any source of bias. In that case, any remaining association between T and Y would reflect the causal effect of interest.

Why This Logic Is Flawed

The idea that we can simply remove all bias by conditioning on every available variable assumes that adding variables to a model can never do harm. It assumes that conditioning always removes bias and never produces it. But this assumption is incorrect.

In reality, conditioning on certain variables can actually introduce new bias into our analysis. This happens when we condition on the wrong types of variables. Using Causal Graph terminology, these ‘wrong types’ can for example be colliders, or descendants of Y.

A collider can for example be a variable that is caused by both variables under study (T and Y). Conditioning on a collider can create a biased association between these variables even when there is no causal relationship between them, whereas it doesn’t if we don’t condition on it.

To understand this, consider the following simple example.

Imagine rolling two fair dice. Let T be the value of die 1, and Y the value of die 2. These two values are completely independent: knowing the outcome of one die tells you nothing about the other. So there is no causal relationship between T and Y.

Now, define a third variable Z as the sum of T and Y. That is, Z = T + Y.

T and Y are independent overall. But if we condition on Z, for example, by only looking at dice rolls where the sum is 5 , then knowing the value of one die automatically tells you something about the other. If T = 1, then Y must be 4. If T is 3, then Y must be 2. In this case, there is a negative association between T and Y, even though they were independent to begin with. By conditioning on Z, we have introduced a biased association that wasn’t there before.

Why Infinite, Perfect Data Doesn’t Help

So why doesn’t having infinite, perfect data solve this problem?

The reason is simple but fundamental: even perfect data won’t tell us which variables we should or shouldn’t adjust for. And this is exactly where the main challenge of causal inference with observational data lies. The problem is not the quality or quantity of the data, but our lack of knowledge about the causal mechanisms that produced it — what we call the Data Generating Process (DGP).

In prediction tasks, we don’t need to understand how the data was generated. As long as we have enough data and good features, a model can learn to accurately map inputs to outcomes, regardless of the underlying causal structure.

But causal inference is different. Here, we care a lot about how the data was generated. Because knowledge of the DGP would tell us which variables are common causes, which are mediators that transmit causal effects, which are colliders, and so on. Or more specifically: knowing the DGP would tell us which variables we can and can’t control for to obtain unbiased causal effect estimates. None of this information will ever be fully revealed by observational data.

Without this understanding, even the best data and the most accurate predictive models can still give us biased and misleading results.

Conclusions

We don’t know the DGP, and so to perform causal inference with observational data we must make assumptions about it. Making these assumptions accurately is where the main difficulty lies, and this problem persists no matter how much data we have.

As a result, causal inference is fundamentally a theory-driven task. It relies on reasoning about how the world works, not just learning patterns from data. This makes it very different from the more familiar data-driven approach we use in predictive modeling.

0 comments

Joinor login to leave a comment