- May 10
Moving from Predictive to Causal Thinking in Data Science
- Jamilla Cooiman, Founder Causal Academy
Most data scientists enter the field with a strong belief in the power of data. And rightly so. We’ve seen how large datasets can train powerful models, uncover complex patterns, and enable predictions that feel almost magical. But when we shift from prediction to causal inference, many of our instincts need to be re-evaluated.
Causal inference isn’t just some new tool that you easily add to your toolbox. It requires a different way of thinking. That’s also why introductory material on causal inference can feel unusually theoretical or even philosophical at times. Before you can estimate a causal effect, you need to understand fully what this actually means, what assumptions you will rely on, and how this should be approached. This isn’t just academic; it has direct consequences for how we approach problems in practice.
And that’s where the real challenge begins. Because applying causal inference isn’t just about learning new techniques. It’s about changing how we frame problems, how we use data, and how we judge the success of our models.
In this blog post, I’ll walk through three fundamental mindset shifts that many data scientists will need to start modelling causally — even if they’re already highly skilled with data.
From Data-First to Design-First Thinking
Most Data Science workflows are built around the idea of letting the data speak. We’re used to tackling a business problem by jumping straight into exploration, where we visualize variables, look for trends, and let those patterns guide our modeling choices. With causal inference, this isn’t the way to go.
Just like in any data science project, causal inference begins with a business question. In this case, the question is typically causal in nature, because if it’s not, there’s usually no need for causal inference, and we can continue with standard predictive modeling instead. The challenge is that even when the question is causal, we need to think carefully about whether it is well-defined, and whether the techniques we plan to use are suitable for answering it.
For example, asking “What would happen to churn if we offer a discount?” is a different causal question than “Given that this customer has churned, would this also have happened if we had sent a discount some weeks ago?”.
The first one is about understanding what happens if we intervene now, and the second is about understanding what would have happened if we intervened in the past. These two questions are both causal, but they require different strategies to answer. The second is more complex and relies on stronger assumptions and more advanced techniques. And so not all causal inference techniques that are appropriate for the first one would be able to answer questions like the second one. Understanding this is the first step to avoid that we perform an analysis that will never be able to answer our query of interest in the first place. In case you want to learn more about this, I’d suggest exploring The Ladder of Causation by Pearl.
Most often the questions will be about understanding what happens if we intervene now, and these questions can generally only be answered using either experimental data or observational data + assumptions.
If we can run a randomized controlled trial (RCT), that’s almost always the most reliable route. But in most real-world situations, randomization isn’t possible. That means we need to rely on observational data. In this case, it’s helpful to ask: if we could design the perfect RCT, what would it look like? This thought exercise forces us to think clearly about the treatment, the outcome, and the context.
Next, we need to formulate the assumptions necessary if we aim to perform causal inferce with observational data. These generally are assumptions about which variables may or may not cause each other, and this will be the basis of understanding what biases might exist between treatment and outcome in our observational data. These assumptions are the foundation of the entire causal analysis. The validity of our conclusions depends on how accurate and reasonable these assumptions are, and so a lot of attention will be devoted to mapping out these assumptions as accurately as we can. This process is often supported by data, but it is guided by domain knowledge: our understanding of the system we’re studying.
So with Causal Inference, we don’t jump straight in the data and let that guide us. Instead, we emphasize understanding the business problem and how we can structure the analysis in such a way that we are able to actually answer this question of interest. This is the essence of design-first thinking. It’s not about blindly trusting the data to “speak.” It’s about using external knowledge and structured reasoning to set the stage, and then letting the data play its part in that context.
From As Much Data as Possible to the Right Data
In predictive modeling, more data often leads to better models. More variables provide more chances to improve fit, reduce error, and capture complex patterns in the data. But causal inference doesn’t reward quantity in the same way. In fact, more variables can sometimes make things worse.
Causal inference isn’t about detecting as many relationships as possible. It is about removing bias. That shifts the goal entirely. Instead of including every variable we can get our hands on to build the most flexible model, the focus becomes identifying the right variables — the ones that allow us to isolate the causal relationship we care about, without introducing new sources of bias.
Take confounders as an example. These are variables that affect both the treatment and the outcome. If we fail to control for them, our estimate of the causal effect will be biased. But it works the other way too. If we include a collider (a variable that is caused by both the treatment and the outcome) we actually introduce bias by conditioning on it.
So causal inference requires a different mindset. Instead of thinking “let’s throw everything in and let the model decide,” we have to be selective. Even if we have hundreds of variables, missing a single important confounder can ruin the analysis. At the same time, including the wrong kind of variable can introduce bias that wasn’t there to begin with.
It’s not about more data. It’s about the right data. And a lot of the causal inference process is focused on figuring out exactly what that means in the context of the question we’re trying to answer.
From Validation Metrics to Assumptions Checking and Sensitivity Analysis
Data scientists are trained to care about model performance. We track metrics like accuracy, mean squared error, AUC, and R-squared. We tune hyperparameters, run cross-validation, and iterate until the model performs well on the test set. But when it comes to causal inference, none of this guarantees that we’re estimating a causal effect correctly.
That’s because causal inference isn’t evaluated by predictive accuracy. It’s evaluated by whether we’ve successfully removed bias. And that depends entirely on whether the assumptions behind our analysis are valid. This can feel unintuitive. A model with weak predictive performance might still produce an accurate causal estimate. And a model with excellent predictive performance might still produce a completely biased one.
With Causal Inference, our attention has to shift. Instead of fixating on performance metrics, we need to fixate on the validity of our assumptions. Have we correctly identified the variables that introduce bias? Have we adjusted for them properly? Are we confident that the relevant variables have been measured accurately? And what about variables we haven’t observed? Could unmeasured confounders be distorting our conclusions?
In predictive modeling, we can evaluate performance by comparing predictions to actual outcomes. If the model says a customer will churn and they do, that’s a correct prediction. We get clear, direct feedback. But in causal inference, we don’t have that luxury. We are predicting causal effects, and we don’t have a record of the ‘true causal effect’ that we can compare this prediction against.
This means we need to get comfortable with tools like sensitivity analysis. This involves checking how sensitive our causal estimates are to potential violations of our assumptions. For example, if there were an unobserved confounder, how strong would it need to be to problematically change our conclusion? These kinds of checks don’t give us definitive answers, but they help us understand how robust our analysis really is.
This lack of clear-cut validation is uncomfortable for most people. It’s a very different mindset from prediction tasks, where performance is straightforward to track. But if you want to answer causal questions and you can’t run a randomized experiment, this is the reality you have to work with. You can either limit yourself to predictive questions, or you can accept that causal inference involves more uncertainty and start learning how to manage it. That means getting familiar with the assumptions your analysis relies on and using every available tool to assess how strong your conclusions really are.
Conclusion
Causal inference challenges many of the habits we’ve developed in data science. It asks us to step away from data-first workflows, to value relevance over volume, and to prioritize mapping out assumptions over clear-cut validation metrics. These shifts aren’t always easy, especially when they go against the instincts built from years of working with predictive models. But they are necessary.