This is a playbook for statisticians on how to use advanced machine learning safely when answering questions like “Does this drug really reduce the risk of death or relapse over time?” It combines causal inference math with survival analysis so that researchers can get more reliable answers from complex clinical data without fooling themselves.
In drug development and epidemiology, time‑to‑event questions (e.g., time to disease progression, death, relapse, adverse event) are central. Standard survival models (like Cox regression) often break down when there are many covariates, non‑linear relationships, or time‑varying treatments and confounders. This paper provides guidelines to use Targeted Maximum Likelihood Estimation (TMLE) together with flexible machine learning algorithms to estimate causal effects on time‑to‑event outcomes in a way that is both statistically valid and robust, even under complex data conditions.
Methodological know‑how and implementation expertise in TMLE plus survival analysis using ML, together with access to rich longitudinal clinical or real‑world data, can form a defensible capability that is hard for less advanced organizations to replicate quickly.
Classical-ML (Scikit/XGBoost)
Time-Series DB
High (Custom Models/Infra)
Computational cost and complexity of fitting ensemble ML models (Super Learner) within TMLE for large, high‑dimensional longitudinal datasets; plus the need for specialized statistical expertise to correctly specify and validate causal assumptions.
Early Majority
Focuses specifically on combining TMLE with machine learning for causal estimation in time‑to‑event (survival) outcomes, offering step‑by‑step best‑practice guidance. This is more specialized than generic causal inference or generic survival modeling, and is tailored to the kinds of longitudinal and censored data structures common in pharma and epidemiology.