I first encountered TMLE—sometimes spelled out as targeted maximum likelihood estimation or targeted minimum-loss estimate—about twelve or so years ago when Mark var der Laan, one of the original developers who literally wrote the book, gave a talk at NYU. It sounded very cool and seemed quite revolutionary and important, but it was really challenging to follow all of the details. Following that talk, I tried to tackle some of the literature, but quickly found that it as a challenge to penetrate. What struck me most was not the algorithmic complexity (which it certainly had), but much of the language and terminology, and the underlying math.
Recently, I inherited a project from a colleague who had proposed using TMLE to analyze a cluster randomized trial using a stepped-wedge design. In order to decide whether we would continue with this plan, I needed to revisit the literature to see if I could make more progress this time around. There are certainly more tutorials available as well as improved software and documentation, so it is much easier to get up and running to generate estimates. I was even able to build a model for the stepped-wedge design (that I hope to share on the blog some point soon). Beyond this, I really wanted to get a deeper understanding of the mathematical model that underlies the method without getting too far into the weeds (and proofs).
I think I have made some progress in developing my understanding, and I wanted to write it down here for my future self (and for anyone else who wants to join along). This post is definitely not a tutorial nor is it literature review. It’s my attempt to encode my understanding of the conceptual ideas that support the underpinnings of the TMLE algorithm.
What problem is TMLE trying to solve?
At its core, TMLE addresses a tension that appears throughout modern causal inference and semiparametric statistics. On one hand, we often want to estimate parameters that depend on complex features of the data-generating process, such as causal effects that require modeling both outcomes and treatment assignment mechanisms. On the other hand, more and more we rely on flexible machine learning methods to estimate these components, which can require large amounts of data before their estimates stabilize and may be difficult to analyze using classical statistical theory.
Classical statistical inference typically relies on relatively simple, structured models where bias, variance, and uncertainty can be derived directly from model assumptions. Flexible machine learning models break many of those assumptions. They can be highly adaptive, involve many tuning choices, and may not correspond to smooth parametric likelihoods. As a result, while they can produce excellent predictions, their statistical behavior is often difficult to characterize analytically.
TMLE provides a framework for resolving this tension. Rather than building estimators directly from outcome or treatment models, which are scientifically important but statistically treated as nuisance components, TMLE constructs estimators whose statistical behavior is governed by a known influence function. Once an estimator has the correct influence function, its large-sample bias, variance, and uncertainty can be characterized using general asymptotic theory, even when nuisance components are estimated flexibly.
What follows is a description of four key elements that I think synthesize the ideas that are starting to make TMLE make more sense: (1) viewing parameters as functionals of probability distributions, (2) describing sampling as small perturbations of the data-generating distribution, (3) linearizing parameters using influence functions, and (4) constructing estimators that are orthogonal to nuisance estimation error.
Estimators as functionals
An estimator, such as a mean or causal effect, is not merely a formula applied to data, but can be thought of more formally as a functional, a mapping that takes a probability distribution \(P\) in a family of distributions \(\mathcal{P}\) and returns a number \(T(P)\): \[ T : \mathcal{P} \to \mathbb{R} \] Here \(\mathcal{P}\) just denotes the collection of probability distributions under consideration. Writing \(T : \mathcal{P} \to \mathbb{R}\) emphasizes that a statistical parameter is not a property of a particular data set, but of the data-generating distribution. The data set enters only through the empirical distribution \(P_n\).
From this perspective, the key question becomes how \(T\) changes when the data-generating distribution changes slightly—for example, by comparing \(T(P_1)\) with \(T(P_2)\), where \(P_1, P_2 \in \mathcal{P}\).
As an example, the mean is a functional:
\[T(P) = \int z\ dP(z).\] If \(P_0\) is the true data-generating distribution, then the target parameter is \(T(P_0)\).
\(P_n\) is the empirical distribution (i.e., the distribution induced based by the observed sample), \[P_n = \frac{1}{n}\sum_{i=1}^n \delta_{Z_i},\] and the estimator is simply \(T(P_n)\). The notation \(\delta_z\) denotes a point mass at \(z\): a probability distribution that assigns probability 1 to the single value \(z\). Writing the empirical distribution as an average of point masses simply formalizes the the idea that the observed data place equal weight on each sampled point. The empirical distribution is therefore a discrete approximation to the true data-generating distribution \(P_0\), placing mass \(1/n\) on each observed value.
Ultimately, estimation is about quantifying how far our estimate \(\hat{\theta}\) based on the empirical distribution deviates from the target parameter \(\theta_0\) defined by the true distribution:
\[\hat{θ} − \theta_0 = T(P_n)−T(P_0).\] Written this way, estimation error is simply the difference between evaluating the same functional at two nearby distributions: the true distribution \(P_0\) and its empirical approximation \(P_n\). Everything relies on understanding how \(T(P)\) changes when \(P\) changes from \(P_0\) to \(P_n\).
Sampling as a perturbation
The empirical distribution \(P_n\) differs from \(P_0\) by many small deviations. As we just saw, each observation contributes a point mass of size \(1/n\). It is tempting to write this difference as \[P_n − P_0,\] but that isn’t really all that helpful. The object \(P_n − P_0\) is not itself a probability distribution but a signed measure that records where the empirical distribution over- or under-represents the truth. One way to conceptualize \(P_n − P_0\) is as a balance sheet: where the empirical distribution places more mass than the truth, the difference is positive; where it places less mass, the difference is negative. The total of these differences always sums to zero, but the pattern of positive and negative deviations determines how functionals of the distribution fluctuate.
\(P_n − P_0\) is not used on its own. It only appears through its action on functions. For any function \(f\),
\[(P_n −P_0)f \equiv \int f(z) \ dP_n(z) − \int f(z) \ dP_0(z).\] This quantity measures how much the empirical average of \(f(Z)\) deviates from its population mean.
In this context, we can think of \(P_n − P_0\) heuristically as a sum of many small perturbations to the true distribution: regions of the sample space where the empirical distribution places slightly too much mass contribute positively, and regions where it places too little mass contribute negatively. The total deviation integrates to zero, but its interaction with specific functions can be nonzero.
This brings us a little closer to the question we ended with in the last section: how sensitive is the functional \(T(P)\) to small changes in the underlying distribution? To answer this question, we need a precise way to describe an infinitesimal change in a probability distribution, one that preserves total probability mass and allows differentiation. Hampel’s contamination model provides exactly such a construction.
Hampel’s contamination model and the influence function
Answering that question requires an understanding of the influence function. After reading about TMLE for a few weeks, I stumbled on Frank Hample’s important paper, written in the early 1970’s and provides a nice explanation of the influence function in the context of robustness.
In that paper, Hampel formalized what is meant by a “small change” in a distribution by introducing a specific directional perturbation of the data-generating distribution. This construction allows the parameter \(T(P)\) to be differentiated with respect to the underlying distribution itself. In other words, Hampel provides a way to define derivatives of statistical parameters with respect to the underlying distribution itself.
As before, if \(P_0\) is the true distribution, and \(\delta_z\) denotes the point mass at \(z\), then we can consider a perturbed distribution that is a mixture of the two:
\[P_{\epsilon} = (1−\epsilon)P_0 + \epsilon \delta_z,\ \ \ 0< \epsilon < 1.\] This doesn’t correspond to adding a full new data point. Instead, it represents adding an infinitesimal amount of probability mass at \(z\), defining a smooth path through the space of distributions along which derivatives can be taken.
The influence function of \(T\) at \(P_0\) is \(\phi_{P_0}\), defined as \[ \phi_{P_0}(z) \equiv \left. \frac{d}{d\epsilon} T(P_\epsilon) \right|_{\epsilon = 0} = \left. \frac{d}{d\epsilon} T\!\left((1-\epsilon)P_0 + \epsilon \delta_z\right) \right|_{\epsilon = 0}. \] In words the influence function measures the first-order effect on \(T(P)\) of “nudging” the distribution at point \(z\).
The goal, however, is not to compute \(T(P_n)\) exactly, but to approximate how far it is from \(T(P_0)\). This difference \[T(P_n) − T(P_0)\] is the estimation error. Describing its behavior is what allows us to understand bias, variability, and ultimately statistical uncertainty.
Because \(P_n\) converges to \(P_0\) as the sample size grows, the difference between \(P_n\) and \(P_0\) becomes smaller. Rather than approximating \(T(P_n)\) directly, we approximate the change in the functional as we move from \(P_0\) to \(P_n\):
This is analogous to ordinary calculus, where we approximate the change \(f(x+h)−f(x)\) by a linear term involving the derivative of \(f\) at \(x\). Here, the role of the “increment” \(h\) is played by the signed difference \(P_n−P_0\), and the role of the derivative is played by the influence function. The influence function provides the best linear approximation to how T(P) responds to small perturbations of the underlying distribution.
This leads directly to a first-order (linear) expansion of the estimation error \(T(P_n) − T(P_0)\).
Linearization via the influence function
If \(T\) is sufficiently smooth, the first-order expansion is:
\[ T(P_n) − T(P_0) = (P_n−P_0)\phi_{P_0} + R_n, \] where the leading term is linear in the perturbation \(P_n − P_0\), and \(R_n\) collects higher-order terms. This expansion should be read as a functional Taylor expansion: the influence function \(\phi_{P_0}\) plays the role of a derivative, and \((P_n−P_0)\phi_{P_0}\) is the linear approximation to the change in \(T\) induced by replacing \(P_0\) with its empirical approximation \(P_n\).
Earlier, I defined \((P_n - P_0)f\) for a general function \(f\). Applying this definition when \(f\) is the influence function \(\phi_{P_0}\) gives \[ (P_n−P_0)\phi_{P_0} \equiv \int \phi_{P_0}(z) \ dP_n(z) − \int \phi_{P_0}(z) \ dP_0(z) = \frac{1}{n} \sum_{i=1}^n \phi_{P_0}(Z_i) - E_{P_0}[\phi_{P_0}(Z)]. \] The second term vanishes because influence functions are defined to satisfy the centering condition \[ E_{P_0}[\phi_{P_0}(Z)] = 0. \] \(R_n\) is a remainder term and plays a crucial role in asymptotic theory. If \[ R_n = o_p(n^{-1/2}), \] then multiplying both sides of the expansion by \(\sqrt{n}\) yields \[ \sqrt{n}\big(T(P_n) - T(P_0)\big) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \phi_{P_0}(Z_i) + o_p(1). \] The appearance of the \(\sqrt{n}\) scaling reflects a universal feature of averaging noisy data. Each observation carries some randomness, and when we average \(n\) independent observations, the noise partly cancels out. The variability of the average shrinks as the sample grows, with the standard deviation decreasing at a rate proportional to \(1/\sqrt{n}\). Multiplying the estimation error by \(\sqrt{n}\) puts it on a scale where the random fluctuations remain visible rather than collapsing toward zero. This is why \(\sqrt{n}(\hat{\theta} - \theta_0)\) often settles into a stable, typically Gaussian or normal, distribution in large samples.
Because the influence function has mean zero, the central limit theorem implies that the right-hand side converges in distribution to a normal random variable. This is what ultimately allows us to characterize the sampling variability of the estimator. TMLE is designed so that its estimation error admits exactly this kind of \(\sqrt{n}\)-scaled linear expansion, even when parts of the model are estimated using flexible machine learning.
Orthogonality under estimation of the influence function
Up to this point, we’ve assumed that \(\phi_{P_0}\) is known, which is never going to be the case. In most problems of interest, particularly causal inference, the influence function depends on additional unknown components of the data-generating process, such as conditional outcome models or treatment assignment mechanisms. These are often called nuisance functions: they are not themselves the primary target of inference but they are required in order to construct the influence function. In practice, they need to be estimated from the data, often using flexible statistical or machine learning methods.
Once nuisance functions are estimated, we replace the true influence function with an estimated one. The leading term then becomes \((P_n - P_0)\phi_{\hat P}\) and can be expanded as \[ (P_n - P_0)\phi_{\hat P} = (P_n - P_0)\phi_{P_0} + (P_n - P_0)(\phi_{\hat P} - \phi_{P_0}). \] The first term is what we described above, and has order \(n^{-1/2}\). The second term in the expansion reflects error from estimating the nuisance functions. If nuisance errors enter the expansion at first order, they contaminate the leading behavior of the estimator. In particular, even small misspecifications in the nuisance models may introduce bias that does not disappear in large samples.
To understand what “first order” and “second order” mean here, it helps to think in terms of convergence rates.
The empirical fluctuation term (with the unobserved true influence function) satisfies \[ (P_n - P_0)\phi_{P_0} = O_p(n^{-1/2}), \] which determines the usual \(n^{−1/2}\) convergence and asymptotic normality of the estimator.
Now suppose the nuisance estimators converge at rate \(r_n\), meaning \[ \Vert \phi_{\hat{P}} - \phi_{P_0}\Vert = O_p(r_n). \] Then the additional term behaves roughly like \[ \big(P_n−P_0\big)\big(\phi_{\hat{P}} − \phi_{P_0}\big) = O_p\big(n^{−1/2}r_n\big). \] If the influence function is not constructed to neutralize nuisance error, this term can be of the same order as the leading term. In that case, estimation error in the nuisance models directly affects the behavior of the estimator.
Orthogonality prevents this from happening. When the influence function is constructed to be orthogonal to nuisance perturbations, the first-order effect of nuisance error disappears. This really means that the estimator has zero slope with respect to small perturbations in the nuisance functions: tiny changes in those auxiliary models do not immediately move the target parameter. Under orthogonality, the nuisance contribution is of order \(n^{−1/2} r_n^2\). If \(r_n\) converges faster than \(n^{−1/4}\), then \(n^{−1/2} r_n^2\) is smaller than \(n^{−1/2}\) and does not affect the limiting distribution.
Importantly, “shrinking fast enough” does not mean the nuisance models must be perfectly specified or highly accurate in finite samples. It only requires that their estimation error decreases with sample size at a modest rate. Orthogonality relaxes the burden on nuisance estimation: instead of requiring parametric \(n^{−1/2}\) accuracy, it typically suffices that nuisance estimators converge at rate faster than \(n^{−1/4}\). This is a much weaker requirement and allows the use of flexible machine learning methods.
This orthogonality allows flexible, slower-converging nuisance estimators to be used without contaminating the primary asymptotic behavior of the target parameter. More on how that influence function is constructed will hopefully follow in the next post that looks more closely at TMLE.
Next steps
In the next post, I’ll take the same approach to walk through the main elements of the TMLE algorithm, focusing on how they operationalize the ideas developed here.
UPDATE: I followed up this post with what is essentially an addendum to explore this notion of orthogonality using simulation.
References:
Van der Laan, Mark J., and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Vol. 4. New York: Springer, 2011.
Hampel, Frank R. “The influence curve and its role in robust estimation.” Journal of the american statistical association 69, no. 346 (1974): 383-393.
Support:
This work was supported by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (NIA IMPACT Collaboratory). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.