Getting to the bottom of TMLE: forcing the target to behave

In the last couple of posts (starting here), I’ve tried to unpack some of the ideas that sit underneath TMLE: viewing parameters as functionals of a distribution, thinking about sampling as a perturbation, and understanding how influence functions describe the leading behavior of estimation error. In the second post, I showed through simulation how errors in nuisance estimation can interact with sampling variability, but typically have a smaller effect than the main sampling fluctuation itself. This brings us to the central idea behind TMLE.

(This series is not meant to be a tutorial, just a set of notes I put together while trying to wrap my head around this important tool. If you are looking for a more comprehensive introduction, there is an excellent collection of videos and tutorials available here.)

If we knew the true exposure and outcome models, the empirical mean of the influence function would already be (approximately) zero, and the estimator would differ from the truth only because of sampling variability.

In practice, we estimate those nuisance models without full knowledge. Even small mistakes can disturb the balance that keeps the influence function centered. When that happens, the final estimate can be pulled away from the truth not just by sampling noise, but by imperfections in the nuisance models.

In this simple continuous-outcome setup, TMLE makes a targeted adjustment to the initial outcome fit so that the empirical mean of the estimated influence function is driven to zero. After this adjustment, the remaining discrepancy behaves like ordinary sampling noise rather than model-driven bias. Rather than simply improving the nuisance fits themselves, TMLE tries to correct the behavior of the target parameter.

To see how this plays out in a causal setting, we first need to be explicit about the parameter we’re trying to estimate.

A brief causal grounding

In causal inference, the parameters we care about are usually contrasts between potential outcomes. For a binary treatment \(A\), each unit has two potential outcomes: \[ Y(1),\ Y(0), \] representing the outcomes that would be observed under treatment and control. A common causal target is the average treatment effect: \[ \psi^0 = E[Y(1) - Y(0)] \] (In this post, I move from the generic functional notation \(T(P)\) used earlier to the specific causal parameter \(\psi(P)\), the average treatment effect.) Because we observe only one of these for each person, estimating this quantity requires assumptions (consistency, exchangeability, and positivity) that allow us to express it as a functional of the observed data distribution.

TMLE operates entirely within this observed-data framework. It does not try to recover individual counterfactuals. Instead, it constructs an estimator whose statistical behavior is aligned with the influence function of the causal parameter.

Why nuisance models matter

In an ideal world, we would observe both potential outcomes and simply average their differences. In reality, we only observe one outcome per person, so we rely on models for the outcome and treatment mechanism to fill in the missing structure. These nuisance models help identify the causal effect, but if they are imperfect (and more likely than not, they will be), their errors can bias the final estimate.

TMLE begins with these initial nuisance estimates but makes a small, carefully chosen adjustment so that their errors interact rather than accumulate. The targeting step is chosen so that the empirical average of the estimated influence function equals zero, the centering property discussed earlier.

In this way, TMLE does not attempt to perfectly reconstruct missing counterfactuals. Instead, it realigns the estimate so that it responds primarily to genuine sampling noise rather than to quirks of the nuisance models.

The ATE and its efficient influence function

Let \(Z = (X, A ,Y)\) denote observed data: baseline covariates \(X\), binary treatment \(A \in \{0,1\}\), and outcome \(Y\).

We can define the outcome regression (\(Q\)) and treatment mechanism (\(g\)) as \[Q_a(x) = E[Y ∣ A = a, X = x], \ \ \ g(x) = P(A = 1 ∣ X = x).\] The ATE can be written as the functional \[ \psi(P) = E_P\big[Q_1(X) - Q_0(X)\big], \] with the target parameter \(\psi_0 = \psi(P_0)\) under the true distribution \(P_0\).

Conceptually, the influence function comes from the same perturbation-and-differentiation process discussed in the initial post: we slightly perturb the underlying distribution and examine the component of the resulting change in the ATE that dominates when the perturbation is small. It turns out that this dominant component can be written as the perturbation \(P_n − P_0\) acting on a particular function of the data: \[\psi(P_n) − \psi(P_0) ≈ (P_n−P_0)\phi_{P_0},\] Here \(\phi_{P_0}\) is the efficient influence function, and \((P_n−P_0) \phi_{P_0}\) simply means the difference between the empirical and population averages of \(\phi_{P_0}(Z)\).

The version shown below, which I am not explicitly deriving, is the efficient influence function for the ATE: \[ \phi_P(Z) = \big( Q_1(X) − Q_0(X) − \psi(P)\big) + \frac{A}{g(X)} \big(Y−Q_1(X)\big) − \frac{1−A}{1 − g(X)} \big(Y−Q_0(X)\big), \] A key element of the EIF is its structure. It combines a plug-in piece involving the conditional mean outcomes and a residual-correction piece weighted by the propensity score.

It turns out that if we plug in the true outcome and treatment models, then the EIF is centered under the true distribution: \[ E_{P_0}[\phi_{P_0}(Z)] = 0. \] But if we plug in estimated nuisances, the empirical mean typically won’t be zero:

\[P_n \phi_{\hat{P}} = \frac{1}{n} \sum_{i=1}^{n} \phi_{\hat{P}}(Z_i) \ne 0,\] This matters, because the ideal first-order expansion behaves like \[\psi(P_n) − \psi(P_0) ≈ (P_n−P_0)\phi_{P_0}.\] That approximation only behaves as expected if the influence function balances out in the sample (i.e., its average is zero). When \(P_n \phi_{\hat P} \neq 0\), that balance is broken, and the estimator drifts away from its clean first-order description.

TMLE restores the balance by slightly adjusting the nuisance fits until the empirical mean of the estimated influence function is brought back to zero.

Evaluating the EIF at the initial fit

Let \(\hat P^0\) denote the observed-data distribution indexed by the initial nuisance estimates \((\hat Q^0,\hat g,\hat\psi^0)\). The estimated EIF at that initial fit is \[ \phi_{\hat P^0}(Z) = \big(\hat Q_1^0(X) - \hat Q_0^0(X) - \hat\psi^0\big) + \frac{A}{\hat g(X)}\big(Y - \hat Q_1^0(X)\big) - \frac{1-A}{1-\hat g(X)}\big(Y - \hat Q_0^0(X)\big), \] where I’m using \(\hat Q^0_a(X)\) as shorthand for \(\hat Q^0(a,X)\).

Now compute its empirical mean: \[ P_n \phi_{\hat P^0} = \frac{1}{n}\sum_{i=1}^n \phi_{\hat P^0}(Z_i). \] If this equals zero, you’re already in great shape: your estimator behaves (to first order) like the ideal one with a centered EIF. If it doesn’t equal zero, TMLE does not throw away the nuisance fits. Instead it “tilts” them just enough to remove the imbalance.

Bring in the clever covariate

This raises the question of what that tilt should look like. From above, the EIF evaluated at the initial estimates is \[ \phi_{\hat P^0}(Z) = \big(\hat Q_1^0(X) - \hat Q_0^0(X) - \hat\psi^0\big) + \frac{A}{\hat g(X)}\big(Y - \hat Q_1^0(X)\big) - \frac{1-A}{1-\hat g(X)}\big(Y - \hat Q_0^0(X)\big). \] Focusing on the part involving the observed outcome \(Y\): \[ \frac{A}{\hat g(X)}\big(Y - \hat Q_1^0(X)\big) - \frac{1-A}{1-\hat g(X)}\big(Y - \hat Q_0^0(X)\big), \] we can rewrite this in a slightly simpler form. Because only one of these terms is active for any individual (depending on whether \(A=1\) or \(A=0\)), the two elements can be combined into a single expression: \[ \left( \frac{A}{\hat g(X)} - \frac{1-A}{1-\hat g(X)} \right) \big(Y - \hat Q^0(A,X)\big). \] This motivates the definition of the clever covariate \[ H_{\hat g}(A,X) = \frac{A}{\hat g(X)} - \frac{1-A}{1-\hat g(X)}. \] With this notation, the outcome-dependent part of the EIF becomes \[ H_{\hat g}(A,X)\big(Y - \hat Q^0(A,X)\big). \] Now the EIF can be written more compactly as \[ \phi_{\hat P^0}(Z) = \big(\hat Q_1^0(X) - \hat Q_0^0(X) - \hat\psi^0\big) + H_{\hat g}(A,X)\big(Y - \hat Q^0(A,X)\big). \] This decomposition makes the source of the imbalance easier to see. The plug-in estimator is \[ \hat\psi^0 = P_n\big(\hat Q_1^0(X) - \hat Q_0^0(X)\big), \] so by construction, \[ P_n\big(\hat Q_1^0(X) - \hat Q_0^0(X) - \hat\psi^0\big) = 0. \] That means any imbalance in the empirical EIF must come entirely from \[ P_n\left[ H_{\hat g}(A,X)\big(Y - \hat Q^0(A,X)\big) \right]. \] So the only part of the EIF we can directly influence is the residual \(Y - \hat Q^0(A,X)\). If we move \(\hat Q^0\) slightly until this residual imbalance disappears, we can bring the empirical EIF back into balance and better target the parameter.

The fluctuation step

TMLE does not refit the outcome model from scratch. Instead, it introduces a one-dimensional update that adjusts the initial regression just enough to remove the imbalance in the empirical influence-function equation: \[ \hat Q^\epsilon(A,X) = \hat Q^0(A,X) + \epsilon H_{\hat g}(A,X). \] The parameter \(\epsilon\) controls how much we tilt the regression. We estimate \(\epsilon\) using the observed outcomes \(Y_i\). For a continuous outcome, we estimate \(\epsilon\) by least squares. The normal equation for this regression is \[ \sum_{i=1}^n H_{\hat g}(A_i, X_i) \big( Y_i - \hat Q^{\epsilon}(A_i,X_i) \big) =0. \] This is equivalent to saying \[ P_n \Big[ H_{\hat g}(A,X)\big(Y-\hat Q^{\epsilon}(A,X)\big) \Big] = 0, \] after dividing both sides by \(n\). Define the updated regression \(Q^*\) by plugging in the estimated \(\hat\epsilon\): \[ \hat Q^*(A,X) = \hat Q^{\hat\epsilon}(A,X). \] Once we have \(\hat Q^*\), we update the ATE estimate to get the TMLE estimate using the usual plug-in formula \[ \hat\psi^* = \frac{1}{n}\sum_{i=1}^n \big( \hat Q^*(1,X_i) - \hat Q^*(0,X_i) \big). \] The updated estimated influence function becomes \[ \phi_{\hat P^*}(Z) = \underbrace{\big(\hat Q^*_1(X) - \hat Q^*_0(X) - \hat\psi^*\big)}_{\text{plug-in}} + \underbrace{H_{\hat g}(A,X)\big(Y - \hat Q^*(A,X)\big)}_{\text{weighted residual error}}. \] The plug-in construction guarantees that the first term has empirical mean zero, while the normal equation above ensures that the residual term also has empirical mean zero. As a result, \[ P_n \phi_{\hat P^*} \approx 0. \] In other words, the targeting step removes the residual imbalance in the efficient influence function within the observed sample. Now the behavior of the estimator matches the ideal first-order expansion.

Returning to the nuisance interaction

In the earlier posts, I tried to argue that influence-function–based estimators behave well when the interaction term \[ (P_n - P_0)(\phi_{\hat P} - \phi_{P_0}), \] becomes small relative to the main sampling fluctuation. When that happens, the estimator behaves as if the true influence function were known. In the previous post, we explored this interaction through simulation and saw that it can shrink toward zero, though it may still be quite variable in finite samples when nuisance models are estimated imperfectly. The targeting step in TMLE is designed to enforce the empirical influence-function equation in the observed sample, which helps ensure that any remaining discrepancy appears only in the higher-order remainder.

To see how targeting helps achieve this, start from the identity \[ (P_n - P_0)(\phi_{\hat P^*} - \phi_{P_0}) = P_n(\phi_{\hat P^*} - \phi_{P_0}) - P_0(\phi_{\hat P^*} - \phi_{P_0}). \] Expanding the first term gives \[ P_n(\phi_{\hat P^*} - \phi_{P_0}) = P_n\phi_{\hat P^*} - P_n\phi_{P_0}. \] The targeting step enforces \[ P_n \phi_{\hat P^*} \approx 0, \] so this term reduces to \[ P_n(\phi_{\hat P^*} - \phi_{P_0}) \approx -\, P_n \phi_{P_0}. \] The quantity \(P_n \phi_{P_0}\) is simply the empirical average of the true influence function, which fluctuates at order \(n^{-1/2}\) due to sampling variability.

The second term, \[ P_0(\phi_{\hat P^*} - \phi_{P_0}), \] reflects how far the targeted influence function is from the true one in the population. Its magnitude is largely determined by the accuracy of the nuisance estimates.

A key feature of the influence function is that first-order errors in either nuisance model cancel out. What remains behaves roughly like the product of the errors in the outcome regression and the propensity score model. As those nuisance estimates improve, this interaction shrinks and becomes negligible relative to the \(n^{-1/2}\) sampling fluctuation. Targeting removes the leading imbalance caused by nuisance estimation in the observed sample. What remains is dominated by the usual sampling fluctuation \((P_n-P_0)\phi_{P_0}\), with nuisance errors entering only through a smaller interaction term.

A quick word on cross-fitting

Everything above can be done with or without cross-fitting. But when \(\hat Q^0\) and \(\hat g\) are estimated using flexible machine-learning methods, cross-fitting helps ensure that the empirical EIF equation behaves the way the theory expects.

Without cross-fitting, the same observations both train the nuisance models and evaluate the influence function. Cross-fitting separates those roles, so each observation is evaluated using nuisance estimates that were learned from other data. This avoids the feedback loop that can otherwise distort the EIF centering condition and helps the usual influence-function asymptotics show up more clearly in finite samples.

Where the “double robustness” shows up

TMLE also inherits a key robustness property from the structure of the influence function. Roughly speaking, the estimator remains consistent if either the outcome regression or the propensity model is estimated correctly.

Nuisance errors enter the estimator multiplicatively rather than additively. If the outcome regression has error \(e_Q\) and the propensity model has error \(e_g\), the leading bias behaves roughly like the product \(e_Q \times e_g\). If either model is correct the bias disappears, and even when both are imperfect their interaction can still be small.

This multiplicative structure comes from the orthogonality built into the efficient influence function: first-order errors in either nuisance model cancel out, so nuisance mistakes only matter through their interaction.

In that sense, TMLE is not trying to perfectly estimate the nuisance models themselves. Instead, it adjusts them just enough so that the target parameter obeys the influence-function equation that governs its asymptotic behavior.

Next steps

I had hoped to include some simulations here to see the theory in action, but this post ended up longer than anticipated. As I did after the first post, I’ll follow up with another post that focuses on simulation examples illustrating the ideas developed here.

Reference:

Van der Laan, Mark J., and Sherri Rose. Targeted learning: causal inference for observational and experimental data. Vol. 4. New York: Springer, 2011.

Support:

This work was supported by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (NIA IMPACT Collaboratory). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

comments powered by Disqus