ouR data generation

Bayesian proportional hazards model for a stepped-wedge design

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 01 Apr 2025 00:00:00 +0000

We’ve finally reached the end of the road. This is the fifth and last post in a series building up to a Bayesian proportional hazards model for analyzing a stepped-wedge cluster-randomized trial. If you are just joining in, you may want to start at the beginning.

The model presented here integrates non-linear time trends and cluster-specific random effects—elements we’ve previously explored in isolation. There’s nothing fundamentally new in this post; it brings everything together. Given that the groundwork has already been laid, I’ll keep the commentary brief and focus on providing the code.

Simulating data from a stepped-wedge CRT

I’ll generate a single data set for 25 sites, each site enrolling study participants over a 30-month period. Sites will transition from control to intervention sequentially, with one new site starting each month. Each site will enroll 25 patients each month.

The outcome ($Y$) is the number of days to an event. The treatment ($A$) reduces the time to event. Survival times also depend on the enrollment month—an effect I’ve exaggerated for illustration. Additionally, each site $i$ has a site-specific effect $b_i \sim N(\mu=0, \sigma = 0.5)$, which influences the time to event among its participants.

Here are the libraries needed for the code shown here:

library(simstudy)
library(data.table)
library(splines)
library(splines2)
library(survival)
library(survminer)
library(coxme)
library(cmdstanr)

Definitions

def <- defData(varname = "b", formula = 0, variance = 0.5^2)

defS <-
  defSurv(
    varname = "eventTime",
    formula = 
      "..int + ..delta_f * A + ..beta_1 * k + ..beta_2 * k^2 + ..beta_3 * k^3 + b",
    shape = 0.30)  |>
  defSurv(varname = "censorTime", formula = -11.3, shape = 0.36)

Parameters

int <- -11.6
delta_f <-  0.80

beta_1 <-  0.05
beta_2 <-  -0.025
beta_3 <- 0.001

Data generation

set.seed(28271)

### Site level data

ds <- genData(25, def, id = "site")                 
ds <- addPeriods(ds, 30, "site", perName = "k") 

# Each site has a unique starting point, site 1 starts period 3, site 2 period 4, etc.

ds <- trtStepWedge(ds, "site", nWaves = 25,     
                   lenWaves = 1, startPer = 3, 
                   grpName = "A", perName = "k")

### Individual level data

dd <- genCluster(ds, "timeID", numIndsVar = 25, level1ID = "id") 
dd <- genSurv(dd, defS, timeName = "Y", censorName = "censorTime", digits = 0,
              eventName = "event", typeName = "eventType")

### Final observed data set

dd <- dd[, .(id, site, k, A, Y, event)]

Here is a set of Kaplan-Meier plots for each site and enrollment period. When a site is in the intervention condition, the K-M curve is red. For simplicity, censoring is not shown, though about 20% of cases in this dataset are censored.

Model estimation

This model has quite a few components relative to the earlier models, but nothing is really new. There is a penalized spline for the effect of time and a random effect for each site. The primary parameter of interest is still $\beta$.

For completeness, here is the model specification:

\[ \log L(\beta) = \sum_{j=1}^{J} \left[ \sum_{i \in D_j} \left(\beta A_i + \sum_{m=1} ^ M \gamma_m X_{m_i} + b_{s[i]} \right) - \sum_{r=0}^{d_j-1} \log \left( \sum_{k \in R_j} \left(\beta A_k + \sum_{m=1} ^ M \gamma_m X_{m_i} + b_{s[k]} \right) - r \cdot \bar{w}_j \right) \right] - \lambda \sum_{m=1}^{M} \left( Q^{(2)} \gamma \right)_m^2 \\ \]

where

$J$: number of unique event times
$M$: number of spline basis functions
$D_j$ is the set of individuals who experience an event at time $t_j$.
$R_j$ is the risk set at time $t_j$, including all individuals who are still at risk at that time.
$d_j$ is the number of events occurring at time $t_j$.
$r$ ranges from 0 to $d_j - 1$, iterating over the tied events.
$\bar{w}_j$ represents the average risk weight of individuals experiencing an event at $t_j$:

\[\bar{w}_j = \frac{1}{d_j} \sum_{i \in D_j} \left(\beta A_i + b_{s[i]} \right)\]

$A_i$: binary indicator for treatment
$X_{m_i}$: value of the $m^{\text{th}}$ spline basis function for the $i^{\text{th}}$ observation
$Q^{(2)}$: the second-difference matrix of the spline function

The parameters of the model are

$\beta$: treatment coefficient
$\gamma_m$: spline coefficient for the $m^\text{th}$ spline basis function
$b_{s[i]}$: cluster-specific random effect, where $s[i]$ is the cluster of patient $i$
$\lambda$: the penalization term; this will not be estimated but provided by the user

The assumed prior distributions for $\beta$ and the random effects are:

\[ \begin{aligned} \beta &\sim N(0,4) \\ b_i &\sim N(0,\sigma_b) \\ \sigma_b &\sim t_{\text{student}}(df = 3, \mu=0, \sigma = 2) \\ \gamma_m &\sim N(0,2) \end{aligned} \]

And here is the implementation of the model in Stan:

stan_code <- 
"
data {
  
  int<lower=1> S;          // Number of clusters
  int<lower=1> K;          // Number of covariates
  
  int<lower=1> N_o;        // Number of uncensored observations
  array[N_o] int i_o;      // Event times (sorted in decreasing order)

  int<lower=1> N;          // Number of total observations
  matrix[N, K] x;          // Covariates for all observations
  array[N] int<lower=1,upper=S> s;          // Cluster
  
  // Spline-related data
  
  int<lower=1> Q;          // Number of basis functions
  matrix[N, Q] B;          // Spline basis matrix
  matrix[N, Q] Q2_spline;  // 2nd derivative for penalization
  real<lower=0> lambda;    // penalization term
  
  array[N] int index;

  int<lower=0> T;            // Number of records as ties
  int<lower=1> J;            // Number of groups of ties
  array[T] int t_grp;        // Indicating tie group
  array[T] int t_index;      // Index in data set
  vector[T] t_adj;           // Adjustment for ties (efron)
  
}

parameters {
  
  vector[K] beta;          // Fixed effects for covariates
  
  vector[S] b;             // Random effects
  real<lower=0> sigma_b;   // SD of random effect
  
  vector[Q] gamma;         // Spline coefficients
  
}

model {
  
  // Priors
  
  beta ~ normal(0, 1);
  
  // Random effects
  
  b ~ normal(0, sigma_b);
  sigma_b ~ normal(0, 0.5);

  
  // Spline coefficients prior
  
  gamma ~ normal(0, 2);
  
  // Penalization term for spline second derivative
  
  target += -lambda * sum(square(Q2_spline * gamma));
  
  // Compute cumulative sum of exp(theta) in log space (more efficient)
  
  vector[N] theta;
  vector[N] log_sum_exp_theta;
  vector[J] exp_theta_grp = rep_vector(0, J);
  
  int first_in_grp;
  
  // Calculate theta for each observation
  
  for (i in 1:N) {
    theta[i] = dot_product(x[i], beta) + dot_product(B[i], gamma) + b[s[i]];
  }
  
  // Compute cumulative sum of log(exp(theta)) from last to first observation
  
  log_sum_exp_theta = rep_vector(0.0, N);
  log_sum_exp_theta[N] = theta[N];
  
  for (i in tail(sort_indices_desc(index), N-1)) {
    log_sum_exp_theta[i] = log_sum_exp(theta[i], log_sum_exp_theta[i + 1]);
  }

   // Efron algorithm - adjusting cumulative sum for ties
  
  for (i in 1:T) {
    exp_theta_grp[t_grp[i]] += exp(theta[t_index[i]]);
  }

  for (i in 1:T) {
  
    if (t_adj[i] == 0) {
      first_in_grp = t_index[i];
    }

    log_sum_exp_theta[t_index[i]] =
      log( exp(log_sum_exp_theta[first_in_grp]) - t_adj[i] * exp_theta_grp[t_grp[i]]);
  }
  
  // Likelihood for uncensored observations

  for (n_o in 1:N_o) {
    target += theta[i_o[n_o]] - log_sum_exp_theta[i_o[n_o]];
  }
}
"

Compiling the model:

stan_model <- cmdstan_model(write_stan_file(stan_code))

Getting the data from R to Stan:

dx <- copy(dd)
setorder(dx, Y)
dx[, index := .I]

dx.obs <- dx[event == 1]
N_obs <- dx.obs[, .N]
i_obs <- dx.obs[, index]

N_all <- dx[, .N]
x_all <- data.frame(dx[, .(A)])
s_all <- dx[, site]

K <- ncol(x_all)                 # num covariates - in this case just A
S <- dx[, length(unique(site))]

# Spline-related info

n_knots <- 5
spline_degree <- 3
knot_dist <- 1/(n_knots + 1)
probs <- seq(knot_dist, 1 - knot_dist, by = knot_dist)
knots <- quantile(dx$k, probs = probs)
spline_basis <- bs(dx$k, knots = knots, degree = spline_degree, intercept = TRUE)
B <- as.matrix(spline_basis)

Q2 <- dbs(dx$k, knots = knots, degree = spline_degree, derivs = 2, intercept = TRUE)
Q2_spline <- as.matrix(Q2)

ties <- dx[, .N, keyby = Y][N>1, .(grp = .I, Y)]
ties <- merge(ties, dx, by = "Y")
ties <- ties[, order := 1:.N, keyby = grp][, .(grp, index)]
ties[, adj := 0:(.N-1)/.N, keyby = grp]

stan_data <- list(
  S = S,
  K = K,
  N_o = N_obs,
  i_o = i_obs,
  N = N_all,
  x = x_all,
  s = s_all,
  Q = ncol(B),
  B = B,
  Q2_spline = Q2_spline,
  lambda = 0.15,
  index = dx$index,
  T = nrow(ties),
  J = max(ties$grp),
  t_grp = ties$grp,
  t_index = ties$index,
  t_adj = ties$adj
)

Now we sample from the posterior - you can see that it takes quite a while to run, at least on my 2020 MacBook Pro M1 with 8GB RAM:

fit_mcmc <- stan_model$sample(
  data = stan_data,
  seed = 1234,
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...

## Chain 4 finished in 1847.8 seconds.
## Chain 1 finished in 2202.8 seconds.
## Chain 3 finished in 2311.8 seconds.
## Chain 2 finished in 2414.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 2194.3 seconds.
## Total execution time: 2415.3 seconds.

fit_mcmc$summary(variables = c("beta", "sigma_b"))

## # A tibble: 2 × 10
##   variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 beta[1]  0.815  0.815 0.0298 0.0298 0.767 0.865  1.00    3513.    5077.
## 2 sigma_b  0.543  0.535 0.0775 0.0739 0.432 0.683  1.00    3146.    5110.

Estimating a “frequentist” random-effects model

After all that, it turns out you can just fit a frailty model with random effects for site and a spline for time period $k$ using the coxmme package. This is obviously much simpler then everything I have presented here.

frailty_model <- coxme(Surv(Y, event) ~ A + ns(k, df = 3) + (1 | site), data = dd)
summary(frailty_model)

## Mixed effects coxme model
##  Formula: Surv(Y, event) ~ A + ns(k, df = 3) + (1 | site) 
##     Data: dd 
## 
##   events, n = 14989, 18750
## 
## Random effects:
##   group  variable        sd  variance
## 1  site Intercept 0.5306841 0.2816256
##                   Chisq    df p   AIC   BIC
## Integrated loglik 18038  5.00 0 18028 17990
##  Penalized loglik 18185 27.85 0 18129 17917
## 
## Fixed effects:
##                    coef exp(coef) se(coef)      z      p
## A               0.80966   2.24714  0.02959  27.36 <2e-16
## ns(k, df = 3)1 -2.71392   0.06628  0.04428 -61.29 <2e-16
## ns(k, df = 3)2  1.04004   2.82933  0.07851  13.25 <2e-16
## ns(k, df = 3)3  4.48430  88.61492  0.04729  94.83 <2e-16

However, the advantage of the Bayesian model is its flexibility. For example, if you wanted to include site-specific spline curves—analogous to site-specific time effects—you could extend the Bayesian approach to do so. The current Bayesian model implements a study-wide time spline, but incorporating site-specific splines would be a natural extension. I initially hoped to implement site-specific splines using the mgcv package, but the models did not converge. I am quite confident that a Bayesian extension would, though it would likely require substantial computing resources. If someone wants me to try that, I certainly could, but for now, I’ll stop here.

A Bayesian proportional hazards model for a cluster randomized trial

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 25 Mar 2025 00:00:00 +0000

In recent posts, I introduced a Bayesian approach to proportional hazards modeling and then extended it to incorporate a penalized spline. (There was also a third post on handling ties when multiple individuals share the same event time.) This post describes another extension: a random effect to account for clustering in a cluster randomized trial. With this in place, I’ll be ready to tackle the final step—building a model for analyzing a stepped-wedge cluster-randomized trial that incorporates both splines and site-specific random effects.

Simulating data with a cluster-specific random effect

Here are the R packages used in the post:

library(simstudy)
library(ggplot2)
library(data.table)
library(survival)
library(survminer)
library(cmdstanr)

The dataset simulates a cluster randomized trial where sites are randomized in a $1:1$ ratio to either the treatment group ($A=1$) or control group ($A=0$). Patients are affiliated with sites and receive the intervention based on site-level randomization. The time-to-event outcome, $Y$, is measured at the patient level and depends on both the site’s treatment assignment and unmeasured site effects:

defC <- 
  defData(varname = "b", formula = 0, variance = "..s2_b", dist = "normal") |>
  defData(varname = "A", formula = "1;1", dist = "trtAssign")

defS <-
  defSurv(
    varname = "timeEvent",
    formula = "-11.6 + ..delta_f * A + b",
    shape = 0.30
  )  |>
  defSurv(varname = "censorTime", formula = -11.3, shape = .40)

delta_f <- 0.7
s2_b <- 0.4^2

I’ve generated a single data set of 50 sites, with 25 in each arm. Each site includes 100 patients. The plot below shows the site-specific Kaplan-Meier curves for each intervention arm.

set.seed(1821)

dc <- genData(50, defC, id = "site")
dd <- genCluster(dc, "site", numIndsVar = 100, level1ID = "id")
dd <- genSurv(dd, defS, timeName = "Y", censorName = "censorTime",
             eventName = "event", typeName = "eventType", keepEvents = TRUE)

Bayesian model

Since this is the fourth iteration of the Bayesian proportional hazards model I’ve been working on, it naturally builds directly on the approach from my previous three posts (here. here, and here). Now, the partial log likelihood is a function of the treatment effect and cluster-specific random effects, given by:

\[ \log L(\beta) = \sum_{j=1}^{J} \left[ \sum_{i \in D_j} \left(\beta A_i + b_{s[i]} \right) - \sum_{r=0}^{d_j-1} \log \left( \sum_{k \in R_j} \left(\beta A_k + b_{s[k]} \right) - r \cdot \bar{w}_j \right) \right] \\ \]

where

$J$: number of unique event times
$D_j$ is the set of individuals who experience an event at time $t_j$.
$R_j$ is the risk set at time $t_j$, including all individuals who are still at risk at that time.
$d_j$ is the number of events occurring at time $t_j$.
$r$ ranges from 0 to $d_j - 1$, iterating over the tied events.
$\bar{w}_j$ represents the average risk weight of individuals experiencing an event at $t_j$:

\[\bar{w}_j = \frac{1}{d_j} \sum_{i \in D_j} \left(\beta A_i + b_{s[i]} \right)\]

$A_i$: binary indicator for treatment for patient $i$.

The parameters of the model are

$\beta$: treatment coefficient
$b_{s[i]}$: cluster-specific random effect, where $s[i]$ is the cluster of patient $i$

The assumed prior distributions for $\beta$ and the random effects are:

\[ \begin{aligned} \beta &\sim N(0,4) \\ b_i &\sim N(0,\sigma_b) \\ \sigma_b &\sim t_{\text{student}}(df = 3, \mu=0, \sigma = 2) \end{aligned} \]

stan_code <- 
"
data {
  
  int<lower=0> S;                   // Number of clusters

  int<lower=0> K;                   // Number of covariates
  int<lower=0> N_o;                 // Number of uncensored observations
  array[N_o] int i_o;               // Index in data set

  int<lower=0> N;                   // Number of total observations
  matrix[N, K] x;                   // Covariates for all observations
  array[N] int<lower=1,upper=S> s;  // Cluster for each observation
  
  array[N] int index;
  
  int<lower=0> T;            // Number of records as ties
  int<lower=1> J;            // Number of groups of ties
  array[T] int t_grp;        // Indicating tie group
  array[T] int t_index;      // Index in data set
  vector[T] t_adj;           // Adjustment for ties (efron)

}

parameters {
  
  vector[K] beta;          // Fixed effects for covariates
  vector[S] b;             // Random effects
  real<lower=0> sigma_b;   // Variance of random effect
  
}

model {
  
  // Prior
  
  beta ~ normal(0, 4);
  b ~ normal(0, sigma_b);
  sigma_b ~ student_t(3, 0, 2);
  
  // Calculate theta for each observation to be used in likelihood
  
  vector[N] theta;
  vector[N] log_sum_exp_theta;
  vector[J] exp_theta_grp = rep_vector(0, J);
  
  int first_in_grp;

  
  for (i in 1:N) {
    theta[i] = dot_product(x[i], beta) + b[s[i]];  
  }
  
  // Computing cumulative sum of log(exp(theta)) from last to first observation
  
  log_sum_exp_theta[N] = theta[N];
  
  for (i in tail(sort_indices_desc(index), N-1)) {
    log_sum_exp_theta[i] = log_sum_exp(theta[i], log_sum_exp_theta[i + 1]);
  }

  // Efron algorithm - adjusting cumulative sum for ties
  
  for (i in 1:T) {
    exp_theta_grp[t_grp[i]] += exp(theta[t_index[i]]);
  }

  for (i in 1:T) {
  
    if (t_adj[i] == 0) {
      first_in_grp = t_index[i];
    }

    log_sum_exp_theta[t_index[i]] =
      log( exp(log_sum_exp_theta[first_in_grp]) - t_adj[i] * exp_theta_grp[t_grp[i]]);
  }
  
  // Likelihood for uncensored observations

  for (n_o in 1:N_o) {
    target += theta[i_o[n_o]] - log_sum_exp_theta[i_o[n_o]];
  }
}
"

Getting the data ready to pass to Stan, compiling the Stan code, and sampling from the model using cmdstanr:

dx <- copy(dd)
setorder(dx, Y)
dx[, index := .I]

dx.obs <- dx[event == 1]
N_obs <- dx.obs[, .N]
i_obs <- dx.obs[, index]

N_all <- dx[, .N]
x_all <- data.frame(dx[, .(A)])
s_all <- dx[, site]

K <- ncol(x_all)                 # num covariates - in this case just A
S <- dx[, length(unique(site))]

ties <- dx[, .N, keyby = Y][N>1, .(grp = .I, Y)]
ties <- merge(ties, dx, by = "Y")
ties <- ties[, order := 1:.N, keyby = grp][, .(grp, index)]
ties[, adj := 0:(.N-1)/.N, keyby = grp]

stan_data <- list(
  S = S,
  K = K,
  N_o = N_obs,
  i_o = i_obs,
  N = N_all,
  x = x_all,
  s = s_all,
  index = dx$index,
  T = nrow(ties),
  J = max(ties$grp),
  t_grp = ties$grp,
  t_index = ties$index,
  t_adj = ties$adj
)

# compiling code

stan_model <- cmdstan_model(write_stan_file(stan_code))

# sampling from model

fit <- stan_model$sample(
  data = stan_data,
  seed = 1234, 
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...

## Chain 3 finished in 39.0 seconds.
## Chain 2 finished in 39.1 seconds.
## Chain 1 finished in 39.5 seconds.
## Chain 4 finished in 39.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 39.3 seconds.
## Total execution time: 39.7 seconds.

The posterior mean for $\beta$, the treatment effect, is quite close to the “true” value of 0.70, as is the estimate of the standard deviation of the random effect (we used $sd = 0.4$ in the data generating process):

fit$summary(variables = c("beta", "sigma_b"))

## # A tibble: 2 × 10
##   variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 beta[1]  0.659  0.660 0.124  0.124  0.455 0.863  1.00    1529.    3265.
## 2 sigma_b  0.417  0.413 0.0472 0.0458 0.347 0.501  1.00   15493.   11171.

The final post in this series will include code to simulate data from a stepped-wedge cluster-randomized trial with a time-to-event outcome. This model will integrate both the spline and random effect components. I’m curious to see how well it performs, as the required computational resources could be substantial.

Accounting for ties in a Bayesian proportional hazards model

keith.goldfeld@nyumc.org (Keith Goldfeld) — Thu, 20 Mar 2025 00:00:00 +0000

Over my past few posts, I’ve been progressively building towards a Bayesian model for a stepped-wedge cluster randomized trial with a time-to-event outcome, where time will be modeled using a spline function. I started with a simple Cox proportional hazards model for a traditional RCT, ignoring time as a factor. In the next post, I introduced a nonlinear time effect. For the third post—one I initially thought was ready to publish—I extended the model to a cluster randomized trial without explicitly incorporating time. I was then working on the grand finale, the full model, when I ran into an issue: I couldn’t recover the effect-size parameter used to generate the data.

After an embarrassingly long debugging process, I finally realized the problem—many events shared the same event times, and my model failed to account for ties. This issue hadn’t been apparent in the earlier models, but the final version was particularly sensitive to it. So, I decided to step back and first implement a model that properly handles ties before moving ahead.

What’s the issue?

The fundamental issue is that the likelihood in the original model assumes event times are unique for each individual. This assumption is reasonable when time is measured in hours but becomes problematic when using days and even more so with weeks, especially if the study covers a broad time range.

When multiple individuals experience an event at the same recorded time (ties), the challenge is defining the appropriate “risk set”—the group still at risk at that time. Two commonly used approaches for handling ties are the Breslow and Efron methods.

The Breslow method takes a simple approach by assuming that all tied events share the same risk set. It treats them as if they happened sequentially but applies the same risk set to each event. This can work well when ties are rare but may introduce bias if they are frequent.
The Efron method refines this by adjusting the risk set dynamically. Instead of treating all tied events as occurring with full risk sets, it reduces the risk set incrementally as each event happens. This better approximates a scenario where events truly occur in close succession rather than simultaneously.

In practical terms, the Efron method provides a more accurate correction when ties are common, while Breslow remains a computationally simpler choice. Another option is the Exact method, which calculates the likelihood by considering all possible orderings of tied events. While this approach is the most precise, it is often computationally impractical for large datasets. Much of this is described nicely in Hertz-Picciotto and Rockhill, though the original methods are detailed by Efron and Breslow. Finally, these lecture notes by Patrick Breheny provide a nice explanation of algorithms for handling tied survival times.

Implementing the Efron method

Since the Efron method generally provides better estimates, I chose to incorporate this into the Bayesian model. The partial likelihood under this approach is

\[ \log L(\beta) = \sum_{j=1}^{J} \left[ \sum_{i \in D_j} \mathbf{x}_i^\top \beta - \sum_{r=0}^{d_j-1} \log \left( \sum_{k \in R_j} \exp(\mathbf{x}_k^\top \beta) - r \cdot \bar{w}_j \right) \right] \\ \]

$\mathbf{x}_i$: vector of covariates for individual $i$.
$J$: number of unique event times.
$D_j$ is the set of individuals who experience an event at time $t_j$.
$R_j$ is the risk set at time $t_j$, including all individuals who are still at risk at that time.
$d_j$ is the number of events occurring at time $t_j$.
$r$ ranges from 0 to $d_j - 1$, iterating over the tied events.
$\bar{w}_j$ represents the average risk weight of individuals experiencing an event at $t_j$:

\[\bar{w}_j = \frac{1}{d_j} \sum_{i \in D_j} \exp(x_i^\top \beta)\]

The key idea here is that instead of treating all tied events as occurring simultaneously with the full risk set (as in Breslow’s method), Efron gradually reduces the risk set as each tied event is considered. This provides a more refined approximation of the true likelihood, particularly when ties are common.

In case a simple numerical example is helpful, here is a toy example of 15 individuals, where three share a common event time of 14 days (shown as $Y$), and two share an event time of 25 days. The critical feature of the Efron correction is that $\bar{w}$ is computed by averaging $\text{exp}(\theta)$ across the group experiencing the event at a given time.

Initially, the cumulative sum of $\text{exp}(\theta)$ for each individual in the group is identical. However, as each event in the tied group is processed sequentially, the risk set is dynamically updated, and the contribution from individuals who have already been accounted for is gradually reduced. This is reflected in the term $Z = U-V$, where $U$ represents the initial total risk weight and $V$ accounts for the incremental reduction as ties are processed.

Simulating data

To generate data for testing the Bayesian model, I am simulating an RCT with 1,000 individuals randomized 1:1 to one of two groups (represented by $A$). The treatment effect is defined by the hazard ratio $\delta_f$. The data includes censoring and ties.

We start with loading the necessary libraries:

library(simstudy)
library(data.table)
library(survival)
library(survminer)
library(cmdstanr)

Here is the simulation - definitions, parameters, and the data generation.

#### Data definitions

def <- defData(varname = "A", formula = "1;1", dist = "trtAssign")

defS <-
  defSurv(
    varname = "timeEvent",
    formula = "-11.6 + ..delta_f * A",
    shape = 0.30
  )  |>
  defSurv(varname = "censorTime", formula = -11.3, shape = .35)

#### Parameters

delta_f <- log(1.5)

#### Generate single data set

set.seed(7398)

dd <- genData(1000, def)
dd <- genSurv(dd, defS, timeName = "Y", censorName = "censorTime", digits = 0,
              eventName = "event", typeName = "eventType", keepEvents  = TRUE)

The Kaplan-Meier plot shows the two arms - red is intervention, and green is control, and the black crosses are the censoring times.

Of the 48 unique event times, 44 are shared by multiple individuals. This histogram shows the number of events at each time point:

Stan model

The Stan code below extends the model to analyze survival data that I presented earlier, so much of the model remains the same. I’ve removed the binary search function that appeared in the earlier model, replacing it with an index field that is passed from R. There are additional data requirements that must also be sent from R to provide information about the ties. The model does not have additional parameters. The adjustment of the likelihood for the ties takes place in the “model” block.

This Stan code extends the survival model I described earlier, with much of the structure remaining unchanged. The binary search function from the previous model has been removed and replaced with an index field passed from R. New data fields related to ties are also required, which must be calculated in R and provided to the model. There are no additional parameters, and the likelihood adjustment for ties is handled in the “model” block.

stan_code <-
"
data {

  int<lower=0> N_o;        // Number of uncensored observations
  array[N_o] int i_o;      // Index in data set

  int<lower=0> N;          // Number of total observations
  vector[N] x;             // Covariates for all observations
  
  array[N] int index;
  
  int<lower=0> T;            // Number of records as ties
  int<lower=1> J;            // Number of groups of ties
  array[T] int t_grp;        // Indicating tie group
  array[T] int t_index;      // Index in data set
  vector[T] t_adj;           // Adjustment for ties (efron)
}

parameters {
  
  real beta;          // Fixed effects for covariates

}

model {
  
  // Prior
  
  beta ~ normal(0, 4);
  
  // Calculate theta for each observation to be used in likelihood
  
  vector[N] theta;
  vector[N] log_sum_exp_theta;
  vector[J] exp_theta_grp = rep_vector(0, J);
  
  int first_in_grp;
  
  for (i in 1:N) {
    theta[i] = x[i] * beta;  
  }

  // Computing cumulative sum of log(exp(theta)) from last to first observation
  
  log_sum_exp_theta[N] = theta[N];
  
  for (i in tail(sort_indices_desc(index), N-1)) {
    log_sum_exp_theta[i] = log_sum_exp(theta[i], log_sum_exp_theta[i + 1]);
  }
  
  // Efron algorithm - adjusting cumulative sum for ties
  
  for (i in 1:T) {
    exp_theta_grp[t_grp[i]] += exp(theta[t_index[i]]);
  }

  for (i in 1:T) {
  
    if (t_adj[i] == 0) {
      first_in_grp = t_index[i];
    }

    log_sum_exp_theta[t_index[i]] =
      log( exp(log_sum_exp_theta[first_in_grp]) - t_adj[i] * exp_theta_grp[t_grp[i]]);
  }
  
  // Likelihood for uncensored observations

  for (n_o in 1:N_o) {
    target += theta[i_o[n_o]] - log_sum_exp_theta[i_o[n_o]];
  }
  
}

generated quantities {
  real exp_beta = exp(beta);
}
"

Compiling the model

stan_model <- cmdstan_model(write_stan_file(stan_code))

Preparing the data for Stan

dx <- copy(dd)
setorder(dx, Y)
dx[, index := .I]

dx.obs <- dx[event == 1]
N_obs <- dx.obs[, .N]
i_obs <- dx.obs[, index]

N_all <- dx[, .N]
x_all <- dx[, A]

ties <- dx[, .N, keyby = Y][N>1, .(grp = .I, Y)]
ties <- merge(ties, dx, by = "Y")
ties <- ties[, order := 1:.N, keyby = grp][, .(grp, index)]
ties[, adj := 0:(.N-1)/.N, keyby = grp]

stan_data <- list(
  N_o = N_obs,
  i_o = i_obs,
  N = N_all,
  x = x_all,
  index = dx$index,
  T = nrow(ties),
  J = max(ties$grp),
  t_grp = ties$grp,
  t_index = ties$index,
  t_adj = ties$adj
)

Fitting models

First, I am estimating the Bayesian model. To save time, I’m using $optimize() to obtain the MLE for beta from the Stan model rather than using MCMC to sample the full posterior distribution. The estimated hazard ratio is right on target:

fit_mle <- stan_model$optimize(data=stan_data, jacobian = FALSE)
fit_mle$draws(format="df")[1,c("beta", "exp_beta")]

## # A tibble: 1 × 2
##    beta exp_beta
##   <dbl>    <dbl>
## 1 0.401     1.49

For comparison, I fit a Cox proportional hazards model using a frequentist (non-Bayesian) approach. The log hazard ratio estimate matches that of the Bayesian model.

cox_model <- coxph(Surv(Y, event) ~ A , data = dd, ties = "efron")

## # A tibble: 1 × 5
##   term  estimate std.error statistic      p.value
##   <chr>    <dbl>     <dbl>     <dbl>        <dbl>
## 1 A        0.401    0.0715      5.62 0.0000000194

Of course, I really should conduct more extensive simulations to better understand the operating characteristics of the Bayesian model, particularly to assess how estimates behave when the ties are ignored. However, I’m eager to get back to the original program, moving next to a random effects model, and then completing the full model by combining the random effect with the spline. I’ll leave these additional simulations for you to explore.

References:

Breslow, N., 1974. Covariance analysis of censored survival data. Biometrics, pp.89-99.

Efron, B., 1977. The efficiency of Cox’s likelihood function for censored data. Journal of the American statistical Association, 72(359), pp.557-565.

Hertz-Picciotto, I. and Rockhill, B., 1997. Validity and efficiency of approximation methods for tied survival times in Cox regression. Biometrics, pp.1151-1156.

A Bayesian proportional hazards model with a penalized spline

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 04 Mar 2025 00:00:00 +0000

In my previous post, I outlined a Bayesian approach to proportional hazards modeling. This post serves as an addendum, providing code to incorporate a spline to model a time-varying hazard ratio non linearly. In a second addendum to come I will present a separate model with a site-specific random effect, essential for a cluster-randomized trial. These components lay the groundwork for analyzing a stepped-wedge cluster-randomized trial, where both splines and site-specific random effects will be integrated into a single model. I plan on describing this comprehensive model in a final post.

Simulating data with a time-varying hazard ratio

Here are the R packages used in the post:

library(simstudy)
library(ggplot2)
library(data.table)
library(survival)
library(survminer)
library(splines)
library(splines2)
library(cmdstanr)

The dataset simulates a randomized controlled trial in which patients are assigned either to the treatment group ($A=1$) or control group ($A=0$) in a $1:1$ ratio. Patients enroll over nine quarters, with the enrollment quarter denoted by $M$, $M \in \{0, \dots, 8 \}$. The time-to-event outcome, $Y$, depends on both treatment assignment and enrollment quarter. To introduce non-linearity, I define the relationship using a cubic function, with true parameters specified as follows:

defI <- 
  defData(varname = "A", formula = "1;1", dist = "trtAssign") |>
  defData(varname = "M", formula = "0;8", dist = "uniformInt")

defS <-
  defSurv(
    varname = "eventTime",
    formula = "..int + ..beta * A + ..alpha_1 * M + ..alpha_2 * M^2 + ..alpha_3 * M^3",
    shape = 0.30)  |>
  defSurv(varname = "censorTime", formula = -11.3, shape = 0.40)

# parameters

int <- -11.6      
beta <-  0.70
alpha_1 <-  0.10   
alpha_2 <-  0.40    
alpha_3 <- -0.05

I’ve generated a single data set of $640$ study participants, $320$ in each arm. The plot below shows the Kaplan-Meier curves by arm for each enrollment period.

set.seed(7368) # 7362

dd <- genData(640, defI)
dd <- genSurv(dd, defS, timeName = "Y", censorName = "censorTime",
  eventName = "event", typeName = "eventType", keepEvents = TRUE)

Bayesian model

This Bayesian proportional hazards model builds directly on the approach from my previous post. Since the effect of $M$ on $Y$ follows a non-linear pattern, I model this relationship using a spline to account for temporal variation in the hazard. The partial likelihood is a function of the treatment effect and spline basis function coefficients, given by:

\[ L(\beta,\mathbf{\gamma}) = \prod_{i=1}^{N} \left( \frac{\exp \left(\beta A_i + \sum_{m=1} ^ M \gamma_m X_{m_i} \right)} {\sum_{j \in R(t_i)} \exp\left(\beta A_j + \sum_{m=1} ^ M \gamma_m X_{m_j}\right) } \right)^{\delta_i} \] where:

$M$: number of spline basis functions
$N$: number of observations (censored or not)
$A_i$: binary indicator for treatment
$X_{m_i}$: value of the $m^{\text{th}}$ spline basis function for the $i^{\text{th}}$ observation
$\delta_i$: event indicator ($\delta_i = 1$ if the event occurred, $\delta_i = 0$ if censored)
$\beta$: treatment coefficient
$\gamma_m$: spline coefficient for the $m^\text{th}$ spline basis function
$R(t_i)$: risk set at time $t_i$ (including only individuals censored after $t_i$)

The spline component of the model is adapted from a model I described last year. In this formulation, time-to-event is modeled as a function of the vector $\mathbf{X_i}$ rather than the period itself. The number of basis functions is determined by the number of knots, with each segment of the curve estimated using B-spline basis functions. To minimize overfitting, we include a penalization term based on the second derivative of the B-spline basis functions. The strength of this penalization is controlled by a tuning parameter, $\lambda$, which is provided to the model.

The Stan code, provided in full here, was explained in earlier posts. The principal difference from the previous post is the addition of the spline-related data and parameters, as well as the penalization term in the model.:

stan_code <-
"
functions {

  // Binary search optimized to return the last index with the target value

  int binary_search(vector v, real tar_val) {
    int low = 1;
    int high = num_elements(v);
    int result = -1;

    while (low <= high) {
      int mid = (low + high) %/% 2;
      if (v[mid] == tar_val) {
        result = mid; // Store the index
        high = mid - 1; // Look for earlier occurrences
      } else if (v[mid] < tar_val) {
        low = mid + 1;
      } else {
        high = mid - 1;
      }
    }
    return result;
  }
}

data {

  int<lower=0> K;          // Number of covariates
  int<lower=0> N_o;        // Number of uncensored observations
  vector[N_o] t_o;         // Event times (sorted in decreasing order)

  int<lower=0> N;          // Number of total observations
  vector[N] t;             // Individual times (sorted in decreasing order)
  matrix[N, K] x;          // Covariates for all observations

  // Spline-related data
  
  int<lower=1> Q;          // Number of basis functions
  matrix[N, Q] B;          // Spline basis matrix
  matrix[N, Q] D2_spline;  // 2nd derivative for penalization
  real lambda;             // penalization term
}

parameters {
  vector[K] beta;          // Fixed effects for covariates
  vector[Q] gamma;         // Spline coefficients
}

model {
  
  // Prior
  
  beta ~ normal(0, 4);
  
  // Spline coefficients prior
  
  gamma ~ normal(0, 4);
  
  // Penalization term for spline second derivative
  
  target += -lambda * sum(square(D2_spline * gamma));
  
  // Calculate theta for each observation to be used in likelihood
  
  vector[N] theta;
  vector[N] log_sum_exp_theta;
  
  for (i in 1:N) {
    theta[i] = dot_product(x[i], beta) + dot_product(B[i], gamma);  
  }
  
  // Compute cumulative sum of log(exp(theta)) from last to first observation
  
  log_sum_exp_theta[N] = theta[N];
  
  for (i in tail(sort_indices_desc(t), N-1)) {
    log_sum_exp_theta[i] = log_sum_exp(theta[i], log_sum_exp_theta[i + 1]);
  }

  // Likelihood for uncensored observations
  
  for (n_o in 1:N_o) {
    int start_risk = binary_search(t, t_o[n_o]); // Use binary search
    
    real log_denom = log_sum_exp_theta[start_risk];
    target += theta[start_risk] - log_denom;
  }
}
"

To estimate the model, we need to get the data ready to pass to Stan, compile the Stan code, and then sample from the model using cmdstanr:

dx <- copy(dd)
setorder(dx, Y)

dx.obs <- dx[event == 1]
N_obs <- dx.obs[, .N]
t_obs <- dx.obs[, Y]

N_all <- dx[, .N]
t_all <- dx[, Y]
x_all <- data.frame(dx[, .(A)])

# Spline-related info

n_knots <- 5
spline_degree <- 3
knot_dist <- 1/(n_knots + 1)
probs <- seq(knot_dist, 1 - knot_dist, by = knot_dist)
knots <- quantile(dx$M, probs = probs)
spline_basis <- bs(dx$M, knots = knots, degree = spline_degree, intercept = TRUE)
B <- as.matrix(spline_basis)

D2 <- dbs(dx$M, knots = knots, degree = spline_degree, derivs = 2, intercept = TRUE)
D2_spline <- as.matrix(D2)

K <- ncol(x_all)             # num covariates - in this case just A

stan_data <- list(
  K = K,
  N_o = N_obs,
  t_o = t_obs,
  N = N_all,
  t = t_all,
  x = x_all,
  Q = ncol(B),
  B = B,
  D2_spline = D2_spline,
  lambda = 0.10
)

# compiling code

stan_model <- cmdstan_model(write_stan_file(stan_code))

# sampling from model

fit <- stan_model$sample(
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  max_treedepth = 15,
  refresh = 0
)

## Running MCMC with 4 parallel chains...
## 
## Chain 4 finished in 64.1 seconds.
## Chain 3 finished in 64.5 seconds.
## Chain 2 finished in 65.2 seconds.
## Chain 1 finished in 70.6 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 66.1 seconds.
## Total execution time: 70.8 seconds.

The posterior mean (and median) for $\beta$, the treatment effect, are quite close to the “true” value of 0.70:

fit$summary(variables = c("beta", "gamma"))

## # A tibble: 10 × 10
##    variable   mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
##    <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
##  1 beta[1]   0.689  0.689 0.0844 0.0857  0.551 0.828  1.00    3664.    4002.
##  2 gamma[1] -1.75  -1.77  1.33   1.35   -3.91  0.468  1.00    1364.    1586.
##  3 gamma[2] -1.59  -1.60  1.33   1.35   -3.75  0.626  1.00    1360.    1551.
##  4 gamma[3] -1.22  -1.24  1.33   1.35   -3.39  0.978  1.00    1365.    1515.
##  5 gamma[4] -0.115 -0.127 1.33   1.35   -2.28  2.09   1.00    1361.    1576.
##  6 gamma[5]  1.97   1.95  1.34   1.35   -0.206 4.20   1.00    1366.    1581.
##  7 gamma[6]  2.63   2.61  1.33   1.34    0.452 4.84   1.00    1358.    1586.
##  8 gamma[7]  1.08   1.05  1.33   1.34   -1.08  3.28   1.00    1360.    1505.
##  9 gamma[8] -0.238 -0.260 1.33   1.34   -2.40  1.97   1.00    1355.    1543.
## 10 gamma[9] -0.914 -0.935 1.33   1.35   -3.07  1.30   1.00    1356.    1549.

The figure below shows the estimated spline and the 95% credible interval. The green line represents the posterior median log hazard ratio for each period (relative to the middle period, 4), with the shaded band indicating the corresponding credible interval. The purple points represent the log hazard ratios implied by the data generation process. For example, the log hazard ratio comparing period 1 to period 4 for both arms is:

\[ \begin{array}{c} (-11.6 + 0.70A +0.10\times1 + 0.40 \times 1^2 -0.05\times1^3) - (-11.6 + 0.70A +0.10\times4 + 0.40 \times 4^2 -0.05\times4^3) = \\ (0.10 + 0.40 - 0.05) - (0.10 \times 4 + 0.40 \times 16 - 0.05 \times 64 ) = \\ 0.45 - 3.60 = -3.15 \end{array} \]

It appears that the median posterior aligns quite well with the values used in the data generation process:

For the next post, I will present another scenario that includes random effects for a cluster randomized trial (but will not include splines).

Estimating a Bayesian proportional hazards model

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 11 Feb 2025 00:00:00 +0000

A recent conversation with a colleague about a large stepped-wedge design (SW-CRT) cluster randomized trial piqued my interest, because the primary outcome is time-to-event. This is not something I’ve seen before. A quick dive into the literature suggested that time-to-event outcomes are uncommon in SW-CRTs-and that the best analytic approach is not obvious. I was intrigued by how to analyze the data to estimate a hazard ratio while accounting for clustering and potential secular trends that might influence the time to the event.

Of course, my first thought was: How would I simulate data to explore different modeling approaches? And then: Could a Bayesian approach be useful here?

Generating data with clustering and a secular time trend turned out to be quite straightforward, which I’ll share that in a future post. Here, I’m focusing on the my first attempts to implement a Bayesian model that might eventually accommodate cluster-level random effects and flexible secular trends using splines, something I explored more generally in an earlier post.

Below, I start by generating a simple set of time-to-event outcomes (without any clustering or time trends) and fit a traditional Cox proportional hazards model to serve as a comparison. I then fit a Bayesian proportional hazards model using Stancode drawn from the online guide. That model works fine, but it has a key limitation that I try to address, first partially successfully, and then pretty successfully. This post walks through all these steps.

Simulating an RCT with time-to-event outcomes

Here are the R packages that are used in this post:

library(simstudy)
library(data.table)
library(survival)
library(cmdstanr)

And here are the data definitions for a two-arm randomized controlled trial that is stratified by a variable $M$. Both the treatment $A$ and covariate $M$ are associated with the time-to-event outcome, as specified in defS. On average, the treatment $A$ speeds up the time-to-event, and $M$ slows things down. (In simstudy survival times are generated using a Weibull data generation process.)

defI <-
  defData(varname = "M", formula = 0.3, dist = "binary") |>
  defData(varname = "A", formula = "1;1", variance = "M", dist = "trtAssign")

defS <- 
  defSurv(
    varname = "timeEvent", 
    formula = "-11.6 + ..delta * A + ..beta_m * M",
    shape = 0.30)  |>
  defSurv(varname = "censorTime", formula = -11.3, shape = .35)

## Parameters

delta <- 1.5
beta_m <- -1.0

We are generating 1,000 independent observations:

set.seed(123)

dd <- genData(1000, defI)
dd <- genSurv(dd, defS, timeName = "tte", censorName = "censorTime", eventName = "event")
dd

## Key: <id>
## Index: <type>
##          id     M     A    tte event       type
##       <int> <int> <int>  <num> <num>     <char>
##    1:     1     0     0 26.974     1  timeEvent
##    2:     2     1     0 38.353     1  timeEvent
##    3:     3     0     0 32.836     1  timeEvent
##    4:     4     1     0 28.768     0 censorTime
##    5:     5     1     0 54.366     1  timeEvent
##   ---                                          
##  996:   996     1     1 11.012     0 censorTime
##  997:   997     0     0 15.420     1  timeEvent
##  998:   998     0     0 21.212     1  timeEvent
##  999:   999     1     0 41.153     0 censorTime
## 1000:  1000     0     1 25.659     1  timeEvent

Here is a Kaplan-Meier plot showing the “survival” times for each level of $M$ and each treatment arm:

Fitting a traditional Cox proportional hazards model, we can see that the log hazard ratio for treatment $A$ is greater than 0, suggesting that on average the time-to-event is shorter for those in the treatment arm. Likewise, those with $M=1$ have longer times-to-events and the log hazard ratio is less than zero:

cox_model <- coxph(Surv(tte, event) ~ A + M, data = dd)
summary(cox_model)

## Call:
## coxph(formula = Surv(tte, event) ~ A + M, data = dd)
## 
##   n= 1000, number of events= 821 
## 
##       coef exp(coef) se(coef)      z Pr(>|z|)    
## A  1.44309   4.23374  0.08018  18.00   <2e-16 ***
## M -0.92537   0.39638  0.08302 -11.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##   exp(coef) exp(-coef) lower .95 upper .95
## A    4.2337     0.2362    3.6181    4.9542
## M    0.3964     2.5228    0.3369    0.4664
## 
## Concordance= 0.695  (se = 0.009 )
## Likelihood ratio test= 415.5  on 2 df,   p=<2e-16
## Wald test            = 389  on 2 df,   p=<2e-16
## Score (logrank) test = 423.4  on 2 df,   p=<2e-16

First Bayes model

As I mentioned before, I turned to Stan documentation for the code that follows. I won’t go into the detailed derivation of the partial likelihood here since that is covered very nicely in the document. However, it is useful to see the final likelihood that is then reflected in the code.

The likelihood is written as follows (and note that the $j$’s represent only the cases with observed times, while $j'$ include censored cases):

\[ L(\boldsymbol{\beta})= \prod_{j=1}^{N^{obs}} \left( \frac{\exp(\mathbf{x}_j^\top \boldsymbol{\beta})}{\sum_{j'=j}^{N} \exp(\mathbf{x}_{j'}^\top \boldsymbol{\beta})}\right) \]

where:

$N^{obs}$ is the number of observed times,
$N$ is the total number of observations (including censored),
$\mathbf{x}_i$ is the vector of covariates for the $i$-th observation,
$\boldsymbol{\beta}$ is the vector of coefficients,

We want the log likelihood, which transforms the product to a sum of logs:

\[ \begin{aligned} \log Pr[\text{obs. fails ordered } 1, \dots N^{obs}|\mathbf{x}, \boldsymbol{\beta}] &= \sum_{j=1}^{N^{obs}} \log \left( \frac{\exp(\mathbf{x}_j^\top \boldsymbol{\beta})}{\sum_{j'=j}^{N} \exp(\mathbf{x}_{j'}^\top \boldsymbol{\beta})}\right) \\ \\ &= \sum_{j=1}^{N^{obs}} \left(\mathbf{x}_j^\top \boldsymbol{\beta} - \log \sum_{j'=j}^{N} \exp(\mathbf{x}_{j'}^\top \boldsymbol{\beta})\right) \end{aligned} \]

In Stan, the function logSumExp can be used to efficiently calculate

\[ \log \sum_{j'=j}^{N} \exp(\mathbf{x}_{j'}^\top \boldsymbol{\beta}). \]

This partial likelihood is implemented below in Stan. One confusing aspect (at least to me) is the way censoring is handled. Essentially, all event times for censored cases are assumed to occur after the last observed time. That is, all censored cases are part of the risk set for observed events, something I had not seen before. This is a pretty big assumption and has implications for data where the actual censoring times occur before the last observed event time.

The code might be a little confusing, because the data are delivered to Stan in reverse order. It is done this way to make calculation of the log likelihood more efficient. If you are trying to follow along with the code to see how it lines up with the equations above, keep this in mind.

stan_code <-
"
data {
  int<lower=0> K;          // num covariates

  int<lower=0> N;          // num uncensored obs
  vector[N] t;             // event time (non-strict decreasing)
  matrix[N, K] x;          // covariates for uncensored obs

  int N_c;                 // num censored obs
  real <lower=t[N]> t_c;   // censoring time
  matrix[N_c, K] x_c;      // covariates for censored obs
}

parameters {
  vector[K] beta;          // slopes (no intercept)
}

transformed parameters {
  vector[N] log_theta = x * beta;
  vector[N_c] log_theta_c = x_c * beta;
}

model {
  beta ~ normal(0, 4);
  
  real log_denom = log_sum_exp(log_theta_c);
  
  for (n in 1:N) {
    log_denom = log_sum_exp(log_denom, log_theta[n]);
    target += log_theta[n] - log_denom;   // log likelihood
  }
  
}
"

This code prepares the R data for Stan:

dd.o <- dd[event == 1]
setorder(dd.o, -tte)
x.o <- data.frame(dd.o[, .(A, M)])
N.o <- dd.o[, .N]
t.o <- dd.o[, tte]

dd.c <- dd[event == 0]
setorder(dd.c, -tte)
x.c <- data.frame(dd.c[, .(A, M)])
N.c <- dd.c[, .N]
t.c <- dd.c[, tte]

K <- ncol(x.o)          # num covariates

stan_data <- list(
  K = K,
  N = N.o,
  t = t.o,
  x = x.o,
  N_c = N.c,
  t_c = max(t.c),
  x_c = x.c
)

I’m using cmdstanr to interface with Stan. First we compile the Stan code.

stan_model <- cmdstan_model(write_stan_file(stan_code))

And then we fit the model. Even with 1,000 observations, the model estimates in just a couple of seconds on my laptop.

fit <- stan_model$sample(
  data = stan_data, 
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...
## 
## Chain 3 finished in 1.7 seconds.
## Chain 1 finished in 1.9 seconds.
## Chain 4 finished in 1.8 seconds.
## Chain 2 finished in 1.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 1.8 seconds.
## Total execution time: 2.0 seconds.

Looking at the log hazard ratios, something seems awry. The Bayesian estimates are attenuated relative to the original Cox PH estimates, and this is not due to the prior distribution assumption. Rather, it is the result of assuming that all censored times are longer than the longest observed time-to-event. I’m not showing this here, but the attenuation does largely go away if there is no censoring.

fit$summary(variables = "beta")

## # A tibble: 2 × 10
##   variable   mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta[1]   1.03   1.03  0.0723 0.0720  0.916  1.15   1.00   14541.   11205.
## 2 beta[2]  -0.588 -0.588 0.0802 0.0802 -0.721 -0.455  1.00   13062.   10867.

Given this limitation, I decided to try to implement an algorithm that accommodates dynamic risk sets, effectively taking censored cases out of the analysis as soon as they are censored. This is what the coxph model estimated above does.

Second Bayes model

The partial likelihood for my alternative Cox proportional hazards model is given by:

\[ L(\boldsymbol{\beta}) = \prod_{i=1}^{N} \left( \frac{\exp(\mathbf{x}_i^\top \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{x}_j^\top \boldsymbol{\beta})} \right)^{\delta_i} \]

where:

$N$ is the number of observations (censored or not),
$\mathbf{x}_i$ is the vector of covariates for the $i$-th observation,
$\boldsymbol{\beta}$ is the vector of coefficients,
$t_i$ is the observed time for the $i$-th observation,
$R(t_i)$ is the risk set at time $t_i$ (the set of individuals still at risk just before time $t_i$) - which only includes censored cases censored after $t_i$,
$\delta_i$ is the event indicator ($\delta_i = 1$ if the event occurred, $\delta_i = 0$ if censored).

The Stan code below implements the log likelihood that follows from this. In contrast to the first version, the data is passed to Stan in ascending order. The one major complication is that I needed to create a search function in order to define the “risk” set. (Actually, I asked DeepSeek to do this for me.) The fundamental difference between this version and the first is the calculation of the denominator in the log likelihood.

stan_code <-
"
functions {
  int binary_search(vector v, real tar_val) {
    int low = 1;
    int high = num_elements(v);
    int result = -1;

    while (low <= high) {
      int mid = (low + high) %/% 2;
      if (v[mid] == tar_val) {
        result = mid; // Store the index
        high = mid - 1; // Look for earlier occurrences
      } else if (v[mid] < tar_val) {
        low = mid + 1;
      } else {
        high = mid - 1;
      }
    }
    return result;
  }
}

data {
  int<lower=0> K;          // Number of covariates

  int<lower=0> N_o;        // Number of uncensored observations
  vector[N_o] t_o;         // Event times (sorted in decreasing order)
  matrix[N_o, K] x_o;      // Covariates for uncensored observations

  int<lower=0> N;          // Number of total observations
  vector[N] t;             // Individual times
  matrix[N, K] x;          // Covariates for all observations
}

parameters {
  vector[K] beta;          // Fixed effects for covariates
}

model {
  
  // Prior

  beta ~ normal(0, 4);
  
  // Model

  vector[N] log_theta = x * beta;

  for (n_o in 1:N_o) {
    int start_risk = binary_search(t, t_o[n_o]); // Use binary search
    real log_denom = log_sum_exp(log_theta[start_risk:N]);
    target += log_theta[start_risk] - log_denom;
  }

}
"

Preparing the data is a little different. This time, I am passing the observed data and the full data, both in ascending order:

dx <- copy(dd)
setorder(dx, tte)

dx.o <- dx[event == 1]
x_o <- data.frame(dx.o[, .(A, M)])
N_o <- dx.o[, .N]
t_o <- dx.o[, tte]

x_all <- data.frame(dx[, .(A, M)])
N_all <- dx[, .N]
t_all <- dx[, tte]

K <- ncol(x_o)          # num covariates

stan_data <- list(
  K = K,
  N_o = N_o,
  t_o = t_o,
  x_o = x_o,
  N = N_all,
  t = t_all,
  x = x_all
)

stan_model <- cmdstan_model(write_stan_file(stan_code))

fit <- stan_model$sample(
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...
## 
## Chain 2 finished in 57.9 seconds.
## Chain 4 finished in 67.2 seconds.
## Chain 1 finished in 67.8 seconds.
## Chain 3 finished in 67.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 65.2 seconds.
## Total execution time: 68.1 seconds.

fit$summary(variables = "beta")

## # A tibble: 2 × 10
##   variable   mean median     sd    mad    q5    q95  rhat ess_bulk ess_tail
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta[1]   1.44   1.43  0.0801 0.0801  1.31  1.57   1.00   12604.   10436.
## 2 beta[2]  -0.925 -0.924 0.0816 0.0817 -1.06 -0.791  1.00   12492.   10436.

Two things to note about this model. First, it appears that estimates are spot on! They mirror the estimates from the coxph model using the survival package. That is encouraging. Second, the implementation is very inefficient, taking more than a minute to run! This is less encouraging, and does not bode well for a more complex model that incorporates random effects and splines.

Final Bayes model

I asked ChatGPT this time to see if it could make my code more efficient. (I’ve been comparing ChatGPT and DeepSeek - both have been pretty impressive.) It recognized that my initial brute force approach was calculating each denominator anew for each observed evaluation. This is highly inefficient, on the order of $O(N^2)$ (Unprompted, ChatGPT provided me with this information). The algorithm is reconfigured so that the denominators are pre-calculated - starting with the last time point (censored or observed), similar to the first approach. It turns out this is much more efficient with $O(N)$.

stan_code <-
"
...

model {
  
  // Prior
  
  beta ~ normal(0, 4);
  
    // Likelihood
  
  vector[N] theta = x * beta;
  vector[N] log_sum_exp_theta;
  
  // Compute cumulative sum of exp(theta) in log space
  
  log_sum_exp_theta[N] = theta[N]; // Initialize the last element
  
  for (i in tail(sort_indices_desc(t), N-1)) {
    log_sum_exp_theta[i] = log_sum_exp(theta[i], log_sum_exp_theta[i + 1]);
  }

  for (n_o in 1:N_o) {
    int start_risk = binary_search(t, t_o[n_o]); // Use binary search
    real log_denom = log_sum_exp_theta[start_risk];
    target += theta[start_risk] - log_denom;
  }
}
"

The data requirements for this are the same as the second model, so no changes are needed there.

stan_model <- cmdstan_model(write_stan_file(stan_code))

fit <- stan_model$sample(
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 4000,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

## Running MCMC with 4 parallel chains...
## 
## Chain 3 finished in 2.2 seconds.
## Chain 2 finished in 2.4 seconds.
## Chain 4 finished in 2.4 seconds.
## Chain 1 finished in 2.7 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 2.4 seconds.
## Total execution time: 2.8 seconds.

fit$summary(variables = "beta")

## # A tibble: 2 × 10
##   variable   mean median     sd    mad    q5    q95  rhat ess_bulk ess_tail
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta[1]   1.44   1.44  0.0800 0.0794  1.30  1.57   1.00   12036.   10499.
## 2 beta[2]  -0.925 -0.925 0.0831 0.0834 -1.06 -0.789  1.00   11887.   10386.

The model also works well, as the estimate is the same as the previous Bayesian model and the coxph model. More importantly, the computation time is reduced considerably, to about 3 seconds. My hope is that this final model is flexible enough to handle the extensions I need for the data structure that sparked this whole exploration.

Thinking about covariates in an analysis of an RCT

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 28 Jan 2025 00:00:00 +0000

I was recently discussing the analytic plan for a randomized controlled trial (RCT) with a clinical collaborator when she asked whether it’s appropriate to adjust for pre-specified baseline covariates. This question is so interesting because it touches on fundamental issues of inference—both causal and statistical. What is the target estimand in an RCT—that is, what effect are we actually measuring? What do we hope to learn from the specific sample recruited for the trial (i.e., how can the findings be analyzed in a way that enhances generalizability)? What underlying assumptions about replicability, resampling, and uncertainty inform the arguments for and against covariate adjustment? These are big questions, which won’t necessarily be answered here, but need to be kept in mind when thinking about the merits of covariate adjustment

Some researchers resist covariate adjustment in the primary analysis, concerned that it might complicate interpretability or limit transparency. Others might like the straightforward clarity and simplicity of the randomized comparison. But perhaps the biggest issue that people have with covariate adjustment is a longstanding concern that flexible modeling could turn into a fishing expedition—searching for covariates that yield the most favorable effect estimate.

After that conversation with my colleague, I revisited a 1994 paper by Stephen Senn, which argues that rather than checking for chance covariate imbalances before making adjustments, “the practical statistician will do well to establish beforehand a limited list of covariates deemed useful and fit them regardless. Such a strategy will usually lead to a gain in power, has no adverse effect on unconditional size and controls conditional size with respect to the covariates identified.” A subsequent paper by Pocock et al. reinforces this perspective. Although they note that “experience shows that for most clinical trials, analyses which adjust for baseline covariates are in close agreement with the simpler unadjusted treatment comparisons”, they argue adjusting for covariates can be justified if it helps: (1) achieve the most appropriate p-value for treatment differences, (2) provide unbiased estimates, and (3) improve precision.

Motivated by Pocock et al., I created some simulations to explore the operational characteristics of covariate adjustment that I’m sharing here. I’ve been distracted more recently with paper writing and manuscript editing, so I am happy to get back to a little R coding.

Simulations

To get things started, here are the R packages used in the simulations.

library(simstudy)
library(data.table)
library(stargazer)
library(parallel)

Data definitions

I am using two sets of data definitions here, splitting up the creation of baseline covariates ($x_1$ and $x_2$) and group assignment from the outcome $y$. We are assuming that $y$ has a Gaussian (normal) distribution.

def_c <- 
  defData(varname = "x1", formula = 0, variance = "..s_x1^2", dist = "normal") |>
  defData(varname = "x2", formula = 0, variance = "..s_x2^2", dist = "normal") |>
  defData(varname = "A", formula = "1;1", dist = "trtAssign")

def_y <- defDataAdd(
  varname = "y", 
  formula = "5 + ..delta * A + ..b1 * x1 + ..b2 * x2", 
  variance = "..s_y^2", 
  dist = "normal"
)

Initial parammeters

Here are the parameters used in the data generation. $x_1$ is highly correlated with the outcome $y$, whereas $x_2$ is not. In the first set of simulations, we are assuming a true effect size $\delta = 5$.

s_x1 <- 8
s_x2 <- 9
s_y <- 12
b1 <- 1.50
b2 <- 0.0
delta <- 5

Single data set generation

L’Ecuyer’s Combined Multiple Recursive Generator (CMRG) random number generator is being used here, because the replication of the data (and the analyses) are done using a parallel process to speed things up a bit.

RNGkind("L'Ecuyer-CMRG")
set.seed(55)

dc <- genData(250, def_c) 
dd <- addColumns(def_y, dc)

head(dd)

## Key: <id>
##       id         x1          x2     A          y
##    <int>      <num>       <num> <int>      <num>
## 1:     1  3.4075117 -1.66327988     0  8.1461927
## 2:     2 -0.3040474 -5.60073657     0 -1.7577859
## 3:     3 -4.4460516  1.20189340     1 -1.6324336
## 4:     4  0.6834332  0.09974478     1  9.5938736
## 5:     5 -4.6324773  7.85745373     0 -0.4782144
## 6:     6 -6.5650815  0.49812462     0 -3.4082489

For this single data set, we can see that the means for $x_1$ within each group are slightly different, while the means for $x_2$ are more similar.

dd[, .(mu_x1 = mean(x1), mu_x2 = mean(x2)), keyby = A]

## Key: <A>
##        A      mu_x1       mu_x2
##    <int>      <num>       <num>
## 1:     0  0.3960784 -0.36510184
## 2:     1 -0.4821337  0.08564994

These differences are confirmed by calculating the standardized imbalance $Z_x$ and the standardized difference $d$. (The difference between $Z_x$ and $d$ is that $Z_x$ has an adjustment for the group sample sizes.)

calc_diff <- function(dx, rx, v) {
  
  mean_diff <- dx[get(rx)==1, mean(get(v))] - dx[get(rx)==0, mean(get(v))]
  s_pooled <- sqrt(
    (dx[get(rx)==1, (.N - 1) * var(get(v))] + 
       dx[get(rx)==0, (.N - 1) * var(get(v))] ) / dx[, .N - 2])
  
  Z_x <- mean_diff / ( s_pooled * sqrt(1/dx[get(rx)==0, .N] + 1/dx[get(rx)==1, .N]) )
  d <- mean_diff / s_pooled
  
  return(list(Z_x = Z_x, d = d))
  
  }

calc_diff(dd, "A", "x1")

## $Z_x
## [1] -0.9424664
## 
## $d
## [1] -0.1192136

calc_diff(dd, "A", "x2")

## $Z_x
## [1] 0.3862489
## 
## $d
## [1] 0.04885705

As designed, $x_1$ is strongly correlated with the outcome $y$, whereas $x_2$ is not.

dd[, .(rho_x1.y = cor(x1, y), rho_x2.y = cor(x2, y))]

##     rho_x1.y     rho_x2.y
##        <num>        <num>
## 1: 0.6807215 -0.003174757

Model estimation

We fit four models to this data: (1) no adjustment for the covariates, (2) adjusting for $x_1$ alone, (3) adjusting for $x_2$ alone, and (4) adjusting for both covariates.

model_1 <- lm(data = dd, formula = y ~ A)
model_2 <- lm(data = dd, formula = y ~ A + x1)
model_3 <- lm(data = dd, formula = y ~ A + x2)
model_4 <- lm(data = dd, formula = y ~ A + x1 + x2)

Two key takeaways from this single data set are that (1) since $x_1$ is a (albeit weak) confounder, failing to adjust for the covariate leads to an underestimation of the treatment effect (due to the (small) negative correlation of $x_1$ and $A$), and (2) since $x_1$ is so highly correlated with $y$, the models that adjust for $x_1$ have lower standard errors for the treatment effect estimate (around 2.0 for models 1 and 3, and closer to 1.5 for models 2 and 4).

## 
## ============================================
##            (1)      (2)      (3)      (4)   
## --------------------------------------------
## A         3.040   4.367***  3.044   4.367***
##          (2.039)  (1.480)  (2.044)  (1.484) 
##                                             
## x1                1.512***          1.512***
##                   (0.101)           (0.101) 
##                                             
## x2                          -0.010   0.001  
##                            (0.111)  (0.081) 
##                                             
## Constant 6.237*** 5.638*** 6.234*** 5.639***
##          (1.442)  (1.046)  (1.446)  (1.048) 
##                                             
## ============================================
## ============================================
##

Operating characteristics (based on replicated data sets)

In order to understand the relative merits of the different modeling approaches, we need to replicate multiple data sets under the same set of assumptions used to generate the single data set. We will generate 2000 data sets and estimate all four models for each data set. For each replication, we use the function est_ancova to calculate a one-sided p-value. We will keep track of the point estimate, the standard error estimate, and the p-value for each iteration.

est_ancova <- function(dx, vars) {

  formula <- as.formula(paste("y ~", paste(vars, collapse = " + ")))
  model <- lm(data = dx, formula = formula)
  
  coef_summary <- summary(model)$coefficients["A", ]
  t_stat <- coef_summary["t value"]
  
  p_value <- pt(t_stat, df = model$df.residual, lower.tail = FALSE)
  ests <- data.table(t(coef_summary[1:2]), p_value)
  setnames(ests, c("est", "se", "pval"))
  
  return(ests)

}

replicate <- function() {
  
  dc <- genData(250, def_c) 
  dd <- addColumns(def_y, dc)
  
  est_1 <- est_ancova(dd, vars = "A")
  est_2 <- est_ancova(dd, vars = c("A", "x1"))
  est_3 <- est_ancova(dd, vars = c("A", "x2"))
  est_4 <- est_ancova(dd, vars = c("A", "x1", "x2"))
  
  return(list(est_1 = est_1, est_2 = est_2, est_3 = est_3, est_4 = est_4))
  
}

res <- mclapply(1:2000, function(x) replicate())

All four models yield relatively unbiased estimates, though the models that adjust for $x_1$ (the potential confounder) result in reduced bias relative to those that do not. However, the clear advantage of models 2 and 4 (those that adjust for $x_1$) is the reduced variance of the treatment effect estimator:

get.field <- function(x, field) {
  data.table(t(sapply(x, function(x) x[[field]]) ))
}

ests <- rbindlist(lapply(res, function(x) get.field(x, "est")))
sapply(ests, function(x) c(bias = mean(x) - delta, var = var(x)))

##            est_1       est_2       est_3         est_4
## bias -0.05513569 -0.00157402 -0.05263476 -0.0002470811
## var   4.51844704  2.33883113  4.55788040  2.3507163302

The reduction in variance translates directly to increased power for the models that adjust for $x_1$, from about 63% to 90%. This seems like a pretty good reason to adjust for a baseline covariate that (a) you collect, and (b) is highly correlated with the outcome.

pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval")))
sapply(pvals, function(x) c(mean(x < 0.025)))

##  est_1  est_2  est_3  est_4 
## 0.6255 0.9055 0.6275 0.9035

Exploring Type 1 error rates

The flip side of statistical power is the Type 1 error - the probability of concluding that there is a treatment effect when in fact there is no treatment effect. We can assess this by setting $\delta = 0$ and running another large number of replications. If we do this 2,000 times by generating a completely new data set each time, we see that the observed error rates are close to 0.025 for all the models, though the models that adjust for $x_1$ are closer to the theoretical value.

delta <- 0

res <- mclapply(1:2000, function(x) replicate())

pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval")))
sapply(pvals, function(x) c(mean(x < 0.025)))

##  est_1  est_2  est_3  est_4 
## 0.0180 0.0265 0.0185 0.0285

Both Senn and Pocock et al. suggest that a key advantage of adjusting for baseline covariates is that it helps achieve the desired error rates, particularly for one-sided tests. Assuming that all possible RCTs are conducted with the same level of covariate imbalance, models that include baseline covariate adjustments will yield accurate error rates. In contrast, models that do not adjust for important (highly correlated) covariates will produce deflated error rates. This occurs primarily because the standard errors of the effect estimates are systematically overestimated, reducing the likelihood of ever rejecting the null hypothesis.

To mimic the requirement that the dataset is sampled conditional on a fixed level of covariate imbalance, we generate the baseline covariates and treatment assignment only once, while the outcome is generated anew for each dataset. Under this approach, covariates and treatment assignment are fixed and only the outcome for a particular unit varies across iterations. An alternative approach would be to generate a large number of datasets using the full randomization process—creating new covariate values, treatment assignments, and outcomes for each iteration. Data sets would only be analyzed if they match the pre-specified covariate imbalance level. Although this approach yields the same results as our chosen method (I confirmed with simulations not shown here), the sampling process appears overly artificial, further complicating the interpretation of the p-value.

replicate_2 <- function() {
  
  dd <- addColumns(def_y, dc)
  
  est_1 <- est_ancova(dd, vars = "A")
  est_2 <- est_ancova(dd, vars = c("A", "x1"))
  est_3 <- est_ancova(dd, vars = c("A", "x2"))
  est_4 <- est_ancova(dd, vars = c("A", "x1", "x2"))
  
  return(list(est_1 = est_1, est_2 = est_2, est_3 = est_3, est_4 = est_4))
  
}

dc <- genData(250, def_c) 

res <- mclapply(1:2000, function(x) replicate_2())

pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval")))
sapply(pvals, function(x) c(mean(x < 0.025)))

##  est_1  est_2  est_3  est_4 
## 0.0075 0.0285 0.0070 0.0280

Causal inference methods for balancing

To me, the strongest argument against adjusting for baseline covariates in the analysis is the risk that investigators may appear overly eager to demonstrate the intervention’s success. Pre-specifying the analysis plan goes a long way toward alleviating such concerns. Additionally, alternative approaches from causal inference methods can further reduce reliance on outcome model assumptions. In particular, balancing methods such as inverse probability weighting (IPW) and overlapping weights (OW) can address covariate imbalances while preserving the original estimand. These techniques re-weight the sample to create balanced pseudo-populations without directly modifying the outcome model, offering a viable alternative to regression-based adjustments. They have the advantage of separating the design model from the outcome model (since the exposure and outcome models are two distinct steps). The balancing can be done before looking at the outcome data - so no risk of fishing for results. I plan on sharing simulations using these approaches sometime in the future.

References:

Senn, Stephen. “Testing for baseline balance in clinical trials.” Statistics in medicine 13, no. 17 (1994): 1715-1726.

Pocock, Stuart J., Susan E. Assmann, Laura E. Enos, and Linda E. Kasten. “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems.” Statistics in medicine 21, no. 19 (2002): 2917-2930.

Can ChatGPT help construct non-trivial statistical models? An example with Bayesian "random" splines

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 08 Oct 2024 00:00:00 +0000

I’ve been curious to see how helpful ChatGPT can be for implementing relatively complicated models in R. About two years ago, I described a model for estimating a treatment effect in a cluster-randomized stepped wedge trial. We used a generalized additive model (GAM) with site-specific splines to account for general time trends, implemented using the mgcv package. I’ve been interested in exploring a Bayesian version of this model, but hadn’t found the time to try - until I happened to pose this simple question to ChatGPT:

I want to build an R Stan model with cluster specific random splines - is that possible?

And as ChatGPT does, it eagerly let me know that this would not be a problem:

Yes, it’s possible to build a model in R using Stan with cluster-specific random splines. This approach involves using a hierarchical model where each cluster has its own spline, but the splines share a common structure.

Here is a step-by-step outline to guide you through this process:

Data Preparation: Organize your data into clusters and ensure it’s in a suitable format for modeling.

Spline Basis Functions: Generate spline basis functions (e.g., B-splines) for your data. This can be done using packages like splines or mgcv in R.

Stan Model Code: Write the Stan code for the hierarchical spline model. The model will include cluster-specific random effects for the spline coefficients.

Data List for Stan: Create a list of data inputs required by the Stan model, including the spline basis functions and the data itself.

Run the Model: Use the rstan package to compile and fit the model.

Maybe I shouldn’t have been, but I was pretty impressed by the code that followed. What follows are the highlights of my interaction with ChatGPT.

Preliminaries

Before getting started, here are the libraries needed to run all of the following code:

library(ggplot2)
library(simstudy)
library(data.table)
library(mgcv)
library(splines)
library(splines2)
library(cmdstanr)
library(posterior)

Data generation

To explore different modeling options, I wanted a simple data generation process to create the simulated data set. ChatGPT suggested using the following non-linear function for $y_{ik}$, the outcome for individual $i$ in cluster $k$, based on predictor $x_{ik}$: \[ y_{ik} = \text{sin}(2\pi x_{ik}), \ \ \{ x \in \mathbb{R} \mid 0 \leq x \leq 1 \} \] The code was simple enough:

set.seed(2024)

n <- 1000
k <- 10  # number of clusters
x <- runif(n)
cluster <- sample(1:k, n, replace = TRUE)
y <- sin(2  * pi * x) + rnorm(n, sd = 0.35)
dd <- data.table(y, x, cluster)
dd$cluster <- factor(dd$cluster)

Although the data generation process suggested by ChatGPT was helpful, it had a significant shortcoming. I wanted to model cluster-specific spline curves, but the ChatGPT code generated the same curve for all clusters. To address this, I used the general formulation and added a cluster-specific effect $a_k$, which stretches the sin curve differently for each cluster: \[ y_{ik} = \text{sin}(2\pi a_k x_{ik}), \ \ \{ a \in \mathbb{R} \mid 0.6 \leq a \leq 1.4 \} \]

k <- 10  # number of clusters

defc <- defData(varname = "a", formula = "0.6;1.4", dist = "uniform")
defi <- 
  defDataAdd(varname = "x", formula = "0;1", dist = "uniform") |>
  defDataAdd(
    varname = "y", 
    formula =  "sin(2 * a * ..pi * x)", 
    variance = 0.35^2
  )

dc <- genData(k, defc, id = "cluster")
dd <- genCluster(dc, "cluster", 100, "id")
dd <- addColumns(defi, dd)
dd[, cluster := factor(cluster)]

Data modeling

The goal is to estimate cluster-specific curves that capture the relationship between $x$ and $y$ within each cluster. I am aiming for these curves to reflect the overall trend without overfitting the data; in other words, we want the estimated function to provide a smooth and interpretable representation of the relationship, balancing flexibility and simplicity.

Although the purpose of my conversation with ChatGPT was to get a Bayesian version of this random spline model, I started off by asking for it to generate a generalized additive model (GAM) to provide a basis for comparison. This is what it came up with: \[ y_{ik} = \beta_0 + s_k(x_{ik}) + \epsilon_{ik}, \ \ \epsilon \sim N(0, \sigma_y) \]

where $s_k(x)$ is a smooth spline function of $x$. The estimated model can be used to provide predictions that can be plotted to describe the relationship between $x$ and $y$:

gam <- gamm(
  y ~ s(x) + s(x, cluster, bs = "fs", k = 8),
  data = dd, method="REML"
)

dd$g <- predict(gam$gam)

Bayesian spline model

The first Bayesian model that ChatGPT generated can be described using this notation:

Data

$N$: number of individuals
$K$: number of clusters
$M$: number of spline basis functions
$y_{ik}$: outcome for individual $i$ in cluster $k$, $i \in 1,\dots,N$, $k \in 1,\dots ,K$
$\boldsymbol{X} \in \mathbb{R}^{N \times M}$: matrix of spline basis function values

Parameters

$\boldsymbol{\beta_{k}} \in \mathbb{R}^M$: spline coefficients for cluster $k$ (a vector of length $M$ for each cluster)
$\sigma_y$: standard deviation of the observation noise
$\sigma_\beta$: prior standard deviation for the spline coefficients

Model

Likelihood:

\[ y_{ik} \sim N\left( \sum_{m=1}^M X_{im} \beta_{km}, \sigma_y \right), \ i \in 1,\dots, N, \ k \in 1, \dots, K\]

Priors:

\[ \boldsymbol{\beta_{k}} \sim N(0, \sigma_{\beta} \boldsymbol{I_M}), \ \ k \in 1,...,K \\ \sigma_{y} \sim N(0, 1), \ \ \sigma_y \gt 0 \\ \sigma_{\beta} \sim N(0, 1), \ \ \sigma_{\beta} \gt 0 \]

The Stan code provided by ChatGPT aligns with this description. As part of the model, I also requested code to generate outcome predictions for each observation, which is implemented in the generated quantities block. My goal was to plot the median of those predictions for each individual $i$ as a comparison to the GAM plot above.”

data {
  int<lower=1> N;                          // number of observations
  int<lower=1> K;                          // number of clusters
  int<lower=1> M;                          // number of basis functions
  array[N] int<lower=1, upper=K> cluster;  // cluster ids
  matrix[N, M] X_spline;                   // basis function values
  vector[N] y;                             // response variable
}

parameters {
  matrix[K, M] beta;          // cluster-specific spline coefficients
  real<lower=0> sigma_y;      // observation noise
  real<lower=0> sigma_beta;   // prior standard deviation for beta
}

model {
  sigma_y ~ normal(0, 1);
  sigma_beta ~ normal(0, 1);
  
  // Priors for beta
  
  for (k in 1:K) {
    beta[k] ~ normal(0, sigma_beta);
  }
  
  // Likelihood
  
  for (n in 1:N) {
    y[n] ~ normal(X_spline[n] * beta[cluster[n]]', sigma_y);
  }
}

generated quantities {
  
  vector[N] y_pred;                    // Vector of observations.
  
  for (n in 1:N) {
    y_pred[n] = normal_rng(X_spline[n] * beta[cluster[n]]', sigma_y);
  }
}

Spline basis functions

In the likelihood, $y_i$ is modeled as a function of the vector $\boldsymbol{X_i}$ rather than the single measurement $x_i$. While I won’t delve deeply into spline estimation, I want to conceptually outline how this vector is constructed in the context of cubic splines.

We control the flexibility of the curve by specifying the number of knots. A unique curve is fitted between each pair of knots (as well as at the ends), with constraints ensuring smooth transitions between these curves. The estimation of these curves is performed using basis functions, specifically B-spline basis functions of $x$.

The number of basis functions is determined by the number of knots. For instance, the plot below illustrates the $M=9$ basis functions required for $K=5$ knots. Each basis function contributes an element to the vector $\boldsymbol{X}$ for each value of $x$. In the case of cubic splines, at most four basis functions can be non-zero between any two knots, as indicated by the intervals on the x-axis. Consequently, the vector $\boldsymbol{X}$ consists of the values of each basis function at a given point $x$, with at most four non-zero entries corresponding to the active basis functions. (As an example, in the plot below there is a vertical line at a single point $x$ that passes through four basis functions.)

This example uses $M = 5$ knots to introduce a slight overfitting of the data, which will allow me to apply another model in the next step that will further smooth the curves. (In a real-world setting, it may have made more sense to start out with fewer knots.) The bs function (in the splines package) computes the B-spline basis function values for each observed $x$.

n_knots <- 5
knot_dist <- 1/(n_knots + 1)
probs <- seq(knot_dist, 1 - knot_dist, by = knot_dist)
knots <- quantile(dd$x, probs = probs)
spline_basis <- bs(dd$x, knots = knots, degree = 3, intercept = TRUE)
X_spline <- as.matrix(spline_basis)

Data list for stan

To fit the model, we need to create the data set that Stan will use to estimate the parameters.

stan_data <- list(
  N = nrow(dd),           # number of observations
  K = k,                  # number of clusters
  M = ncol(X_spline),     # number of basis functions
  cluster = dd$cluster,   # vector of cluster ids
  X_spline = X_spline,    # basis function values
  y = dd$y                # response variable
)

Run stan model

ChatGPT provided code to estimate the model using the rstan package. However, I prefer using the cmdstanr package, which I find more stable and generally less finicky. From the plot, you can see that the estimation was quite good. However, the curves are a bit too wiggly, indicating the data may have been slightly overfit, particularly for clusters 1, 3, and 7.

mod <- cmdstan_model("code/spline.stan")

fit <- mod$sample(
  data = stan_data,
  chains = 4,
  iter_warmup = 500,
  iter_sampling = 2000,
  parallel_chains = 4,
  refresh = 0 # print update every 500 iters
)

## Running MCMC with 4 parallel chains...

## Chain 2 finished in 5.4 seconds.
## Chain 1 finished in 5.6 seconds.
## Chain 3 finished in 5.6 seconds.
## Chain 4 finished in 5.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.6 seconds.
## Total execution time: 5.9 seconds.

draws <- as_draws_df(fit$draws())

ds <- summarize_draws(draws, .fun = median) |> data.table()
dd$np <- ds[substr(variable, 1, 3) == "y_p", 2]

Penalized spline

When I made my initial inquiry to ChatGPT, it provided only a single model and didn’t indicate that there might be alternatives. To elicit another option, I had to specifically ask. To smooth the estimate provided by the initial model (which admittedly I made too wiggly on purpose), I asked ChatGPT to provide a penalized Bayesian spline model, and it obliged.

The model is just an extension of the spline model, with an added penalization term that is based on the second derivative of the B-spline basis functions. We can strengthen or weaken the penalization term using a tuning parameter $\lambda$, that is provided to the model. The Stan model code is unchanged from the original model, except for the added penalization term.

model {

  sigma_y ~ normal(0, 1);
  sigma_beta ~ normal(0, 1);

  // Priors for beta
  
  for (k in 1:K) {
    beta[k] ~ normal(0, sigma_beta);
  }
  
  //Penalization <---------------------------------------
  
  for (k in 1:K) {
    target += -lambda * sum(square(D2_spline * beta[k]'));
  }
  
  // Likelihood
  
  for (n in 1:N) {
    y[n] ~ normal(X_spline[n] * beta[cluster[n]]', sigma_y);
  }
}

The second derivatives of the B-spline basis functions are estimated using the dbs function in the splines2 package. Like the matrix $\boldsymbol{X}$, $\boldsymbol{D_2}$ has dimensions $N \times\ M$. Both $\boldsymbol{D_2}$ and $\lambda$ are added to the data passed to Stan:

D2 <- dbs(dd$x, knots = knots, degree = 3, derivs = 2, intercept = TRUE)
D2_spline <- as.matrix(D2)

stan_data <- list(
  N = nrow(dd),
  K = k,
  M = ncol(X_spline),
  cluster = dd$cluster,
  X_spline = X_spline,
  D2_spline = D2_spline,
  y = dd$y,
  lambda = 0.00005
)

## Running MCMC with 4 parallel chains...

## Chain 2 finished in 16.5 seconds.
## Chain 1 finished in 16.6 seconds.
## Chain 3 finished in 16.7 seconds.
## Chain 4 finished in 16.7 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 16.6 seconds.
## Total execution time: 16.8 seconds.

The plot directly comparing the penalized Bayesian model with the initial Bayesian model (initial Bayesian model in blue) shows the impact of further smoothing.

A direct comparison between the GAM and penalized Bayesian models (GAM in green) suggests that there might be some differences in the estimation for at least several clusters, particularly those that change direction twice. The penalized Bayesian model appears to be smoothing more than the GAM:

I was also aware of a third version of the Bayesian spline model that uses a random-walk prior on the $\beta\text{'s}$ to induce smoothing. Unprompted, ChatGPT did not mention this. But, upon request it did give me code that I was able to implement successfully. I’ll leave it to you to explore this further on your own—or perhaps ask ChatGPT for assistance.

Reference:

OpenAI. (2024). ChatGPT (September 30, Version) [Large language model]. https://chat.openai.com/

Support:

This work is supported within the National Institutes of Health (NIH) Health Care Systems Research Collaboratory by cooperative agreement UG3/UH3AT009844 from the National Institute on Aging. This work also received logistical and technical support from the NIH Collaboratory Coordinating Center through cooperative agreement U24AT009676. Support was also provided by the NIH National Center for Complementary and Integrative Health Administrative Supplement for Complementary Health Practitioner Research Experience through cooperative agreement UH3AT009844 and by the National Center for Complementary and Integrative Health of the National Institutes of Health under award number UH3AT009844. Work also supported by Memorial Sloan Kettering Cancer Center Support Grant/Core Grant P30CA008748. The author was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

An IV study design to estimate an effect size when randomization is not ethical

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 03 Sep 2024 00:00:00 +0000

An investigator I frequently consult with seeks to estimate the effect of a palliative care treatment protocol for patients nearing end-stage disease, compared to a more standard, though potentially overly burdensome, therapeutic approach. Ideally, we would conduct a two-arm randomized clinical trial (RCT) to create comparable groups and obtain an unbiased estimate of the intervention effect. However, in this case, it may be considered unethical to randomize patients to a non-standard protocol.

Alternatively, we could conduct an observational study, measuring outcomes for patients who choose one of the two protocols. This approach would yield unbiased estimates only if we could collect all relevant data to ensure the absence of unmeasured confounding. While such an analysis could be useful, it’s uncertain whether we could convincingly argue that the estimates are truly unbiased, though we might assess their sensitivity to unmeasured confounding.

A potential middle ground involves randomizing in a way that merely increases the likelihood that a patient will choose the palliative care protocol. If successful, this could allow us to estimate a specific causal effect—the Complier Average Treatment Effect (CATE). The CATE might offer valuable insights into the relative merits of the two approaches. The key advantage of this approach is that it enables an unbiased estimate of the CATE, even in the presence of unmeasured confounding.

My goal here is to introduce you to this instrumental variable (IV) design and present simulations that demonstrate its potential strengths. IV analysis is a widely used estimation method across various fields, including economics, sociology, and epidemiology.

Observational study with no unmeasured confounding

The directed acyclic graph (DAG) below represents an observational dataset with two predictors, $X$ and $U$, where only $X$ is measured. In this scenario, we can estimate the average treatment effect by fitting a regression model that adjusts for $X$, assuming $X$ is equally distributed across the palliative care and standard therapy groups. Although UU, the unmeasured covariate, predicts the outcome, it does not predict the selected protocol. Therefore, UU is not a confounder and does not need to be included in the model to ensure an unbiased estimate.

The data definitions for the simulation follow from the DAG. In the simulation, the variable $b$ is the association between $U$ and the chosen protocol. In this first case, $b$ will be set to zero so that $U$ is only a predictor of the outcome, but not the protocol. In the data generation, we generate potential outcomes $Y^0$ and $Y^1$ for each individual. These are the outcomes we would observe for the same individual if they receive the standard therapy or the palliative care protocol, respectively. The causal effect at the individual level is $Y^1 - Y^0$. (I’ve discussed causal inference elsewhere - for example, here, here, and here. If you want a really good guide - and even includes a chapter on IV - you can’t really do any better than the book Causal Inference: What If by Hernán and Robins and available online.) The observed outcome $Y$ is the potential outcome for the protocol actually selected. In this case (and all the others that follow), I assume a treatment effect of 2 (with some variation across individuals.)

Here are the necessary libraries for all the simulations that follow and then the data definitions just described:

library(simstudy)
library(data.table)
library(broom)
library(AER)

def <-
  defData(varname = "X", formula = 0, variance = 1, dist = "normal") |>
  defData(varname = "U", formula = 0, variance = 1, dist = "normal") |>
  defData(varname = "P", formula = "-1 + .2*X + ..b*U", dist = "binary", link="logit") |>
  defData(varname = "Y0", formula = "X + U", variance = 16, dist = "normal") |>
  defData(varname = "Y1", formula = "2 + Y0", variance = 0.75, dist = "normal") |>
  defData(varname = "Y", formula = "(P==0)*Y0 + (P==1)*Y1", dist = "nonrandom")

I’m generating a large sample to reduce sample variability:

set.seed(9434)

b <- 0
dd <- genData(10000, def)
dd

## Key: <id>
##           id           X          U     P        Y0         Y1         Y
##        <int>       <num>      <num> <int>     <num>      <num>     <num>
##     1:     1  0.56455033 -0.3995340     0 -1.129070 -0.4144314 -1.129070
##     2:     2  0.47370692  0.8632443     0  1.198505  4.0695024  1.198505
##     3:     3  0.14111773 -0.5722770     1  2.893212  6.1998168  6.199817
##     4:     4 -0.33594059  0.7847092     0  2.796105  5.1121294  2.796105
##     5:     5 -1.58531504 -0.4670841     1 -4.371836 -2.2362033 -2.236203
##    ---                                                                  
##  9996:  9996 -1.31839289 -0.4082700     0  2.458730  5.4139270  2.458730
##  9997:  9997 -1.37300260  0.9126086     0  3.820652  5.3291210  3.820652
##  9998:  9998 -1.37026038 -0.4491965     0 -1.145322  0.3858421 -1.145322
##  9999:  9999 -2.31914806  0.4899357     0  1.070460  4.4608188  1.070460
## 10000: 10000 -0.04254774  0.6394997     1  2.434553  4.9220870  4.922087

The true average casual effect is close to two, as expected. This is something we cannot observe:

dd[, mean(Y1 - Y0)]

## [1] 1.984264

If we estimate a linear model adjusting for $X$, the parameter for $P$ should also be close to two, which it is:

tidy(lm(Y ~ P + X, data = dd))

## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -0.0324    0.0492    -0.658 5.10e-  1
## 2 P             2.03      0.0933    21.8   3.96e-103
## 3 X             0.957     0.0418    22.9   8.37e-113

Observational study with unmeasured confounding

The DAG below now has an arrow pointing from $U$ to the protocol, so $b$ is non-zero. If we are not able to measure $U$ and control for it in the model, we will get a biased estimate of the treatment effect.

In this case, the unobserved average causal effect is still very close to two.

set.seed(9434)

b <- 0.2
dd <- genData(10000, def)

dd[, mean(Y1 - Y0)]

## [1] 1.984264

However, if we fit a model without accounting for the unmeasured confounder $U$, we get a biased (upward) estimate for the effect of the palliative care protocol:

## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -0.0699    0.0492     -1.42 1.56e-  1
## 2 P             2.17      0.0933     23.3  1.48e-116
## 3 X             0.952     0.0419     22.8  7.88e-112

Adjusting for $U$ removes the bias, though this would not be possible if we could not measure $U$:

## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -0.0270    0.0480    -0.563 5.73e-  1
## 2 P             1.99      0.0911    21.8   2.46e-103
## 3 X             0.943     0.0407    23.1   1.79e-115
## 4 U             0.970     0.0411    23.6   4.96e-120

IV design

Since we don’t believe it’s possible to measure all potential confounders, nor do we think conducting a standard RCT is appropriate or feasible, we might want to consider a design that randomizes patients to a setting where they are more likely to choose palliative care over the standard protocol. By establishing a counseling program that encourages patients to opt for palliative care when appropriate, we can use the variation in treatment decisions to estimate a causal effect. Crucially, patients in both the counseling and non-counseling arms can choose either palliative care or the standard protocol; the expectation is that those in the counseling arm will be more likely to choose palliative care.

It’s important to be clear that our interest lies in the effect of palliative care versus the standard protocol, not in the counseling program itself. A direct comparison between the counseling and non-counseling groups would not provide the information we need, as each group is likely to include patients receiving both types of care. Therefore, such a comparison would be ambiguous. Moreover, if we compare palliative care and standard protocol within the counseling arm, we face the same issue of unmeasured confounding as in a purely observational study.

Additionally, comparing patients who opt for palliative care in the counseling arm with those who choose the standard protocol in the non-counseling arm could lead to problems. Some patients in the non-counseling arm may have chosen standard care regardless of their group assignment, making them poor counterfactuals for those in the counseling arm who opted for palliative care. Similarly, some patients in the counseling arm who chose palliative care might have done so even without counseling, so including them in a comparison with non-counseling patients who opted for standard care may not be meaningful.

This situation illustrates a taxonomy of patient behavior essential to instrumental variable (IV) analysis, known as principal stratification. In this framework, patients can be categorized as never-takers (those who choose standard care regardless of their assignment), always-takers (those who choose palliative care regardless), or compliers (those who choose standard care in the non-counseling arm but opt for palliative care in the counseling arm). There is also a fourth category, deniers, which we assume does not exist.

In IV analysis, we compare the compliers across arms, effectively sidestepping the never-takers and always-takers. Although we can’t definitively classify individual patients as compliers or never-takers, we assume that a patient in the non-counseling arm who chooses palliative care is an always-taker. Similarly, we assume that a patient in the counseling arm who opts for standard care would have done so even without counseling.

The instrument in IV analysis is the characteristic or assignment that induces differential distributions of the exposures of interest. In this case, the instrument is the assignment to counseling or not, and the exposure of interest is palliative care vs. the standard protocol. The target estimand in IV analysis is the complier average treatment effect or CATE. CATE can be estimated using IV analysis under a set of four key assumptions. First, we assume monotonicity of behavior - so that there are never-takers, always-takers, and compliers, but no deniers. Second, the instrument must be uncorrelated with the error term of the IV model. Third, the instrument can have no direct effect on the outcome of interest (that is, the effect is only induced through the exposure of interest). Fourth, the instrument must be correlated with the exposure of interest. This means that the probability of choosing palliative care will be different in the counseling and non-counseling arms.

The instrument in IV analysis is the characteristic or assignment that induces different distributions of the exposures of interest—in this case, the assignment to counseling or not, with the exposure of interest being palliative care versus the standard protocol. The target estimand in IV analysis is the Complier Average Treatment Effect (CATE). CATE can be estimated using IV analysis under four key assumptions:

Monotonicity: There are never-takers, always-takers, and compliers, but no deniers.
Independence: The instrument must be uncorrelated with the error term of the IV model.
Exclusion Restriction: The instrument can have no direct effect on the outcome of interest; its effect must be solely through the exposure of interest.
Relevance: The instrument must be correlated with the exposure of interest, meaning the probability of choosing palliative care differs between the counseling and non-counseling arms.

Here is the DAG that represents the last three assumptions (but not the monotinicity assumption):

The simulation for this scenario is a little more involved, because we need to generate outcomes depending on the stratum an individual falls in, and we need to generate the palliative care exposure differentially. The way I have implemented this is to generate $P^0$, the potential exposure assuming an individual is randomized not to receive counseling. The probability of opting for palliative care is very low, less than 10%:

def_p0 <-
  defData(varname = "A", formula = "1;1", dist = "trtAssign") |>
  defData(varname = "X", formula = 0, variance =1, dist = "normal") |>
  defData(varname = "U", formula = 0, variance =1, dist = "normal") |>
  defData(varname = "P0", formula = "-2.5 + 0.2*X + 0.2*U", dist = "binary", link="logit")

set.seed(7271)

dd <- genData(10000, def_p0)
dd

## Key: <id>
##           id     A           X          U    P0
##        <int> <int>       <num>      <num> <int>
##     1:     1     1 -0.05885615  1.6427735     0
##     2:     2     1 -1.30383586  0.3174614     0
##     3:     3     1  0.61268664 -1.5143240     0
##     4:     4     0  2.07869248  0.4865766     0
##     5:     5     1 -0.43060933 -1.3370887     0
##    ---                                         
##  9996:  9996     0 -0.88119577  0.3312608     0
##  9997:  9997     1 -0.90726045  0.8927394     0
##  9998:  9998     1 -0.69425529 -1.4237684     0
##  9999:  9999     0  1.37659232 -0.4990659     0
## 10000: 10000     0 -0.71196863  0.2136513     0

dd[, mean(P0)]

## [1] 0.084

Next we generate $P^1$, the potential exposure when randomized to counseling.

def_p1 <-
  defCondition(
    condition = "P0 == 0", 
    formula = "0.5 + X + U", 
    dist = "binary", 
    link = "logit"
  ) |>
  defCondition(
    condition = "P0 == 1", 
    formula = 1, 
    dist = "nonrandom"
  )

dd <- addCondition(def_p1, dd, newvar = "P1")
dd

## Key: <id>
##           id    P1     A           X          U    P0
##        <int> <num> <int>       <num>      <num> <int>
##     1:     1     1     1 -0.05885615  1.6427735     0
##     2:     2     1     1 -1.30383586  0.3174614     0
##     3:     3     0     1  0.61268664 -1.5143240     0
##     4:     4     1     0  2.07869248  0.4865766     0
##     5:     5     1     1 -0.43060933 -1.3370887     0
##    ---                                               
##  9996:  9996     1     0 -0.88119577  0.3312608     0
##  9997:  9997     0     1 -0.90726045  0.8927394     0
##  9998:  9998     0     1 -0.69425529 -1.4237684     0
##  9999:  9999     1     0  1.37659232 -0.4990659     0
## 10000: 10000     1     0 -0.71196863  0.2136513     0

The probability of choosing palliative care is much higher under counseling (and is in fact 100% when $P^0 = 1$):

dd[, mean(P1), keyby = P0]

## Key: <P0>
##       P0        V1
##    <int>     <num>
## 1:     0 0.5803493
## 2:     1 1.0000000

This next step is a little less intuitive. I’m first generating interim potential outcomes $Q^0$ (for $Y^0$) and $Q^1$ (for $Y^1$). $Q^0$ depends on $X$ and $U$, and $Q^1$ is centered around $2 + Q^0$. The actual potential outcomes $Y^0$ and $Y^1$ depend on the status of $P^0$ and $P^1$, respectively. If $P^0 = 0$ then $Y^0$ takes on the value of $Q^0$, but if $P^0 = 1$ then $Y^0$ takes on the value of $Q^1$. The same logic defines $Y^1$.

def_A <-
  defDataAdd(varname = "Q0", formula = "X + U", variance = 16, dist = "normal") |>
  defDataAdd(varname = "Q1", formula = "2 + Q0", variance = 0.75, dist = "normal") |>
  defDataAdd(varname = "Y0", formula = "Q0*(P0==0) + Q1*(P0==1)", dist = "nonrandom") |>
  defDataAdd(varname = "Y1", formula = "Q0*(P1==0) + Q1*(P1==1)", dist = "nonrandom") |>
  defDataAdd(varname = "P", formula = "(A==0)*P0 + (A==1)*P1", dist = "nonrandom") |>
  defDataAdd(varname = "Y", formula = "(P==0)*Y0 + (P==1)*Y1", dist = "nonrandom")

dd <- addColumns(def_A, dd)
dd[, .(A, P0, P1, P, Q0 = round(Q0, 2), Q1 = round(Q1, 2), 
       Y0 = round(Y0, 2), Y1 = round(Y1, 2), Y = round(Y, 2))]

##            A    P0    P1     P    Q0    Q1    Y0    Y1     Y
##        <int> <int> <num> <num> <num> <num> <num> <num> <num>
##     1:     1     0     1     1 -2.51 -1.21 -2.51 -1.21 -1.21
##     2:     1     0     1     1 -2.62 -1.59 -2.62 -1.59 -1.59
##     3:     1     0     0     0 -1.24  0.67 -1.24 -1.24 -1.24
##     4:     0     0     1     0  2.05  3.13  2.05  3.13  2.05
##     5:     1     0     1     1  2.49  4.32  2.49  4.32  4.32
##    ---                                                      
##  9996:     0     0     1     0  0.95  3.29  0.95  3.29  0.95
##  9997:     1     0     0     0  1.34  2.49  1.34  1.34  1.34
##  9998:     1     0     0     0  2.66  7.25  2.66  2.66  2.66
##  9999:     0     0     1     0  1.59  3.93  1.59  3.93  1.59
## 10000:     0     0     1     0  1.79  3.49  1.79  3.49  1.79

This data generating process forces the causal effect of palliative care to be zero for never-takers and always-takers, and averages 2 for compliers.

dd[, mean(Y1 - Y0), keyby = .(P0, P1)]

## Key: <P0, P1>
##       P0    P1       V1
##    <int> <num>    <num>
## 1:     0     0 0.000000
## 2:     0     1 1.999411
## 3:     1     1 0.000000

A model that adjusts for observed covariates only will provide a biased estimate for the effect of palliative care (in this case, leading to an overestimate):

tidy(lm(Y ~ P + X, data = dd))

## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -0.124    0.0518     -2.40 1.66e-  2
## 2 P              2.40     0.0894     26.8  6.65e-153
## 3 X              0.944    0.0427     22.1  1.03e-105

However, if we are able to measure and adjust for all covariates, we will get an unbiased estimate:

tidy(lm(Y ~ P + X + U, data = dd))

## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  0.00817    0.0509     0.161 8.72e-  1
## 2 P            1.98       0.0891    22.3   2.07e-107
## 3 X            0.988      0.0417    23.7   1.08e-120
## 4 U            0.944      0.0417    22.6   1.33e-110

Although we cannot actually measure $U$ (i.e., it is not observed), we can use IV estimation to get an unbiased estimate of the CATE. Note that $A$ is introduced into the regression equation:

tidy(ivreg(Y ~ P + X  |  A + X, data = dd))

## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   0.0203    0.0686     0.296 7.67e-  1
## 2 P             1.98      0.158     12.5   1.56e- 35
## 3 X             0.983     0.0445    22.1   8.17e-106

To get a more robust assessment of the analytic methods, I’ve conducted a simple experiment that generates 1000 smaller data sets of 500 individuals each. For each generated data set, I am recording

overall_po: average overall causal effect of never-takers, always-takers, and compliers combined based on the unobserved potential outcomes
complier_po: compliers-only causal effect based on the potential outcomes
overall: overall observed effect without any adjustment
within: comparison of palliative care vs the standard protocol for patients who received counselling
cross: comparison of patients who received counseling and opted for palliative care with patients who did not receive counseling and received the standard protocol
lm_x: estimated effect adjusting for $X$ alone
lm_xu: estimated effect adjusting for both $X$ and $U$
iv: IV estimate

gen_ests <- function(n) {
  
  dd <- genData(n, def_p0)
  dd <- addCondition(def_p1, dd, newvar = "P1")
  dd <- addColumns(def_A, dd) 
  
  overall_po <- dd[,mean(Y1-Y0)]
  
  dx <- dd[P0 == 0 & P1 == 1]
  complier_po <- dx[,mean(Y1-Y0)]
  
  overall <- dd[P == 1, mean(Y)] - dd[P == 0, mean(Y)]
  within <- dd[P == 1 & A == 1, mean(Y)] - dd[P == 0 & A == 1, mean(Y)]
  cross <- dd[P == 1 & A == 1, mean(Y)] - dd[P == 0 & A == 0, mean(Y)]
  
  lm_x <- coef(lm(Y ~ P + X, data = dd))["P"]
  lm_xu <- coef(lm(Y ~ P + X + U, data = dd))["P"]
  
  iv <- coef(ivreg(Y ~ P + X  |  A + X, data = dd))["P"]
  
  data.table(overall_po, complier_po, overall, within, cross, lm_x, lm_xu, iv)
  
}

res <- rbindlist(lapply(1:1000, function(x) gen_ests(500)))

Below, the means and standard deviations of the estimates across all 1000 iterations are shown. As expected, all estimates are biased except for the complier average treatment effect based on the potential outcomes, the linear model adjusting for both $X$ and $U$, and the IV estimate:

lapply(res, function(x) c( round(mean(x), 2), round(sd(x), 2) ))

## $overall_po
## [1] 1.08 0.05
## 
## $complier_po
## [1] 2.00 0.05
## 
## $overall
## [1] 2.80 0.39
## 
## $within
## [1] 3.44 0.54
## 
## $cross
## [1] 2.56 0.43
## 
## $lm_x
## [1] 2.41 0.39
## 
## $lm_xu
## [1] 1.99 0.39
## 
## $iv
## [1] 1.97 0.68

Caveats

While IV estimation works well in this idealized setting, there are some key limitations worth noting. First, consider the Complier Average Treatment Effect (CATE). Is the CATE truly what we are interested in? If most patients are compliers, then perhaps it is, but if they represent only a small proportion of the population, the usefulness of the information becomes less clear. We are likely more interested in understanding the effect of an approach that will be acceptable to a significant portion of eligible patients. The size of this “significant portion” will depend on the context—perhaps 25% is sufficient in some cases, while in others, we may require closer to 75%.

The second issue is the assumption that counseling has no direct effect on the outcome. Although I’ve been vague about the outcome in question, it’s crucial to choose an outcome carefully to ensure that the act of receiving counseling, independent of the protocol selected, does not influence the outcome. If this assumption is violated, the IV estimate will no longer be unbiased for the CATE.

Reference:

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Generating binary data by specifying the relative risk, with simulations

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 02 Jul 2024 00:00:00 +0000

The most traditional approach for analyzing binary outcome data is logistic regression, where the estimated parameters are interpreted as log odds ratios or, if exponentiated, as odds ratios (ORs). No one other than statisticians (and maybe not even statisticians) finds the odds ratio to be a very intuitive statistic, and many feel that a risk difference or risk ratio/relative risks (RRs) are much more interpretable. Indeed, there seems to be a strong belief that readers will, more often than not, interpret odds ratios as risk ratios. This turns out to be reasonable when an event is rare. However, when the event is more prevalent, the odds ratio will diverge from the risk ratio. (Here is a paper that discusses some of these issues in greater depth, in case you came here looking for more.)

I was playing around with ORs and RRs using simstudy and realized that up until now, one could not specify a binary data generating process using an assumption about the underlying RR. (Well, you actually could, but it required the extra step of explicitly creating probability parameters using a RR assumption.) I’ve rectified that in the latest development version, by including a “log” link option for the binary distribution (and for binomial data generation more broadly). Here’s some simulation code to show this in action.

Simulation

In this data set, the treatment indicator is $A$. The control group ($A = 0$) will have a $20\%$ underlying probability of an outcome. The risk ratio is 1.8, so that the underlying probability of an outcome in the treatment group ($A = 1$) is $1.8 \times 0.20 = 36\%$.

library(simstudy)
library(data.table)

def <- 
  defData(varname = "A", formula = "1;1", dist = "trtAssign") |>
  defData(
    varname = "y", 
    formula = "log(0.2) + A * log(1.8)",
    dist = "binary",
    link = "log"
  )

The data generation bears this out:

set.seed(123)
dd <- genData(1000, def)

## Key: <id>
##          id     A     y
##       <int> <int> <int>
##    1:     1     0     0
##    2:     2     0     0
##    3:     3     0     0
##    4:     4     1     0
##    5:     5     0     0
##   ---                  
##  996:   996     0     1
##  997:   997     0     0
##  998:   998     0     0
##  999:   999     1     0
## 1000:  1000     0     0

dd[, .(obs_p = mean(y)), keyby = A]

## Key: <A>
##        A obs_p
##    <int> <num>
## 1:     0 0.210
## 2:     1 0.366

Added bonus: estimating the RR using regression

Under the traditional approach, we might estimate a logistic regression model:

summary(glm(y ~ A, family = binomial, data = dd))

## 
## Call:
## glm(formula = y ~ A, family = binomial, data = dd)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3249     0.1098 -12.067  < 2e-16 ***
## A             0.7755     0.1438   5.393 6.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1200.7  on 999  degrees of freedom
## Residual deviance: 1170.7  on 998  degrees of freedom
## AIC: 1174.7
## 
## Number of Fisher Scoring iterations: 4

The parameter of interest is $e^{0.7755} = 2.1716$, which is a bit higher than the $RR = 1.8$ used to generate the data. The reason, of course, is that it is the OR.

We can estimate the RR by using a log-binomial regression model. Like logistic regression, this is implemented using the glm function with the binomial “family”, but with a “log” link instead of a “logistic” link:

fit_logbin = glm(y ~ A, family = binomial(link="log"), data = dd)
summary(fit_logbin)

## 
## Call:
## glm(formula = y ~ A, family = binomial(link = "log"), data = dd)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.56065    0.08674  -17.99  < 2e-16 ***
## A            0.55553    0.10483    5.30 1.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1200.7  on 999  degrees of freedom
## Residual deviance: 1170.7  on 998  degrees of freedom
## AIC: 1174.7
## 
## Number of Fisher Scoring iterations: 6

Here, the estimated parameter is the log(RR), so the estimated $RR =e^{0.5555} = 1.7428$, much closer to the true value.

The paper I’ve referenced suggests that binomial regression with a log link does not always reliably converge, so they suggest a modified Poisson regression as a better approach. The modification arises because the standard errors estimated from the Poisson model are too conservative (i.e. too high). We know that the binary data (Bernoulli distribution) variation is proportional to $p(1-p)$, whereas the variance for the Poisson model is proportional to $p$. Since $0 < p < 1$, $p > p(1-p)$. This makes sense as the binary outcome data is limited to $0$ and $1$, but the Poisson data can include values greater than $1$.

We can see this by generating data from each distribution:

rb <- rbinom(10000, 1, 0.20)
rp <- rpois(10000, 0.20)

##      dist   avg   var   min   max
##    <char> <num> <num> <num> <int>
## 1:  binom   0.2  0.16     0     1
## 2:   pois   0.2  0.19     0     3

Now, let’s estimate a Poisson regression model. Since “log” is the default link for Poisson regression for function glm, we should get a risk ratio estimate similar to the log-binomial regression above, and in fact we do:

fit_pois <- glm(y ~  A, family = poisson, data = dd)
summary(fit_pois)

## 
## Call:
## glm(formula = y ~ A, family = poisson, data = dd)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.56065    0.09758 -15.994  < 2e-16 ***
## A            0.55553    0.12242   4.538 5.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 717.00  on 999  degrees of freedom
## Residual deviance: 695.61  on 998  degrees of freedom
## AIC: 1275.6
## 
## Number of Fisher Scoring iterations: 5

The only issue is that we see that the standard error is inflated somewhat, the log-binomial model standard error the treatment parameter was $0.10483$, whereas the estimate is $0.12242$ in the Poisson model. If we estimate a robust standard error using the function vcovHC in the sandwich package, the standard error is in now line with the log-binomial estimate. (I could simulate many data sets to see what the empirical variation of the estimate is, but I’ll leave that to you as an exercise.)

library(sandwich)

data.table(
  estimate = summary(fit_pois)$coef["A", "Estimate"],
  pois.SE = summary(fit_pois)$coef["A", "Std. Error"],
  logbin.SE = summary(fit_logbin)$coef["A", "Std. Error"],
  robust.SE = sqrt(diag(vcovHC(fit_pois, type = "HC3")))["A"]
)

##     estimate  pois.SE logbin.SE robust.SE
##        <num>    <num>     <num>     <num>
## 1: 0.5555258 0.122418  0.104825 0.1050351

And just to doubly make sure that the robust standard errors are reasonable, we can estimate standard errors using a simple bootstrap:

bootstrap_both <- function(dx) {
  
  selected.rows <- dx[, sample(id, replace = TRUE), keyby = A][, V1]
  ds <- dx[selected.rows]
  
  fit_logbin = glm(y ~ A, family = binomial(link="log"), data = ds)
  fit_pois <- glm(y ~  A, family = poisson, data = ds)
  
  data.table(bs.logbin = coef(fit_logbin)["A"], bs.pois = coef(fit_pois)["A"])
  
}

bs.res <- rbindlist(lapply(1:1500, function(x) bootstrap_both(dd)))
bs.res[, .(bs.se_logbin = sd(bs.logbin), bs.se_pois = sd(bs.pois))]

##    bs.se_logbin bs.se_pois
##           <num>      <num>
## 1:    0.1043538  0.1043538

It does look like the robust standard errors for the Poisson regression model are indeed pretty robust, and that in this simple case at least, there is no difference between the Poisson and log-binomial models.

References:

Guangyong Zou. “A Modified Poisson Regression Approach to Prospective Studies with Binary Data.” American Journal of Epidemiology. Volume 159, Issue 7, 1 April 2004, Pages 702–706.

simstudy: another way to generate data from a non-standard density

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 04 Jun 2024 00:00:00 +0000

One of my goals for the simstudy package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist and addDataDist, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.

Here are the libraries needed, in case you want to follow along:

library(simstudy)
library(data.table)
library(ggplot2)

set.seed(1234)

The target density is simply defined by specifying a vector that is intended to loosely represent a data distribution. We start by specifying the vector (which can be of any length):

base_data <- 
  c(1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10)

We can look at the density to make sure this is the distribution we are interested in drawing our data from:

emp_density <- density(base_data, n = 10000)
den_curve <- data.table(x = emp_density$x, y = emp_density$y)

ggplot(data = den_curve, aes(x = x, y = y)) +
  geom_line(linewidth = 1) +
  scale_y_continuous(name = "density\n", limits = c(0, 0.11), 
     breaks = seq(0, .10, by = .02)) +
  scale_color_manual(values = colors) +
  theme(panel.grid = element_blank())

Actually drawing samples from this distribution is a simple call to genDataDensity. The key argument is the data distribution as as represented by the vector of integers:

dx <- genDataDensity(10000, dataDist = base_data, varname = "x1")

Here’s a look at the sampled data and their relationship to the target density:

ggplot(data = dx, aes(x=x1)) +
  geom_histogram(aes(y = after_stat(count / sum(count))), 
    binwidth = 1, fill = "grey", color = "black", alpha = .2) +
  geom_line(data = den_curve, aes(x = x, y = y), 
    color = "black", linewidth = 2) +
  scale_y_continuous(name = "density\n", limits = c(0, 0.11), 
     breaks = seq(0, .10, by = .02)) +
  scale_x_continuous(limits = c(-6, 15), breaks = seq(-5, 10, by = 5)) +
  theme(panel.grid = element_blank(),
        plot.title = element_text(face = "bold", size = 10))

Just to show that this was not a fluke, here are three additional target distributions, specified with three different vectors:

base_data <- list(
  c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7,   7,  8,  9, 10),
  c(1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10, 10),
  c(1, 2, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10)
)

We can generate data from each of the distributions and then confirm that each one adequately fits its target distribution:

dx1 <- genDataDensity(10000, dataDist = base_data[[1]], varname = "x1")
dx2 <- genDataDensity(10000, dataDist = base_data[[2]], varname = "x1")
dx3 <- genDataDensity(10000, dataDist = base_data[[3]], varname = "x1")

Addendum: code to generate multiple distribution plot

Here is a little more detail in case someone might find it useful to have the code that generates the “facet” plot. In the plot with the single distribution, I specified the histogram with this command:

geom_histogram(aes(y = after_stat(count / sum(count))), ...)

When I tried to apply this to the “facet” plot, the denominator of that plot (sum(count)) was not calculated for each subgroup (i.e., dataset), but was the total across all datasets. As a result, the dataset-specific proportions were underestimated; we can see that here:

dx <- rbindlist(list(dx1, dx2, dx3), idcol = TRUE)

ggplot(data = dx, aes(x=x1)) +
  geom_histogram(
    aes(y = after_stat(count / sum(count)), fill = factor(.id), color = factor(.id)),
    binwidth = 1, alpha = .2) +
  geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) +
    xlab("\nx1") +
  ylab("density\n") +
  scale_fill_manual(values = colors) +
  scale_color_manual(values = colors) +
  theme(panel.grid = element_blank(),
        plot.title = element_text(face = "bold", size = 10),
        legend.position = "none") + 
  facet_grid(~ .id)

I looked around for a way to address this, but couldn’t find anything that obviously addressed this shortcoming (though I am convinced it must be possible, and I just couldn’t locate the solution). I considered using ggarrangeor something similar, but was not satisfied with the results. Instead, it turned out to be faster just to calculate the proportions myself. This is the process I used:

First, I created a dataset with the bins (using a bin size of 1):

cuts <- seq(dx[,floor(min(x1))], dx[,ceiling(max(x1))], by = 1)
dcuts <- data.table(bin = 1:length(cuts), binlab = cuts)

dcuts

##       bin binlab
##     <int>  <num>
##  1:     1     -3
##  2:     2     -2
##  3:     3     -1
##  4:     4      0
##  5:     5      1
##  6:     6      2
##  7:     7      3
##  8:     8      4
##  9:     9      5
## 10:    10      6
## 11:    11      7
## 12:    12      8
## 13:    13      9
## 14:    14     10
## 15:    15     11
## 16:    16     12
## 17:    17     13
## 18:    18     14

Then, I allocated each observation to a bin using the cut function:

dx[, bin := cut(x1, breaks = cuts, labels = FALSE)]
dx <- merge(dx, dcuts, by = "bin")

dx

## Key: <bin>
##          bin   .id    id        x1 binlab
##        <int> <int> <int>     <num>  <num>
##     1:     1     1  1251 -2.097413     -3
##     2:     1     1  2215 -2.580587     -3
##     3:     1     1  2404 -2.042049     -3
##     4:     1     1  3228 -2.078958     -3
##     5:     1     1  5039 -2.055471     -3
##    ---                                   
## 29996:    17     3  7690 13.290347     13
## 29997:    17     3  8360 13.083991     13
## 29998:    17     3  8860 13.149421     13
## 29999:    17     3  9214 13.043727     13
## 30000:    17     3  9743 13.199752     13

Finally, I calculated the distribution-specific proportions (showing only the second distribution):

dp <- dx[, .N, keyby = .(.id, binlab)]
dp[, p := N/sum(N), keyby = .id]

dp[.id == 2]

## Key: <.id>
##       .id binlab     N      p
##     <int>  <num> <int>  <num>
##  1:     2     -3     3 0.0003
##  2:     2     -2    38 0.0038
##  3:     2     -1   130 0.0130
##  4:     2      0   340 0.0340
##  5:     2      1   619 0.0619
##  6:     2      2   938 0.0938
##  7:     2      3  1161 0.1161
##  8:     2      4  1155 0.1155
##  9:     2      5  1035 0.1035
## 10:     2      6   882 0.0882
## 11:     2      7   828 0.0828
## 12:     2      8   861 0.0861
## 13:     2      9   822 0.0822
## 14:     2     10   641 0.0641
## 15:     2     11   384 0.0384
## 16:     2     12   140 0.0140
## 17:     2     13    23 0.0023

And now the facet plot will work just fine. Here is the code and the plot (again).

ggplot(data = dp, aes(x = binlab, y = p)) +
  geom_bar(aes(fill = factor(.id), color = factor(.id)), stat = "identity", alpha = .4) +
  geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), 
            linewidth = 2) +
  xlab("\nx1") +
  ylab("density\n") +
  scale_fill_manual(values = colors) +
  scale_color_manual(values = colors) +
  theme(panel.grid = element_blank(),
        plot.title = element_text(face = "bold", size = 10),
        legend.position = "none") + 
  facet_grid(~ .id)

Addendum follow-up

Well, that was quick. Andrea provided code on Disqus - which for some reason is no longer publishing on my site, and if anyone has thoughts about that issue, feel free to contact me :) - that does exactly what I was trying to do without any pre-plotting data transformations. The trick is to use the density stat available in geom_histogram, This actually looks better, because it lines up more precisely with the density curve.

ggplot(data = dx, aes(x=x1)) +
  geom_histogram(
    aes(y = after_stat(density), fill = factor(.id), color = factor(.id)),
    binwidth = 1, alpha = .2) +
  geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) +
  xlab("\nx1") +
  ylab("density\n") +
  scale_fill_manual(values = colors) +
  scale_color_manual(values = colors) +
  theme(panel.grid = element_blank(),
        plot.title = element_text(face = "bold", size = 10),
        legend.position = "none") +
  facet_grid(~ .id)

simstudy 0.8.0: customized distributions

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 May 2024 00:00:00 +0000

Over the past few years, a number of folks have asked if simstudy accommodates customized distributions. There’s been interest in truncated, zero-inflated, or even more standard distributions that haven’t been implemented in simstudy. While I’ve come up with approaches for some of the specific cases, I was never able to develop a general solution that could provide broader flexibility.

This shortcoming changes with the latest version of simstudy, now available on CRAN. Custom distributions can now be specified in defData and defDataAdd by setting the argument dist to “custom”. To introduce the new option, I am providing a couple of examples.

Specifying the customized distribution

When defining a custom distribution in defData, you provide the name of the user-defined function as a string in the formula argument. The arguments of this custom function are listed in the variance argument, separated by commas and formatted as “arg_1 = val_form_1, arg_2 = val_form_2, $\dots$, arg_K = val_form_K”.

The arg_k’s represent the names of the arguments passed to the customized function, where $k$ ranges from $1$ to $K$. You can use values or formulas for each val_form_k. If formulas are used, ensure that the variables have been previously generated. Double dot notation is available in specifying value_formula_k. It is important to note that the parameter list of the actual function must include an argument”n = n”, but $n$ should not be included in the definition as part of defData or defDataAdd (specified in the variance field).

Example 1

Here is an example where we generate data from a zero-inflated beta distribution. (I’ve implemented something like this in the past using a mixture distribution, which is also a fine way to go). I’ve created a user-defined function zeroBeta that takes on shape parameters $a$ and $b$ for the beta distribution, as well as $p_0$, the proportion of the sample that takes on a value of zero. Note that the function also takes an argument $n$ that will not to be be specified in the data definition; $n$ will represent the number of observations being generated:

zeroBeta <- function(n, a, b, p0) {
  betas <- rbeta(n, a, b)
  is.zero <- rbinom(n, 1, p0)
  betas*!(is.zero)
}

The data definition specifies that we want to create a variable $zb$ from the user-defined zeroBeta function with $a$ and $b$ set to 0.75, and $p_0 = 0.02$:

def <- defData(
  varname = "zb", 
  formula = "zeroBeta", 
  variance = "a = 0.75, b = 0.75, p0 = 0.02", 
  dist = "custom"
)

The data are generated with a call to genData as is typically done in simstudy:

set.seed(1234)
dd <- genData(100000, def)

## Key: <id>
##             id         zb
##          <int>      <num>
##      1:      1 0.93922887
##      2:      2 0.35609519
##      3:      3 0.08087245
##      4:      4 0.99796758
##      5:      5 0.28481522
##     ---                  
##  99996:  99996 0.81740836
##  99997:  99997 0.98586333
##  99998:  99998 0.68770216
##  99999:  99999 0.45096868
## 100000: 100000 0.74101272

A plot of the data highlights an over-representation of zeroes:

Example 2

In this second example, I am generating sets of truncated Gaussian distributions with means ranging from $-1$ to $1$. (I wrote about this a while ago - the approach implemented here is an alternative way to generate these data.) rnormt is a customized (user-defined) function that generates the truncated Gaussian data, and requires four arguments (the left truncation value, the right truncation value, the distribution average without truncation and the distribution standard deviation without truncation):

rnormt <- function(n, min, max, mu, s) {
  
  F.a <- pnorm(min, mean = mu, sd = s)
  F.b <- pnorm(max, mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  qnorm(u, mean = mu, sd = s)
  
}

In this example, truncation limits differ based on group membership. Initially, three groups are created (represented by the variable defined as limit), followed by the generation of truncated values (named tn). For Group 1, truncation is defined by the range of $-1$ to $1$; for Group 2, the range is $-2$ to $2$; and for Group 3, the range is $-3$ to $3$. We’ll generate three data sets, each with a distinct mean denoted by M, using the double-dot notation to implement the different means.

def <-
  defData(
    varname = "limit", 
    formula = "1/4;1/2;1/4",
    dist = "categorical"
  ) |>
  defData(
    varname = "tn", 
    formula = "rnormt", 
    variance = "min = -limit, max = limit, mu = ..M, s = 1.5",
    dist = "custom"
  )

The data generation requires three calls to genData, one for each different mean value $\mu$. I have chosen to implement this with lapply:

mu <- c(-1, 0, 1)
dd <-lapply(mu, function(M) genData(100000, def))

The output is a list of three data sets; here are the first six observations from each of the three data sets:

## [[1]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     2  0.6949619
## 2:     2     2 -0.3641963
## 3:     3     2 -0.4721632
## 4:     4     3 -2.6083796
## 5:     5     2 -0.6800441
## 6:     6     3 -0.5813880
## 
## [[2]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     1  0.4853614
## 2:     2     2 -0.5690811
## 3:     3     2  0.5282246
## 4:     4     2  0.1107778
## 5:     5     2 -0.3504309
## 6:     6     2  1.9439890
## 
## [[3]]
## Key: <id>
##       id limit         tn
##    <int> <int>      <num>
## 1:     1     2  1.3560628
## 2:     2     2  1.4543616
## 3:     3     3  1.4491010
## 4:     4     2  0.7328855
## 5:     5     2 -0.1254556
## 6:     6     2 -0.7455908

A plot highlights the group differences for each of the three data sets:

simstudy enhancement: specifying idiosyncratic follow-up times for longitudinal data

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 16 Apr 2024 00:00:00 +0000

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are $\{0, 1, 2, ..., n - 1\}$, where there are $n$ total periods. However, when follow-up periods required more specificity, such as $\{0, 90, 180, 365\}$ days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

Just as a quick introduction, here is a simple example that shows the default settings of addPeriods. We are generating three individuals that will have measurements at four different periods, generically identified as $\{0, 1, 2, 3\}$:

library(simstudy)
set.seed(123)

dd <- genData(3)
dp <- addPeriods(dd, nPeriods = 4)

## Key: <timeID>
##        id period timeID
##     <int>  <int>  <int>
##  1:     1      0      1
##  2:     1      1      2
##  3:     1      2      3
##  4:     1      3      4
##  5:     2      0      5
##  6:     2      1      6
##  7:     2      2      7
##  8:     2      3      8
##  9:     3      0      9
## 10:     3      1     10
## 11:     3      2     11
## 12:     3      3     12

In this next example, we still assume four measurement periods, but they will be at baseline, 90 days, 180 days, and 1 year. The outcome $Y$ that is a function of the day of follow-up:

def <- defDataAdd(varname = "Y", formula = "100 + 0.25 * day", variance = 400)

dd <- genData(3)
dp <- addPeriods(dd, nPeriods = 4, perName = "day", periodVec = c(0, 90, 180, 365))
dp <- addColumns(def, dp)

Here is the resulting data set:

## Key: <timeID>
##        id   day timeID         Y
##     <int> <num>  <int>     <num>
##  1:     1     0      1  88.79049
##  2:     1    90      2 117.89645
##  3:     1   180      3 176.17417
##  4:     1   365      4 192.66017
##  5:     2     0      5 102.58575
##  6:     2    90      6 156.80130
##  7:     2   180      7 154.21832
##  8:     2   365      8 165.94878
##  9:     3     0      9  86.26294
## 10:     3    90     10 113.58676
## 11:     3   180     11 169.48164
## 12:     3   365     12 198.44628

The second example transforms data in “wide” format to “long” format. Here is the wide data generation:

tdef <- 
  defData(varname = "Y0", dist = "normal", formula = 10, variance = 1) |>
  defData(varname = "Y1", dist = "normal", formula = "Y0 + 5", variance = 1) |>
  defData(varname = "Y2", dist = "normal", formula = "Y0 + 10", variance = 1)

dd <- genData(3, tdef)

## Key: <id>
##       id        Y0       Y1       Y2
##    <int>     <num>    <num>    <num>
## 1:     1 10.400771 17.18768 21.10213
## 2:     2 10.110683 15.60853 19.63789
## 3:     3  9.444159 12.47754 18.37634

And here is the transformation, with the time periods:

dp <- addPeriods(
  dd, 
  perName = "day", 
  timevars = paste0("Y", 0:2),
  timevarName = "Y",
  periodVec = c(0, 180, 365)
)

## Key: <timeID>
##       id   day         Y timeID
##    <int> <num>     <num>  <int>
## 1:     1     0 10.400771      1
## 2:     1   180 17.187685      2
## 3:     1   365 21.102127      3
## 4:     2     0 10.110683      4
## 5:     2   180 15.608533      5
## 6:     2   365 19.637891      6
## 7:     3     0  9.444159      7
## 8:     3   180 12.477542      8
## 9:     3   365 18.376335      9

As a little bonus, here is additional code to introduce a little more reality into the data generation process. In this case, not all the follow-up measurements would be collected precisely on the exact follow-up date (although all the baseline measurements would be made on day zero). In particular, we are assuming that about 60% of the cases would be collected after the scheduled time (never before). I’ve implemented this logic by creating a lag variable in the data definition:

deflag <- 
  defDataAdd(varname = "lagdays", formula = 10, dist = "noZeroPoisson") |>
  defDataAdd(varname = "lag", formula = "0 | .4 + lagdays | .6", dist = "mixture") |>
  defDataAdd(varname = "obs_day", formula = "(day > 0) * (day + lag)")

dd <- genData(3)
dp <- addPeriods(dd, 3, perName = "day", periodVec = c(0, 180, 365))
dp <- addColumns(deflag, dp)

## Key: <timeID>
##       id   day timeID lagdays   lag obs_day
##    <int> <num>  <int>   <num> <num>   <num>
## 1:     1     0      1       9     9       0
## 2:     1   180      2       9     9     189
## 3:     1   365      3       7     0     365
## 4:     2     0      4       7     7       0
## 5:     2   180      5       8     0     180
## 6:     2   365      6      10     0     365
## 7:     3     0      7       8     8       0
## 8:     3   180      8      13    13     193
## 9:     3   365      9       5     0     365

Since addPeriods is a work in progress, feel free to reach out to me with suggestions, either directly or by creating an issue in GitHub.

Perfectly balanced treatment arm distribution in a multifactorial CRT using stratified randomization

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 20 Feb 2024 00:00:00 +0000

Over two years ago, I wrote a series of posts (starting here) that described possible analytic approaches for a proposed cluster-randomized trial with a factorial design. That proposal was recently funded by NIA/NIH, and now the Emergency departments leading the transformation of Alzheimer’s and dementia care (ED-LEAD) trial is just getting underway. Since the trial is in its early planning phase, I am starting to think about how we will do the randomization, and I’m sharing some of those thoughts (and code) here.

A brief overview of ED-LEAD

The ED-LEAD study is evaluating a set of three independent yet potentially synergistic interventions all targeted at improving the healthcare of persons living with dementia (PLWD) and their caregivers. The target population is patients 66 and older with a diagnosis of dementia who come to the Emergency Department (ED) but are discharge home. The common thread linking the three interventions – an emergency care redesign, community paramedicine, and nurse-telephonic care – is the particular focus of each on PLWD who have gone to the Emergency Department (ED) for care. The primary outcome is whether a patient has a subsequent ED visit within 30 day.

Currently, a total of 14 health systems are participating in the study, with each system contributing between 1 and 12 EDs, for a total of 80 EDs. The proposed randomization structure is based on the factorial design, where an ED will be randomized to any combination of the three interventions: (1) none, (2) a only, (3) b only, (4) c only, (5) a + b, (6) a + c, (7) b + c, and (8) a + b + c. A key challenge, which I am focusing on here, is how to randomize 16 EDs to each of the 8 arms while at the same time stratifying by health system. (Another challenge is how ensure decent balance of ED-level characteristics, such as patient volume or location, across the arms; for that, we’ll be using constrained randomization, which I wrote about quite a few years ago.)

The randomization algorithm

In the presentation of the algorithm, I am simplifying a bit, assuming there are eight arms, but only six health systems, and 40 total EDs. Since we have 40 EDs and eight arms, our target is a random allocation that has exactly five EDs per arm.

In our simplified example, this is the distribution of the number of EDs across the health systems:

Split health systems with more than eight EDs into smaller subgroups. For each health system that has more than eight EDs, randomly select eight EDs from the health system to create a subgroup A and the remainder go in group B. If group B still has more than eight EDs, split those into B and C, and continue the process until the last subgroup has fewer than eight EDs. In the example, Health System (HS) 1 has 12 EDs, so we create subgroups 1A and 1B. Likewise, HS 3 is split into two subgroups:
Sort the health systems (including the subgroups) in descending order by number of EDs:
Sequentially make random allocations for each health system. Now we are at the heart of the algorithm. Each system or subgroup with eight EDs will be assigned all eight arms. So, HS 1A and HS 3A will each be allocated all eight arms. For each health system with fewer than eight EDs, K arms are sampled without replacement, where K is the number of EDs in the health system or subgroup. In the example below, HS 2 has seven EDs, so seven arms are selected at random. The process is repeated for each health system in descending order based on the number of EDs, with one key modification as we progress down the list: at some point, we will reach the maximum number of EDs for an arm, so that arm must be removed from the possible arms for the subsequent health systems. In the example, after the arms for HS 1B have been determined (the fifth iteration of the sequence), arms 7 and 8 have been allocated to five health systems; since the limit has been reached, these two arms are no longer available for the three remaining health systems. When we get to the final health system (HS 5) that has only one ED, there is only a single arm available, since all the other arms have been allocated to five health systems.
Re-sort the assignment matrix. This sorting needs to happen so that we can merge the allocations with the actual EDs in each health system:
Assign specific EDs in each health system to the treatment arms that have been allocated. For each health system (or subgroup), sample with replacement all of the EDs to create a random ordering. The treatment arm assignments allocated to the health system are made in the order of this ordering. In the case of HS 1A in our example, the first sampled ED is assigned to treatment arm 1, the second to arm 2, etc., all the way until the ED that was sampled eighth, which is assigned to arm 8. In the case of HS 3B, there are only three EDs; the first sampled is assigned to arm 1, the second ED is assigned to arm 5, and the third to arm 6.

In practice, we could call it a day once the assignments have been made, but the case of ED-LEAD, we are still concerned about the balance of ED characteristics across the arms. This will be accomplished using constrained randomization, which involves generating a large number (say 500K or 1M) of possible random allocations. We will evaluate the balance for each one with respect to the desired characteristics in order to create a set of possible balanced random allocations. (This possible set will likely be a very small subset of all the generated allocations, which is why we need such a large pool.) The final random allocation will be a random selection from the subset of balanced allocations. (In the code that follows, I have not included this step.)

The R implementation

The implementation of the algorithm in R follows. First, we load the necessary libraries.

set.seed(12345)

library(data.table)
library(simstudy)

The number of arms is set to eight, and the number of EDs for each of the six health systems is also specified. As in the example above, there are 40 total EDs:

n_arms <- 8
n_eds = c(12, 1, 11, 5, 4, 7)

In this chunk of code, we are randomly sorting the EDs within each health system and splitting the large health systems into subgroups with a maximum eight EDs.

dd.hs <- data.table(hsid = 1:length(n_eds), N = n_eds )
dd.ed <- genCluster(dd.hs, "hsid", "N", "id")

n_groups <- ceiling(n_eds / n_arms)

dd.split <- data.table()

for (i in seq_along(n_groups)) {
  eds <- dd.ed[hsid == i, id]
  n_to_samp <- min(8, dd.ed[hsid == i, .N])
  for (j in 1:n_groups[i]) {
      
    if (length(eds) > 1) sample_eds <- sample(eds, n_to_samp)
    else sample_eds <- eds
      
    dd.split <- rbind(dd.split, data.table(hs = i, group = j, ed = sample_eds) ) 
    eds <- eds[!(eds %in% sample_eds)]
    n_to_samp <- min(8, length(eds))
  }
}
  
dd.split[, ed_index := 1:.N, keyby = .(hs, group)]
dd.hs <- dd.split[, .N, keyby = .(hs, group)][, hs_index := .I][]
dd.split <- merge(dd.split, dd.hs, by = c("hs", "group"))

Here is HS 1 that has been split into two groups, one with eight EDs and the second with one. Note that the EDs have been randomly sorted and randomly assigned to the two subgroups.

dd.split[hs == 1]

##     hs group ed ed_index N hs_index
##  1:  1     1  3        1 8        1
##  2:  1     1 10        2 8        1
##  3:  1     1  8        3 8        1
##  4:  1     1 11        4 8        1
##  5:  1     1  2        5 8        1
##  6:  1     1  6        6 8        1
##  7:  1     1 12        7 8        1
##  8:  1     1  9        8 8        1
##  9:  1     2  1        1 4        2
## 10:  1     2  5        2 4        2
## 11:  1     2  4        3 4        2
## 12:  1     2  7        4 4        2

The arms for each health system (or subgroup if a health system has been divided into smaller groups) are generated here. However, before arms are sampled for a particular group, the arms that have previously reached the threshold of five EDS are removed from the list of possible arms:

n_per_arm <- dd.hs[, sum(N)] / n_arms
  
dd.hs <- dd.hs[order(-N)]
n_eds <- copy(dd.hs$N)
  
dd.hs[, ed_index := .I ]
setkey(dd.hs, hs_index)

A <- 1:n_arms
mat.assign <- matrix(0, nrow = length(n_eds), ncol = n_arms)
  
for (i in seq_along(n_eds)) {
  
  # keeping only arms that haven't reached threshold
  
  B <- A[apply(mat.assign, 2, sum) < n_per_arm] 
  
  if (length(B) == 1) {       # only one possible arm remains
    
    mat.assign[i, B] <- 1
    
  } else {                    # multiple arms remain
    
    a <- sample(B, n_eds[i], replace = FALSE)
    mat.assign[i, a] <- 1
    
  }
}

mat.assign <- mat.assign[dd.hs$ed_index,] # sorted in the original order

In the assignment matrix, each row represents a health system or group, and each column represents each possible treatment arm. A value of 1 in each column indicates that treatment arm has been allocated to a health system. For systems or subgroups with eight EDs, all of the arms are allocated:

mat.assign

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    1    1    1    1    1    1    1
## [2,]    0    0    0    1    1    0    1    1
## [3,]    0    0    0    0    1    0    0    0
## [4,]    1    1    1    1    1    1    1    1
## [5,]    1    0    0    0    1    1    0    0
## [6,]    0    1    1    0    0    1    1    1
## [7,]    1    1    1    1    0    0    0    0
## [8,]    1    1    1    1    0    1    1    1

Confirming that the number of arms per system and the number of EDs per arm are both correct:

apply(mat.assign, 1, sum) # number of arms per system or subgroup

## [1] 8 4 1 8 3 5 4 7

apply(mat.assign, 2, sum) # number of EDs per arm

## [1] 5 5 5 5 5 5 5 5

The final step is to apply the assignment matrix to the list of EDs, which is done by vectorizing the matrix and then merging with the list of EDs.

rownames(mat.assign) <- 1:nrow(mat.assign)
colnames(mat.assign) <- 1:ncol(mat.assign)

dd.assign <- data.table(as.table(mat.assign)) # vectorize
dd.assign <- dd.assign[N == 1]
  
dd.assign <- dd.assign[, .(hs_index = V1, arm = V2)]
dd.assign[,`:=`(hs_index = as.numeric(hs_index), arm = as.numeric(arm))]
  
setkey(dd.assign, hs_index)
dd.assign[, ed_index := 1:.N, keyby = hs_index]
  
dd.assign <- merge(dd.assign, dd.split, by = c("hs_index", "ed_index"))
  
dd.assign <- dd.assign[, .(hs, group, ed, arm)]
setkey(dd.assign, ed)

The treatment assignments for HS 1:

dd.assign[hs == 1]

##     hs group ed arm
##  1:  1     2  1   4
##  2:  1     1  2   5
##  3:  1     1  3   1
##  4:  1     2  4   7
##  5:  1     2  5   5
##  6:  1     1  6   6
##  7:  1     2  7   8
##  8:  1     1  8   3
##  9:  1     1  9   8
## 10:  1     1 10   2
## 11:  1     1 11   4
## 12:  1     1 12   7

Verifying the algorithm works

A randomization protocol is only valid to the extent that each unit of randomization (in the case, the ED) has equal probability of being assigned to any of the arms. Since we have eight arms, we should expect that each ED has a probability of 0.125 (1\8) of getting randomized to each arm.

In addition, within each health system, we expect that each arm has equal probability of being assigned; the probability is a function of the number of EDs in the health system. The expected rate (over repeated random allocations) for an arm in a particular health system is the number of EDs divided by eight. For example, we would expect the rates for HS 1 to be 1.5 (12/8).

These hypothetical probabilities can be checked by generating a large number of possible random allocations and evaluating the empirical probabilities. I have generated 50,000 possible random allocations. (I haven’t included the code as it is really just an adaptation of the code above.)

The probabilities for the first six EDs are shown below, one row for each ED. In every case, the probability of a particular ED getting assigned to one of the arms was quite close to the expected value of 0.125. (The results for the other EDs, not included here for brevity, are consistent with these.)

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] 0.125 0.125 0.127 0.125 0.125 0.123 0.124 0.126
## [2,] 0.125 0.124 0.123 0.126 0.125 0.127 0.125 0.125
## [3,] 0.126 0.125 0.127 0.123 0.125 0.126 0.123 0.126
## [4,] 0.125 0.126 0.128 0.125 0.125 0.125 0.123 0.123
## [5,] 0.127 0.124 0.123 0.125 0.126 0.124 0.125 0.126
## [6,] 0.126 0.127 0.125 0.123 0.125 0.123 0.126 0.125

The averages at the health system level are show next. Each row represents a health system (1 through 6), showing the rate of an arm getting allocated to the health system. The theoretical rates for each of the six health systems are $12/8 = 1.500$, $1/8 = 0.125$, $11/8 = 1.375$, $5/8 = 0.625$, $4/8 = 0.500$, and $7/8 = 0.875$, respectively. We are looking to see if the rates are consistent across the arms for a particular health system, and that they are close to the theoretical value. It appears that they are:

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] 1.500 1.501 1.499 1.500 1.499 1.499 1.500 1.502
## [2,] 0.124 0.126 0.127 0.125 0.124 0.124 0.125 0.125
## [3,] 1.373 1.374 1.375 1.377 1.379 1.373 1.376 1.373
## [4,] 0.627 0.621 0.628 0.626 0.624 0.626 0.625 0.623
## [5,] 0.502 0.502 0.497 0.498 0.498 0.500 0.501 0.502
## [6,] 0.874 0.876 0.874 0.874 0.876 0.876 0.873 0.875

Support:

This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U19AG078105, which funds the Emergency departments leading the transformation of Alzheimer’s and dementia care (ED-LEAD) study. The author, the leader of the Statistics Analysis Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

A three-arm trial using two-step randomization

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 19 Dec 2023 00:00:00 +0000

Clinical Decision Support (CDS) tools are systems created to support clinical decision-making. Health care professionals using these tools can get guidance about diagnostic and treatment options when providing care to a patient. I’m currently involved with designing a trial focused on comparing a standard CDS tool with an enhanced version (CDS+). The main goal is to directly compare patient-level outcomes for those who have been exposed to the different versions of the CDS. However, we might also be interested in comparing the basic CDS with a control arm, which would suggest some type of three-arm trial.

A key question is the unit of randomization - should it be at the provider or patient level? If we can assume that CDS and CDS+ can be implemented at the patient level without any contamination, a trial comparing the two CDS versions could take advantage of patient-level randomization. However, once you turn on CDS or CDS+ at the provider level, it may not be possible to have an uncontaminated control arm, suggesting a less efficient cluster randomized trial. One of my colleagues came up with a design that takes advantage of patient-level randomization while acknowledging the need to address potential contamination of the patients in the control arm. This would be accomplished by randomizing providers to either the control or CDS arms, and then randomizing patients within the CDS arm (stratified by provider) to either CDS or CDS+ intervention.

It should be fairly obvious that two-step approach with at least some patient-level randomization would be better (with respect to statistical power) than a more standard three-arm cluster randomized trial that could provide the same effect estimates (though my initial reaction was that it wasn’t so obvious). My goal here is to simulate data (and provide code) using each design and estimate power under each.

Preliminaries

Before we get started, here are the libraries that will be need for the simulations, model fitting, and outputting the results:

library(simstudy)
library(ggplot2)
library(lmerTest)
library(data.table)
library(multcomp)
library(parallel)
library(flextable)

RNGkind("L'Ecuyer-CMRG")  # to set seed for parallel process

Three-arm cluster randomized trial

We’ll start with the three-arm cluster randomized trial, where providers would be randomized in a 1:1:1 ratio to either $Control$, $CDS$, or $CDS+$. The data definitions for the cluster-level data includes a random effect $b$ and a three-level treatment indicator $A$. The individual-level outcome is a continuous variable centered around 0 for patients in the control arm (offset by the provider-specific random effect). The effect of CDS (compared to Control) is 1.5, and the incremental effect of CDS+ (compared to CDS) is 0.75.

defs <- 
  defData(varname = "b", formula = 0, variance = 1) |>
  defData(varname = "A", formula = "1;1;1", dist = "trtAssign") 

defy <- 
  defDataAdd(varname = "y", formula = "1.50 * (A==2) + 2.25 * (A==3) + b", variance = 8)

To generate a single data set, we generate the provider level data, add the patient-level records and generate the patient-level outcomes.

set.seed(9612)

dc <- genData(30, defs, id = "provider")
dd <- genCluster(dc, "provider", 40, "id")
dd <- addColumns(defy, dd)

A mixed effects model with a provider-level random effect gives the parameter estimates for the effect of CDS (versus control) and CDS+ (also versus control):

lmerfit <- lmer(y ~ factor(A) + (1 | provider), data = dd)
as_flextable(lmerfit) |> delete_part(part = "footer")

group		Estimate	Standard Error	df	statistic	p-value
Fixed effects
	(Intercept)	-0.733	0.339	27	-2.164	0.0395	*
	factor(A)2	1.853	0.479	27	3.865	0.0006	***
	factor(A)3	2.567	0.479	27	5.353	0.0000	***
Random effects
provider	sd__(Intercept)	0.975
Residual	sd__Observation	2.818

However, we are really interested in comparing CDS with Control and CDS+ with CDS (and not CDS+ with control); we can use the glht package to provide the contrasts. To ensure that our overall Type I error rate is 5%, we use a Bonferroni-corrected p-value threshold of 0.025. In this particular case, we would not infer that there is any benefit to CDS+ over CDS, though it does appear that CDS is better than no CDS.

K1 <- matrix(c(0, 1, 0, 0, -1, 1), 2, byrow = T)
summary(glht(lmerfit, K1))

## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lmer(formula = y ~ factor(A) + (1 | provider), data = dd)
## 
## Linear Hypotheses:
##        Estimate Std. Error z value Pr(>|z|)    
## 1 == 0   1.8530     0.4794   3.865  0.00022 ***
## 2 == 0   0.7136     0.4794   1.488  0.23509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

In the next step, I’m generating 2500 data sets and fitting a model for each one. The estimated parameters, standard errors, and p-values are saved to an R list so that I can estimate the power to detect an effect for each comparison (based on the assumptions used to generate the data):

genandfit <- function() {
  
  dc <- genData(30, defs, id = "provider")
  dd <- genCluster(dc, "provider", 40, "id")
  dd <- addColumns(defy, dd)
  
  lmerfit <- lmer(y ~ factor(A) + (1 | provider), data = dd)
  K1 <- matrix(c(0, 1, 0, 0, -1, 1), 2, byrow = T)
  data.table(ests = summary(glht(lmerfit, K1))$test$coefficients,
             sd = summary(glht(lmerfit, K1))$test$sigma,
             pvals = summary(glht(lmerfit, K1))$test$pvalues
  )
  
}

reps <- mclapply(1:2500, function(x) genandfit())
reps[1:3]

## [[1]]
##         ests        sd       pvals
## 1: 1.6634678 0.5213168 0.002768887
## 2: 0.8372875 0.5213168 0.189092375
## 
## [[2]]
##         ests        sd        pvals
## 1: 0.8206728 0.4380257 0.1097610533
## 2: 1.6465060 0.4380257 0.0003375337
## 
## [[3]]
##        ests        sd        pvals
## 1: 1.340631 0.4691332 0.0082278928
## 2: 1.777505 0.4691332 0.0002994282

The estimated power for each effect estimate is the proportion of iterations with a p-value less than 0.025. Using this standard cluster-randomized three-arm study design, there is 74% power to detect a difference between CDS and Control when the true difference is 1.50, and only 19% power to detect a difference between CDS+ and CDS when the true difference is 0.75:

pvals <- rbindlist(mclapply(reps, function(x) data.table(t(x[, pvals]))))
pvals[, .(`CDS vs Control` = mean(V1 < 0.025), `CDS+ vs CDS` = mean(V2 < 0.025))]

##    CDS vs Control CDS+ vs CDS
## 1:         0.7356      0.1904

Two-step randomization

The same process can be applied to evaluate the two-step randomization. In this formulation, the clusters are randomized in a 1:2 ratio to control $(A = 0)$ or CDS $(A = 1)$. For the the clusters randomized to CDS ($2/3$ of the total clusters), the individual patients are randomized to standard CDS $(X=0)$ or CDS+ $(X=1)$. The randomization is stratified by provider with a 1:1 ratio. The randomization scheme is facilitated by the simstudy defCondition and addCondition functions. For the patients in the control arm, $A=0$ and $X=0$.

The individual outcome $y$ is generated slightly differently than in the three-arm trial, but the effect sizes are equivalent: 1.50 point difference between standard CDS and control and 0.75 difference between CDS+ and CDS.

defs <- 
  defData(varname = "b", formula = 0, variance = 1) |>
  defData(varname = "A", formula = "1;2", dist = "trtAssign")

defc <- 
  defCondition(condition = "A == 1", formula = "1;1", 
    variance = "provider", dist = "trtAssign") |>
  defCondition(condition = "A == 0", formula = 0, 
    dist = "nonrandom")

defy <- 
  defDataAdd(varname = "y", formula = "1.50 * A + 0.75 * X + b", variance = 8)

dc <- genData(30, defs, id = "provider")
dd <- genCluster(dc, "provider", 40, "id")
dd <- addCondition(defc, dd, "X")
dd <- addColumns(defy, dd)

The model for this data set also looks slightly different than the three arm case, as we can model the CDS+ vs CDS directly; as a result, there is no need to consider the contrasts. In this example, we would conclude that CDS+ is an improvement over CDS and CDS is better than no CDS.

lmerfit <- lmer(y ~ A + X + (1|provider), data = dd)
as_flextable(lmerfit) |> delete_part(part = "footer")

group		Estimate	Standard Error	df	statistic	p-value
Fixed effects
	(Intercept)	-0.123	0.362	28	-0.341	0.7358
	A	1.641	0.455	31	3.608	0.0011	**
	X	1.078	0.200	1,169	5.391	0.0000	***
Random effects
provider	sd__(Intercept)	1.055
Residual	sd__Observation	2.828

The estimated the power from replicated data sets makes it pretty clear that the two-step randomization design has considerably more power for both effects, each reaching the 80% threshold.

pvals[, .(`CDS vs Control` = mean(A < .025), `CDS+ vs CDS` = mean(X < 0.025))]

##    CDS vs Control CDS+ vs CDS
## 1:         0.8392      0.8544

While it appears that the two-step randomization design is clearly superior to the three-arm cluster randomized design, it is important to again point out the key caveat here that, conditional on the provider, the patient outcomes in the CDS and CDS+ arms patients need to be independent of each other. For example, if the provider can’t avoid applying CDS+ tools to the CDS only patients, this assumption of independence is violated and the two-step design is not going to be appropriate. Instead, a three-arm cluster randomized design with more providers (clusters) will be needed.

Creating a nice looking Table 1 with standardized mean differences

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 26 Sep 2023 00:00:00 +0000

I’m in the middle of a perfect storm, winding down three randomized clinical trials (RCTs), with patient recruitment long finished and data collection all wrapped up. This means a lot of data analysis, presentation prep, and paper writing (and not so much blogging). One common (and not so glamorous) thread cutting across all of these RCTs is the need to generate a Table 1, the comparison of baseline characteristics that convinces readers that randomization worked its magic (i.e., that study groups are indeed “comparable”). My primary goal here is to provide some R code to automate the generation of this table, but not before highlighting some issues related to checking for balance and pointing you to a couple of really interesting papers.

Table 1 and balance

In Table 1, we report summary statistics by intervention arm. The mean and standard deviation (or median and interquartile range) of continuous measures and percentages for categorical measures are provided for a selected set of baseline subject characteristics, such as age, sex, and baseline health status. Typically, these tables will include a statistic intended to provide an objective measure of balance (or lack thereof). Any indication of imbalance for a particular characteristic across groups, might suggest that the final estimate of the treatment effect should be adjusted, lest there be any residual confounding bias of the effect estimate.

There are two key (and related) questions here. First, what should that “objective” measure of imbalance be, and second should we even be checking for imbalance in the first place? Traditionally, Table 1 has included a set of p-values resulting from a series of between-group comparisons, one for each baseline measure. This paper, written by Douglas Altman in 1985 (back when I was still an undergraduate!), points out a slew of issues with using the p-value in this context. I’ve always been concerned that studies with large sample sizes may have small p-values for small differences (i.e., differences that we should not be worried about), but Altman is actually most concerned that using p-values in the context of smaller studies can mislead us into ignoring important differences. He states that “unfortunately, … use of significance tests may be unhelpful. It is the strength of the association rather than the significance level (which also depends upon sample size) which is of importance.”

A second paper, written ten years later by Stephen Senn, argued even more strongly that it may actually be counter-productive to attempt a “formal” assessment of balance. Using the p-value, you are implicitly conducting a hypothesis test, with the null hypothesis that the study has used randomization to allocate subjects to the groups. But this is indeed what you have done, so there is no question there. (Of course, it is possible that someone has cheated.) Senn recommends instead that rather than conducting these tests, “the practical statistician will do well to establish beforehand a limited list of covariates deemed useful and fit them regardless. Such a strategy will usually lead to a gain in power, has no adverse effect on unconditional size and controls conditional size with respect to the covariates identified.”

Given all of this, why should we do anything beyond reporting the group means and percentages and let the readers decide about comparability? It is hardly compelling to say this, but I think most journals will demand some formal comparison (though, to be honest, I haven’t attempted to submit a table without). And, if there needs to be a comparison, I would move away from the p-value given its shortcomings alluded to here, and use a standardized mean difference (SMD). In the case of a continuous measure, this is the difference in group means divided by the pooled standard deviation (and is defined differently for categorical measures). The SMD quantifies the difference on a scale that is comparable across measures so that the reader can identify where the largest imbalances are, and make a judgement about comparability.

Which is a long-winded way of getting to the point of the this post: how can we easily generate a nice looking Table 1 with SMDs in R?

Simulating data

First step is to generate some data for an RCT. Here are the R packages I will be using:

library(simstudy)
library(table1)
library(smd)
library(flextable)
library(officer)
library(smd)

The data set will have four “baseline” measures, two numerical and two categorical; missingness will be generated for two of the measures. Three of the variables are actually derived from the same categorical variable in order to compare the SMD for a categorical variable treated numerically as well as with missing data.

Here is the data generation process:

def <-  
  defData(varname = "rx", formula = "1;1", dist = "trtAssign") |>
  defData(varname = "x", formula = 0, variance = 10) |>
  defData(varname = "v1", formula = ".5;.3;.2", dist = "categorical")

dm <- 
  defMiss(varname = "x", formula = .10) |>
  defMiss(varname = "f2_v1", formula = '.05 + .05*(frx == "Control")')

set.seed(8312)

dd <- genData(1000, def)
dd <- genFactor(dd, "rx", labels = c("Control", "Treatment"), replace = TRUE)
dd <- genFactor(dd, "v1", prefix = "f1_")
dd <- genFactor(dd, "v1", prefix = "f2_", labels = c("red", "blue", "green"))

missMat <- genMiss(dd, dm, idvars = "id")
dobs <- genObs(dd, missMat, idvars = "id")

dobs

##         id          x v1       frx f1_v1 f2_v1
##    1:    1  3.3059744  1   Control     1   red
##    2:    2 -2.7291981  3 Treatment     3 green
##    3:    3         NA  1 Treatment     1   red
##    4:    4  3.1638764  1   Control     1   red
##    5:    5  5.2252358  1 Treatment     1   red
##   ---                                         
##  996:  996         NA  1   Control     1   red
##  997:  997 -1.8891992  2 Treatment     2  blue
##  998:  998  0.2994518  2 Treatment     2  blue
##  999:  999 -0.8043489  2 Treatment     2  blue
## 1000: 1000  1.5822111  1 Treatment     1  <NA>

Calculating the SMD

Standardized mean differences can be calculated using the smd package, which uses the methods described in this paper by Yang and Dalton. The standardized mean difference for a numeric measure is

\[ d = \frac{ \left( \bar{x}_1 - \bar{x}_2 \right) } {\text{se}_p }, \]

where $\bar{x}_1$ and $\bar{x}_2$ are the means for each group. $\text{se}_p$ is the pooled standard deviation:

\[ \text{se}_p = \sqrt{\frac{s^2_1 + s^2_2}{2}}, \]

where $s^2_1$ and $s^2_2$ are the group-specific variances. Here is the SMD for the continuous measure x:

with(dobs, smd(x = x, g = frx, na.rm = TRUE))

##        term   estimate
## 1 Treatment 0.06275175

For categorical measures, the SMD is the multivariate Mahalanobis distance between the group-specific proportion vectors: $\{p_{11}, \dots,p_{1k}\}$ and $\{p_{21}, \dots,p_{2k}\}$. Here is the SMD for the categorical measure f1_v1:

with(dobs, smd(x = f1_v1, g = frx, na.rm = TRUE))

##        term   estimate
## 1 Treatment 0.02629001

Creating Table 1

We are creating Table 1 with package table1. (See here for a nice vignette.) The package does not explicitly calculate the SMD, but allows us to customize the table creation with a user-defined function, which is shown below. An alternative package,tableone, does have an SMD option built in. However, missing data reporting and integration with the flextable package, two very important features, are not built into tableone; in contrast, table1 provides both capabilities.

Here is the relatively simple code used to generate the table:

mysmd <- function(x, ...) {
  
  # Construct vectors of data y, and groups (strata) g
  
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  
  abs(round(smd::smd(y, g, na.rm = TRUE)[2], 3))
  
}

tab_1 <- table1(
  ~ x + v1 + f1_v1 + f2_v1 | frx, 
  data = dobs, overall = FALSE, 
  render.continuous=c(.="Mean (SD)"),
  extra.col = list(`SMD`= mysmd)
)

And here is the table that is generated by table1:

	Control (N=500)	Treatment (N=500)	SMD
x			0.063
Mean (SD)	0.204 (3.19)	0.00807 (3.06)
Missing	59 (11.8%)	52 (10.4%)
v1			0.005
Mean (SD)	1.76 (0.802)	1.76 (0.808)
f1_v1			0.026
1	234 (46.8%)	238 (47.6%)
2	151 (30.2%)	145 (29.0%)
3	115 (23.0%)	117 (23.4%)
f2_v1			0.042
red	216 (43.2%)	223 (44.6%)
blue	139 (27.8%)	137 (27.4%)
green	103 (20.6%)	114 (22.8%)
Missing	42 (8.4%)	26 (5.2%)

This is pretty nice as it is, but we might want embellish a bit by using the capabilities of flextable, another package I’ve become enamored with lately. The table1 object can be turned directly into a flextable using function t1flex. And once we have transformed the table type, the possibilities are almost endless. One really nice thing about a flextable is that it can be output to a Word file, to a PowerPoint file, html, or other useful formats. Taking this approach, Table 1 generation (or really any table generation) can be fully automated, obviating any need for manual table creation and eliminating at least one possible source of human error.

For example, here is code that approximates a JAMA-style table:

set_flextable_defaults(
  font.family = "Calibri", 
  font.size = 11
)

header <- "Table 1"
footer <- "Values are No. (%) unless otherwise noted. SD = standard deviation"

tab_1f <- t1flex(tab_1) |> 
  add_header_lines(header) |>
  add_footer_lines(footer) |>
  bold(i = 1, part = "header") |> 
  hline_top(part = "header", border = fp_border(color = "red", width = 3)) |> 
  hline(i = 1, part = "header", border = fp_border(width = 0.25)) |>
  hline_top(part = "body", border = fp_border(width = 0.25)) |> 
  hline_bottom(part = "body", border = fp_border(width = 0.25)) |> 
  hline_bottom(part = "footer", border = fp_border(width = 0.25)) |> 
  border_inner_h(part = "body", border = fp_border(width = 0.25, style = "dotted")) |> 
  autofit(part = "body") |>
  bg(part = "body", bg = "#f5f5f5") |>
  align(part = "all", align = "center") |> 
  align(part = "header", j=1, i=2, align = "left")  |>
  align(part = "footer", align = "left") |>
  merge_v(j = 1) |>
  valign(j = 1, valign = "top") |>
  align(j = 1, align = "left")

And, the final product:

Table 1
	Control (N=500)	Treatment (N=500)	SMD
x			0.063
Mean (SD)	0.204 (3.19)	0.00807 (3.06)
Missing	59 (11.8%)	52 (10.4%)
v1			0.005
Mean (SD)	1.76 (0.802)	1.76 (0.808)
f1_v1			0.026
1	234 (46.8%)	238 (47.6%)
2	151 (30.2%)	145 (29.0%)
3	115 (23.0%)	117 (23.4%)
f2_v1			0.042
red	216 (43.2%)	223 (44.6%)
blue	139 (27.8%)	137 (27.4%)
green	103 (20.6%)	114 (22.8%)
Missing	42 (8.4%)	26 (5.2%)
Values are No. (%) unless otherwise noted. SD = standard deviation

References:

Altman, Douglas G. “Comparability of randomised groups.” Journal of the Royal Statistical Society Series D: The Statistician 34, no. 1 (1985): 125-136.

Senn, Stephen. “Testing for baseline balance in clinical trials.” Statistics in medicine 13, no. 17 (1994): 1715-1726.

Yang, D. and Dalton, J.E., 2012, April. A unified approach to measuring the effect size between two groups using SAS. In SAS global forum (Vol. 335, pp. 1-6).

Rich B (2023). table1: Tables of Descriptive Statistics in HTML. R package version 1.4.3.

Gohel D, Skintzos P (2023). flextable: Functions for Tabular Reporting.

Finding logistic models to generate data with desired risk ratio, risk difference and AUC profiles

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 20 Jun 2023 00:00:00 +0000

About two years ago, someone inquired whether simstudy had the functionality to generate data from a logistic model with a specific AUC. It did not, but now it does, thanks to a paper by Peter Austin that describes a nice algorithm to accomplish this. The paper actually describes a series of related algorithms for generating coefficients that target specific prevalence rates, risk ratios, and risk differences, in addition to the AUC. simstudy has a new function logisticCoefs that implements all of these. (The Austin paper also describes an additional algorithm focused on survival outcome data and hazard ratios, but that has not been implemented in simstudy). This post describes the the new function and provides some simple examples.

A little background …

In simstudy, there are at least two ways to define a binary data generating process. The first is to operate on the scale of the proportion or probability using the identity link. This allows users to define a data generating process that reflects assumptions about risk ratios and risk differences when comparing two groups defined by an exposure or treatment. However, this process can become challenging when introducing other covariates, because it can be difficult to constrain the probabilities so that they fall between 0 and 1.

The second approach works on the log-odds scale using a logit link, and is much more amenable to accommodating covariates. Unfortunately, this comes at the price of being able to easily generate specific risk ratios and risk differences, because all parameters are log-odds ratios. The overall (marginal) prevalence of an outcome in a population will vary depending on the distribution of covariates in that population, and the strengths (both absolute and relative) of the association of those covariates with the outcome. That is, the coefficients of a logistic model (including the intercept) determine the prevalence. The same is true regarding the risk ratio and risk difference (if there is one particular exposure or treatment of interest) and the AUC.

Since neither approach will help us out here, I created the function logisticCoefs to fill in the gap. Here we start with the simplest case where we have a target marginal proportion or prevalence, and then illustrate data generation with three other target statistics: risk ratios, risk differences, and AUCs.

Prevalence

In this first example, we start with one set of assumptions for four covariates $x_1, x2 \sim N(0, 1)$, $b_1 \sim Bin(0.3)$, and $b_2 \sim Bin(0.7)$, and generate the outcome y with the following data generating process:

\[ \text{logit}(y) = 0.15x_1 + 0.25x_2 + 0.10b_1 + 0.30b_2\]

library(simstudy)
library(ggplot2)
library(data.table)

coefs1 <- c(0.15, 0.25, 0.10, 0.30)

d1 <- defData(varname = "x1", formula = 0, variance = 1)
d1 <- defData(d1, varname = "x2", formula = 0, variance = 1)
d1 <- defData(d1, varname = "b1", formula = 0.3, dist = "binary")
d1 <- defData(d1, varname = "b2", formula = 0.7, dist = "binary")

d1a <- defData(d1, varname = "y", 
  formula = "t(..coefs1) %*% c(x1, x2, b1, b2)",
  dist = "binary", link = "logit")

set.seed(48392)

dd <- genData(500000, d1a)
dd

## Key: <id>
##             id    x1     x2    b1    b2     y
##          <int> <num>  <num> <int> <int> <int>
##      1:      1  0.29  0.390     0     1     1
##      2:      2  0.76 -0.925     0     0     0
##      3:      3 -1.47  0.939     0     0     1
##      4:      4  1.92  0.560     0     1     1
##      5:      5  1.40 -0.238     0     1     0
##     ---                                      
## 499996: 499996 -0.32  0.367     0     0     0
## 499997: 499997 -1.08  2.152     0     0     0
## 499998: 499998 -1.10  0.380     1     0     0
## 499999: 499999  0.56 -1.042     0     1     0
## 500000: 500000  0.52  0.076     0     1     1

The overall proportion of $y=1$ in this case is

dd[, mean(y)]

## [1] 0.56

If we have a desired marginal proportion of 0.40, then we can add an intercept of -0.66 to the data generating process:

\[ \text{logit}(y) = -0.66 + 0.15x_1 + 0.25x_2 + 0.10b_1 + 0.30b_2\]

The simulation now gives us the desired target:

d1a <- defData(d1, varname = "y", 
  formula = "t(c(-0.66, ..coefs1)) %*% c(1, x1, x2, b1, b2)",
  dist = "binary", link = "logit")

genData(500000, d1a)[, mean(y)]

## [1] 0.4

If we change the distribution of the covariates, so that $x_1 \sim N(1, 1)$, $x_2 \sim N(2, 1)$, $b_1 \sim Bin(0.5)$, and $b_2 \sim Bin(0.8)$, and the strength of the association of these covariates with the outcome so that

\[ \text{logit}(y) = 0.20x_1 + 0.35x_2 + 0.20b_1 + 0.45b_2,\]

the marginal proportion/prevalence (assuming no intercept term) also changes, going from 0.56 to 0.84:

coefs2 <- c(0.20, 0.35, 0.20, 0.45)

d2 <- defData(varname = "x1", formula = 1, variance = 1)
d2 <- defData(d2, varname = "x2", formula = 3, variance = 1)
d2 <- defData(d2, varname = "b1", formula = 0.5, dist = "binary")
d2 <- defData(d2, varname = "b2", formula = 0.8, dist = "binary")

d2a <- defData(d2, varname = "y", 
  formula = "t(..coefs2) %*% c(x1, x2, b1, b2)",
  dist = "binary", link = "logit")

genData(500000, d2a)[, mean(y)]

## [1] 0.84

But under this new distribution, adding an intercept of -2.13 yields the desired target.

\[ \text{logit}(y) = -2.13 + 0.20x_1 + 0.35x_2 + 0.20b_1 + 0.45b_2 \]

d2a <- defData(d2, varname = "y", 
  formula = "t(c(-2.13, ..coefs2)) %*% c(1, x1, x2, b1, b2)",
  dist = "binary", link = "logit")

genData(500000, d1a)[, mean(y)]

## [1] 0.4

Finding the intercept

Where did those two intercepts come from? The paper by Peter Austin describes an iterative bisection procedure that takes a distribution of covariates and a set of coefficients to identify the intercept coefficient that yields the target marginal proportion or prevalence.

The general idea of the algorithm is to try out series of different intercepts in an intelligent way that ends up at the right spot. (If you want the details for the algorithm, take a look at the paper.) The starting search range is pre-defined (we’ve used -10 to 10 for the intercept), and we start with an value of 0 for the initial intercept and simulate a large data set (the paper uses 1 million observations, but 100,000 seems to work just fine) and record the population prevalence. If we’ve overshot the target prevalence, we turn our attention to the range between -10 and 0, taking the average, which is -5. Otherwise, we focus on the range between 0 and 10. We iterate this way, choosing the range we need to focus on and setting the intercept at the mid-point (hence the name bisection). The algorithm will converge pretty quickly on the value of the intercept that gives the target population prevalence for the underlying covariate distribution and coefficient assumptions.

In the current implementation in simstudy, the intercept is provided by a simple call to logisticCoefs. Here are the calls for the two sets of definitions in definition tables d1 and d2.

logisticCoefs(defCovar = d1, coefs = coefs1, popPrev = 0.40)

##    B0    x1    x2    b1    b2 
## -0.66  0.15  0.25  0.10  0.30

logisticCoefs(defCovar = d2, coefs = coefs2, popPrev = 0.40)

##    B0    x1    x2    b1    b2 
## -2.13  0.20  0.35  0.20  0.45

Risk ratios

Just as the prevalence depends on the distribution of covariates and their association with the outcome, risk ratios comparing the outcome probabilities for two groups also depend on the additional covariates. The marginal risk ratio comparing treatment ($A =1$ to control ($A=0$) (given the distribution of covariates) is

\[RR = \frac{P(y=1 | A = 1)}{P(y=1 | A = 0)}\] In the data generation process we use a log-odds ratio of -0.40 (odds ratio of approximately 0.67) in both cases, but we get different risk ratios (0.82 vs. 0.93), depending on the covariates (defined in d1 and d2).

d1a <- defData(d1, varname = "rx", formula = "1;1", dist = "trtAssign")
d1a <- defData(d1a, varname = "y",
  formula = "t(c(-0.40, ..coefs1)) %*% c(rx, x1, x2, b1, b2)",
  dist = "binary", link = "logit"
)

dd <- genData(500000, d1a)
dd[rx==1, mean(y)]/dd[rx==0, mean(y)]

## [1] 0.82

d2a <- defData(d2, varname = "rx", formula = "1;1", dist = "trtAssign")
d2a <- defData(d2a, varname = "y",
  formula = "t(c(-0.40, ..coefs2)) %*% c(rx, x1, x2, b1, b2)",
  dist = "binary", link = "logit"
)

dd <- genData(500000, d2a)
dd[rx==1, mean(y)]/dd[rx==0, mean(y)]

## [1] 0.93

By specifying both a population prevalence and a target risk ratio in the call to logisticCoefs, we can get the necessary parameters. When specifying the target risk ratio, it is required to be between 0 and 1/popPrev. A risk ratio cannot be negative, and the probability of the outcome under treatment cannot exceed 1 (which will happen if the risk ratio is greater than 1/popPrev).

C1 <- logisticCoefs(d1, coefs1, popPrev = 0.40, rr = 0.85)
C1

##    B0     A    x1    x2    b1    b2 
## -0.66 -0.26  0.15  0.25  0.10  0.30

If we use $C_1$ in the data generation process, we will get a data set with the desired target prevalence and risk ratio:

d1a <- defData(d1, varname = "rx", formula = "1;1", dist = "trtAssign")
d1a <- defData(d1a, varname = "y",
  formula = "t(..C1) %*% c(1, rx, x1, x2, b1, b2)",
  dist = "binary", link = "logit"
)

dd <- genData(500000, d1a)

Here are the prevalence and risk ratio:

dd[rx==0, mean(y)]

## [1] 0.4

dd[rx==1, mean(y)]/dd[rx==0, mean(y)]

## [1] 0.86

You can do the same for the second set of assumptions.

Risk differences

Risk differences have the same set of issues, and are handled in the same way. The risk difference is defined as

\[ RD = P(y=1 | A = 1) - P(y=1 | A = 0)\]

To get the coefficients related to a population prevalence of 0.40 and risk difference of -0.15 (so that the proportion in the exposure arm is 0.25), we use the rd argument:

C1 <- logisticCoefs(d1, coefs1, popPrev = 0.40, rd = -0.15)
C1

##    B0     A    x1    x2    b1    b2 
## -0.66 -0.71  0.15  0.25  0.10  0.30

Again, using $C_1$ in the data generation process, we will get a data set with the desired target prevalence and risk difference:

d1a <- defData(d1, varname = "rx", formula = "1;1", dist = "trtAssign")
d1a <- defData(d1a, varname = "y",
  formula = "t(..C1) %*% c(1, rx, x1, x2, b1, b2)",
  dist = "binary", link = "logit"
)

dd <- genData(500000, d1a)

dd[rx==0, mean(y)]

## [1] 0.4

dd[rx==1, mean(y)] - dd[rx==0, mean(y)]

## [1] -0.15

AUC

The AUC is another commonly used statistic to evaluate a logistic model. (I described the AUC in a post a while back.) We can use logisticCoefs to find the parameters that will allow us to generate data from a model with a specific AUC. To get the coefficients related to a population prevalence of 0.40 and an AUC of 0.85, we use the auc argument (which must be between 0.5 and 1):

C1 <- logisticCoefs(d1, coefs1, popPrev = 0.40, auc = 0.85)
C1

##    B0    x1    x2    b1    b2 
## -1.99  0.85  1.41  0.56  1.69

Again, using $C_1$ in the data generation process, we will get a data set with the desired target prevalence and the AUC (calculated here using the lrm function in the rms package:

d1a <- defData(d1, varname = "y",
  formula = "t(..C1) %*% c(1, x1, x2, b1, b2)",
  dist = "binary", link = "logit"
)

dd <- genData(500000, d1a)

dd[, mean(y)]

## [1] 0.4

fit <- rms::lrm(y ~ x1 + x2 + b1 + b2, data = dd)
fit$stats["C"]

##    C 
## 0.85

Visualizing the different AUCs

To finish up, here is an application of the logisticCoefs that facilitates visualization of data generated by different prevalence and AUC assumptions. In this case, there are three different scenarios, all based on a single covariate score. The score is used to predict whether an individual is “qualified”. (In the figures, those who are qualified are colored red, those who are not are green.)

In the first scenario, we want to generate data for a sample where 40% are considered “qualified”, though the AUC is only 0.75. In the second scenario, we still assume 40%, but the AUC is 0.95.

d1 <- defData(varname = "score", formula = 0, variance = 1)

C1 <- logisticCoefs(d1, coefs = .3, popPrev = 0.40, auc = .75)
C2 <- logisticCoefs(d1, coefs = .3, popPrev = 0.40, auc = .95)

Here are the parameters for each data generating process:

rbind(C1, C2)

##       B0 score
## C1 -0.49   1.1
## C2 -1.11   3.9

Given the higher AUC in the second scenario, we should see more separation between the qualified and non-qualified based on the scores. Indeed, the figure seems to support that:

In the third scenario, the proportion of qualified people drops to 20% and the AUC based on the model with the score is 0.75:

C3 <- logisticCoefs(d1, coefs = .3, popPrev = 0.20, auc = .75)

rbind(C1, C3)

##       B0 score
## C1 -0.49   1.1
## C3 -1.69   1.1

We see fewer red dots overall in the right hand plot, but the separation between qualified and unqualified is not noticeably different:

References:

Austin, Peter C. “The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations.” BMC Medical Research Methodology 23, no. 1 (2023): 1-10.

A demo of power estimation by simulation for a cluster randomized trial with a time-to-event outcome

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 23 May 2023 00:00:00 +0000

A colleague reached out for help designing a cluster randomized trial to evaluate a clinical decision support tool for primary care physicians (PCPs), which aims to improve care for high-risk patients. The outcome will be a time-to-event measure, collected at the patient level. The unit of randomization will be the PCP, and one of the key design issues is settling on the number to randomize. Surprisingly, I’ve never been involved with a study that required a clustered survival analysis. So, this particular sample size calculation is new for me, which led to the development of simulations that I can share with you. (There are some analytic solutions to this problem, but there doesn’t seem to a consensus about the best approach to use.)

Overview

In tackling this problem, there were four key elements that I needed to work out before actually conducting the power simulations. First, I needed to determine the hypothetical survival curve in the context of a single (control) arm and simulate data to confirm that I could replicate the desired curve. Second, I wanted to generate cluster-level variation so that I could assess the implications of the variance assumptions (still in a single-arm context). Third, I generated two intervention arms (without any clustering) to assess effect size assumptions. And lastly, I generated a full data set that included clustering, randomization, and censoring, and then fit a model that would be the basis for the power analysis to ensure that everything was working as expected. Once this was all completed, I was confident that I could move on to generating estimates of power under a range of sample size and variability assumptions. I apologize in advance for a post that is a bit long, but the agenda is quite packed and includes a lot of code.

Defining shape of survival curve

Defining the shape of the survival curve is made relatively easy using the function survGetParams in the simstudy package. All we need to do is specify some (at least two) coordinates along the curve and the function will return the parameters for the mean and shape of a Weibull function that best fit the points. These parameters are used in the data generation process. In this case, the study’s investigators provided me with a couple of points, indicating that approximately 10% of the sample would have an event by day 30, and half would have an event at day 365. Since the study is following patients at most for 365 days, we will consider anything beyond that to be censored (more on censoring later).

To get things started, here are the libraries needed for all the code that follows:

library(simstudy)
library(data.table)
library(survival)
library(survminer)
library(GGally)
library(coxme)
library(parallel)

Now, we get the parameters that define the survival curve:

points <- list(c(30, 0.90), c(365, .50))
r <- survGetParams(points)
r

## [1] -4.8  1.3

To simulate data from this curve, the time-to-event variable tte is defined using these parameters generating by survGetParams. The observed time is the minimum of one year and the time-to-event, and we create an event indicator if the time-to-even is less than one year.

defs <- defSurv(varname = "tte", formula = r[1], shape = r[2])

defa <- defDataAdd(varname = "time", formula = "min(365, tte)", dist = "nonrandom")
defa <- defDataAdd(defa, "event", "1*(tte <= 365)", dist = "nonrandom")

Generating the data is quite simple in this case:

set.seed(589823)

dd <- genData(1000)
dd <- genSurv(dd, defs, digits = 0)
dd <- addColumns(defa, dd)

dd

##         id  tte time event
##    1:    1  234  234     1
##    2:    2 1190  365     0
##    3:    3  395  365     0
##    4:    4  178  178     1
##    5:    5   72   72     1
##   ---                     
##  996:  996  818  365     0
##  997:  997  446  365     0
##  998:  998  118  118     1
##  999:  999  232  232     1
## 1000: 1000 5308  365     0

The plots below show the source function determined by the parameters on the left and the actual data generated on the right. It appears that the generated data matches the data generation process:

splot <- survParamPlot(r[1], r[2], points = points, n = 1000, limits = c(0, 365) )

fit <- survfit(Surv(time, event) ~ 1, data = dd)

j <- ggsurv(fit, CI = FALSE, surv.col = "#ed7c67", size.est = 0.8) + 
  theme(panel.grid = element_blank(),
        axis.text = element_text(size = 7.5),
        axis.title = element_text(size = 8, face = "bold"),
        plot.title = element_blank()) +
  ylim(0, 1) +
  xlab("time") + ylab("probability of survival")

ggarrange(splot, j, ncol = 2, nrow = 1)

Evaluating cluster variation

Cluster variation in the context of survival curves implies that there is a cluster-specific survival curve. This variation is induced with a random effect in in the data generation process. In this case, I am assuming a normally distributed random effect with mean 0 and some variance (distributions other than a normal distribution can be used). The variance assumption is a key one (which will ultimately impact the estimates of power), and I explore that a bit more in the second part of this section.

Visualizing cluster variation

The data generation process is a tad more involved than above, though not much more. We need to generate clusters and their random effect first, before adding the individuals. tte is now a function of the distribution parameters as well as the cluster random effect b. We are still using a single arm and assuming that everyone is followed for one year. In the first simulation, we set the random effect variance $b = 0.1$.

defc <- defData(varname = "b", formula = 0, variance = 0.1)

defs <- defSurv(varname = "tte", formula = "r[1] + b", shape = r[2])

defa <- defDataAdd(varname = "time", formula = "min(365, tte)", dist = "nonrandom")
defa <- defDataAdd(defa, "event", "1*(tte <= 365)", dist = "nonrandom")

dc <- genData(20, defc, id = "pcp")
dd <- genCluster(dc, "pcp", numIndsVar = 1000, "id")
dd <- genSurv(dd, defs, digits = 0)
dd.10 <- addColumns(defa, dd)
dd.10

##        pcp    b    id  tte time event
##     1:   1 0.25     1  788  365     0
##     2:   1 0.25     2   93   93     1
##     3:   1 0.25     3  723  365     0
##     4:   1 0.25     4  151  151     1
##     5:   1 0.25     5 1367  365     0
##    ---                               
## 19996:  20 0.22 19996  424  365     0
## 19997:  20 0.22 19997  207  207     1
## 19998:  20 0.22 19998   70   70     1
## 19999:  20 0.22 19999  669  365     0
## 20000:  20 0.22 20000  994  365     0

The following plot shows two sets of survival curves, each based on different levels of variation, 0.10 on the left, and 0.01 on the right. Each curve represents a different cluster. With this plot, we get a clear visualization of how variance assumption of the random effect impacts the variation of the survival curves:

Variation of the probability of an event across clusters

The plot of the survival curves is only one way to consider the impact of cluster variation. Another option is to look at the binary event outcome under the assumption of no censoring. I like to evaluate the variation in the probability of an event across the clusters, particularly by looking at the range of probabilities, or considering the coefficient of variation, which is $\sigma / \mu$.

To show how this is done, I am generating a data set with a very large number of clusters (2000) and a large cluster size (500), and then calculating the probability of an event for each cluster:

defc <- defData(varname = "b", formula = 0, variance = 0.100)

dc <- genData(2000, defc, id = "pcp")
dd <- genCluster(dc, "pcp", numIndsVar = 500, "id")
dd <- genSurv(dd, defs, digits = 0)
dd <- addColumns(defa, dd)

ds <- dd[, .(p = mean(event)), keyby = pcp]
ds

##        pcp    p
##    1:    1 0.41
##    2:    2 0.43
##    3:    3 0.36
##    4:    4 0.51
##    5:    5 0.50
##   ---          
## 1996: 1996 0.82
## 1997: 1997 0.53
## 1998: 1998 0.88
## 1999: 1999 0.42
## 2000: 2000 0.63

Here is the distribution of observed cluster-level proportions:

Here are the mean probability, the standard deviation of probabilities, the coefficient of variation for the probabilities, and the 95% interval of the probabilities when the random effect variance in the survival generation process is 0.10:

ds[, .(mu = mean(p), s = sd(p), cv = sd(p)/mean(p))]

##     mu    s   cv
## 1: 0.5 0.11 0.22

ds[, .(quantile(p, probs = c(0.025, .975)))]

##      V1
## 1: 0.31
## 2: 0.72

To compare across a range variance assumptions, I’ve generated ten data sets and plotted the results below. If you hover over the points, you will get the CV estimate. This could be helpful in helping collaborators decide what levels of variance is appropriate to focus on in the final power estimation and sample size determination.

In this case, we’ve decided that a coefficient of variation is not likely to exceed 0.17 (with a corresponding 95% interval of proportions ranging from 35% to 66%), so we’ll consider that when evaluating power.

Evaluating the effect size

Next, we generate data that includes treatment assignment (but excludes cluster variability and censoring before one year). The treatment effect is expressed as a log hazard ratio, which in this case 0.4 (equal to a hazard ratio of just about 1.5).

The data generation starts with treatment assignment, adds the time-to-event survival data, and then adds the one-year censoring data, as before:

defa <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")

defs <- defSurv(varname = "tte", formula = "r[1] + 0.4 * rx", shape = r[2])

defe <- defDataAdd(varname = "time", formula = "min(365, tte)", dist = "nonrandom")
defe <- defDataAdd(defe, "event", "1*(tte <= 365)", dist = "nonrandom")

dd <- genData(1000, defa)
dd <- genSurv(dd, defs, digits = 0)
dd <- addColumns(defe, dd)

dd

##         id rx  tte time event
##    1:    1  0 1902  365     0
##    2:    2  0  286  286     1
##    3:    3  0  480  365     0
##    4:    4  0   32   32     1
##    5:    5  1   12   12     1
##   ---                        
##  996:  996  0  663  365     0
##  997:  997  1  962  365     0
##  998:  998  1   19   19     1
##  999:  999  1  502  365     0
## 1000: 1000  0   85   85     1

The plot of the survival curves by treatment arms provides a visualization of the treatment effect:

Again, since we have no censoring, we can estimate the probability of an event within 365 days for each arm:

dd[, .(p = mean(event)), keyby = rx]

##         rx    p
## 1: control 0.51
## 2:      rx 0.63

An effect size of 0.4 on the log hazard scale translates to an odds ratio of about 1.64, a risk ratio of 1.24, and risk difference of 12 percentage points. In the absence of any external data on the potential effect size, we can use an effect size that is minimally clinically meaningful based on any or all these effect size measurements.

Complete data generation and model estimation

With the pieces in place, we are ready to put it all together and add censoring to the mix to finalize the full data generating process. We will fit a mixed-effects Cox proportional hazards model to see if we can recover the parameters that we have used to generate the data, and, if that goes well, we will be ready to estimate power.

We start in defc by defining the cluster-level random effect variation and treatment assignment design (in this case 1 to 1, treatment to control). We add a censoring process in defa. This assumes that we will be enrolling patients for six months, spread out across this time period. The study will last exactly one year, so every patient will be followed for at least six months, and only some will be followed for a full year (i.e., those who join the study on the first day).

Finally, defs defines the data generation process for the survival outcome, which we’ve seen above, though now we have both a treatment effect and a random effect, in addition to the baseline parameters in the vector r.

defc <- defData(varname = "b", formula = 0, variance = 0.05)
defc <- defData(defc, varname = "rx", formula = "1;1", dist = "trtAssign")

defa <- defDataAdd(varname = "start_day", formula = "1;182", dist = "uniformInt")
defa <- defDataAdd(defa, varname = "censor", 
  formula = "365 - start_day ", dist = "nonrandom")

defs <- defSurv(varname = "tte", formula = "r[1] + 0.4 * rx + b", shape = r[2])

The data generation is the same as before, though there is an additional censoring process, which is done with the function addCompRisk:

dc <- genData(500, defc, id = "pcp")
dd <- genCluster(dc, "pcp", numIndsVar = 200, "id")
dd <- addColumns(defa, dd)
dd <- genSurv(dd, defs, digits = 0)
dd <- addCompRisk(dd, events = c("tte", "censor"), 
  timeName = "time", censorName = "censor", keepEvents = TRUE)

dd

##         pcp       b rx     id start_day censor  tte time event   type
##      1:   1 -0.0158  1      1        65    300  307  300     0 censor
##      2:   1 -0.0158  1      2       113    252  907  252     0 censor
##      3:   1 -0.0158  1      3       121    244  110  110     1    tte
##      4:   1 -0.0158  1      4        60    305  170  170     1    tte
##      5:   1 -0.0158  1      5        95    270 2673  270     0 censor
##     ---                                                              
##  99996: 500 -0.0015  0  99996       160    205  291  205     0 censor
##  99997: 500 -0.0015  0  99997        79    286  193  193     1    tte
##  99998: 500 -0.0015  0  99998        64    301   28   28     1    tte
##  99999: 500 -0.0015  0  99999       153    212  459  212     0 censor
## 100000: 500 -0.0015  0 100000        19    346  298  298     1    tte

Since we have generated a rather large data set, we should be able to recover the parameters pretty closely if we are using the correct model. We are going to fit a mixed effects survival model (also known as a frailty model) to see how well we did.

fit_coxme <- coxme(Surv(time, event) ~ rx + (1 | pcp), data = dd)
summary(fit_coxme)

## Cox mixed-effects model fit by maximum likelihood
##   Data: dd
##   events, n = 49955, 1e+05
##   Iterations= 11 48 
##                   NULL Integrated  Fitted
## Log-likelihood -555429    -553708 -553058
## 
##                   Chisq  df p  AIC  BIC
## Integrated loglik  3442   2 0 3438 3421
##  Penalized loglik  4743 414 0 3914  259
## 
## Model:  Surv(time, event) ~ rx + (1 | pcp) 
## Fixed coefficients
##    coef exp(coef) se(coef)  z p
## rx 0.39       1.5    0.022 18 0
## 
## Random effects
##  Group Variable  Std Dev Variance
##  pcp   Intercept 0.22    0.05

Pretty good! The estimated HR of 0.39 (95% CI: 0.35 - 0.43) is on target (we used 0.40 in the data generation process), and the estimated variance for the PCP random effect was 0.05, also on the mark. I’d say we are ready to proceed to the final step.

Power estimation

To conduct the power estimation, I’ve essentially wrapped the data generation and model estimation code in a collection of functions that can be called repeatedly to generate multiple data sets and model estimates The goal is to calculate the proportion of data sets with a statistically significant result for a particular set of assumptions (i.e., the estimate of power for the assumed effect size, variation, and sample sizes). I’ve provided the code below in the addendum in case you haven’t grown weary of all this detail. I described a general framework for using simulation to estimate sample size/power, and I’m largely following that process here.

I’ve written a little function scenario_list (which I’m now thinking I should add to simstudy) to create different parameter combinations that will determine the power estimation. In this case, the parameters I am interested in are the number of PCPs that should be randomized and the variance assumption. The number of patients per PCP (cluster size) is also important to vary, but for illustration purposes here I am keeping it constant.

Here is the simplified scenario list with four possible combinations:

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

npcps <- c(20, 30)
npats <- c(15)
s2 <- c(0.03, 0.04)

scenarios <- scenario_list(npcps = npcps, npats = npats, s2 = s2)

scenarios

## [[1]]
## npcps npats    s2 
## 20.00 15.00  0.03 
## 
## [[2]]
## npcps npats    s2 
## 30.00 15.00  0.03 
## 
## [[3]]
## npcps npats    s2 
## 20.00 15.00  0.04 
## 
## [[4]]
## npcps npats    s2 
## 30.00 15.00  0.04

I use the mclapply function in the parallel package to generate three iterations for each scenario:

model.ests <- mclapply(scenarios, function(a) s_scenarios(a, nrep = 3))

model.ests

## [[1]]
##    npcps npats   s2 est_s re.var_s   p_s
## 1:    20    15 0.03  0.41   0.1596 0.100
## 2:    20    15 0.03  0.32   0.0306 0.065
## 3:    20    15 0.03  0.32   0.0004 0.041
## 
## [[2]]
##    npcps npats   s2 est_s re.var_s     p_s
## 1:    30    15 0.03  0.13    0.017 3.5e-01
## 2:    30    15 0.03  0.62    0.029 7.3e-05
## 3:    30    15 0.03  0.45    0.040 3.8e-03
## 
## [[3]]
##    npcps npats   s2 est_s re.var_s    p_s
## 1:    20    15 0.04  0.55  7.9e-02 0.0084
## 2:    20    15 0.04  0.07  4.0e-04 0.6700
## 3:    20    15 0.04  0.24  8.2e-05 0.1500
## 
## [[4]]
##    npcps npats   s2 est_s re.var_s     p_s
## 1:    30    15 0.04  0.42  8.3e-05 2.7e-03
## 2:    30    15 0.04  0.57  4.0e-04 2.6e-05
## 3:    30    15 0.04  0.36  1.0e-01 4.2e-02

In the actual power calculations, which are reported below, I used 60 scenarios defined by these data generation parameters:

npcps <- c(18, 24, 30, 36)
npats <- c(15, 18, 21)
s2 <- c(0.01, 0.02, 0.03, 0.04, 0.05)

For each of these scenarios, I generated 5000 data sets and estimated models for each (i.e., a total of 300,000 data sets and model fits). For each of the 60 scenarios, I estimated the proportion of the 5000 model fits that yielded a p-value < 0.05 for the estimated log hazard ratio. I had the benefit of using a high performance computer, because running this on my laptop would have taken well over 10 hours (only about 10 minutes on the HPC).

At the end, we have a plot of “power curves” that shows estimated power for each of the scenarios. If we assume that we can expect at least 18 patients per PCP and that the between-PCP variance will be around 0,03 or 0.04, we should be OK randomizing 30 PCPs (15 in each arm), though it might more prudent to go with 36, just to be safe:

Addendum

Here is the code I used to generate the data for the power curve plot. It is based on the framework I mentioned earlier. There is one extra function here, extract_coxme_table, which I pulled from stackoverflow, because there is currently no obvious way to extract data from the coxme model fit.

extract_coxme_table <- function (mod) {
  beta <- mod$coefficients 
  nvar <- length(beta)
  nfrail <- nrow(mod$var) - nvar
  se <- sqrt(diag(mod$var)[nfrail + 1:nvar])
  z <- round(beta/se, 2)
  p <- signif(1 - pchisq((beta/se)^2, 1), 2)
  table=data.table(beta = beta, se = se, z = z, p = p)
  return(table)
}

s_def <- function() {
  
  defc <- defData(varname = "b", formula = 0, variance = "..s2")
  defc <- defData(defc, varname = "rx", formula = "1;1", dist = "trtAssign")
  
  defa <- defDataAdd(varname = "start_day", formula = "1;182", dist = "uniformInt")
  defa <- defDataAdd(defa, varname = "censor", 
                     formula = "365 - start_day ", dist = "nonrandom")
  
  defs <- defSurv(varname = "tte", formula = "-4.815 + 0.4 * rx + b", shape = 1.326)
  
  defa2 <- defDataAdd(varname = "event6", 
                      formula = "1*(tte <= 182)", dist = "nonrandom")
  
  return(list(defc = defc, defa = defa, defs = defs, defa2 = defa2))
  
}

s_generate <- function(argsvec, list_of_defs) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  dc <- genData(npcps, defc, id = "pcp")
  dd <- genCluster(dc, "pcp", npats, "id")
  dd <- addColumns(defa, dd)
  dd <- genSurv(dd, defs, digits = 0)
  dx <- addCompRisk(dd, events = c("tte", "censor"), 
                    timeName = "time", censorName = "censor", keepEvents = TRUE)
  dx <- addColumns(defa2, dx)
  
  dx[]
  
}

s_replicate <- function(argsvec, list_of_defs) {
  
  dx <- s_generate(argsvec, list_of_defs)
  
  coxfitm <-coxme(Surv(time, event) ~ rx + (1 | pcp), data = dx)
  
  list2env(as.list(argsvec), envir = environment())
  
  return(data.table(
    npcps = npcps,
    npats = npats,
    s2 = s2,
    est_s = fixef(coxfitm), 
    re.var_s = VarCorr(coxfitm)$pcp,
    p_s = extract_coxme_table(coxfitm)$p
  ))
  
}

s_scenarios <- function(argsvec, nreps) {
  
  list_of_defs <- s_def()
  
  rbindlist(
    parallel::mclapply(
      X = 1 : nreps, 
      FUN = function(x) s_replicate(argsvec, list_of_defs), 
      mc.cores = 4)
  )
  
}

Generating variable cluster sizes to assess power in cluster randomized trials

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 18 Apr 2023 00:00:00 +0000

In recent discussions with a number of collaborators at the NIA IMPACT Collaboratory about setting the sample size for a proposed cluster randomized trial, the question of variable cluster sizes has come up a number of times. Given a fixed overall sample size, it is generally better (in terms of statistical power) if the sample is equally distributed across the different clusters; highly variable cluster sizes increase the standard errors of effect size estimates and reduce the ability to determine if an intervention or treatment is effective.

When I started to prepare a quick simulation to demonstrate this phenomenon, I quickly realized that there was no easy way using simstudy (my simulation package of choice) to generate the desired variable cluster sizes while holding the total sample size constant. I thought about it for a bit and came up with a simple solution that is now implemented and available for download (devtools::install_github("kgoldfeld/simstudy")). My plan here is to describe this solution, and then show the results of the simulation that inspired the need for (and uses) the new functionality.

Quick recap on how to generate cluster data with simstudy

There are two ways I would have typically simulated clustered data using a data generation process defined by a linear mixed-effects model. In both cases, we would define a variable $n$ the represents the number of subjects per cluster.

In the first approach, we would assume perfectly balanced cluster sizes and fix $n$ at a constant value. In the example, we will generate 10 clusters with 20 members each. In this case, I am generating the cluster-level random effect and treatment assignment. The individual-level outcome is a function of the treatment assignment and the cluster effect, as well as random individual-level variation. All of this is specified in the data generation definitions:

library(simstudy)

d0 <- defData(varname = "n", formula = 20, dist = "nonrandom")
d0 <- defData(d0, varname = "a", formula = 0, variance = 0.33)
d0 <- defData(d0, varname = "rx", formula = "1;1", dist = "trtAssign")

d1 <- defDataAdd(varname = "y", formula = "18 + 1.6 * rx + a", 
          variance = 16, dist = "normal")

The data are generated in two steps. First, the cluster-level data are generated:

set.seed(2761)

dc <- genData(10, d0, "site")
dc

##     site  n       a rx
##  1:    1 20 -0.3548  1
##  2:    2 20 -1.1232  1
##  3:    3 20 -0.5963  0
##  4:    4 20 -0.0503  1
##  5:    5 20  0.0894  0
##  6:    6 20  0.5294  1
##  7:    7 20  1.2302  0
##  8:    8 20  0.9663  1
##  9:    9 20  0.0993  0
## 10:   10 20  0.6508  0

And then the individual level data are generated, $n = 20$ subjects for each site:

dd <- genCluster(dc, "site", "n", "id")
dd <- addColumns(d1, dd)
dd

##      site  n      a rx  id    y
##   1:    1 20 -0.355  1   1 17.7
##   2:    1 20 -0.355  1   2 16.2
##   3:    1 20 -0.355  1   3 19.2
##   4:    1 20 -0.355  1   4 20.6
##   5:    1 20 -0.355  1   5 14.7
##  ---                           
## 196:   10 20  0.651  0 196 25.3
## 197:   10 20  0.651  0 197 22.1
## 198:   10 20  0.651  0 198 13.2
## 199:   10 20  0.651  0 199 15.6
## 200:   10 20  0.651  0 200 13.8

If we want variable cluster sizes, we could slightly modify the data definitions so that $n$ is no longer constant. (From here on out, I am just generating $n$ and the cluster-level data without the random effects and treatment assignment, but I could have just as easily included the full data.) Here, I am using the Poisson distribution, but I could use the negative binomial distribution if I wanted more variation across clusters:

d0 <- defData(varname = "n", formula = 20, dist = "poisson")
genData(10, d0, "site")

##     site  n
##  1:    1 13
##  2:    2 18
##  3:    3 21
##  4:    4 26
##  5:    5 25
##  6:    6 27
##  7:    7 23
##  8:    8 30
##  9:    9 23
## 10:   10 20

This is great, but the total sample size is no longer fixed at 200 (here we have randomly generated 226 individuals). The total will vary from sample to sample. So, if we want to have both across-cluster variability and constant total sample size, we need a new approach.

New approach using simstudy

There is a new simstudy distribution called “clusterSize”, which requires two parameters: the (fixed) total sample size (input into the formula field) and a (non-negative) dispersion measure that represents the variability across clusters (input into the variance field). (The idea behind the data generation is described in the addendum.) If the dispersion is set to $0$, then we will have constant cluster sizes:

d0 <- defData(varname = "n", formula = 200, variance = 0, dist = "clusterSize")
genData(10, d0, "site")

##     site  n
##  1:    1 20
##  2:    2 20
##  3:    3 20
##  4:    4 20
##  5:    5 20
##  6:    6 20
##  7:    7 20
##  8:    8 20
##  9:    9 20
## 10:   10 20

When we increase the dispersion, we start to introduce cluster-size variability but keep the overall sample size at 200:

d0 <- defData(varname = "n", formula = 200, variance = 0.2, dist = "clusterSize")
genData(10, d0, "site")

##     site  n
##  1:    1 20
##  2:    2 28
##  3:    3 25
##  4:    4 24
##  5:    5 28
##  6:    6 22
##  7:    7  7
##  8:    8 13
##  9:    9 22
## 10:   10 11

And we can have extreme variability with a very high dispersion value:

d0 <- defData(varname = "n", formula = 200, variance = 5, dist = "clusterSize")
genData(10, d0, "site")

##     site   n
##  1:    1  10
##  2:    2   2
##  3:    3  17
##  4:    4   2
##  5:    5  49
##  6:    6 110
##  7:    7   1
##  8:    8   4
##  9:    9   1
## 10:   10   4

Application: cluster-size variability and statistical power

I conducted a simulation experiment to assess the impact of the dispersion parameter on the estimated power for a cluster randomized trial with cluster-level effects. In the simulation (code is available here), I assumed 20 clusters (10 randomized to the experimental arm, 10 to the control arm) and a total of 500 participants (so on average 25 per arm).

The specific model I used to generate the data was

\[y_{ij} = 20 + 1.6 * A_{i} + a_{i} + e_{ij},\]

where $y_{ij}$ is the continuous outcome for subject $j$ in cluster $i$. $A_i$ is the treatment indicator for cluster $i$, $A_i = 1$ if cluster $i$ has been randomized to the experimental arm, $A_i = 0$ otherwise. $a_i$ is the cluster-specific random effect, is normally distributed: $a_i \sim N(\mu = 0, \sigma_a^2)$. $e_{ij}$ is the (unmeasured) individual $j$ effect, and is also normally distributed: $e_{ij} \sim N(\mu =0, \sigma_e^2 = 16)$.

Statistical power is directly influenced by overall variability of the outcome, which in this case includes the cluster and individual level variation. Specifically, power is a function of the intra-class (or intra-cluster) correlation (ICC), which can be calculated using \[ICC = \frac{\sigma_a^2}{\sigma_a^2 + \sigma_e^2}.\] In the simulations, ICCs ranged from $0.1$ to $0.4$. Since $\sigma_e^2$ was fixed, the variance $\sigma^2_a$ was determined by the ICC.

The focus of these simulations is to provide a figure that illustrates the impact of cluster-size variability (with constant total sample size) on power. I used different dispersion assumptions, ranging from 0 to 0.5, to generate different data sets. For each of the 44 ICC/dispersion parameter combinations, I generated 50,000 data sets (yes, I used a high performance computing core) and estimated a linear mixed effect model for each. The power was calculated for each combination by looking at the proportion of the p-values less than 0.05. The figure below shows the results; it appears that both higher ICCs and cluster size variability lead to reduced power:

Given a particular ICC level, it seems pretty clear that cluster-size variability does matter when thinking about power. However, the impact may not be as substantial as the general variability in the outcome across the clusters. Unfortunately, I know of no analytic software that will provide insight into this. But if you are using simulation to conduct the sample size calculation, it is now extremely easy to incorporate cluster size variability into the simulations (particularly if you are using simstudy).

Support:

This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (NIA IMPACT Collaboratory). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

Addendum - a simple trick to generate variation

Generating the variable cluster sizes under a fixed total is actually quite simple if you take advantage of the Dirichlet distribution. The Dirichlet distribution is essentially a multivariate generalization of the beta distribution. In the Dirichlet distribution, the multivariate values range from 0 to 1, and they sum to 1. Given the range of the data, it is very natural to use values generated from this distribution as probabilities or proportions, which is what the simstudy clusterSize distribution does. This is perhaps easiest to see in a simple example.

Values from the Dirichlet distribution can be generated using the rdirichlet function in the the dirmult package. The key parameter, called the concentration parameter, is a vector of length k, where k is the number of values (e.g. clusters) we are interested in generating. In the first example, I am generating 10 values using a concentration parameter of 32. (In the simstudy implementation of the clusterSize distribution, the dispersion parameter $d$ is 1/concentration.)

Generating 20 values, we can see that all values are between 0 and 1, and sum to 1:

x <- dirmult::rdirichlet(1, alpha = rep(32, 20) )[1,]
x

##  [1] 0.0442 0.0465 0.0374 0.0506 0.0516 0.0516 0.0623 0.0429 0.0601 0.0404
## [11] 0.0490 0.0517 0.0589 0.0566 0.0320 0.0568 0.0458 0.0610 0.0542 0.0462

sum(x)

## [1] 1

From here, it is easy generate values between 0 and 400 if we have a total sample size of 400.

s1 <- floor(x*400)
s1

##  [1] 17 18 14 20 20 20 24 17 24 16 19 20 23 22 12 22 18 24 21 18

sum(s1)

## [1] 389

Due to rounding, the sum does not equal to 400. In the simstudy function this rounding error is accounted for by allocating an additional unit to randomly selected clusters.

If we use a lower concentration parameter (in this case 4), there should be more variability, and indeed there appears to be:

x <- dirmult::rdirichlet(1, alpha = rep(4, 20) )[1,]
s2 <- floor(x * 400)
s2

##  [1] 17 27  7 11 18 24 13 26  5 25 23 28 19 26 14 11 43 22 24  9

It appears that the second sample is more variable than the first, and we can confirm this with the standard deviation:

c(sd1 = sd(s1), sd2 = sd(s2))

##  sd1  sd2 
## 3.30 9.05

Implementing a one-step GEE algorithm for very large cluster sizes in R

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 Mar 2023 00:00:00 +0000

Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis. Estimation in this case is not really an issue because the cluster sizes are typically relatively small. However, if there are groups of individuals, we also need to account for correlation. Unfortunately, if these group/cluster sizes are too large - perhaps bigger than 1000 - traditional GEE estimation techniques just may not be feasible.

An approach to GEE that is feasible has been described by Lipsitz et al and implemented using a SAS macro (which is available on the journal’s website). I am not much of a SAS user, so I searched for an implementation in R. Since I didn’t come across anything, I went ahead and implemented it myself. I am undecided about creating an R package for this, but in the meantime I thought I would compare it to a standard package in R and provide a link to the code if you’d like to implement it yourself. (And, if it does already exist in an R package, definitely let me know, because I certainly don’t want to duplicate anything.)

The one-step GEE algorithm

Traditional GEE models (such as those fit with R packages gee and geepack) allow for flexibility in specifying the within-cluster correlation structure (we generally still assume that individuals in different clusters are uncorrelated). For example, one could assume that the correlation across individuals is constant within a cluster. We call this exchangeable or compound symmetry correlation, and the intra-cluster correlation (ICC) is the measure of that correlation. Alternatively, if measurements are collected over time, we might assume that measurements taken closer together are more highly correlated; this is called auto-regressive correlation.

The proposed algorithm that is implemented here is called the one-step GEE, and is operating under the assumption of exchangeable correlation. To provide a little more detail on the algorithm, but to keep it simple, let me quote directly from the paper’s abstract:

We propose a one-step GEE estimator that (1) matches the asymptotic efficiency of the fully iterated GEE; (2) uses a simpler formula to estimate the [intra-cluster correlation] ICC that avoids summing over all pairs; and (3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this article is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars.

The rest of the way, I will simulate data and fit the models using the traditional estimation approach as well as the one-step approach.

Comparing standard GEE with one-step GEE

To start, I am simulating a simple data set with 100 clusters that average 100 individuals per cluster.

library(simstudy)
library(data.table)
library(geepack)

First, the definitions used in the data generation. Each individual has three covariates, and the probability is a function of two of them:

d1 <- defData(varname = "n", formula = 100, dist = "noZeroPoisson")

d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "p", formula = "-0.7 + 0.7*x1 - 0.4*x2", 
        dist = "nonrandom", link="logit")

And then the data generation - the final step creates the within-site correlated outcomes with an ICC of 0.15:

set.seed(1234)

ds <- genData(100, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id")
dc <- addColumns(d2, dc)

dd <- addCorGen(dc, idvar = "site", param1 = "p",
        rho = 0.15, corstr = "cs", dist = "binary", cnames = "y", method = "ep")

Here’s a few records from the data set, which has just under 10,000 observations across the 100 clusters:

dd

##       site   n   id          x1          x2          x3         p y
##    1:    1  88    1 -0.57111723  0.21022042 -0.31201547 0.2343570 0
##    2:    1  88    2 -0.18406857  0.68915628 -0.72801775 0.2488957 0
##    3:    1  88    3 -0.35066169 -0.10884256  0.28141278 0.2886548 0
##    4:    1  88    4 -0.32095917 -0.03676544 -0.33847489 0.2870070 0
##    5:    1  88    5 -0.05132678  0.07095837 -0.09308927 0.3177108 1
##   ---                                                              
## 9784:  100 106 9784 -0.26912777 -0.22893853  0.16303770 0.3107074 1
## 9785:  100 106 9785  0.16399013  0.26200677 -0.03187061 0.3340309 0
## 9786:  100 106 9786  0.14450843 -0.24399797 -0.28108422 0.3772482 1
## 9787:  100 106 9787  0.29307929  0.03889844 -0.37495296 0.3750989 0
## 9788:  100 106 9788 -0.06246070 -0.41041326  0.43236084 0.3590345 1

We can fit a regular GEE model here, since the cluster sizes are relatively small:

system.time(geefit <- geese(y ~ x1 + x2 + x3, id = site, data = dd, 
  family = binomial, corstr = "exchangeable"))

##    user  system elapsed 
##   2.245   0.103   1.903

summary(geefit)

## 
## Call:
## geese(formula = y ~ x1 + x2 + x3, id = site, data = dd, family = binomial, 
##     corstr = "exchangeable")
## 
## Mean Model:
##  Mean Link:                 logit 
##  Variance to Mean Relation: binomial 
## 
##  Coefficients:
##               estimate     san.se        wald            p
## (Intercept) -0.7275165 0.08459309  73.9632404 0.000000e+00
## x1           0.7396758 0.06584516 126.1929384 0.000000e+00
## x2          -0.3191633 0.06073004  27.6196887 1.476680e-07
## x3          -0.0477172 0.06113708   0.6091727 4.350995e-01
## 
## Scale Model:
##  Scale Link:                identity 
## 
##  Estimated Scale Parameters:
##              estimate     san.se     wald p
## (Intercept) 0.9954265 0.03464657 825.4633 0
## 
## Correlation Model:
##  Correlation Structure:     exchangeable 
##  Correlation Link:          identity 
## 
##  Estimated Correlation Parameters:
##        estimate     san.se     wald            p
## alpha 0.1441719 0.02168204 44.21411 2.943523e-11
## 
## Returned Error Value:    0 
## Number of clusters:   100   Maximum cluster size: 125

The one-step GEE function (which I’ve called gee1step) runs quite a bit faster than the standard GEE model (more than 10 times faster), but the results are virtually identical.

system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site"))

##    user  system elapsed 
##   0.138   0.018   0.090

fit1

## $estimates
##                   est     se.err          z      p.value
## Intercept -0.72743563 0.08549337 -8.5086793 1.759257e-17
## x1         0.73943837 0.06704127 11.0296000 2.750859e-28
## x2        -0.31903588 0.06136514 -5.1989760 2.003894e-07
## x3        -0.04758544 0.06165700 -0.7717768 4.402466e-01
## 
## $rho
## [1] 0.1440159
## 
## $clusters
## $clusters$n_clusters
## [1] 100
## 
## $clusters$avg_size
## [1] 97.88
## 
## $clusters$min_size
## [1] 77
## 
## $clusters$max_size
## [1] 125
## 
## 
## $outcome
## [1] "y"
## 
## $model
## y ~ x1 + x2 + x3
## 
## attr(,"class")
## [1] "gee1step"

The one-step algorithm with very large cluster sizes

Obviously, in the previous example, gee1step is unnecessary because geese handled the data set just fine. But, in the next example, with an average of 10,000 observations per cluster, geese will not run - at least not on my MacBook Pro. gee1step does just fine. I’m generating the data slightly differently here since simstudy doesn’t do well with extremely large correlation matrices. I’m using a random effect instead to induce correlation:

vicc <- iccRE(0.15, dist = "binary")

d1 <- defData(varname = "n", formula = 10000, dist = "noZeroPoisson")
d1 <- defData(d1, varname = "b", formula = 0, variance = vicc)

d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "y", formula = "-0.7 + 0.7*x1 - 0.4*x2 + b", 
        dist = "binary", link="logit")

### generate data

set.seed(1234)

ds <- genData(100, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id")
dd <- addColumns(d2, dc)

Now, we have almost one million observations:

dd

##         site     n          b     id          x1            x2          x3 y
##      1:    1  9879 -1.3761022      1 -0.11929302  0.4136057653 -0.16119383 1
##      2:    1  9879 -1.3761022      2  0.03086998  0.3085905336  0.46055431 0
##      3:    1  9879 -1.3761022      3  0.51821656  0.2047086274  0.16721059 0
##      4:    1  9879 -1.3761022      4 -0.27688665  0.0030742246 -1.08113944 0
##      5:    1  9879 -1.3761022      5  0.03850389 -0.0991678903 -0.09447011 0
##     ---                                                                     
## 997872:  100 10065  0.2648166 997872  0.46022770  0.1588510133 -0.11851768 0
## 997873:  100 10065  0.2648166 997873  0.10696608  0.0002424567 -0.10632926 1
## 997874:  100 10065  0.2648166 997874  0.32317258 -0.1626614787  0.37855380 1
## 997875:  100 10065  0.2648166 997875 -0.17992641 -0.0333636043 -0.20060539 0
## 997876:  100 10065  0.2648166 997876  0.65942651  0.1960137135 -0.05687481 1

Despite the very large cluster sizes, the one-step algorithm still runs very fast. In addition to what is shown here, I have conducted experiments with repeated data sets to confirm that the coefficient estimates are unbiased and the standard error estimates are correct.

system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site"))

##    user  system elapsed 
##   2.371   0.339   1.813

fit1

## $estimates
##                    est      se.err          z      p.value
## Intercept -0.569908988 0.069360637  -8.216605 2.093444e-16
## x1         0.615260991 0.010232501  60.128117 0.000000e+00
## x2        -0.349871546 0.008694526 -40.240439 0.000000e+00
## x3         0.008132196 0.006470784   1.256756 2.088421e-01
## 
## $rho
## [1] 0.1016716
## 
## $clusters
## $clusters$n_clusters
## [1] 100
## 
## $clusters$avg_size
## [1] 9978.76
## 
## $clusters$min_size
## [1] 9766
## 
## $clusters$max_size
## [1] 10242
## 
## 
## $outcome
## [1] "y"
## 
## $model
## y ~ x1 + x2 + x3
## 
## attr(,"class")
## [1] "gee1step"

And please, if someone thinks it would be valuable for me to create a package for this, let me know. It would certainly help motivate me :).

UPDATE: I actually went ahead and created the most bare bone of packages, gee1step. The package can be downloaded from GitHub (not CRAN) by using the command devtools::install_github("kgoldfeld/gee1step"). I welcome anyone who wants to help me improve it so that it can go up on CRAN.

Reference:

Lipsitz, Stuart, Garrett Fitzmaurice, Debajyoti Sinha, Nathanael Hevelone, Jim Hu, and Louis L. Nguyen. “One-step generalized estimating equations with large cluster sizes.” Journal of Computational and Graphical Statistics 26, no. 3 (2017): 734-737.

simstudy 0.6.0 released: more flexible correlation patterns

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 Feb 2023 00:00:00 +0000

The new version (0.6.0) of simstudy is available for download from CRAN. In addition to some important bug fixes, I’ve added new functionality that should make data generation with correlated data a little more flexible. In the previous post, I described enhancements to the function genCorMat. As part of this release announcement, I’m describing blockExchangeMat and blockDecayMat, two new functions that can be used to generate correlation matrices when there is a temporal element to the data generation.

This is not an entirely new feature in simstudy. In the previous development version of simstudy, I introduced a function genBlockMat (which I wrote about here). I wasn’t satisfied with the (lack of) flexibility and (lack of) clarity of that function, so I took another approach that ultimately made the final cut of the new release. (Of course, it has always been possible to generate correlated data in simstudy indirectly by using random effects, but here I am focusing on marginal correlation rather than conditional correlation.) I provided the motivation for time-dependent correlation matrices in the earlier post, so I won’t repeat myself here (but if you do read that, don’t pay attention to the function I describe there).

The parameterization of the data generation process implemented in blockExchangeMat and blockDecayMat relies on correlation matrix structures described by Li et al. They’ve created a very useful nomenclature for laying out different patterns of correlation that can occur within a cluster over time. This classification is based on the distinction between cross-sectional vs cohort samples as well as the exchangeable vs. decay patterns of correlation.

Cross-sectional data

In the case where individuals in clusters are “treated” only once over the course of a study, then the sample is considered cross-sectional. (This treatment might be the control condition or the experimental condition.) I am assuming that the outcome is always collected relatively close in time to the treatment - this just simplifies things conceptually. However, it does not necessarily to be the case, as long as the patient is exposed only a single time.

The key point is that in a cross-sectional design, individuals in the same cluster who are treated at different time periods will be unique. The correlation between individual outcomes will be driven entirely by the intra-cluster correlation. The structure of that intra-cluster correlation will depend on the assumption we make about how correlation changes over time: the correlation can reflect either exchangeability or decay.

Exchangeable correlation

Under the assumption of exchangeability, there is a constant within-cluster within-period correlation ($\rho_w$) across all study participants in the same cluster and period. For participants in the same cluster but different period, the within-cluster between-period correlation ($\rho_b$) is different from $\rho_w$ (presumably lower) but constant over time.

A matrix that includes three periods and two individuals per period is shown below. This represents the correlation structure for a single cluster. Each box represents a different time period. So, the correlation represented in the box in the upper left hand corner is the within-cluster within-period correlation for the first period. The bottom left box represents the within-cluster between-period correlation for the individuals in the first and third periods. (Note that we are assuming that individuals in different clusters are not correlated with each other.)

\scriptsize{ R = \left ( \begin{array}{c|c|c} \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} & \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} \\ \hline \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} \\ \hline \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} & \begin{matrix} \rho_b & \rho_b \\ \rho_b & \rho_b \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} \end{array} \right ) }

Decaying correlation

Under the assumption of decay, the within-cluster within-period correlation ($\rho_w$) is the same as under the exchangeability assumptions. The between-period correlation is now a function of the difference in time when the two individuals were treated. The correlation is $\rho_w * r^{|s-t|}$, where $r$ is a decay parameter between 0 and 1, and $s$ and $t$ are the periods under consideration. For example, in the lower left-hand box, we have the correlation between individuals in the first period ($s=1$) and individuals in the third period ($t=3$), which gives a correlation coefficient of $\rho_w \times r^{|1-3|} = \rho_w \times r^2$. As the difference in periods grows, $r^{|s-t|}$ (and therefore, correlation) gets smaller.

\scriptsize{ R = \left ( \begin{array}{c|c|c} \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_w r & \rho_w r \\ \rho_w r & \rho_w r \end{matrix} & \begin{matrix} \rho_w r^2 & \rho_w r^2 \\ \rho_w r^2 & \rho_w r^2 \end{matrix} \\ \hline \begin{matrix} \rho_w r & \rho_w r \\ \rho_w r & \rho_w r \end{matrix}& \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_w r & \rho_w r \\ \rho_w r & \rho_w r \end{matrix} \\ \hline \begin{matrix} \rho_w r^2 & \rho_w r^2 \\ \rho_w r^2 & \rho_w r^2 \end{matrix} & \begin{matrix} \rho_w r & \rho_w r \\ \rho_w r & \rho_w r \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} \end{array} \right ) }

Closed cohort

When individuals in clusters are “treated” or exposed in each period of a study, the sample is considered to be a cohort. If every individual is measured in each period, as I’ve just described, this would be a closed cohort; it is closed in the sense that once the cohort is defined at the beginning of the study, no new participants are added. If we allow participants to start and stop and random points, this would be an open cohort design. For the purposes of simulation it is challenging to generate data under an open cohort design with this marginal approach (using correlation matrices), and is much easier to do with random effects (which I did here). Everything I describe here applies to closed cohorts only.

Exchangeable

The key difference between the cross-sectional and cohort design is the within-individual between-period (auto) correlation. Under the exchangeable assumption, the autocorrelation is specified with the correlation coefficient $\rho_a$. The within-period between-individual correlation is still $\rho_w$, and the between-period between-individual correlation is still $\rho_b$. All of these correlations remain constant in the exchangeable framework:

\scriptsize{ R = \left ( \begin{array}{c|c|c} \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} & \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} \\ \hline \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} \\ \hline \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} & \begin{matrix} \rho_a & \rho_b \\ \rho_b & \rho_a \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} \end{array} \right ) }

Decay

The decay structure under an assumption of a closed cohort is the last of the four possible variations. The within-period between-individual correlation $\rho_w$ remains the same, and so does the between-period between-individual correlation $\rho_wr^{|s-t|}$. The between-period within-individual correlation is specified as $r^{|s-t|}$:

\scriptsize{ R = \left ( \begin{array}{c|c|c} \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} r & \rho_w r \\ \rho_w r & r \end{matrix} & \begin{matrix} r^2 & \rho_w r^2 \\ \rho_w r^2 & r^2 \end{matrix} \\ \hline \begin{matrix} r & \rho_w r \\ \rho_w r & r \end{matrix}& \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} & \begin{matrix} r & \rho_w r \\ \rho_w r & r \end{matrix} \\ \hline \begin{matrix} r^2 & \rho_w r^2 \\ \rho_w r^2 & r^2 \end{matrix} & \begin{matrix} r & \rho_w r \\ \rho_w r & r \end{matrix} & \begin{matrix} 1 & \rho_w \\ \rho_w & 1 \end{matrix} \end{array} \right ) }

Generating block matrices and simulating data

We can put all of this into action with the help of two new functions blockExchangeMat and blockDecayMat. In the simulations that follow, I will start with two clusters and three periods, just to keep it simple. In the addendum, I will provide a more elaborate example.

Cross-sectional data with exchangeable correlation

In the first example, we specify $\rho_w = 0.5$ and $\rho_b = 0.3$, and there will be two individuals per cluster per period, so a total of six individuals per cluster. Since the assumptions are the same for both clusters, it is only necessary to generate a single correlation matrix that will apply to each cluster (as opposed to a list of clusters, which I will generate in the more involved example at the end).

library(simstudy)
library(data.table)

R_XE <- blockExchangeMat(ninds = 2 , nperiods = 3, rho_w = 0.5,
  rho_b = 0.3, pattern = "xsection")

R_XE

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  0.5  0.3  0.3  0.3  0.3
## [2,]  0.5  1.0  0.3  0.3  0.3  0.3
## [3,]  0.3  0.3  1.0  0.5  0.3  0.3
## [4,]  0.3  0.3  0.5  1.0  0.3  0.3
## [5,]  0.3  0.3  0.3  0.3  1.0  0.5
## [6,]  0.3  0.3  0.3  0.3  0.5  1.0

The code block that follows generates two clusters (with different mean values for the outcome for each cluster), and then 3 periods for each cluster, and finally 2 individuals per each period (per cluster). The correlated data are generated in the last step, using the correlation matrix $R_{XE}$.

defC <- defData(varname = "lambda", formula = "c(4, 7)", dist = "nonrandom")

set.seed(1234)

dc <- genData(2, defC, id = "site")
dp <- addPeriods(dc, 3, idvars = "site")
dx <- genCluster(dtClust = dp, cLevelVar = "timeID", 
                 numIndsVar = 2, level1ID = "id")

dd <- addCorGen(dx, idvar = "site", corMatrix = R_XE,
  dist = "poisson", param1 = "lambda", cnames = "y")

dd

##     site period lambda timeID id y
##  1:    1      0      4      1  1 4
##  2:    1      0      4      1  2 2
##  3:    1      1      4      2  3 2
##  4:    1      1      4      2  4 4
##  5:    1      2      4      3  5 2
##  6:    1      2      4      3  6 2
##  7:    2      0      7      4  7 3
##  8:    2      0      7      4  8 5
##  9:    2      1      7      5  9 2
## 10:    2      1      7      5 10 4
## 11:    2      2      7      6 11 5
## 12:    2      2      7      6 12 5

The next function generates 5000 data sets for these 12 individuals, so that we can estimate an empirical correlation matrix that we can compare with the true correlation matrix. In this case, it looks like things have worked out quite well.

replicate <- function(R, dx) {
  reps <- lapply(1:5000, function(x)
    addCorGen(dx, idvar = "site", corMatrix = R,
      dist = "poisson", param1 = "lambda", cnames = "y")
  )

  drep <- data.table::rbindlist(reps, idcol = "rep")
  drep[, seq := 1:.N, keyby = rep]
  dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1])
  round(cor(dmat), 1) 
}

replicate(R = R_XE, dx = dx)

##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.5 0.3 0.3 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.5 1.0 0.3 0.3 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.3 0.3 1.0 0.5 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.3 0.3 0.5 1.0 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.3 0.3 0.3 0.3 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.3 0.3 0.3 0.3 0.5 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.3 0.3 0.3 0.3
## 8  0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.3 0.3 0.3 0.3
## 9  0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 1.0 0.5 0.3 0.3
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 0.5 1.0 0.3 0.3
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 0.3 0.3 1.0 0.5
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 0.3 0.3 0.5 1.0

Cross-sectional data with correlation decay

We repeat these steps, but now with a decay parameter $r = 0.8$ and no parameter $\rho_b$.

R_XD <- blockDecayMat(ninds = 2 , nperiods = 3, rho_w = 0.5,
  r = 0.8, pattern = "xsection")

R_XD

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.50  0.4  0.4 0.32 0.32
## [2,] 0.50 1.00  0.4  0.4 0.32 0.32
## [3,] 0.40 0.40  1.0  0.5 0.40 0.40
## [4,] 0.40 0.40  0.5  1.0 0.40 0.40
## [5,] 0.32 0.32  0.4  0.4 1.00 0.50
## [6,] 0.32 0.32  0.4  0.4 0.50 1.00

The empirical correlation matches the matrix $R_{XD}$:

replicate(R = R_XD, dx = dx)

##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.5 0.4 0.4 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.5 1.0 0.4 0.4 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.4 0.4 1.0 0.5 0.4 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.4 0.4 0.5 1.0 0.4 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.3 0.3 0.4 0.4 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.3 0.3 0.4 0.4 0.5 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.4 0.4 0.3 0.3
## 8  0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.4 0.4 0.3 0.3
## 9  0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 1.0 0.5 0.4 0.4
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 0.5 1.0 0.4 0.4
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 0.4 0.4 1.0 0.5
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3 0.4 0.4 0.5 1.0

Cohort data with exchangeable correlation

Since we have a cohort, we introduce $\rho_a$ = 0.4, and specify $pattern = \text{"cohort"}$:

R_CE <- blockExchangeMat(ninds = 2 , nperiods = 3, rho_w = 0.5, 
  rho_b = 0.3, rho_a = 0.4, pattern = "cohort")

R_CE

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  0.5  0.4  0.3  0.4  0.3
## [2,]  0.5  1.0  0.3  0.4  0.3  0.4
## [3,]  0.4  0.3  1.0  0.5  0.4  0.3
## [4,]  0.3  0.4  0.5  1.0  0.3  0.4
## [5,]  0.4  0.3  0.4  0.3  1.0  0.5
## [6,]  0.3  0.4  0.3  0.4  0.5  1.0

And here is the empirical correlation matrix:

replicate(R = R_CE, dx = dx)

##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.5 0.4 0.3 0.4 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.5 1.0 0.3 0.4 0.3 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.4 0.3 1.0 0.5 0.4 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.3 0.4 0.5 1.0 0.3 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.4 0.3 0.4 0.3 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.3 0.4 0.3 0.4 0.5 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.4 0.3 0.4 0.3
## 8  0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.3 0.4 0.3 0.4
## 9  0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.3 1.0 0.5 0.4 0.3
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.4 0.5 1.0 0.3 0.4
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.3 0.4 0.3 1.0 0.5
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.4 0.3 0.4 0.5 1.0

Cohort data with correlation decay

In the final case, the parameterization for decaying correlation with a cohort is the same as a decay in the case of a cross sectional design; the only difference that we set $pattern = \text{"cohort"}$:

R_CD <- blockDecayMat(ninds = 2 , nperiods = 3, rho_w = 0.5, 
  r = 0.8, pattern = "cohort")

R_CD

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.50  0.8  0.4 0.64 0.32
## [2,] 0.50 1.00  0.4  0.8 0.32 0.64
## [3,] 0.80 0.40  1.0  0.5 0.80 0.40
## [4,] 0.40 0.80  0.5  1.0 0.40 0.80
## [5,] 0.64 0.32  0.8  0.4 1.00 0.50
## [6,] 0.32 0.64  0.4  0.8 0.50 1.00

And in this final case, the empirical data set also does quite well:

replicate(R = R_CD, dx = dx)

##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.5 0.8 0.4 0.6 0.3 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.5 1.0 0.4 0.8 0.3 0.6 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.8 0.4 1.0 0.5 0.8 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.4 0.8 0.5 1.0 0.4 0.8 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.6 0.3 0.8 0.4 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.3 0.6 0.4 0.8 0.5 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.8 0.4 0.6 0.3
## 8  0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.4 0.8 0.3 0.6
## 9  0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.4 1.0 0.5 0.8 0.4
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.5 1.0 0.4 0.8
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.3 0.8 0.4 1.0 0.5
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.6 0.4 0.8 0.5 1.0

Reference:

Li, Fan, James P. Hughes, Karla Hemming, Monica Taljaard, Edward R. Melnick, and Patrick J. Heagerty. “Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: an overview.” Statistical Methods in Medical Research 30, no. 2 (2021): 612-639.

Addendum

I want to provide a slightly more elaborate example to show the true flexibility of the new functions. In the case of a cross-sectional design, the number of observations per period for a specific cluster does not need to remain constant (though in the case of data generation under a cohort design it does). So, we can vary the total number of observations as well as the correlation parameters by cluster.

In this example, there are 10 clusters and three periods. The number of individuals per cluster per period is randomly generated, and ranges from two to four. The decay rate $r$ varies by cluster (and is generated using the beta distribution with shape parameters 6 and 2). The parameter $\rho_w$ is constant across all clusters, and is 0.6

defC <- defData(varname = "lambda", formula = "sample(5:10, 1)", dist = "nonrandom")
defP <- defDataAdd(varname = "n", formula = "2;4", dist="uniformInt")

dc <- genData(n = 10, dtDefs = defC, id = "site")
dc <- addPeriods(dtName = dc, nPeriods = 3, 
                 idvars = "site", perName = "period")
dc <- addColumns(defP, dc)

dd <- genCluster(dtClust = dc, cLevelVar = "timeID", 
  numIndsVar = "n", level1ID = "id")

Here are the counts for three sites:

dc[site %in% c(1, 3, 7), .(site, period, n)]

##    site period n
## 1:    1      0 4
## 2:    1      1 4
## 3:    1      2 4
## 4:    3      0 2
## 5:    3      1 4
## 6:    3      2 2
## 7:    7      0 4
## 8:    7      1 4
## 9:    7      2 2

And here are the unique decay rates for the same sites:

r <- round(rbeta(10, 6, 2), 2)
r[c(1, 3, 7)]

## [1] 0.82 0.80 0.93

And finally, here are the correlation matrices for these three sites:

N <- dd[, .N, keyby = .(site, period)][, N]

R <- blockDecayMat(ninds = N , nperiods = 3, rho_w = 0.6, r = r, nclusters = 10)

lapply(R, function(x) round(x,2))[c(1, 3, 7)]

## [[1]]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
##  [1,] 1.00 0.60 0.60 0.60 0.49 0.49 0.49 0.49 0.40  0.40  0.40  0.40
##  [2,] 0.60 1.00 0.60 0.60 0.49 0.49 0.49 0.49 0.40  0.40  0.40  0.40
##  [3,] 0.60 0.60 1.00 0.60 0.49 0.49 0.49 0.49 0.40  0.40  0.40  0.40
##  [4,] 0.60 0.60 0.60 1.00 0.49 0.49 0.49 0.49 0.40  0.40  0.40  0.40
##  [5,] 0.49 0.49 0.49 0.49 1.00 0.60 0.60 0.60 0.49  0.49  0.49  0.49
##  [6,] 0.49 0.49 0.49 0.49 0.60 1.00 0.60 0.60 0.49  0.49  0.49  0.49
##  [7,] 0.49 0.49 0.49 0.49 0.60 0.60 1.00 0.60 0.49  0.49  0.49  0.49
##  [8,] 0.49 0.49 0.49 0.49 0.60 0.60 0.60 1.00 0.49  0.49  0.49  0.49
##  [9,] 0.40 0.40 0.40 0.40 0.49 0.49 0.49 0.49 1.00  0.60  0.60  0.60
## [10,] 0.40 0.40 0.40 0.40 0.49 0.49 0.49 0.49 0.60  1.00  0.60  0.60
## [11,] 0.40 0.40 0.40 0.40 0.49 0.49 0.49 0.49 0.60  0.60  1.00  0.60
## [12,] 0.40 0.40 0.40 0.40 0.49 0.49 0.49 0.49 0.60  0.60  0.60  1.00
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1.00 0.60 0.48 0.48 0.48 0.48 0.38 0.38
## [2,] 0.60 1.00 0.48 0.48 0.48 0.48 0.38 0.38
## [3,] 0.48 0.48 1.00 0.60 0.60 0.60 0.48 0.48
## [4,] 0.48 0.48 0.60 1.00 0.60 0.60 0.48 0.48
## [5,] 0.48 0.48 0.60 0.60 1.00 0.60 0.48 0.48
## [6,] 0.48 0.48 0.60 0.60 0.60 1.00 0.48 0.48
## [7,] 0.38 0.38 0.48 0.48 0.48 0.48 1.00 0.60
## [8,] 0.38 0.38 0.48 0.48 0.48 0.48 0.60 1.00
## 
## [[3]]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] 1.00 0.60 0.60 0.60 0.56 0.56 0.56 0.56 0.52  0.52
##  [2,] 0.60 1.00 0.60 0.60 0.56 0.56 0.56 0.56 0.52  0.52
##  [3,] 0.60 0.60 1.00 0.60 0.56 0.56 0.56 0.56 0.52  0.52
##  [4,] 0.60 0.60 0.60 1.00 0.56 0.56 0.56 0.56 0.52  0.52
##  [5,] 0.56 0.56 0.56 0.56 1.00 0.60 0.60 0.60 0.56  0.56
##  [6,] 0.56 0.56 0.56 0.56 0.60 1.00 0.60 0.60 0.56  0.56
##  [7,] 0.56 0.56 0.56 0.56 0.60 0.60 1.00 0.60 0.56  0.56
##  [8,] 0.56 0.56 0.56 0.56 0.60 0.60 0.60 1.00 0.56  0.56
##  [9,] 0.52 0.52 0.52 0.52 0.56 0.56 0.56 0.56 1.00  0.60
## [10,] 0.52 0.52 0.52 0.52 0.56 0.56 0.56 0.56 0.60  1.00

And here are the empirical correlation matrices for the three sites:

reps <- lapply(1:5000, 
  function(x) addCorGen(dd, idvar = "site", corMatrix = R,
    dist = "poisson", param1 = "lambda", cnames = "y")
)

drep <- data.table::rbindlist(reps, idcol = "rep")

empir_corr <- function(cluster) {
  dcrep <- drep[site == cluster, ]
  dcrep[, seq := 1:.N, keyby = rep]
  dmat <- as.matrix(dcast(dcrep, rep ~ seq, value.var = "y")[, -1])
  
  return(round(cor(dmat), 2))
}

empir_corr(cluster = 1)

##       1    2    3    4    5    6    7    8    9   10   11   12
## 1  1.00 0.59 0.60 0.60 0.48 0.49 0.49 0.49 0.37 0.37 0.37 0.39
## 2  0.59 1.00 0.59 0.59 0.50 0.49 0.49 0.50 0.38 0.37 0.39 0.39
## 3  0.60 0.59 1.00 0.58 0.49 0.47 0.49 0.49 0.37 0.37 0.37 0.40
## 4  0.60 0.59 0.58 1.00 0.50 0.48 0.48 0.49 0.39 0.39 0.39 0.41
## 5  0.48 0.50 0.49 0.50 1.00 0.59 0.58 0.60 0.49 0.49 0.48 0.49
## 6  0.49 0.49 0.47 0.48 0.59 1.00 0.59 0.60 0.47 0.47 0.46 0.49
## 7  0.49 0.49 0.49 0.48 0.58 0.59 1.00 0.59 0.48 0.48 0.47 0.50
## 8  0.49 0.50 0.49 0.49 0.60 0.60 0.59 1.00 0.48 0.47 0.47 0.49
## 9  0.37 0.38 0.37 0.39 0.49 0.47 0.48 0.48 1.00 0.58 0.57 0.58
## 10 0.37 0.37 0.37 0.39 0.49 0.47 0.48 0.47 0.58 1.00 0.59 0.58
## 11 0.37 0.39 0.37 0.39 0.48 0.46 0.47 0.47 0.57 0.59 1.00 0.59
## 12 0.39 0.39 0.40 0.41 0.49 0.49 0.50 0.49 0.58 0.58 0.59 1.00

empir_corr(cluster = 3)

##      1    2    3    4    5    6    7    8
## 1 1.00 0.59 0.47 0.46 0.48 0.47 0.38 0.39
## 2 0.59 1.00 0.48 0.46 0.49 0.47 0.37 0.40
## 3 0.47 0.48 1.00 0.60 0.61 0.60 0.48 0.49
## 4 0.46 0.46 0.60 1.00 0.60 0.60 0.48 0.49
## 5 0.48 0.49 0.61 0.60 1.00 0.60 0.49 0.49
## 6 0.47 0.47 0.60 0.60 0.60 1.00 0.49 0.49
## 7 0.38 0.37 0.48 0.48 0.49 0.49 1.00 0.59
## 8 0.39 0.40 0.49 0.49 0.49 0.49 0.59 1.00

empir_corr(cluster = 7)

##       1    2    3    4    5    6    7    8    9   10
## 1  1.00 0.60 0.59 0.62 0.57 0.57 0.56 0.57 0.54 0.52
## 2  0.60 1.00 0.59 0.60 0.56 0.56 0.55 0.55 0.51 0.51
## 3  0.59 0.59 1.00 0.61 0.56 0.57 0.57 0.57 0.51 0.52
## 4  0.62 0.60 0.61 1.00 0.56 0.57 0.57 0.57 0.51 0.53
## 5  0.57 0.56 0.56 0.56 1.00 0.61 0.61 0.60 0.55 0.56
## 6  0.57 0.56 0.57 0.57 0.61 1.00 0.61 0.60 0.56 0.56
## 7  0.56 0.55 0.57 0.57 0.61 0.61 1.00 0.61 0.55 0.56
## 8  0.57 0.55 0.57 0.57 0.60 0.60 0.61 1.00 0.55 0.57
## 9  0.54 0.51 0.51 0.51 0.55 0.56 0.55 0.55 1.00 0.58
## 10 0.52 0.51 0.52 0.53 0.56 0.56 0.56 0.57 0.58 1.00

Flexible correlation generation: an update to genCorMat in simstudy

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 14 Feb 2023 00:00:00 +0000

I’ve been slowly working on some updates to simstudy, focusing mostly on the functionality to generate correlation matrices (which can be used to simulate correlated data). Here, I’m briefly describing the function genCorMat, which has been updated to facilitate the generation of correlation matrices for clusters of different sizes and with potentially different correlation coefficients.

I’ll briefly describe what the existing function can currently do, and then give an idea about what the enhancements will provide.

Simple correlation matrix generation

In its original form, genCorMat could generate a single (square) correlation matrix of a specified dimension. This could a randomly generated (valid) correlation matrix, or a correlation matrix with a set of specified coefficients.

Here is an example of the first, a randomly generated correlation matrix:

library(simstudy)
library(data.table)
set.seed(37265)

genCorMat(4)

##             [,1]        [,2]        [,3]       [,4]
## [1,]  1.00000000 -0.22742403  0.01285282 -0.3201579
## [2,] -0.22742403  1.00000000 -0.04973973 -0.1218070
## [3,]  0.01285282 -0.04973973  1.00000000 -0.2940923
## [4,] -0.32015788 -0.12180695 -0.29409234  1.0000000

And here is a matrix with a specified set of coefficients (and you well get an error message if it is not positive semidefinite!):

R <- genCorMat(4, cors = c(0.6, 0.4, 0.2, 0.5, 0.3, 0.4))
R

##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.6  0.4  0.2
## [2,]  0.6  1.0  0.5  0.3
## [3,]  0.4  0.5  1.0  0.4
## [4,]  0.2  0.3  0.4  1.0

This matrix can be used to generate data using functions genCorData or genCorGen:

dd <- genCorGen(n = 1000, nvars = 4, corMatrix = R, params1 = c(3, 5, 8, 9), 
  dist = "poisson", wide = TRUE)

head(dd)

##    id V1 V2 V3 V4
## 1:  1  3  3  5  8
## 2:  2  3  9 12  8
## 3:  3  1  2 14 12
## 4:  4  4  9 13 14
## 5:  5  4  9  7 11
## 6:  6  4  5  6  8

And the correlation from this data set is quite close to the specified matrix R.

round(cor(as.matrix(dd[, -1])), 1)

##     V1  V2  V3  V4
## V1 1.0 0.6 0.4 0.2
## V2 0.6 1.0 0.5 0.3
## V3 0.4 0.5 1.0 0.5
## V4 0.2 0.3 0.5 1.0

Specifying a structure

With the updated version of genCorMat, it is now possible to specify an exchangeable/compound symmetry or auto-regressive structure. Here is the compound symmetry structure:

genCorMat(nvars = 4, rho = 0.6, corstr = "cs")

##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.6  0.6  0.6
## [2,]  0.6  1.0  0.6  0.6
## [3,]  0.6  0.6  1.0  0.6
## [4,]  0.6  0.6  0.6  1.0

And here is a matrix with an auto-regressive or decaying structure:

genCorMat(nvars = 4, rho = 0.6, corstr = "ar1")

##       [,1] [,2] [,3]  [,4]
## [1,] 1.000 0.60 0.36 0.216
## [2,] 0.600 1.00 0.60 0.360
## [3,] 0.360 0.60 1.00 0.600
## [4,] 0.216 0.36 0.60 1.000

Cluster-specific correlation matrices

The final major enhancement is the capability to generate a list of correlation matrices, each of which corresponds to a specific cluster. These matrices can be of different sizes (to accommodate different cluster sizes) and have different parameters (if not random). The only constraints are that the overall structure of matrices need to be the same (i.e. random, cs, or ar1), and it is not possible to use the cors argument (since the number of correlation parameters would be different depending on the cluster size).

In this example, I am generating matrices with a cs structure for four clusters with sizes 2, 3, 4, and 3, respectively, and within-cluster correlations of $\rho_1 = 0.6$, $\rho_2 = 0.7$, $\rho_3 = 0.5$, and $\rho_4 = 0.4$. This reflects an overall block correlation matrix that looks like this:

\scriptsize{ R = \left ( \begin{array}{c|c|c|c} \begin{matrix} 1.0 & 0.6 \\ 0.6 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.7 & 0.7 \\ 0.7 & 1.0 & 0.7 \\ 0.7 & 0.7 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.5 & 0.5 & 0.5 \\ 0.5 & 1.0 & 0.5 & 0.5 \\ 0.5 & 0.5 & 1.0 & 0.5 \\ 0.5 & 0.5 & 0.5 & 1.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} \\ \hline \begin{matrix} 0.0 & 0.0 \\ 0.0 & 0.0 \\ 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{matrix} & \begin{matrix} 1.0 & 0.4 & 0.4 \\ 0.4 & 1.0 & 0.4 \\ 0.4 & 0.4 & 1.0 \end{matrix} \\ \end{array} \right ) }

Each column represents an individual unit (and so does each row). Reading down a column (or across a row) gives the correlations with the other individual units. The clusters are represented by the grids drawn over the matrix. In this case, individuals are correlated only with other individuals in the same cluster.

To generate this system of matrices, we just need to specify the number of observations per cluster ($nvars$), the correlation coefficients for each cluster ($rho$, which in this case is a vector), and the number of clusters. The $nvars$ argument needs to match the numbers of individuals in each cluster in the data set, and the lengths of $nvars$ and $rho$ must be the same as the number of clusters (though either or both can be scalars, in which case the values are shared across the clusters). The output is a list of correlation matrices, one for each cluster.

RC <- genCorMat(nvars = c(2, 3, 4, 3), rho = c(0.6, 0.7, 0.5, 0.4), 
  corstr = "cs", nclusters = 4)

RC

## $`1`
##      [,1] [,2]
## [1,]  1.0  0.6
## [2,]  0.6  1.0
## 
## $`2`
##      [,1] [,2] [,3]
## [1,]  1.0  0.7  0.7
## [2,]  0.7  1.0  0.7
## [3,]  0.7  0.7  1.0
## 
## $`3`
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.5  0.5  0.5
## [2,]  0.5  1.0  0.5  0.5
## [3,]  0.5  0.5  1.0  0.5
## [4,]  0.5  0.5  0.5  1.0
## 
## $`4`
##      [,1] [,2] [,3]
## [1,]  1.0  0.4  0.4
## [2,]  0.4  1.0  0.4
## [3,]  0.4  0.4  1.0

To create these correlated data, first we can generate a data set of individuals that are clustered in groups. The outcome will be Poisson distributed, so we are specifying mean $\lambda$ for each cluster:

d1 <- defData(varname = "n", formula = "c(2, 3, 4, 3)", dist = "nonrandom")
d1 <- defData(d1, varname = "lambda", formula = "c(6, 7, 9, 8)", dist = "nonrandom")

ds <- genData(4, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")

Now, we can generate some data using the correlation matrix RC:

dd <- addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = RC,
          dist = "poisson", cnames = "y", method = "copula")

dd

##     site n lambda id  y
##  1:    1 2      6  1 11
##  2:    1 2      6  2  7
##  3:    2 3      7  3  4
##  4:    2 3      7  4  3
##  5:    2 3      7  5  5
##  6:    3 4      9  6  8
##  7:    3 4      9  7  7
##  8:    3 4      9  8 10
##  9:    3 4      9  9 11
## 10:    4 3      8 10  2
## 11:    4 3      8 11  6
## 12:    4 3      8 12  4

If we want to confirm that everything is working as expected, we can recover the overall correlation matrix by generating a large number of data sets (in this case 5000):

replicate <- function(R, dc) {
  reps <- lapply(1:5000, function(x)
  addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = R,
    dist = "poisson", cnames = "y", method = "copula")
  )

  drep <- data.table::rbindlist(reps, idcol = "rep")
  drep[, seq := 1:.N, keyby = rep]
  dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1])
  round(cor(dmat), 1) 
}

replicate(R = RC, dc = dc)

##      1   2   3   4   5   6   7   8   9  10  11  12
## 1  1.0 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 2  0.6 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 3  0.0 0.0 1.0 0.7 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 4  0.0 0.0 0.7 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 5  0.0 0.0 0.7 0.7 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 6  0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0
## 7  0.0 0.0 0.0 0.0 0.0 0.5 1.0 0.5 0.5 0.0 0.0 0.0
## 8  0.0 0.0 0.0 0.0 0.0 0.5 0.5 1.0 0.5 0.0 0.0 0.0
## 9  0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 1.0 0.0 0.0 0.0
## 10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.4 0.4
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 1.0 0.4
## 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4 1.0

It seems to have worked quite well - this empirical matrix matches the hypothetical matrix above. In the next post, I’ll describe how block matrices for different clusters over different time periods can also be flexibly generated for different groups.

Addendum

As a bonus feature, here is a code snippet that generates data for a large number of clusters, where the parameters (cluster means, variance, and correlation coefficients) themselves are randomly generated. By providing this flexibility, we induce a lot of variability in the data generation process.

d1 <- defData(varname = "n", formula = 20, dist = "noZeroPoisson")
d1 <- defData(d1, varname = "mu", formula = 10, variance = 8, dist = "normal")
d1 <- defData(d1, varname = "s2", formula = 4, dist = "nonrandom")

ds <- genData(100, d1, id = "site")
dd <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")

n <- dd[, .N, keyby = site][, N]
nsites <- length(n)
rho <- rbeta(nsites, 25, 15)

RM <- genCorMat(nvars = n, rho = rho, corstr = "cs", nclusters = nsites)

dd <- addCorGen(dd, idvar = "site", param1 = "mu", param2 = "s2",
                corMatrix = RM, dist = "normal", cnames = "y", method = "copula")

dd

##       site  n        mu s2   id         y
##    1:    1 22  7.095561  4    1  4.102052
##    2:    1 22  7.095561  4    2  6.001640
##    3:    1 22  7.095561  4    3  4.942132
##    4:    1 22  7.095561  4    4  4.062929
##    5:    1 22  7.095561  4    5  6.112493
##   ---                                    
## 1989:  100 23 13.073472  4 1989 14.111518
## 1990:  100 23 13.073472  4 1990 13.773178
## 1991:  100 23 13.073472  4 1991 13.763948
## 1992:  100 23 13.073472  4 1992 13.562904
## 1993:  100 23 13.073472  4 1993 12.891312

A GAM for time trends in a stepped-wedge trial with a binary outcome

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 17 Jan 2023 00:00:00 +0000

In a previous post, I described some ways one might go about analyzing data from a stepped-wedge, cluster-randomized trial using a generalized additive model (a GAM), focusing on continuous outcomes. I have spent the past few weeks developing a similar model for a binary outcome, and have started to explore model comparison and methods to evaluate goodness-of-fit. The following describes some of my thought process.

Data generation

The data generation process I am using here follows along pretty closely with the earlier post, except, of course, the outcome has changed from continuous to binary. In this example, I’ve increased the correlation for between-period effects because it doesn’t seem like outcomes would change substantially from period to period, particularly if the time periods themselves are relatively short. The correlation still decays over time.

Here are the necessary libraries:

library(simstudy)
library(ggplot2)
library(data.table)
library(mgcv)
library(gratia)
library(patchwork)
library(mgcViz)
library(DHARMa)
library(itsadug)

The data generation, based on 24 sites, 25 time periods, and 100 individuals per site per time period, is formulated this way:

\[ y_{ijk} \sim Bin\left(p_{jk}\right) \\ \ \\ log\left( \frac{p_{ijk}}{1-p_{ijk}} \right) = -1.5 + a_j + b_{jk} + 0.65 A_{jk} \]

$y_{ijk} \in \{0,1\}$ is the outcome, and $p(y_{ijk} = 1) = p_{ijk}$. The log-odds ratio is a linear function of the site specific random intercept $a_{j}$, the site-specific period $k$ effect $b_{jk}$, and treatment status of site $j$ in period $k$, $A_{jk} \in \{ 0, 1\}$ depending the the stage of stepped wedge. The treatment effect in this case (an odds ratio) is $exp(0.65) = 1.9$. The $a_j \sim N(0, 0.6)$. The vector of site-period effects $\mathbf{b_j} \sim N(0, \Sigma_b)$, where $\Sigma_b = DRD$ is a $25 \times 25$ covariance matrix based on a diagonal matrix $D$ and an auto-regressive correlation structure $R$:

\[ D = \sqrt{0.1} * I_{25 \times 25} \]

and

\[ R =\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{24} \\ \rho & 1 & \rho & \dots & \rho^{23} \\ \rho^2 & \rho & 1 & \dots & \rho^{22} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \rho^{24} & \rho^{23} & \rho^{22} & \dots & 1 \\ \end{bmatrix}, \ \ \rho = 0.95 \]

Here is the implementation of this data generation process using simstudy:

def <- defData(varname = "a", formula = 0, variance = 0.6)
def <- defData(def, varname = "mu_b", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "s2_b", formula = 0.1, dist = "nonrandom")
  
defOut <- defDataAdd(varname = "y", 
  formula = "-1.5 + a + b + 0.65 * A", 
  dist = "binary", link="logit"
)

set.seed(1913)

ds <- genData(24, def, id = "site")
ds <- addPeriods(ds, 25, "site", perName = "k")
ds <- addCorGen(
  dtOld = ds, idvar = "site", 
  rho = 0.95, corstr = "ar1",
  dist = "normal", param1 = "mu_b", param2 = "s2_b", cnames = "b"
)

ds <- trtStepWedge(ds, "site", nWaves = 24, 
  lenWaves = 1, startPer = 1, 
  grpName = "A", perName = "k"
)

ds$site <- as.factor(ds$site)
  
dd <- genCluster(ds, "timeID", numIndsVar = 100, level1ID = "id")
dd <- addColumns(defOut, dd)

dd

##        site  k     a mu_b s2_b timeID      b startTrt A    id y
##     1:    1  0 0.142    0  0.1      1 -0.867        1 0     1 0
##     2:    1  0 0.142    0  0.1      1 -0.867        1 0     2 0
##     3:    1  0 0.142    0  0.1      1 -0.867        1 0     3 0
##     4:    1  0 0.142    0  0.1      1 -0.867        1 0     4 0
##     5:    1  0 0.142    0  0.1      1 -0.867        1 0     5 0
##    ---                                                         
## 59996:   24 24 0.879    0  0.1    600 -0.291       24 1 59996 0
## 59997:   24 24 0.879    0  0.1    600 -0.291       24 1 59997 1
## 59998:   24 24 0.879    0  0.1    600 -0.291       24 1 59998 1
## 59999:   24 24 0.879    0  0.1    600 -0.291       24 1 59999 1
## 60000:   24 24 0.879    0  0.1    600 -0.291       24 1 60000 1

Here is visualization of the observed proportions of a good outcome ($y = 1$) by site and period:

Model estimation using a GAM

The first model will include a treatment effect and an overall smooth function of time, and then a site-specific smooth “effect”. I am using the function bam in the mgcv package, though I could use the gamm function, the gam function, or even the gamm4 function of the gamm4 package. In this case, all provide quite similar estimates, but bam has the advantage of running faster with this large data set. Here is the model:

\[ \text{log-odds}\left[P(y_{ijk} = 1)\right] = \beta_0 + \beta_1 A_{jk} + s(k) + s_j(k) \]

fit.A <- bam(
  y ~ A + s(k) + s(k, site, bs = "fs"), 
  data = dd, 
  method = "fREML",
  family = "binomial"
)

summary(fit.A)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## y ~ A + s(k) + s(k, site, bs = "fs")
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.4379     0.1548   -9.29   <2e-16 ***
## A             0.6828     0.0489   13.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df  Chi.sq p-value    
## s(k)       2.53   3.06    2.11    0.53    
## s(k,site) 79.45 238.00 5865.39  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.125   Deviance explained = 10.6%
## fREML =  85184  Scale est. = 1         n = 60000

The model does well to recover the true values of the parameters used in the data generation process (not always guaranteed for a single data set). The plot on the right shows the “main” smooth effect of time (i.e., across all sites), and the plot on the left shows the site-specific effects over time. The between-site variability is quite apparent.

draw(fit.A) +
  plot_annotation("") &
  theme(panel.grid = element_blank())

Goodness of fit

I am particularly interested in understanding if the model is a good fit based on the observed data. One way to do this is to repeatedly simulate predicted data from the model, and the visually assess whether the observed (i.e., actual) data falls reasonably into range of simulated data. To do this, I am using the simulate.gam function from the mgcViz package. I’ve created 95% bands based on the simulated data, and in this case it looks like the observed data fits into the bands quite well.

sim <- simulate.gam(fit.A, nsim = 1000)

ls <- split(sim, rep(1:ncol(sim), each = nrow(sim)))

dq <- lapply(ls, 
  function(x) {
    d <- cbind(dd, sim = x)
    d[, .(obs = mean(y), sim = mean(sim)), keyby = .(site, k)]
  }
)

dl <- rbindlist(dq, idcol = ".id")
df <- dl[, .(obs = mean(obs), min = quantile(sim, p = 0.025), 
             max = quantile(sim, 0.975)), keyby = .(site, k)]

ggplot(data = df, aes(x= k, y = obs)) +
  geom_ribbon(aes(x = k, ymin = min, ymax = max),
              alpha = 0.2, fill = "forestgreen") +
  geom_point(color = "forestgreen", size = 1) +
  
  facet_wrap( ~ site, ncol = 6) +
  theme(panel.grid = element_blank())

An alternative way to assess the goodness of fit is to generate a QQ-type plot that will alert us to any deviations. I am using the DHARMa package, which “uses a simulation-based approach to create readily interpretable quantile residuals for fitted generalized linear mixed models.” This residual is defined as “the value of the empirical density function at the value of the observed data.” The empirical density function comes from the same simulated data I just used to generate the 95% bands. It turns out that these residuals should be uniformly distributed if the model is a good fit. (See here for more details.)

The QQ-plot indicates a good fit if all the residuals lie on the diagonal line, as they do here:

simResp <- matrix(dl$sim, nrow = 600)
obsResp <- dq[[1]]$obs

DHARMaRes = createDHARMa(
  simulatedResponse = simResp, 
  observedResponse = obsResp, 
  integerResponse = F
)

plotQQunif(DHARMaRes, testDispersion = F, testOutliers = F)

A plot of the residuals against the predicted values also indicates a uniform distribution:

plotResiduals(DHARMaRes, quantreg = T)

A model with no site-specific period effects

Now, it is clearly a bad idea to fit a model without site-specific time effects, since I generated the data under that very assumption. However, I wanted to make sure the goodness-of-fit tests signal that this reduced model is not appropriate:

fit.1curve <- bam(
  y ~ A + s(k, k = 4)  , 
  data = dd, 
  method = "fREML",
  family = "binomial"
)

In the visual representation, it is apparent that the model is not properly capturing the site variability; for a number of sites, the observed data lies far outside the model’s 95% prediction bands:

Both the QQ- and residual plots are consistent with the prediction plot; it is pretty clear this second model is not a good fit:

We can formally compare the AIC from each model using function compareML from the package itsadug, which provides confirmation that the model with the site-specific curve is an improvement:

compareML(fit.A, fit.1curve)

## fit.A: y ~ A + s(k) + s(k, site, bs = "fs")
## 
## fit.1curve: y ~ A + s(k, k = 4)
## 
## Model fit.1curve preferred: lower fREML score (37.173), and lower df (3.000).
## -----
##        Model Score Edf Difference    Df
## 1      fit.A 85184   7                 
## 2 fit.1curve 85147   4    -37.173 3.000
## 
## AIC difference: -6308.48, model fit.A has lower AIC.

A model with no treatment effect

It is not obvious that including a treatment effect is necessary, since the smoothed curve can likely accommodate the shifts arising due to treatment. After all, treatment is confounded with time. So, I am fitting a third model that excludes a term for the treatment effect:

fit.noA <- bam(
  y ~ s(k) + s(k, site, bs = "fs"), 
  data = dd, 
  method = "fREML",
  family = "binomial"
)

The QQ-plot indicates that this model fits quite well, which is not entirely a surprise. (The 95% band plot looks reasonable as well, but I haven’t it included here.)

However, if we compare the two models using AIC, then the model with the treatment effect does appear superior:

compareML(fit.A, fit.noA)

## fit.A: y ~ A + s(k) + s(k, site, bs = "fs")
## 
## fit.noA: y ~ s(k) + s(k, site, bs = "fs")
## 
## Model fit.noA preferred: lower fREML score (28.266), and lower df (1.000).
## -----
##     Model Score Edf Difference    Df
## 1   fit.A 85184   7                 
## 2 fit.noA 85156   6    -28.266 1.000
## 
## AIC difference: -113.13, model fit.A has lower AIC.

My point here has been to show that we can indeed estimate flexible models with respect to time for data collected from a stepped wedge trial when the outcome is binary. And not only can we fit these models and get point estimates and measures of uncertainty, but we can also evaluate the goodness-of-fit to check the appropriateness of the model using a couple of different approaches.

In the earlier posts, I saw that the standard error estimate for the treatment effect is likely underestimated when the outcome is continuous. I did conduct a simulation experiment to determine if this is the case with a binary outcome, and unfortunately, it is. However, the extent of the bias seems to be small, particularly when the time trend is not too wiggly (i.e. is relatively stable within a reasonable time frame). I do feel comfortable using this approach, and will rely more on confidence intervals than p-values, particularly given the very large sample sizes. I will be particularly careful to draw conclusions about a treatment effect if the the point estimate of the effect size is quite low but still statistically significant based on the estimated standard errors.

Addendum (added 11/02/2023)

In the comments below, there was a question regarding the output from the function compareML, because in the first comparison above a p-value was reported, but in the second comparison, the p-value was not reported. I reached out to Jacolien van Rij, the developer of the itsadug package, and this is her response:

There is no p-value in the second comparison, because there is no trade-off between added complexity (in the sense of model terms) and increased explained variance. We use statistics to determine whether the explained variance is significantly increased while taking into account the increased complexity of the model. This is not the question in the second comparison, because model fit.noA is less complex AND explains more variance – so it’s an absolute win, we do not need to do model comparisons. (Unless the difference in explained variance is very small – but then we would generally prefer the simpler model too.)

Jacolien also had to additional bits of advice:

Important in model comparisons is that you compare models that are minimally different. In the first comparison, this is not the case: model fit.A is different in two aspects, namely it is missing the random effects term and it’s smooth term is constrained to a k of 4. So this is not a comparison I would recommend doing.

Note that you’re doing a Chisquare test on fREML scores, rather than ML scores. REML scores are not valid for model comparison procedures, as the fitted fixed effects of the two models may not be constrained/fitted in the same way. Instead, please use ML scores (add method=“ML” in bam(), which may take more time to run ) for model comparison. I’m planning to add a warning in the next version of the package.

Modeling the secular trend in a stepped-wedge design

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 13 Dec 2022 00:00:00 +0000

Recently I started a discussion about modeling secular trends using flexible models in the context of cluster randomized trials. I’ve been motivated by a trial I am involved with that is using a stepped-wedge study design. The initial post focused on more standard parallel designs; here, I want to extend the discussion explicitly to the stepped-wedge design.

The stepped-wedge design

Stepped-wedge designs are a special class of cluster randomized trial where each cluster is observed in both treatment arms (as opposed to the classic parallel design where only some of the clusters receive the treatment). In what is essentially a cross-over design, each cluster transitions in a single direction from control (or pre-intervention) to intervention. I’ve written about this in a number of different contexts (for example, with respect to power analysis, complicated ICC patterns, using Bayesian models for estimation, open cohorts, and baseline measurements to improve efficiency).

In the classic stepped-wedge design, groups of sites (considered waves) are randomized to intervention starting times. For example, if there are 24 sites divided into 6 waves (so 4 sites per wave), there will be six starting times and 7 measurement periods (if we want to have at least one baseline/control period for each wave, and at least one intervention period per wave). Schematically, the design looks like this:

We could use a linear mixed effects model to estimate the intervention effect $\delta$, which might look like this:

\[ Y_{ijk} = a_{j} + \beta_{k} + \delta A_{jk} + e_{ijk} \]

where $Y_{ijk}$ is the (continuous) outcome of individual $i$ in cluster $j$ during time period $k$. $a_j$ is the random intercept for site $j$, and we assume that $a_j \sim N(0, \sigma^2_a)$. $A_{jk}$ is the intervention indicator for site $j$ during time period $k$. $\beta_k$ is a period-specific effect. And $e_{ijk}$ is the individual level effect, $e_{ijk} \sim N(0, \sigma^2_e)$.

In the particular study motivating these posts, the situation is different in a key way: given its complexity, the intervention can only be implemented at one site a time, so that the number of waves equals the number of sites. This leads to this slightly more elaborate schematic:

The challenge under this scenario is that $k$ (the number of periods) is starting to get quite large, requiring us to estimate a large number of period specific effects $\beta_k$. In addition, the periods are actually shorter, so we have less information available to estimate those period effects. An alternative approach, as you may have anticipated, is to smooth the secular trend, using a model that looks like this:

\[ Y_{ijk} = a_{j} + s(k) + \delta A_{jk} + e_{ijk} \]

where $s(.)$ is a smooth function of time. And by using a smooth function, we can take this one step further and specify a site-specific smoothing function $s_j(.)$:

\[ Y_{ijk} = a_{j} + s_j(k) + \delta A_{jk} + e_{ijk} \]

So, we will use either cubic splines or generalized additive models (GAMs) to estimate the curve, which will allow us to control for the period effect while estimating the treatment effect. By smoothing the function, we are assuming that the measurements closer in time are more highly correlated than measurements further apart.

Data generation process

Here is the data generation process that we will use to explore the different models:

\[ Y_{ijk} \sim N(\mu_{ijk}, \sigma^2 = 40) \\ \mu_{ijk} = a_{j} + b_{jk} + \delta A_{jk} \\ a_j \sim N(0, \sigma^2_a = 9) \\ b_{jk} \sim N(0, \Sigma_b) \\ \delta = 5\\ \]

In this data generation process, the time effect will not be explicitly smooth, but the underlying covariance structure used to generate the period effects will induce some level of smoothness. This is similar to what was described in the previous post. As in that earlier example, $b_{jk}$ is a site-specific time period effect for each time period $k$; the vector of cluster-time effects $\mathbf{b_j} \sim N(0, \Sigma_b)$, where $\Sigma_b = DRD$ is a $25 \times 25$ covariance matrix based on a diagonal matrix $D$ and an auto-regressive correlation structure $R$:

\[ D = 4 * I_{25 \times 25} \]

and

Now we are ready to implement this data generating process using simstudy. First the R packages that we will need:

library(simstudy)
library(ggplot2)
library(data.table)
library(mgcv)
library(lme4)
library(splines)

The data definitions for $a_j$, $b_{jk}$, and $Y_{ijk}$ are established first:

def <- defData(varname = "a", formula = 0, variance = 9)
def <- defData(def, varname = "mu_b", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "s2_b", formula = 16, dist = "nonrandom")

defOut <- defDataAdd(varname = "y", formula = "a + b + 5 * A", variance = 40)

We (1) generate 24 sites with random intercepts, (2) create 25 periods for each site, (3) generate the period-specific effects ($b_{jk}$) for each site, and (4) assign the treatment status based on the stepped-wedge design:

set.seed(1234)

ds <- genData(24, def, id = "site")                  #1
ds <- addPeriods(ds, 25, "site", perName = "k")      #2

ds <- addCorGen(dtOld = ds, idvar = "site",  
                rho = 0.8, corstr = "ar1",
                dist = "normal", param1 = "mu_b", 
                param2 = "s2_b", cnames = "b")       #3

ds <- trtStepWedge(ds, "site", nWaves = 24, 
                   lenWaves = 1, startPer = 1, 
                   grpName = "A", perName = "k")     #4

ds$site <- as.factor(ds$site)

ds

##      site  k         a mu_b s2_b timeID          b startTrt A
##   1:    1  0 -3.621197    0   16      1 -3.6889733        1 0
##   2:    1  1 -3.621197    0   16      2 -1.6620662        1 1
##   3:    1  2 -3.621197    0   16      3  1.5816344        1 1
##   4:    1  3 -3.621197    0   16      4  4.0869655        1 1
##   5:    1  4 -3.621197    0   16      5  1.9385573        1 1
##  ---                                                         
## 596:   24 20  1.378768    0   16    596 -1.5470625       24 0
## 597:   24 21  1.378768    0   16    597 -1.7687554       24 0
## 598:   24 22  1.378768    0   16    598  4.1179282       24 0
## 599:   24 23  1.378768    0   16    599  7.1421562       24 0
## 600:   24 24  1.378768    0   16    600  0.3645013       24 1

In the last two steps, we create 30 individuals per site per period and generate each individual-level outcome. The figure shows the outcomes for all the sites over time:

dd <- genCluster(ds, "timeID", numIndsVar = 30, level1ID = "id")
dd <- addColumns(defOut, dd)

Model estimation

I am fitting three models to this simulated data set: (1) a mixed effects model with fixed time period effects, (2), a mixed effects model with a random cubic spline for the period effect for each site, and (3) a generalized additive model with a site-specific smooth function for time. For each estimated model, I’ve overlaid the predicted values on top of the observed (generated) data points.

I’ve also conducted an experiment using 5000+ replicated data sets to see how each model really performs with respect to the estimation of the treatment effect. (Code for these replications can be found here). These replications provide information about some operating characteristics of the different models (estimated bias, root mean squared error (RMSE), average estimated standard error, and coverage rate, i.e. proportion of 95% confidence intervals that include the true value 5).

Mixed effects model with fixed time-period effects

Here’s the first model. Note that I am not estimating an intercept so that each period effect is directly estimated. (I did try to estimate the 600 site-specific period random effects, but it proved too computationally intensive for my computer, which ground away for a half hour before I mercifully stopped it). The model does include a site-specific random intercept.

fitlme_k <- lmer(y ~ A + factor(k) - 1 + (1|site), data = dd)
summary(fitlme_k)

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ A + factor(k) - 1 + (1 | site)
##    Data: dd
## 
## REML criterion at convergence: 122095.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1545 -0.6781 -0.0090  0.6765  4.2816 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  site     (Intercept) 10.76    3.280   
##  Residual             51.37    7.167   
## Number of obs: 18000, groups:  site, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## A            5.83701    0.18482  31.582
## factor(k)0  -1.72745    0.72085  -2.396
## factor(k)1  -2.44884    0.72089  -3.397
## factor(k)2  -1.61074    0.72102  -2.234
## factor(k)3  -0.09524    0.72122  -0.132
## factor(k)4  -0.85081    0.72151  -1.179
## factor(k)5  -0.11645    0.72188  -0.161
## factor(k)6  -0.57468    0.72233  -0.796
## factor(k)7  -0.13628    0.72287  -0.189
## factor(k)8   0.01035    0.72348   0.014
## factor(k)9  -0.63440    0.72418  -0.876
## factor(k)10  0.33878    0.72496   0.467
## factor(k)11  0.34778    0.72581   0.479
## factor(k)12  0.21387    0.72675   0.294
## factor(k)13  1.25549    0.72777   1.725
## factor(k)14  0.60881    0.72887   0.835
## factor(k)15 -0.30760    0.73005  -0.421
## factor(k)16 -0.56911    0.73131  -0.778
## factor(k)17 -1.43275    0.73265  -1.956
## factor(k)18 -1.46688    0.73406  -1.998
## factor(k)19 -2.12147    0.73555  -2.884
## factor(k)20 -1.47431    0.73712  -2.000
## factor(k)21 -1.85067    0.73877  -2.505
## factor(k)22 -1.32576    0.74050  -1.790
## factor(k)23 -1.12176    0.74229  -1.511
## factor(k)24 -1.31230    0.74417  -1.763

The predicted values indicate that the model does not pick up the site-specific variation over time:

Although the estimate of the treatment effect from the single data set is 5.8 [95% CI: 5.5, 6.2], the treatment effect estimate from this model is actually unbiased (based on evaluating the results from the replications). However, RMSE = 0.97 (which is equivalent to the true standard error of the estimated treatment effect since there is no bias), but the average estimated standard error was only 0.18, and the coverage of the 95% CIs was only 29%. Indeed the estimated confidence interval from our single data set did not include the true value. Based on all of this, the model doesn’t seem all that promising, particularly with respect to measuring the uncertainty.

Mixed effects model with site-specific natural cubic spline

With the second model, also a mixed effects model, I’ve included a random cubic spline (based on four knots) instead of the random intercept:

dd[, normk := (k - min(k))/(max(k) - min(k))]
knots <- c(.2, .4, .6, .8)

fitlme_s <- lmer(y ~ A + ( ns(normk, knots = knots) - 1 | site ), data = dd)
summary(fitlme_s)

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ A + (ns(normk, knots = knots) - 1 | site)
##    Data: dd
## 
## REML criterion at convergence: 120120.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.0707 -0.6768  0.0080  0.6671  4.1554 
## 
## Random effects:
##  Groups   Name                      Variance Std.Dev. Corr                   
##  site     ns(normk, knots = knots)1 30.97    5.565                           
##           ns(normk, knots = knots)2 59.08    7.686     0.33                  
##           ns(normk, knots = knots)3 28.10    5.301     0.12 -0.10            
##           ns(normk, knots = knots)4 75.10    8.666     0.16  0.55 -0.31      
##           ns(normk, knots = knots)5 28.28    5.318     0.49  0.14  0.58 -0.28
##  Residual                           45.25    6.727                           
## Number of obs: 18000, groups:  site, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -2.0677     0.1830  -11.30
## A             5.0023     0.3052   16.39
## 
## Correlation of Fixed Effects:
##   (Intr)
## A -0.088

This time we can see that the model predictions better reflect the site-specific time trends:

This model also provides an unbiased estimate (in the case of the first data set, the estimate was spot on 5.0 [4.4, 5.6].) The RMSE was lower than the first model (0.78) and the average estimate of the standard error was slightly higher (0.31). The coverage was also higher, but still only 56%. There is still room for improvement.

GAM with site-specific smoother

This last model is a GAM (using the gam function from the mgcv package). A key parameter in the smoothing function s is the bs argument for the type of basis spline. I’ve used the option “fs” that allows for “random factor smooth interactions,” which is what we need here. In addition, the dimension of the basis (the argument k, not to be confused with the period k), was set by evaluating the selection criterion (GCV) as well investigating the RMSE and the average estimated standard errors. A value of k between 10 and 15 seems to be ideal, I’ve settled on $k = 12$.

gamfit <- gam(y ~ A + s(k, site, bs = "fs", k = 12), data = dd)
summary(gamfit)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ A + s(k, site, bs = "fs", k = 12)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4595     0.7429  -0.619    0.536    
## A             5.2864     0.4525  11.682   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df    F p-value    
## s(k,site) 261.7    287 32.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.403   Deviance explained = 41.2%
## GCV = 41.693  Scale est. = 41.082    n = 18000

The predicted value plot highlights that this model has estimated site-specific secular functions that are a little more wriggly than the cubic splines.

In spite of the less smooth curve, the GAM estimate is unbiased as well with a slightly lower RMSE (0.76) then the cubic spline model (0.78). Better yet, the estimated standard errors averaged 0.45, and the coverage is 76% (compared to 56% from the cubic spline model).

In general, at least in this simulation setting, the GAM seems to be an improvement over the random cubic spline model. However, this last model still underestimates the measure of uncertainty, suggesting there is more work to be done. Next, I will explore estimation of robust standard errors using bootstrap methods.

To be continued …

Generating clustered data with marginal correlations

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 22 Nov 2022 00:00:00 +0000

A student is working on a project to derive an analytic solution to the problem of sample size determination in the context of cluster randomized trials and repeated individual-level measurement (something I’ve thought a little bit about before). Though the goal is an analytic solution, we do want confirmation with simulation. So, I was a little disheartened to discover that the routines I’d developed in simstudy for this were not quite up to the task. I’ve had to quickly fix that, and the updates are available in the development version of simstudy, which can be downloaded using devtools::install_github(“kgoldfeld/simstudy”). While some of the changes are under the hood, I have added a new function, genBlockMat, which I’ll describe here.

Correlation in cluster randomized trials

The fundamental issue with cluster randomized trials is that outcomes for a group of patients in a specific cluster are possibly correlated; the degree to which this is true impacts both how much we “learn” from each individual. The more highly correlated individuals are, the less information we actually have. (In the extreme case, if there is perfect correlation, we really only have a sample of one from each group.)

When generating data and modeling associations, the structure of the correlation needs to reflect the context of the study design. The specific structure can depend on whether outcomes generally vary over time (so that patient outcomes within a cluster closer temporally might be more highly correlated than outcomes collected from patients far apart in time) and whether measurements are collected for the same individuals over time (you might expect the measurements of the same individual to be more highly correlated than measurements of two different individuals).

There are at least two ways to go about simulating correlated data from a cluster randomized trial. The first is to use a random effect to induce correlation. For example, a simple data generating process for a binary outcome with a treatment indicator and one covariate would start with a formulation like this:

\[ P(Y_{ij} = 1) = \pi_{ij}, \ \ \ Y_{ij} \in \{0,1\}\] \[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i + b_j \]

where $Y_{ij}$ is the outcome for individual $i$ in cluster $j$. ($A$ is a treatment indicator and $X$ is a covariate.) The key here is $b_j$, which is a cluster level effect that is typically assumed to have a normal distribution N(0, $\sigma_b^2)$. In a simulation, we would use use specific values to generate a probability $\pi_{ij}$ for each; each of the $\pi_{ij}$’s within a cluster would be correlated by the presence of the cluster effect $b_j$. It would follow that the $Y_{ij}$’s would also be correlated within cluster $j$. We can call this the conditional data generation process, and we could use a mixed-effects regression model to recover the parameters. But we won’t do this here.

Instead, we can dispose of $b_j$, like this:

\[ log \left( \frac{\pi_{ij}}{1-\pi_{ij}} \right) = \beta_0 + \beta_1 A_j + \beta_2X_i \]

As before, we would generate the $\pi_{ij}$’s, but the probabilities are going to be uncorrelated now (except of course the correlation due to randomization assignment, but this would be across clusters). The within-cluster correlation is directly introduced into the $Y_{ij}$’s by using using multivariate data generation process. If we were in the realm of normally distributed outcomes, we would use a multivariate normal data generating process $MVN(\mathbf{\mu}, \Sigma)$, where $\Sigma$ is a covariance matrix. (This could be done in simstudy using genCorData or addCorData.) In this case, with a binary outcome, we need an analogous approach, which is implemented in the simstudy functions genCorGen and addCorGen. To recover the parameters used to generate these data, a generalized estimating equations (GEE) model would be used; and rather than being conditional, the parameter estimates from this model will be marginal, just as the data generation process was.

Generating data - multiple time periods, single individual measurement

OK - that is a bit more background than I intended (though probably not enough). Now onto the new function and simulations.

In the first example here, the outcomes are measured at three different periods, but an individual in a cluster is measured only once. In other words, the time periods include different sets of individuals.

If we have 3 time periods and 3 individuals in each time period, the within-cluster correlation between two individuals in the same time period is $\alpha_1$, the correlation between individuals in adjacent time periods (periods 1&2 and periods 2&3) is $\alpha_2$, and the correlation between individuals in time periods 1 and 3 would be $\alpha_3$. The correlation structure for the cluster would be represented like this with each period represented in $3 \times 3$ sub-blocks:

\[ \mathbf{R} = \left( \begin{matrix} 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_3 & \alpha_3 & \alpha_3 \\ \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha_2 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & 1 & \alpha_1 \\ \alpha_3 & \alpha_3 & \alpha_3 & \alpha_2 & \alpha_2 & \alpha_2 & \alpha_1 & \alpha_1 & 1 \end{matrix} \right ) \]

The overall correlation matrix for the full data set (assuming 5 clusters) is represented by block matrix $\textbf{B}$ with

\[ \mathbf{B} = \left( \begin{matrix} \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{R} \\ \end{matrix} \right ) \]

where $\mathbf{0}$ is a $9 \times 9$ matrix of $0$’s.

The new function genBlockMat enables us to generate the $R$ blocks (though currently it requires that the number of individuals per period per cluster are constant - I will relax that requirement in the future). Here are a couple of examples. In the first we are fixing $\alpha_1 = 0.3$, $\alpha_2 = 0.2$, and $\alpha_3 = 0.1$:

library(simstudy)
library(data.table)

R <- genBlockMat(rho =c(0.3, 0.2, 0.1), nInds = 3 , nPeriods = 3)
R

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]  1.0  0.3  0.3  0.2  0.2  0.2  0.1  0.1  0.1
##  [2,]  0.3  1.0  0.3  0.2  0.2  0.2  0.1  0.1  0.1
##  [3,]  0.3  0.3  1.0  0.2  0.2  0.2  0.1  0.1  0.1
##  [4,]  0.2  0.2  0.2  1.0  0.3  0.3  0.2  0.2  0.2
##  [5,]  0.2  0.2  0.2  0.3  1.0  0.3  0.2  0.2  0.2
##  [6,]  0.2  0.2  0.2  0.3  0.3  1.0  0.2  0.2  0.2
##  [7,]  0.1  0.1  0.1  0.2  0.2  0.2  1.0  0.3  0.3
##  [8,]  0.1  0.1  0.1  0.2  0.2  0.2  0.3  1.0  0.3
##  [9,]  0.1  0.1  0.1  0.2  0.2  0.2  0.3  0.3  1.0

In the second example, we specify the correlation using an auto-regressive structure with $\alpha = 0.3$, so that $\alpha_1 = \alpha =0.3$, $\alpha_2 = \alpha^2 = 0.09$, and $\alpha_3 = \alpha^3 = 0.027$:

genBlockMat(rho = 0.3, corstr = "ar1", nInds = 3 , nPeriods = 3)

##        [,1]  [,2]  [,3] [,4] [,5] [,6]  [,7]  [,8]  [,9]
##  [1,] 1.000 0.300 0.300 0.09 0.09 0.09 0.027 0.027 0.027
##  [2,] 0.300 1.000 0.300 0.09 0.09 0.09 0.027 0.027 0.027
##  [3,] 0.300 0.300 1.000 0.09 0.09 0.09 0.027 0.027 0.027
##  [4,] 0.090 0.090 0.090 1.00 0.30 0.30 0.090 0.090 0.090
##  [5,] 0.090 0.090 0.090 0.30 1.00 0.30 0.090 0.090 0.090
##  [6,] 0.090 0.090 0.090 0.30 0.30 1.00 0.090 0.090 0.090
##  [7,] 0.027 0.027 0.027 0.09 0.09 0.09 1.000 0.300 0.300
##  [8,] 0.027 0.027 0.027 0.09 0.09 0.09 0.300 1.000 0.300
##  [9,] 0.027 0.027 0.027 0.09 0.09 0.09 0.300 0.300 1.000

Finally, we can specify using an exchangeable or compound symmetry structure with $\alpha = 0.3$, so that $\alpha_1 = \alpha_2 = \alpha_3 = \alpha = 0.3$ (i.e., there is constant between-individual correlation within each cluster over time):

genBlockMat(rho =0.3, corstr = "cs", nInds = 3 , nPeriods = 3)

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]  1.0  0.3  0.3  0.3  0.3  0.3  0.3  0.3  0.3
##  [2,]  0.3  1.0  0.3  0.3  0.3  0.3  0.3  0.3  0.3
##  [3,]  0.3  0.3  1.0  0.3  0.3  0.3  0.3  0.3  0.3
##  [4,]  0.3  0.3  0.3  1.0  0.3  0.3  0.3  0.3  0.3
##  [5,]  0.3  0.3  0.3  0.3  1.0  0.3  0.3  0.3  0.3
##  [6,]  0.3  0.3  0.3  0.3  0.3  1.0  0.3  0.3  0.3
##  [7,]  0.3  0.3  0.3  0.3  0.3  0.3  1.0  0.3  0.3
##  [8,]  0.3  0.3  0.3  0.3  0.3  0.3  0.3  1.0  0.3
##  [9,]  0.3  0.3  0.3  0.3  0.3  0.3  0.3  0.3  1.0

Using the first target correlation matrix ($\alpha_1 = 0.3$, $\alpha_2 = 0.2$, and $\alpha_3 = 0.1$), we can go ahead and generate some data. The probability of outcome $Y$ will be a function of cluster-level treatment $A$, $A \in \{0,1\}$ and individual-level covariate $z$, a continuous measure centered closely around 0.

Here are the data definitions:

b0 <- -1.0; delta <- 1; number_inds = 3;

defC <- defData(varname = "A", formula = "1;1", dist = "trtAssign")

defI <- defDataAdd(varname = "z", formula = 0, variance = 0.10)
defI <- defDataAdd(defI, varname = "p",
                   formula = "..b0 + ..delta * A + .5*z",
                   dist = "nonrandom", link = "logit")

The key to generating the data using the specified correlation matrix $R$ is that the data need to be set up in the correct order; it will need to be sorted by cluster and period (and in this case the order of individuals does not matter). First, we generate the cluster level data, and period data for the clusters, and then the 3 individuals within each cluster and time period.

set.seed(1234)

dc <- genData(n = 10, dtDefs = defC, id = "site")
dc <- addPeriods(dtName = dc, nPeriods = 3, 
        idvars = "site", perName = "period")
dd <- genCluster(dtClust = dc, cLevelVar = "timeID", 
        numIndsVar = number_inds, level1ID = "idnum")
dd <- addColumns(defI, dd)

setkey(dd, "site", "period", "idnum")

The correlated outcomes are generated using addCorGen, where the correlation matrix $R$ is applied. And, as I said before (I cannot reiterate this point enough), it is critical that the data are sorted in the correct order - by site and period:

dres <- addCorGen(dd, idvar = "site", corMatrix = R,
      dist = "binary", param1 = "p", cnames = "y", method = "ep")

head(dres, n = 9)

##    site period A timeID idnum          z         p y
## 1:    1      0 1      1     1  0.3109796 0.5387943 0
## 2:    1      0 1      1     2 -0.1968381 0.4754151 1
## 3:    1      0 1      1     3 -0.2313320 0.4711157 0
## 4:    1      1 1      2     4 -0.1633853 0.4795882 1
## 5:    1      1 1      2     5 -0.5536305 0.4312347 1
## 6:    1      1 1      2     6  0.2783134 0.5347331 1
## 7:    1      2 1      3     7  0.4332353 0.5539436 1
## 8:    1      2 1      3     8 -0.5335796 0.4336954 0
## 9:    1      2 1      3     9 -0.1984128 0.4752187 1

To convince myself that the data are indeed being generated intended, I am generating a large number of data sets using addCorGen (with fixed individual level probabilities). The empirical within-cluster correlation can be estimated as the observed correlation of the repeated data sets (within each cluster). I am outputting the nine $\pi$’s for the individuals in a one of the clusters next to the mean observed outcome for that individual. And, more importantly, I am showing the within-cluster correlation calculated from the 2,500 observations. In both cases, the observed values are very much consistent with the true values used to generate the data. Though I am not showing you, this is the case for all 10 clusters.

reps <- lapply(1:2500, 
  function(x)  addCorGen(dd, idvar = "site", corMatrix = R,
      dist = "binary", param1 = "p", cnames = "y", method = "ep"))

empir_corr <- function(cluster) {
  drep <- data.table::rbindlist(reps, idcol = "rep")
  drep <- drep[site == cluster, ]
  drep[, seq := 1:.N, keyby = rep]
  dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1])
  
  mu <- cbind(true = round(dd[site == cluster, p], 2), 
              observed = round(apply(dmat, 2, mean), 2))
  
  R_hat <- round(cor(dmat), 1) 
  
  return(list(mu = mu, R_hat = R_hat))
}

empir_corr(cluster = 7)

## $mu
##   true observed
## 1 0.32     0.33
## 2 0.25     0.25
## 3 0.27     0.28
## 4 0.25     0.26
## 5 0.28     0.27
## 6 0.26     0.26
## 7 0.27     0.26
## 8 0.28     0.28
## 9 0.31     0.31
## 
## $R_hat
##     1   2   3   4   5   6   7   8   9
## 1 1.0 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1
## 2 0.3 1.0 0.3 0.2 0.2 0.2 0.1 0.1 0.1
## 3 0.3 0.3 1.0 0.2 0.2 0.2 0.1 0.1 0.1
## 4 0.2 0.2 0.2 1.0 0.3 0.3 0.2 0.2 0.2
## 5 0.2 0.2 0.2 0.3 1.0 0.3 0.2 0.2 0.2
## 6 0.2 0.2 0.2 0.3 0.3 1.0 0.2 0.2 0.2
## 7 0.1 0.1 0.1 0.2 0.2 0.2 1.0 0.3 0.3
## 8 0.1 0.1 0.1 0.2 0.2 0.2 0.3 1.0 0.3
## 9 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 1.0

Multiple time periods, repeated individual measurements

In this next example, I will simulate a case where there are two measurement periods but the same set of individuals is in each period (say a baseline and follow-up). In this case, the within-individual (between-period) correlation ($\alpha^*$) can be different from the between-individual, within-period correlation ($\alpha_1$) and the between-individual, between-period correlation ($\alpha_2$):

\[ \mathbf{R} = \left( \begin{matrix} 1 & \alpha_1 & \alpha_1 & \alpha^* & \alpha_2 & \alpha_2 \\ \alpha_1 & 1 & \alpha_1 & \alpha_2 & \alpha^* & \alpha_2 \\ \alpha_1 & \alpha_1 & 1 & \alpha_2 & \alpha_2 & \alpha^* \\ \alpha^* & \alpha_2 & \alpha_2 & 1 & \alpha_1 & \alpha_1 \\ \alpha_2 & \alpha^* & \alpha_2 & \alpha_1 & 1 & \alpha_1 \\ \alpha_2 & \alpha_2 & \alpha^* & \alpha_1 & \alpha_1 & 1 \\ \end{matrix} \right ) \]

genBlockMat implements this with an additional argument iRho for the within-individual correlations. Here we are using iRho to set $\alpha^* = 0.5$:

R <- genBlockMat(rho =c(0.3, .1), nInds = number_inds, nPeriods = 2, iRho = 0.5)
R

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  0.3  0.3  0.5  0.1  0.1
## [2,]  0.3  1.0  0.3  0.1  0.5  0.1
## [3,]  0.3  0.3  1.0  0.1  0.1  0.5
## [4,]  0.5  0.1  0.1  1.0  0.3  0.3
## [5,]  0.1  0.5  0.1  0.3  1.0  0.3
## [6,]  0.1  0.1  0.5  0.3  0.3  1.0

The data generation process here is slightly modified to accommodate this structure of individuals repeated across time periods. The clusters are generated first, as before, but then the individuals are created within the clusters. Once the individuals are created, then the time periods are introduced before generating the outcome (which in this case depends on the time period). Before generating the correlated outcomes using addCorGen, it is now important that the data be sorted by cluster, period, and individual, as this is in line with the correlation structure.

b0 <- -1.0; b1 <- .5; delta <- -0.3; number_inds = 3;

defC <- defData(varname = "A", formula = "1;1", dist = "trtAssign")

defI <- defDataAdd(varname = "z", formula = 0, variance = 0.10)
defI <- defDataAdd(defI, varname = "p",
                   formula = "..b0 + ..b1* measure + ..delta * A * measure + z",
                   dist = "nonrandom", link = "logit")

set.seed(1234)

dc <- genData(n = 10, dtDefs = defC, id = "site")
dc <- genCluster(dtClust = dc, cLevelVar = "site", 
        numIndsVar = number_inds, level1ID = "idnum")
dd <- addPeriods(dtName = dc, nPeriods = 2, idvars = "idnum", perName = "measure")
dd <- addColumns(defI, dd)

setkey(dd, "site", "measure", "idnum")

dres <- addCorGen(dd, idvar = "site", corMatrix = R,
      dist = "binary", param1 = "p", cnames = "y", method = "ep")

head(dres, n = 6)

##    idnum measure site A timeID          z         p y
## 1:     1       0    1 1      1  0.3109796 0.3342510 0
## 2:     2       0    1 1      3 -0.2313320 0.2259484 0
## 3:     3       0    1 1      5 -0.5536305 0.1745625 1
## 4:     1       1    1 1      2 -0.1968381 0.2695635 0
## 5:     2       1    1 1      4 -0.1633853 0.2762009 0
## 6:     3       1    1 1      6  0.2783134 0.3724579 0

Again, we can confirm that the data generation process is working as anticipated by looking at the empirical means and correlation matrix based on 2,500 sets of outcomes:

reps <- lapply(1:2500, function(x) 
  addCorGen(dd, idvar = "site", corMatrix = R,
      dist = "binary", param1 = "p", cnames = "y", method = "ep"))

empir_corr(cluster = 2)

## $mu
##   true observed
## 1 0.36     0.36
## 2 0.23     0.23
## 3 0.31     0.30
## 4 0.21     0.21
## 5 0.31     0.31
## 6 0.27     0.27
## 
## $R_hat
##     1   2   3   4   5   6
## 1 1.0 0.3 0.3 0.5 0.1 0.1
## 2 0.3 1.0 0.3 0.1 0.5 0.1
## 3 0.3 0.3 1.0 0.1 0.1 0.5
## 4 0.5 0.1 0.1 1.0 0.3 0.3
## 5 0.1 0.5 0.1 0.3 1.0 0.3
## 6 0.1 0.1 0.5 0.3 0.3 1.0

Varying cluster sizes for single period designs

I’ve alluded to the fact that genBlockMat and addCorGen cannot accommodate varying cluster sizes (yet), but if we need to generate clustered data with correlated outcomes and varying cluster sizes in a single period using addCorGen, that is now possible. Here is a simple example to demonstrate how this works.

I am generating three clusters with sizes of 3, 4, and 2 individuals, respectively. The outcome is binary, and the probability of success varies slightly by cluster. First, we generate the clusters:

d1 <- defData(varname = "n", formula = "c(3, 4, 2)", dist = "nonrandom")
d1 <- defData(d1, varname = "p", formula = 0.4, variance = 40, dist = "beta")

set.seed(1234)

ds <- genData(3, d1, id = "site")
ds

##    site n         p
## 1:    1 3 0.2942948
## 2:    2 4 0.4245886
## 3:    3 2 0.5027532

And then the individual level data with the correlated outcomes:

dd <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")

addCorGen(dd, idvar = "site", rho =0.4, corstr = "cs", param1 = "p", 
  dist = "binary", cnames = "y", method = "ep")

##    site n         p id y
## 1:    1 3 0.2942948  1 0
## 2:    1 3 0.2942948  2 0
## 3:    1 3 0.2942948  3 0
## 4:    2 4 0.4245886  4 1
## 5:    2 4 0.4245886  5 0
## 6:    2 4 0.4245886  6 1
## 7:    2 4 0.4245886  7 1
## 8:    3 2 0.5027532  8 1
## 9:    3 2 0.5027532  9 0

Once again, to confirm that the correlation structure is what we expect, I’ve generated 2,500 sets of outcomes. This time, I am showing the full observed correlation matrix where it is clear that the between cluster outcomes are independent of each other:

reps <- lapply(1:2500, function(x) 
  addCorGen(dd, idvar = "site", rho =0.4, corstr = "cs", param1 = "p", 
            dist = "binary", cnames = "y", method="ep")
)

drep <- data.table::rbindlist(reps, idcol = "rep")
dmat <- as.matrix(dcast(drep, rep ~ id, value.var = "y")[, -1])

mu <- cbind(true = round(dd[, p], 2), observed = round(apply(dmat, 2, mean), 2))
R_hat <- round(cor(dmat), 1) 

mu

##   true observed
## 1 0.29     0.29
## 2 0.29     0.30
## 3 0.29     0.29
## 4 0.42     0.43
## 5 0.42     0.41
## 6 0.42     0.43
## 7 0.42     0.42
## 8 0.50     0.50
## 9 0.50     0.50

R_hat

##     1   2   3   4   5   6   7   8   9
## 1 1.0 0.4 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 2 0.4 1.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0
## 3 0.4 0.4 1.0 0.0 0.0 0.0 0.0 0.0 0.0
## 4 0.0 0.0 0.0 1.0 0.4 0.4 0.4 0.0 0.0
## 5 0.0 0.0 0.0 0.4 1.0 0.4 0.4 0.0 0.0
## 6 0.0 0.0 0.0 0.4 0.4 1.0 0.4 0.0 0.0
## 7 0.0 0.0 0.0 0.4 0.4 0.4 1.0 0.0 0.0
## 8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.4
## 9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 1.0

Modeling the secular trend in a cluster randomized trial using very flexible models

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 01 Nov 2022 00:00:00 +0000

A key challenge - maybe the key challenge - of a stepped wedge clinical trial design is the threat of confounding by time. This is a cross-over design where the unit of randomization is a group or cluster, where each cluster begins in the control state and transitions to the intervention. It is the transition point that is randomized. Since outcomes could be changing over time regardless of the intervention, it is important to model the time trends when conducting the efficacy analysis. The question is how we choose to model time, and I am going to suggest that we might want to use a very flexible model, such as a cubic spline or a generalized additive model (GAM).

I am not going to talk more about stepped wedge designs here (if you want more background this paper would be a fine place to start), but will briefly describe a flexible way to model time trends. And I am going to simplify a bit to assume that we are talking about a cluster randomized trial (CRT), where clusters are randomized to treatment or control only. Confounding by time is not really an issue here, since treatment and control are implemented in parallel across different clusters, but we still might want to model time to get more efficient estimates of the treatment effect. I will consider the flexible modeling approaches for stepped wedge designs in a future post.

Simulating the data (the data generation process)

As I typically do, I will frame this discussion around a simulated data set, which I will describe in some detail. Before we start, here are the libraries I use to generate and present the data:

library(simstudy)
library(ggplot2)
library(cowplot)
library(data.table)
library(mgcv)
library(lme4)
library(splines)

The simulated data will include 48 clusters over 20 time periods. 24 will be randomized to the control arm, 24 to the intervention. For each cluster and period, there are 30 individuals. The figure shows the cluster averages at each time point $k$ for one randomly generated data set:

The data generation process that underlies this plot is:

\[ Y_{ijk} \sim N(\mu =100 + b^0_{j} + b^1_{jk} - 0.1k^2 + 5A_j, \sigma^2 = 9) \]

$Y_{ijk}$ is the outcome measurement for individual $i$ in cluster $j$ at period $k$. In this case, $k \in \{0, \dots, 19\}$. There is an increasing decline in $Y$ over time (based on the quadratic term $k^2$). $A_j$ is a treatment indicator for cluster $j$, and $A \in \{0 ,1\}$, and the treatment effect is $5$.

$b_{0j}$ is a cluster-level random intercept for cluster $j$, $b^0_{j} \sim N(\mu = 0, \sigma^2 = 6)$. $b^1_{jk}$ is a cluster-specific time period effect for each time period $k$; the vector of cluster-time effects $\mathbf{b^1_j} \sim N(0, \Sigma)$, where $\Sigma = DRD$ is a $20 \times 20$ covariance matrix based on a diagonal matrix $D$ and an auto-regressive correlation structure $R$:

\[ D = 16 * I_{20 \times 20}\] and

\[ R =\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{19} \\ \rho & 1 & \rho & \dots & \rho^{18} \\ \rho^2 & \rho & 1 & \dots & \rho^{17} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \rho^{19} & \rho^{18} & \rho^{17} & \dots & 1 \\ \end{bmatrix}, \ \ \rho = 0.7 \]

The simstudy definitions establish $b^0$, $A$, $D$, and $Y$. The vector $\mathbf{b^1}$ is created separately in the actual data generation process using addCorGen, using $\mu = 0$ and $\sigma^2 = 16$. Here are the initial definitions:

def <- defData(varname = "b0", formula = 0, variance = 6)
def <- defData(def, varname = "A", formula = "1;1", dist = "trtAssign")
def <- defData(def, varname = "mu", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "s2", formula = 16, dist = "nonrandom")

defOut <- defDataAdd(varname = "y", 
  formula = "100 + b0 + b1k - 0.1 * k^2 + 5*A", 
  variance = 9)

I’ve wrapped the data generation process inside a function so that I can use it in a replication study at the end of the post. The function adds a normalized version of time and ensures that the site variable is a factor, both adjustments needed for modeling.

s_generate <- function() {
  
  d <- genData(48, def, id = "site")
  d <- addPeriods(d, 20, "site", perName = "k")
  
  d <- addCorGen(dtOld = d, idvar = "site", nvars = 20, 
                 rho = .7, corstr = "ar1",
                 dist = "normal", param1 = "mu", param2 = "s2", cnames = "b1k")
  
  d <- genCluster(d, "timeID", numIndsVar = 30, level1ID = "id")
  d <- addColumns(defOut, d)
  
  d[, normk := (k - min(k))/(max(k) - min(k))]
  d[, site := as.factor(site)]
  
  d[]
}

set.seed(123)
dd <- s_generate()

Some modeling options

If we are interested in accounting for the secular (or time) trend when estimating the treatment effect, we have a number of different options. We can assume there is no structure to the pattern of time, we can impose an extreme form of structure, or we can try to find a flexible middle ground.

Time without structure

In stepped wedge designs - it is quite common to assume little if no structure in time trends. In the context of a CRT this could be set up by including a time-specific effect for each period $k$, as in this model for an outcome $Y_{ijk}$ for individual $i$ in group $j$:

\[ Y_{ijk} = \beta_0 + \gamma_k + \delta A_j + b_j +e_{ijk} \]

where $A_j$ is an indicator for treatment $j$, and is set to 1 if cluster $j$ has been randomized to the intervention. $\beta_0$ and $b_j$ are the intercept and random intercept, respectively. $\delta$ is the effect size parameter. $\gamma_k$ is the time-specific effect for period $k$. This is a totally reasonable approach to take, but if $k$ starts to get quite large, we would need to need estimate large number of parameters ($K$ period parameters, to be more precise), which is not always desirable, so we won’t take this approach here.

Time with over-simplified structure

An alternative approach is to model time in a linear fashion as

\[ Y_{ijk} = \beta_0 + \gamma k + \delta A_j + b_j + e_{ijk} \]

where we have a single parameter $\gamma$ instead of $K$ period parameters. Here is an estimate of the treatment effect $\delta$ using a mixed effects model assuming a common linear time trend:

linear <- lmer(y ~ A + k + ( 1  | site) , data = dd)
summary(linear)$coefficients["A", c("Estimate", "Std. Error")]

##   Estimate Std. Error 
##      6.533      0.844

The linear model gets around the problem of a large number parameters, but it imposes a very strong assumption that the outcome $Y$ changes linearly over time (and in this case at the same rate for each cluster). This is unlikely to be the case. We could fit a quadratic model like

\[ Y_{ijk} = \beta_0 + \gamma_0 k + \gamma_1 k^2 + \delta A_j + b_j + e_{ijk} \]

but the assumption is still quite strong. We could also fit a mixed effects model with a random slope $b_{1j}$ as well:

\[ Y_{ijk} = \beta_0 + \gamma k + \delta A_j + b_{0j} + b_{1j} k + e_{ijk} \]

But, if the temporal trend is not linear, there is no reason to think this would be the best approach.

Mixed effects model with fixed cubic spline and random intercept

We can introduce some flexibility into the model by using a cubic spline, which is constructed using a piece-wise cubic polynomial defined by specific points (knots) along the x-axis.

\[ Y_{ijk} = \beta_0 + cs(k) + \delta A_j + b_j + e_{ijk} \]

The cubic spline model is fit in R using the function bs in the splines package. In order to get more stable estimates, I’ve standardized the time measurement before using it in the model. In this case, the effect size estimate and standard error are the same as the linear model.

fix_cs <- lmer(y ~ A + bs(normk) + ( 1  | site) , data = dd)
summary(fix_cs)$coefficients["A", c("Estimate", "Std. Error")]

##   Estimate Std. Error 
##      6.533      0.844

Mixed effects model with random cubic spline

There is no reason to believe that each cluster shares the same time trend, as assumed in the first two models estimated here. So now we introduce additional flexibility by fitting random cubic spline for each cluster.

\[ Y_{ijk} = \beta_0 + \delta A_j + b_j + cs_j(k) + e_{ijk} \]

The only difference between the fixed cubic spline model estimation is that the bs function appears in random effect portion of the model. The effect size estimate is slightly more biased than the previous estimates but has slightly less uncertainty.

ran_cs <- lmer(y ~ A + ( bs(normk) | site) , data = dd)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00672918 (tol = 0.002, component 1)

summary(ran_cs)$coefficients["A", c("Estimate", "Std. Error")]

##   Estimate Std. Error 
##       6.56       0.80

Generalized additive model with site-specific smoothing

Another flexible modeling approach is the generalized additive model, which provides potentially even more flexibility than the spline models and can provide protections against over fitting. The underlying flexibility of the GAM is due to the wide range of basis functions that are available for the construction of the curve. I recommend taking a look the link for a nice introduction.

In this case, the model includes cluster-specific curves $f_j(k)$:

\[ Y_{ijk} = \beta_0 + \delta A_j + f_j(k) + e_{ijk} \]

We estimate the model using the gamm function in the mgcv package. By setting the bs argument to “fs” in the smoothing function s, we will get estimated cluster-specific curves. “fs” refers to a special smooth factor interaction basis, where the interaction in this case is between site and time $k$.

gam <- gamm(y ~ A + s(k, site, bs = "fs", k = 5), data = dd, method="REML")
cbind(summary(gam$gam)$p.coeff, summary(gam$gam)$se)[2,]

## [1] 6.537 0.846

The figure below shows the predicted site-specific curves for each of the estimated models. The rigidity of the linear and fixed cubic spline models is pretty clear. And in at least this particular case, the two flexible methods appear to generate quite similar predicted curves.

The next figure shows the individual-level outcomes and the predicted curves for a small number of sites. It is clear that the curves for the less flexible methods are biased. The similarity of the flexible models is particularly evident here.

Evaluating bias and variance of treatment effect estimate

The results from a single data set are interesting, but we really need to understand how well the models perform over a large number of data sets. How do the model estimates of the true treatment effect ($\delta = 5$) compare when considering bias, variance, and coverage of the 95% confidence interval?

The replication process requires generating data and then fitting the models. The data generation uses the data definitions and data generating function provided earlier in the post. In this case, we will use 1000 data sets.

replicate <- function(){
  
  dd <- s_generate()
  
  linear <- lmer(y ~ A + k + ( 1  | site) , data = dd)
  est.lin <- summary(linear)$coefficients["A", c("Estimate", "Std. Error")]
  
  fix_cs <- lmer(y ~ A + bs(normk) + ( 1  | site) , data = dd)
  est.fcs <- summary(fix_cs)$coefficients["A", c("Estimate", "Std. Error")]
  
  ran_cs <- lmer(y ~ A + ( bs(normk) | site) , data = dd)
  est.rcs <- summary(ran_cs)$coefficients["A", c("Estimate", "Std. Error")]
  
  gam <- gamm(y ~ A + s(k, site, bs = "fs", k = 5), data = dd, method="REML")
  est.gam <- cbind(summary(gam$gam)$p.coeff, summary(gam$gam)$se)[2,]
  
  dres <- data.table(t(est.lin), t(est.fcs), t(est.rcs), t(est.gam))
  setnames(dres, 
    c("d.lin", "se.lin", "d.fcs", "se.fcs", "d.rcs", "se.rcs", "d.gam", "se.gam")
  )
  
  dres[]
}

res <- rbindlist(pblapply(1:1000, function(x) replicate()))

Each replication provides the point estimate of the treatment effect as well as the estimate of the standard error. Here is a sampling of the results:

res

##       d.lin se.lin d.fcs se.fcs d.rcs se.rcs d.gam se.gam
##    1:  4.53  0.847  4.53  0.847  4.73  0.805  4.55  0.850
##    2:  3.68  0.911  3.68  0.911  3.68  0.873  3.77  0.908
##    3:  3.34  0.734  3.34  0.734  3.33  0.685  3.36  0.729
##    4:  4.95  0.690  4.95  0.690  5.07  0.680  4.89  0.688
##    5:  5.75  0.865  5.75  0.865  5.67  0.776  5.81  0.868
##   ---                                                    
##  996:  5.27  1.001  5.27  1.001  5.17  0.974  5.35  1.008
##  997:  5.94  0.842  5.94  0.842  6.10  0.818  5.89  0.839
##  998:  4.92  0.910  4.92  0.910  4.91  0.876  4.94  0.916
##  999:  4.72  0.799  4.72  0.799  4.41  0.696  4.71  0.786
## 1000:  4.56  0.887  4.56  0.887  4.74  0.852  4.54  0.888

The average of the point estimates across all replications provides an estimate of the bias for each model. The four approaches are relatively unbiased, and this includes the less flexible approaches that didn’t seem to do so well at prediction on the individual level. The random cubic spline seems to have slightly less bias:

res[, .(lin = mean(d.lin), fcs = mean(d.fcs), rcs = mean(d.rcs), gam = mean(d.gam))] - 5

##       lin    fcs    rcs    gam
## 1: 0.0314 0.0314 0.0206 0.0304

A comparison of the observed standard errors suggests that the random cubic spline model is slightly more variable than the other three modeling approaches, suggesting a bias-variance trade-off.

res[, .(lin = sd(d.lin), fcs = sd(d.fcs), rcs = sd(d.rcs), gam = sd(d.gam))]

##     lin  fcs   rcs   gam
## 1: 0.92 0.92 0.941 0.919

And while all four methods underestimate the uncertainty, on average, the random cubic spline model most severely underestimated the standard errors:

res[, .(lin = mean(se.lin), fcs = mean(se.fcs), rcs = mean(se.rcs), gam = mean(se.gam))]

##      lin   fcs  rcs   gam
## 1: 0.907 0.907 0.87 0.908

Consistent with the disparities in variance estimates, the random cubic splines did not perform as well with respect to the coverage rates of the 95% confidence intervals:

coverage <- function(est, se) {
  rmin <- est - 1.96 * se
  rmax <- est + 1.96 * se
  
  mean(rmin < 5 & rmax > 5)
}

res[, .(lin = coverage(d.lin, se.lin), fcs = coverage(d.fcs, se.fcs),
        rcs = coverage(d.rcs, se.rcs), gam = coverage(d.gam, se.gam))]

##      lin   fcs   rcs  gam
## 1: 0.944 0.944 0.924 0.94

It will be interesting to see how the bias-variance trade-off plays out in the context of a stepped wedge design, particularly if the outcomes are binary. Will the less flexible methods continue to perform as well as the GAM model, and will the cubic spline model continue to underestimate the standard errors? More to come. (The next post is now available.)

Presenting results for multinomial logistic regression: a marginal approach using propensity scores

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 20 Sep 2022 00:00:00 +0000

Multinomial logistic regression modeling can provide an understanding of the factors influencing an unordered, categorical outcome. For example, if we are interested in identifying individual-level characteristics associated with political parties in the United States (Democratic, Republican, Libertarian, Green), a multinomial model would be a reasonable approach to for estimating the strength of the associations. In the case of a randomized trial or epidemiological study, we might be primarily interested in the effect of a specific intervention or exposure while controlling for other covariates. Unfortunately, interpreting results from a multinomial logistic model can be a bit of a challenge, particularly when there is a large number of possible responses and covariates.

My goal here is to generate a data set to illustrate how difficult it might be to interpret the parameter estimates from a multinomial model. And then I lay out a relatively simple solution that allows us to easily convert from the odds scale to the probability scale so we can more easily see the effect of the exposure on the outcome.

The multinomial logistic model

This is the formal specification of the model:

\[log\left[\frac{P(P_i = y)}{P(P_i = ref)}\right] = \alpha_y + \gamma_y A_i+ \boldsymbol{X_i \beta_y}\]

where $P_i$ is the political party of individual $i$, $P_i \in \{R, L, G\}$ and ref is the reference party - we’ll set that to be $D$ in this case. $A_i$ is a treatment/exposure indicator, and is 1 if exposed, 0 otherwise. (Say there is a particular ad campaign that we are assessing.) $\boldsymbol{X_i}$ is a matrix of covariates.

The complexity of the model starts to emerge as we consider the parameters, which are all specific to the level of $y$. $\alpha_y$ is the intercept for the log odds of $y$ vs. the reference category (so $\alpha_y$ represents the log odds for the unexposed with average covariate values, assuming all covariates have been centered at zero). $\gamma_y$ is the log odds ratio comparing the odds of $y$ vs. $D$ for the exposed and unexposed. $\boldsymbol{\beta_y}$ is a vector of parameters that reflect the association of the covariates with the party choice.

If we have $K$ possible categories (in this example $K = 4$), there are $K-1$ sets of parameters. That is challenging enough. But each of those parameters is in relation to the reference category, so if you want to compare the odds across two non-reference categories, it can be a little challenging.

Simulating the data

Before getting started, here are the necessary R packages for everything we are doing here.

library(simstudy)
library(data.table)
library(ggplot2)
library(nnet)
library(MatchIt)

Definitions

In this simulation, I am generating a categorical outcome that has four levels. There is a single exposure ($A$) and two covariates ($x_1$ and $x_2$). Six time periods worth of data are being generated, The probability of exposure ($A$) depends on both covariates and time. The outcome is associated with the covariates, and the effect of the intervention changes over time, complicating matters.

First, I define the effects of time on both exposure and the outcome (defT). Then I define covariates $x_1$ and $x_2$, exposure $A$, and outcome $Y$ (defY).

trunc_norm <- function(mean, sd, lower, upper) {
  msm::rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper)
}

defT <- defData(varname = "a_j", 
  formula = "trunc_norm(mean = 0, sd = 0.4, 
  lower = -Inf, upper = 0)", dist = "nonrandom")
defT <- defData(defT, varname = "b_j", 
  formula = "trunc_norm(mean = 0, sd = 0.4, 
  lower = -Inf, upper = 0)", dist = "nonrandom")

defY <- defDataAdd(varname = "x1", formula = 0, variance = 1)
defY <- defDataAdd(defY, varname = "x2", formula = 0.5, dist = "binary")
defY <- defDataAdd(defY, varname = "A", formula = "-1 + a_j - .5*x1 + .6*x2", 
  dist = "binary", link = "logit")
defY <- defDataAdd(defY, varname = "y", 
  formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
             b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
             -0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ", 
  dist = "categorical", link = "logit")

Data generation

In the data generation step, we assume six periods and 200 individuals in each period:

set.seed(999)

dd <- genData(6, defT, id = "period")
dd <- genCluster(dd, "period", 200, level1ID = "id")

dd <- addColumns(defY, dd)
dd <- genFactor(dd, "y")

dd[, fy := relevel(fy, ref = "4")]
dd[, period := factor(period)]

Here are a few rows of the data set:

##       period   id A     x1 x2 fy
##    1:      1    1 0  0.101  0  3
##    2:      1    2 0  0.901  1  4
##    3:      1    3 0 -2.074  1  4
##    4:      1    4 1 -1.229  0  2
##    5:      1    5 0  0.643  1  4
##   ---                           
## 1196:      6 1196 0  0.164  1  1
## 1197:      6 1197 1 -1.571  1  1
## 1198:      6 1198 1  0.398  1  1
## 1199:      6 1199 0 -0.199  0  4
## 1200:      6 1200 0  1.022  1  1

Here are a two figures that allow us to visualize the relationship of the exposure $A$ with the covariates and time:

Typically, I would like to plot the raw outcome data to get an initial sense of the how the outcome relates to the covariates of interest, but with a categorical measure that has four levels, it is not so obvious how to present the data in an insightful way. At the very least, we can show the distributions of the outcome over time and exposure, without taking into account the covariates (so there might be some confounding inherent in the plots):

“Traditional” analysis

If we do suspect there might be confounding due to covariates $x_1$ and $x_2$, and we see that there appears to be different effects of exposure over time, it would not be unreasonable to estimate a multinomial model that adjusts for both covariates and includes an interaction term for exposure and period.

fit <- multinom(fy ~ x1 + x2 + A*period, data = dd, trace = FALSE)

Interpreting the results

The parameter estimates for a multinomial model are shown in the table below. In this case, the fourth response category is the reference, so the table shows the odds ratios for each response relative to the reference (the intercepts $\alpha_y$’s are not shown in the table). Each section of the table (labeled “1”, “2”, and “3”) represent the estimated parameters for each response level. While some readers may be able to get a lot out of this table, I find it a little overwhelming, particularly when it comes to understanding (a) the impact of time on the exposure effect, and (b) how responses other than the reference category compare to each other.

Characteristic	1	2	3
OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
x1	0.70	0.57, 0.85	<0.001	1.16	0.97, 1.38	0.10	0.65	0.55, 0.76	<0.001
x2	0.68	0.46, 1.01	0.056	0.83	0.60, 1.15	0.3	0.65	0.48, 0.87	0.005
A	0.63	0.13, 3.16	0.6	5.28	2.28, 12.2	<0.001	0.34	0.10, 1.13	0.078
period
1	—	—		—	—		—	—
2	1.54	0.71, 3.34	0.3	0.81	0.41, 1.61	0.5	0.99	0.58, 1.71	>0.9
3	0.55	0.21, 1.46	0.2	0.94	0.49, 1.82	0.9	0.85	0.49, 1.47	0.6
4	0.88	0.37, 2.10	0.8	1.36	0.73, 2.55	0.3	0.81	0.46, 1.41	0.5
5	0.32	0.11, 0.94	0.038	0.46	0.22, 0.94	0.034	0.86	0.51, 1.46	0.6
6	0.91	0.40, 2.06	0.8	0.94	0.50, 1.75	0.8	0.63	0.36, 1.09	0.10
A * period
A * 2	3.32	0.52, 21.1	0.2	1.04	0.33, 3.33	>0.9	0.87	0.16, 4.60	0.9
A * 3	26.2	3.78, 182	<0.001	1.40	0.42, 4.67	0.6	5.34	1.18, 24.1	0.029
A * 4	14.0	2.22, 88.0	0.005	0.36	0.11, 1.17	0.089	2.00	0.42, 9.42	0.4
A * 5	27.4	3.87, 193	<0.001	0.44	0.12, 1.68	0.2	1.83	0.40, 8.44	0.4
A * 6	89.7	11.3, 714	<0.001	1.16	0.22, 6.14	0.9	2.02	0.15, 27.0	0.6
¹ OR = Odds Ratio, CI = Confidence Interval

Probability scale

What I really want is to be able to see the estimates on the probability scale. This is always challenging because we can only get predicted probabilities at specific levels of the covariates (i.e. the model is conditional), and it is not clear what levels of the covariates we should use for this purpose. While we could just use the average value of each covariate to generate an average probability for each exposure group and each time period, there is something arbitrary to doing this.

Perhaps, a somewhat more palatable approach is get estimates of the marginal probabilities. A while ago, I presented an approach in the context of logistic regression (which is just a special case of multinomial regression, where the categorical outcome has only two levels) that estimated a predicted probability for each individual under each treatment arm, and then calculated an average risk difference by averaging across all the patients. This could presumably work here, but I decided to try another approach that eliminates the covariates from the analysis by using propensity score matching.

Propensity score matching

I’ve described propensity score matching in an earlier post (and provided at least one good reference there), so I won’t go into much detail here. The general idea is that we can estimate a probability of exposure (i.e., the propensity score), and then match individuals in the two exposure groups based on those scores. If done well, this creates two comparable groups that are balanced with respect to the confounders (assuming all the confounders have been measured and are included in the exposure model for the propensity score). Once the matching is done, we can estimate the multinomial model without any covariates - and convert to predicted probabilities without relaying on any assumptions about the covariates.

In the first step, I am matching individuals within each time period This way, the groups will be balanced at each time point, and I could estimate marginal probabilities for each period.

matchby <- function(dx) {
  
  m <- matchit(A ~ x1 + x2, data = dx,
        method = "nearest", distance = "glm", caliper = .25)
  match.data(m)
  
  }

m.out <- rbindlist(lapply(1:6, function(x) matchby(dd[period==x])))

Analysis of matched data

In the second step, I fit a multinomial model that includes only exposure and time, and then generate predicted probability for each exposure and period combination.

mfit <- multinom(fy ~ A*period, data = m.out, trace = FALSE)

dnew <- data.table(expand.grid(A = c(0,1), period = factor(c(1:6))))
dpred <- data.table(predict(mfit, newdata = dnew, "probs"))

dpred <- cbind(dnew, dpred)
dplot <- melt(data = dpred, id.vars = c("A", "period"), 
              value.name = "proportion", variable.name = "response")
dplot[, response := factor(response, levels = c(1, 2, 3, 4))]
dplot[, A := factor(A, labels = c("not exposed", "exposed"))]

Here are the predicted probabilities for second time period:

##              A period response proportion
## 1: not exposed      2        1     0.1786
## 2:     exposed      2        1     0.2143
## 3: not exposed      2        2     0.1250
## 4:     exposed      2        2     0.4286
## 5: not exposed      2        3     0.2321
## 6:     exposed      2        3     0.0714
## 7: not exposed      2        4     0.4643
## 8:     exposed      2        4     0.2857

Bootstrap estimates of confidence bands

To go along with our point estimates, we need a measure of uncertainty, which we will estimate by bootstrap. For this analysis, I am bootstrapping the whole process, starting by sampling with replacement within each period and each exposure group, doing the matching, fitting the model, and generating the predictions.

bs <- function(x) {
  
  ids <- dd[, .(id = sample(id, .N, replace = T)), keyby = .(period, A)][, id]
  db <- dd[ids]
  db[, id := .I]
  
  mb.out <- rbindlist(lapply(1:6, function(x) matchby(db[period==x])))
  
  mfit <- multinom(fy ~ A*period, data = mb.out, trace = FALSE)
  
  dbnew <- data.table(expand.grid(A = c(0,1), period = factor(c(1:6))))
  dbpred <- data.table(predict(mfit, newdata = dbnew, "probs"))
  
  cbind(iter = x, dnew, dbpred)
}

bspred <- rbindlist(lapply(1:500, function(x) bs(x)))

Plot point estimates and confidence bands

What follows is the code to generate the figure showing the predicted probabilities for each arm. But before creating the plot, I’ve extracted 95% confidence intervals for each response level and period from the bootstrap data that will be used to draw the confidence bands.

bsplot <- melt(data = bspred, id.vars = c("iter", "A", "period"), 
  value.name = "proportion", variable.name = "response")
bsplot[, response := factor(response, levels = c(1, 2, 3, 4))]
bsplot[, A := factor(A, labels = c("not exposed", "exposed"))]

ci <- bsplot[, 
  .(l95 = quantile(proportion, 0.025), u95 = quantile(proportion, 0.975)), 
  keyby = .(response, A, period)
]

ggplot(data = dplot, aes(x = period,  group = A)) +
  geom_ribbon(data = ci, 
              aes(ymin = l95, ymax = u95, fill = A),
              alpha = .2)  +
  geom_line(aes(y = proportion, color = A), size = .8) +
  facet_grid(. ~ response, labeller = label_both) +
  theme(panel.grid = element_blank(),
        legend.title = element_blank()) +
  scale_color_manual(values = c("#806cc6", "#b2c66c")) +
  scale_fill_manual(values = c("#806cc6", "#b2c66c"))

The point estimates mirror the marginal raw data plots quite closely, which is hardly surprising since we treated time as a categorical variable and the model is saturated. The benefit of doing the modeling is that we have generated estimates of uncertainty, and are in a position to make some inferences. For example, it looks like the exposure has an increasing effect on the probability of a level “1” response in the last three periods. Likewise, the effect of the exposure on the probability of a level “2” response was strongest in the first two periods, and then disappears. And there is too much uncertainty to say anything definitive about level “3” and “4” responses.

Flexible simulation in simstudy with customized distribution functions

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 30 Aug 2022 00:00:00 +0000

Really, the only problem with the simstudy package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 - see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code.

I am going to illustrate this with two examples, first by implementing a truncated normal distribution, and second by implementing the flexible non-linear data generating algorithm that I described last time.

Before we get going, here are the necessary libraries:

library(simstudy)
library(data.table)
library(msm)
library(ggplot2)
library(mgcv)

General concept

In the data definition step, it is possible to specify any valid R function in the formula argument. If dist is specified as “nonrandom”, then simstudy will generate data based on that function. (Yes, the specification as “nonrandom” is a bit awkward since we are defining a stochastic data generating process in this case; in future versions I plan to allow dist to be specified as “custom” to make this less dissonant.)

In this example, I want to be able to generate data from a truncated normal distribution. There is an existing function rtnorm in the msm package that I can take advantage here. What I have done is essentially create a wrapper function that makes a single draw from the truncated distribution with a specified mean, standard deviation, and pair of truncation bounds:

trunc_norm <- function(mean, sd, lower, upper) {
  rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper)
}

Now that trunc_norm has been created, I am free to use this is in a data definition statement. And even more important, the call to trunc_norm can depend on other variables; in this case, I have created binary variable x that will determine the upper and lower bounds of the distribution. When $x=0$, the $N(0, 3.5^2)$ distribution is truncated at -5 and 5, and when $x=1$, the distribution is truncated at -8 and 8.

defI <- defData(varname = "x", formula = 0.5, dist = "binary")
defI <- defData(defI, varname = "y", 
  formula = "trunc_norm(mean = 0, sd = 3.5, 
               lower = -5 + -3*x, upper = 5 + 3*x)",
  dist = "nonrandom")

The generated data appear to have the properties that we would expect:

dd <- genData(1000, defI)

Application to non-linear data generation

Last time, I described an approach to generate a variable $y$ that has a non-linear response with respect to an input variable $x$. At the end of that post, I created two functions, one of which can be referred to in the defData statement to generate the data. (I plan on implementing these functions in simstudy, but I was eager to get the concept out there in case any one has some suggestions or could use this feature right away.)

In the first step, I need to generate a smooth function by specifying a few points. I do this by calling getNLfunction. (If you want the code for this, let me know, but I actually provided most of it last week.) The variable nlf is an object that contains the function:

dpoints <- data.table(x = c(20, 30, 53, 65, 80), y = c(15, 44, 60, 55, 35))
nlf <- getNLfunction(dpoints)

The function genNL makes predictions based on the nlf object and adds a little Gaussian noise. We use the same approach as we did above for the truncated normal to generate different responses $y$ based on the level of $x$:

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y", 
  formula = "genNL(nf = ..nlf, x, sd = 10)", dist = "nonrandom")

dd <- genData(300, def)

And if we introduce much less noise, we get much closer to the original underlying function specified by our points:

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y", 
  formula = "genNL(nf = ..nlf, x, sd = 0.5)", dist = "nonrandom")

dd <- genData(300, def)

Simulating data from a non-linear function by specifying a handful of points

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 09 Aug 2022 00:00:00 +0000

Trying to simulate data with non-linear relationships can be frustrating, since there is not always an obvious mathematical expression that will give you the shape you are looking for. I’ve come up with a relatively simple solution for somewhat complex scenarios that only requires the specification of a few points that lie on or near the desired curve. (Clearly, if the relationships are straightforward, such as relationships that can easily be represented by quadratic or cubic polynomials, there is no need to go through all this trouble.) The translation from the set of points to the desired function and finally to the simulated data is done by leveraging generalized additive modelling (GAM) methods, and is described here.

Before we begin, here are the R packages used in the post:

library(data.table)
library(ggplot2)
library(mgcv)

Starting with the idea of non-linear function

Say we desire a functional relationship $y = f(x)$ that looks something like this:

Conceptualize the function with a handful of points

Since it is not obvious (to me at least) how to express this curve mathematically, let’s start by selecting a few points that describe the general shape. In this case, I think using seven points does the trick pretty well:

x <- c(20, 30, 40, 50, 60, 70, 80)
y <- c(15, 40, 25, 22, 35, 24, 15)

dd <- data.table(x = x, y = y)

ggplot(data = dd, aes(x = x, y = y)) +
  geom_line(lty = 2, color = "grey70") +    
  geom_point() +
  theme(panel.grid = element_blank())

Fill in the gaps

If we want to extrapolate from this small set of points to a curve, we will need more data. We can easily generate this data by taking points along the lines that connect our original points. This requires only some rudimentary algebra. First, just create a data set that includes the start and end points of each line segment:

dc <- copy(dd)
  
dc[, xend := (shift(x, type = "lead"))]
dc[, yend := (shift(y, type = "lead"))]
dc <- dc[ -.N]
dc[, id := .I]

dc

##     x  y xend yend id
## 1: 20 15   30   40  1
## 2: 30 40   40   25  2
## 3: 40 25   50   22  3
## 4: 50 22   60   35  4
## 5: 60 35   70   24  5
## 6: 70 24   80   15  6

And then after calculating the slope and intercept, you’ve got points along those line segments:

interp_line <- function(x, y, xend, yend) {
  slope <- (yend - y)/(xend - x)
  b <- y - slope*x
  newx <- seq(x, xend, length = 100)
  newy <- newx*slope + b
  
  data.table::data.table(x = newx, y = newy)
}
  
dx <- dc[, interp_line(x, y, xend, yend), keyby = id]
dx

##      id        x        y
##   1:  1 20.00000 15.00000
##   2:  1 20.10101 15.25253
##   3:  1 20.20202 15.50505
##   4:  1 20.30303 15.75758
##   5:  1 20.40404 16.01010
##  ---                     
## 596:  6 79.59596 15.36364
## 597:  6 79.69697 15.27273
## 598:  6 79.79798 15.18182
## 599:  6 79.89899 15.09091
## 600:  6 80.00000 15.00000

The augmented set of points looks like a sequence of line segments when plotted:

ggplot(data = dx, aes(x = x, y = y)) +
  geom_point(size = .05) +
  theme(panel.grid = element_blank())

Smoothing out points to get our curve

The idea here is to convert this set of line segments into a curve that can be represented by a functional object we can use to generate new outcomes. Any model fit in R can serve as a functional object in the sense that if you provide an input (or number of inputs) to the predict function for that model fit, you will get an output from that model.

GAMs are a family of flexible models that use spline functions to generate highly curved functions, so are a perfect solution here. (There are many, many sources that describe GAMs: this and this provide excellent overviews, and of course, there is this book.) In this case, we will fit a GAM model to our augmented data set to get a smoothed version the desired function. A call to the function gam in the mgcv package with a smoothing term specified for our input $x$ (and a relatively low degrees of freedom $k$ to ensure a very smooth result), gives the object that will serve as our function:

gam.fit <- gam(y ~ s(x, k = 7), data = dx)

Using predict.gam, we can see how the GAM has abstracted from the line segments to give us our desired curve. (Of course, we might need to tweak our input set of points to get closer to the curve we are looking for.)

dx[, ypred := predict(gam.fit)]
  
ggplot(data = dx, aes(x=x, y = ypred)) +
      geom_line()  +
      geom_point(data = dd, aes(x = x, y = y)) +
      theme(panel.grid = element_blank())

Generating the random data from this non-linear ‘function’

It is now easy to generate outcome data $y$ that is centered around our function by generating random inputs ($x$’s), calling predict.gam, and adding some Gaussian noise:

set.seed(123)

rand_data <- data.table(x = runif(120, 20, 80))
rand_data$y <- predict(gam.fit, newdata = rand_data) + rnorm(nrow(rand_data),0, 2)

ggplot(data = rand_data, aes(x = x, y = y)) +
  geom_point(size = 1)  +
  theme(panel.grid = element_blank())

Wrapping all of this up in a function

This can all be simplified by putting the code inside a function; with this, we can generate the desired data in a few simple steps.

Here, we are trying to generated a second curve - this time using six points. The function getNLfunction takes those points, plots the curve, and returns the fitted GAM object.

x <- c(20, 30, 60, 67, 73, 80)
y <- c(15, 42, 42, 67, 67, 30)

nlf <- getNLfunction(data.table(x = x, y = y), k = 1)

The new function genNL is really just a wrapper for predict.gam. Here, we generate new $x$’s and the corresponding $y$’s with noise:

rand_data <- data.table(x = runif(180, 20, 80))
rand_data$y <- genNL(nlf, rand_data$x, sd = 2)

So, with just a few lines of code, it is possible to generate observations from an infinite range of flexible functions. Sometime soon, I will demonstrate how this flexible data generation process can be incorporated directly into a simstudy data generation process.

simstudy updated to version 0.5.0

keith.goldfeld@nyumc.org (Keith Goldfeld) — Wed, 20 Jul 2022 00:00:00 +0000

A new version of simstudy is available on CRAN. There are two major enhancements and several new features. In the “major” category, I would include (1) changes to survival data generation that accommodate hazard ratios that can change over time, as well as competing risks, and (2) the addition of functions to allow users to sample from existing data sets with replacement to generate “synthetic” data will real life distribution properties. Other less monumental, but important, changes were made: updates to functions genFormula and genMarkov, and two added utility functions, survGetParams and survParamPlot. (I did describe the survival data generation functions in two recent posts, here and here.)

Here are the highlights of the major enhancements:

Non-proportional hazards

If we want to simulate a scenario where survival time is a function of sex and the relative risk of death (comparing males to females) changes after 150 days, we cannot use the proportional hazards assumption that simstudy has typically assumed. Rather, we need to be able to specify different hazards at different time points. This is now implemented in simstudy by using the defSurv function and the transition argument.

In this case, the same outcome variable “death” is specified multiple times (currently the limit is actually two times) in defSurv, and the transition argument indicates the point at which the hazard ratio (HR) changes. In the example below, the log(HR) comparing males and females between day 0 and 150 is -1.4 (HR = 0.25), and after 150 days the hazards are more closely aligned, log(HR) = -0.3 (HR = 0.74). The data definitions determine the proportion of males in the sample and specify the time to death outcomes:

library(simstudy)
library(survival)
library(gtsummary)
  
def <- defData(varname = "male", formula = 0.5, dist = "binary")

defS <- defSurv(varname = "death", formula = "-14.6 - 1.4 * male", 
  shape = 0.35, transition = 0)
defS <- defSurv(defS, varname = "death", formula = "-14.6 - 0.3 * male", 
  shape = 0.35, transition = 150)

If we generate the data and take a look at the survival curves, it is possible to see a slight inflection point at 150 days where the HR shifts:

set.seed(10)

dd <- genData(600, def)
dd <- genSurv(dd, defS, digits = 2)

If we fit a standard Cox proportional hazard model and test the proportionality assumption, it is quite clear that the assumption is violated (as the p-value < 0.05):

coxfit <- coxph(formula = Surv(death) ~ male, data = dd)
cox.zph(coxfit)

##        chisq df       p
## male    12.5  1 0.00042
## GLOBAL  12.5  1 0.00042

If we split the data at the proper inflection point of 150 days, and refit the model, we can recover the parameters (or at least get pretty close):

dd2 <- survSplit(Surv(death) ~ ., data= dd, cut=c(150),
                 episode= "tgroup", id="newid")

coxfit2 <- coxph(Surv(tstart, death, event) ~ male:strata(tgroup), data=dd2)

tbl_regression(coxfit2)

Characteristic	log(HR)¹	95% CI¹	p-value
male * strata(tgroup)
male * tgroup=1	-1.3	-1.6, -1.0	<0.001
male * tgroup=2	-0.51	-0.72, -0.29	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

Competing risks

A new function addCompRisk generates a single time to event outcome from a collection of time to event outcomes, where the observed outcome is the earliest event time. This can be accomplished by specifying a timeName argument that will represent the observed time value. The event status is indicated in the field set by the eventName argument (which defaults to “event”). And if a variable name is indicated using the censorName argument, the censored events automatically have a value of 0.

To use addCompRisk, we first define and generate unique events - in this case event_1, event_2, and censor:

set.seed(1)

dS <- defSurv(varname = "event_1", formula = "-10", shape = 0.3)
dS <- defSurv(dS, "event_2", "-6.5", shape = 0.5)
dS <- defSurv(dS, "censor", "-7", shape = 0.55)

dtSurv <- genData(1001)
dtSurv <- genSurv(dtSurv, dS)

dtSurv

##         id censor event_1 event_2
##    1:    1     55    15.0     9.7
##    2:    2     47    19.8    23.4
##    3:    3     34     8.0    33.1
##    4:    4     13    25.2    40.8
##    5:    5     61    28.6    18.9
##   ---                            
##  997:  997     30    22.3    33.7
##  998:  998     53    22.3    20.5
##  999:  999     62    19.8    12.1
## 1000: 1000     55    11.1    22.1
## 1001: 1001     37     7.2    33.9

Now we generate a competing risk outcome “obs_time” and an event indicator “delta”:

dtSurv <- addCompRisk(dtSurv, events = c("event_1", "event_2", "censor"), 
  eventName = "delta", timeName = "obs_time", censorName = "censor")

dtSurv

##         id obs_time delta    type
##    1:    1      9.7     2 event_2
##    2:    2     19.8     1 event_1
##    3:    3      8.0     1 event_1
##    4:    4     13.0     0  censor
##    5:    5     18.9     2 event_2
##   ---                            
##  997:  997     22.3     1 event_1
##  998:  998     20.5     2 event_2
##  999:  999     12.1     2 event_2
## 1000: 1000     11.1     1 event_1
## 1001: 1001      7.2     1 event_1

Here’s a plot competing risk data using the cumulative incidence functions (rather than the survival curves):

The data generation can be done in two (instead of three) steps by including the timeName and eventName arguments in the call to genSurv. By default, the competing events will be all the events defined in defSurv:

set.seed(1)

dtSurv <- genData(1001)
dtSurv <- genSurv(dtSurv, dS, timeName = "obs_time", 
  eventName = "delta", censorName = "censor")

dtSurv

##         id obs_time delta    type
##    1:    1      9.7     2 event_2
##    2:    2     19.8     1 event_1
##    3:    3      8.0     1 event_1
##    4:    4     13.0     0  censor
##    5:    5     18.9     2 event_2
##   ---                            
##  997:  997     22.3     1 event_1
##  998:  998     20.5     2 event_2
##  999:  999     12.1     2 event_2
## 1000: 1000     11.1     1 event_1
## 1001: 1001      7.2     1 event_1

Synthetic data

Sometimes, it may be useful to generate data that will represent the distributions of an existing data set. Two new functions, genSynthetic and addSynthetic make it fairly easy to do this.

Let’s say we start with an existing data set $A$ that has fields $a$, $b$, $c$, and $d$:

##       index    a b c    d
##    1:     1 2.74 8 0 11.1
##    2:     2 4.57 4 1 13.6
##    3:     3 2.63 4 0  8.0
##    4:     4 4.74 7 0 12.5
##    5:     5 1.90 4 0  7.2
##   ---                    
##  996:   996 0.92 3 0  5.2
##  997:   997 2.89 4 0  8.5
##  998:   998 2.80 7 0 10.9
##  999:   999 2.47 6 0  8.1
## 1000:  1000 2.63 6 0 12.5

We can create a synthetic data set by sampling records with replacement from data set $A$:

S <- genSynthetic(dtFrom = A, n = 250, id = "index")
S

##      index   a b c    d
##   1:     1 4.0 6 0 11.4
##   2:     2 3.2 4 1  9.5
##   3:     3 2.7 4 0  6.5
##   4:     4 1.7 4 0  6.2
##   5:     5 4.2 4 0  8.9
##  ---                   
## 246:   246 1.1 5 0  6.5
## 247:   247 3.1 4 1  8.7
## 248:   248 3.3 2 0  1.2
## 249:   249 3.6 6 0  9.3
## 250:   250 3.1 3 0  6.2

The distribution of variables in $S$ matches their distribution in $A$. Here are the univariate distributions for each variable in each data set:

summary(A[, 2:5])

##        a             b              c              d       
##  Min.   :0.0   Min.   : 0.0   Min.   :0.00   Min.   : 0.1  
##  1st Qu.:2.3   1st Qu.: 4.0   1st Qu.:0.00   1st Qu.: 6.9  
##  Median :3.0   Median : 5.0   Median :0.00   Median : 9.0  
##  Mean   :3.0   Mean   : 5.1   Mean   :0.32   Mean   : 9.1  
##  3rd Qu.:3.8   3rd Qu.: 6.0   3rd Qu.:1.00   3rd Qu.:11.2  
##  Max.   :6.0   Max.   :13.0   Max.   :1.00   Max.   :18.1

summary(S[, 2:5])

##        a             b              c              d       
##  Min.   :0.1   Min.   : 0.0   Min.   :0.00   Min.   : 0.6  
##  1st Qu.:2.3   1st Qu.: 3.0   1st Qu.:0.00   1st Qu.: 6.6  
##  Median :3.0   Median : 5.0   Median :0.00   Median : 8.6  
##  Mean   :3.0   Mean   : 4.7   Mean   :0.33   Mean   : 8.6  
##  3rd Qu.:3.8   3rd Qu.: 6.0   3rd Qu.:1.00   3rd Qu.:10.5  
##  Max.   :5.2   Max.   :12.0   Max.   :1.00   Max.   :18.1

And here are the covariance matrices for both:

cor(A[, cbind(a, b, c, d)])

##         a       b       c    d
## a  1.0000 -0.0283 -0.0019 0.30
## b -0.0283  1.0000  0.0022 0.72
## c -0.0019  0.0022  1.0000 0.42
## d  0.3034  0.7212  0.4205 1.00

cor(S[, cbind(a, b, c, d)])

##        a     b      c    d
## a  1.000 0.033 -0.028 0.33
## b  0.033 1.000  0.052 0.76
## c -0.028 0.052  1.000 0.39
## d  0.335 0.764  0.388 1.00

To impute or not: the case of an RCT with baseline and follow-up measurements

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 12 Apr 2022 00:00:00 +0000

Under normal conditions, conducting a randomized clinical trial is challenging. Throw in a pandemic and things like site selection, patient recruitment and patient follow-up can be particularly vexing. In any study, subjects need to be retained long enough so that outcomes can be measured; during a period when there are so many potential disruptions, this can become quite difficult. This issue of loss to follow-up recently came up during a conversation among a group of researchers who were troubleshooting challenges they are all experiencing in their ongoing trials. While everyone agreed that missing outcome data is a significant issue, there was less agreement on how to handle this analytically when estimating treatment effects.

For me, this discussion quickly brought to mind two posts I did on missing data, where I reflected on the different missing data mechanisms (MCAR, MAR, and NMAR) and explored when it might be imperative to use multiple imputation as part of the analysis.

In light of the recent conversation, I wanted to revisit this issue of loss to follow-up in the context of a clinical trial where the outcome measure is collected at baseline (about which I’ve written about before, here and here) and we can be fairly certain that this baseline measurement will be quite well balanced at baseline.

The data generating process

In my earlier posts on missing data, I described the observed and missing data processes using a directly acyclic graphs (DAGs), which allow us to visualize the assumed causal relationships in our model. Here is a DAG for a clinical trial that collects baseline measure $Y$ at baseline ($Y_0$) and again one year later ($Y_1$):

$A$ is the treatment indicator, $A \in \{0,1\}$, $A=1$ if the patient has been randomized to the treatment arm, and $A=0$ under the control arm. $R_Y$ is a missing data indicator set to 1 if there is loss to follow up (i.e., $Y_1$ is not collected), and 0 otherwise. $Y_1^*$ is the observed value of $Y_1$. If $R_Y = 1$, the value of $Y_1^*$ is NA (i.e. missing), otherwise $Y_1 ^*= Y_1$.

In the scenario depicted in this DAG, both $Y_0$ and $A$ potentially influence the (possibly unobserved) outcome $Y_1$ and whether there is loss to follow-up $R_Y$. (I have explicitly left out the possibility that $Y_1$ itself can impact missingness, because this is a much more challenging problem - not missing at random or NMAR.)

The strengths of the relationships between the variables are determined by the parameters $\delta$, $\alpha$, and $\beta$. (I have fixed the direct relationship between $Y_0$ and $Y_1$ to a value of 1, but there is no reason that needs to be so.) The dashed line from $Y_0$ to the causal line connecting $A$ and $Y_1$ which has parameter $\lambda$ reflects the possibility that the treatment effect of $A$ will vary across different levels of the baseline measurement (i.e., there is an interaction between $Y_0$ and $A$).

For the purposes of this simulation, I am assuming this linear relationship:

\[Y_{1i} = Y_{0i} + \delta A_i - \lambda A_i Y_{0i} + e_i, \ \ A_i \in \{0, 1\}\]

I am using $-\lambda$ in order to simulate a situation where patients with lower values of $Y_0$ actually have larger overall treatment effects than those with higher values.

$Y_0$ and $e$ are both normally distributed:

\[ Y_{0i} \sim N(\mu =0, \sigma^2 = 1)\] \[e_i \sim N(\mu =0, \sigma^2 = 0.5)\]

The missing data mechanism is also linear, but on the logistic scale. In this scenario, patients with lower baseline values $Y_0$ are more likely to be lost to follow-up than patients with higher values (assuming, of course, $\alpha > 0$):

\[\text{logit}(P(R_{Yi} = 1)) =-1.5 - \alpha Y_{0i} - \beta A_i\]

Under these assumptions, the probability that a patient with baseline measure $Y_0 = 0$ in the control arm is lost to follow-up is $\frac{1}{1 + exp(1.5)} \approx 18\%$.

Data simulation

I am using the simstudy package to simulate data from these models, which allows me to define the data generating process described above. First, let’s load the necessary libraries:

library(simstudy)
library(data.table)
library(mice)

The table def implements the definitions for the data generating process. I’ve created two versions of $Y_1$. The first is the true underlying value of $Y_1$, and the second $Y_{1_{obs}}$ is really $Y_1^*$ from the DAG. At the outset, there are no missing data, so initially $Y_{1_{obs}}$ is just a replicate of $Y_1$:

def <- defData(varname = "y0", formula = 0, variance = 1)
def <- defData(def, "a", formula = "1;1", dist = "trtAssign")
def <- defData(def, "y1", "y0 + ..delta * a - ..lambda * y0 * a", 0.5)
def <- defData(def, "y1_obs", formula = "y1", dist = "nonrandom")

The missing data generating process is defined in table defM:

defM <- defMiss(
    varname = "y1_obs", 
    formula = "-1.5  - ..alpha * y0 - ..beta * a", 
    logit.link = TRUE
)

For this particular simulation, I am assuming $\delta = 1$, $\lambda = 0.8$, $\alpha = 1$, and $\beta = 0$:

delta <- 1
lambda <- 0.8

alpha <- 1
beta <- 0

With all the definitions and parameters set, we are ready to generate the data:

RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(1234)

dd <- genData(1200, def)
dmiss <- genMiss(dd, defM, idvars = "id")
dobs <- genObs(dd, dmiss, idvars = "id")

dobs

##         id     y0 a     y1 y1_obs
##    1:    1 -0.068 1  0.365     NA
##    2:    2 -0.786 1  0.842  0.842
##    3:    3  0.154 0 -0.072 -0.072
##    4:    4  0.037 0 -1.593 -1.593
##    5:    5  0.926 0  1.915  1.915
##   ---                            
## 1196: 1196  0.442 1  1.333  1.333
## 1197: 1197  2.363 1  2.385  2.385
## 1198: 1198 -1.104 0 -2.115 -2.115
## 1199: 1199 -1.380 1  0.947  0.947
## 1200: 1200 -1.023 1  1.250     NA

Estimating the treatment effect

Now, with the data in hand, we can estimate the treatment effect. In this case, I will fit three different models. The first assumes that there was no missing data at all (i.e., we had full access to $Y_1$ for all study participants). The second is an analysis using only cases with complete data, which ignores missing data entirely and assumes that the missing data process is MCAR (missing completely at random). The third analysis uses multiple imputation to generate values for the missing cases based on distributions of the observed data - and does this repeatedly to come up with a series of data sets (in this case 20). In this last analysis, a model is fit for each of the 20 data sets, and the results are pooled:

fit_all <- lm(y1 ~ y0 + a, data = dobs)
fit_comp <- lm(y1_obs ~ y0 + a, data = dobs)

imp_dd <- dobs[, -c("id", "y1")]
imp <- mice(imp_dd, m=20, maxit=5, print=FALSE)
fit_imp <- pool(with(imp, lm(y1_obs ~ y0 + a)))

Here is a figure that shows the estimated regression lines for each of the models (showed sequentially in animated form). In all three cases, we are adjusting for baseline measurement $Y_0$, which is a good thing to do even when there is good balance across treatment arms; this tends to reduce standard errors. Also note that I am ignoring the possibility of heterogeneous treatment effects with respect to different levels of $Y_0$ (determined by $\lambda$ in the data generation process); I am effectively estimating the average treatment effect across all levels of $Y_0$.

The analysis based on the full data set (A) recovers the treatment effect parameter quite well, but the complete data analysis (B) underestimates the treatment effect; the imputed analysis (C) does much better.

Estimating the bias of each modeling approach

To conduct a more systematic assessment of the bias associated with each model $m, \ m \in \{A, B, C\},$ I repeatedly simulated data under a range of assumptions about $\lambda$, $\alpha$ and $\beta$ (I fixed $\delta$ since it has no impact on the bias). In total, I assessed 54 scenarios by setting $\lambda = \{0, 0.2, \dots, 1\}$, $\alpha = \{0, 0.5, 1\}$, and $\beta = \{0, 1, 2\}$. (The code for this simulation can be found below in the addendum.)

For each set of assumptions $s, \ s \in \{1, \dots, 54\}$, I generated 5000 data sets with 200 patients and estimated the parameters from all three models for each data set. I was particularly interested in the estimate of the average treatment effect $\hat\delta_{smk}$ (i.e. the average treatment effect for each model $m$ under assumptions $s$ and each iteration $k, \ k\in \{1,\dots,5000\}).$

Using the results from the iterations, I estimated the bias $_{sm} for each set of assumptions $s$ and model $m$ as:

\[\hat{\text{B}}_{sm} =\frac{1}{5000} \sum_{k=1}^{5000} (\hat\delta_{smk} - \delta)\]

The following figure shows $\hat{\text{B}}_{sm}$ for each of the three modeling approaches:

It is clear that if we have no missing data, all the estimates are unbiased. And in this case, it does not appear that missingness related specifically to treatment arm (determined by parameter $\beta$) does not have much of an impact. However bias is impacted considerably by both heterogeneous treatment effect (parameter $\lambda$) and missingness related to $Y_0$ (parameter $\alpha$), and especially the combination of both $\alpha$ and $\lambda$.

If missingness is independent of $Y_0$ ($\alpha = 0$), there is no induced bias just using complete data (model B), even with substantial heterogeneity of treatment effect ($\lambda = 1$). With moderate missingness due to $Y_0$ ($\alpha = 0.5$), there is still no bias for the complete data analysis with low heterogeneity. However, bias is introduced here as heterogeneity becomes more pronounced. Using imputation reduces a good amount of the bias. Finally, when missingness is strongly related to $Y_0$, both the complete data and imputed data analysis fare poorly, on average. Although multiple imputation worked well in our initial data set above with $\alpha = 1$, the figure from the repeated simulations suggests that multiple imputation did not perform so well on average at that level. This is probably due to the fact that if there is a lot of missing data, imputation has much less information at its disposal and the imputed values are not so helpful.

Addendum

Here is the code used to generate the iterative simulations:

s_define <- function() {
  
  def <- defData(varname = "y0", formula = 0, variance = 1)
  def <- defData(def, "a", formula = "1;1", dist = "trtAssign")
  def <- defData(def, "y1", 
           formula = "y0 + ..delta * a - ..lambda * y0 * a", variance = 0.5)
  def <- defData(def, "y1_obs", formula = "y1", dist = "nonrandom")
  
  defM <- defMiss(
    varname = "y1_obs", formula = "-1.5  - ..alpha * y0 - ..beta * a", 
    logit.link = TRUE
  )
  
  return(list(def = def, defM = defM))
}

s_generate <- function(list_of_defs, argsvec) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  dd <- genData(200, def)
  dmiss <- genMiss(dd, defM, idvars = "id")
  dobs <- genObs(dd, dmiss, idvars = "id")

  return(dobs) #  generated_data is a data.table
}

s_model <- function(generated_data) {
  
  imp_dd <- generated_data[, -c("id", "y1")]
  imp <- mice(imp_dd, m=20, maxit=5, print=FALSE)
  
  a_all <- coef(lm(y1 ~ y0 + a, data = generated_data))["a"]
  a_missing <- coef(lm(y1_obs ~ y0 + a, data = generated_data))["a"]

  fit_imp <- pool(with(imp, lm(y1_obs ~ y0 + a)))
  a_imp <- summary(fit_imp)[3, "estimate"]

  return(data.table(a_all, a_missing, a_imp)) # model_results is a data.table
}

s_single_rep <- function(list_of_defs, argsvec) {
  
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data)
  
  return(model_results)
}

s_replicate <- function(argsvec, nsim) {
  
  list_of_defs <- s_define()
  
  model_results <- rbindlist(
    parallel::mclapply(
      X = 1 : nsim, 
      FUN = function(x) s_single_rep(list_of_defs, argsvec), 
      mc.cores = 4)
  )
  
  #--- add summary statistics code ---#
  
  summary_stats <- model_results[, .(
      mean_all = mean(a_all, na.rm = TRUE), 
      bias_all = mean(a_all - delta, na.rm = TRUE), 
      var_all = var(a_all, na.rm = TRUE), 
      
      mean_missing = mean(a_missing, na.rm = TRUE), 
      bias_missing = mean(a_missing - delta, na.rm = TRUE), 
      var_missing = var(a_missing, na.rm = TRUE),
      
      mean_imp = mean(a_imp, na.rm = TRUE), 
      bias_imp = mean(a_imp - delta, na.rm = TRUE), 
      var_imp = var(a_imp, na.rm = TRUE)
    )]
  
  summary_stats <- data.table(t(argsvec), summary_stats)
  
  return(summary_stats) # summary_stats is a data.table
}

#---- specify varying power-related parameters ---#

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

delta <- 1
lambda <- c(0, 0.2, .4, .6, .8, 1)
alpha <- c(0, 0.5, 1)
beta <- c(0, 1, 2)

scenarios <- scenario_list(delta = delta, lambda = lambda, alpha = alpha, beta = beta)

summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 5000)))

Simulating time-to-event outcomes with non-proportional hazards

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 29 Mar 2022 00:00:00 +0000

As I mentioned last time, I am working on an update of simstudy that will make generating survival/time-to-event data a bit more flexible. I previously presented the functionality related to competing risks, and this time I’ll describe generating survival data that has time-dependent hazard ratios. (As I mentioned last time, if you want to try this at home, you will need the development version of simstudy that you can install using devtools::install_github(“kgoldfeld/simstudy”).)

Constant/proportional hazard ratio

In the current version of simstudy 0.4.0 on CRAN, the data generation process for survival/time-to-event outcomes can include covariates that effect the hazard rate (which is the risk/probability of having an event conditional on not having had experienced that event earlier). The ratio of hazards comparing different levels of a covariate are constant across all time points. For example, if we have a single binary covariate $x$, the hazard $\lambda(t)$ at time $t$ is \[\lambda(t|x) = \lambda_0(t) e ^ {\beta x}\] where $\lambda_0(t)$ is a baseline hazard when $x=0$. The ratio of the hazards for $x=1$ compared to $x=0$ is \[\frac{\lambda_0(t) e ^ {\beta}}{\lambda_0(t)} = e ^ \beta,\] so the log of the hazard ratio is a constant $\beta$, and the hazard ratio is always $e^\beta$.

Here is a simulated example that assumes a constant log hazard ratio of $-0.7$:

library(simstudy)
library(data.table)
library(survival)

def <- defData(varname = "x", formula = 0.4, dist = "binary")

defS <- defSurv(varname = "death", formula = "-14.6 - 0.7 * x", shape = 0.35)
defS <- defSurv(defS, varname = "censor", scale = exp(13), shape = 0.5)

set.seed(7361)
dd <- genData(500, def)
dd <- genSurv(dd, defS, digits = 2, timeName = "time", censorName = "censor")
dd

##       id x   time event   type
##   1:   1 0 164.98     1  death
##   2:   2 1 247.84     1  death
##   3:   3 0  28.54     1  death
##   4:   4 0 138.05     1  death
##   5:   5 0 228.53     1  death
##  ---                          
## 496: 496 0  79.47     1  death
## 497: 497 1   5.41     0 censor
## 498: 498 1 211.54     1  death
## 499: 499 0 240.73     1  death
## 500: 500 1 256.66     1  death

This is the Kaplan-Meier plot comparing survival curves for cases where $x=0$ with cases where $x=1$, which illustrates what a proportional hazard rate looks like:

The Cox proportional hazards model recovers the correct log hazards rate:

coxfit <- coxph(formula = Surv(time, event) ~ x, data = dd)

Characteristic	log(HR)¹	95% CI¹	p-value
x	-0.72	-0.92, -0.52	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

Since we know that we used proportional hazards to generate the data, we can expect that a test evaluating the proportional hazards assumption using weighted residuals will confirm that the assumption is met. If the $\text{p-value} < 0.05$, then we would conclude that the assumption of proportional hazards is not warranted. In this case $p = 0.68$, so the model is apparently reasonable (which we already knew):

cox.zph(coxfit)

##        chisq df    p
## x       0.17  1 0.68
## GLOBAL  0.17  1 0.68

Non-constant/non-proportional hazard ratio

When generating data, we may not always want to be limited to a situation where the hazard ratio is constant over all time periods. To facilitate this, it is possible to specify two different data definitions for the same outcome, using the transition field to specify the point at which the second definition replaces the first. (While it would theoretically be possible to generate data for more than two periods, the process is more involved, and has not been implemented.)

In this next case, the risk of death when $x=1$ is lower at all time points compared to when $x=0$, but the relative risk (or hazard ratio) changes at 150 days:

def <- defData(varname = "x", formula = 0.4, dist="binary")

defS <- defSurv(varname = "death", formula = "-14.6 - 1.3 * x", 
  shape = 0.35, transition = 0)
defS <- defSurv(defS, varname = "death", formula = "-14.6 - 0.4 * x", 
  shape = 0.35, transition = 150)
defS <- defSurv(defS, varname = "censor", scale = exp(13), shape = 0.50)

dd <- genData(500, def)
dd <- genSurv(dd, defS, digits = 2, timeName = "time", censorName = "censor")

The survival curve for the sample with $x=1$ has a slightly different shape under this data generation process compared to the previous example under a constant hazard ratio assumption; there is more separation early on (prior to day 150), and then the gap is closed at a quicker rate.

If we ignore the possibility that there might be a different relationship over time, the Cox proportional hazards model gives an estimate of the log hazard ratio quite close to -0.70:

coxfit <- survival::coxph(formula = Surv(time, event) ~ x, data = dd)

Characteristic	log(HR)¹	95% CI¹	p-value
x	-0.84	-1.0, -0.65	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

However, further inspection of the proportionality assumption should make us question the appropriateness of the model. Since $p<0.05$, we would be wise to see if we can improve on the model.

cox.zph(coxfit)

##        chisq df      p
## x       10.1  1 0.0015
## GLOBAL  10.1  1 0.0015

We might be able to see from the plot where proportionality diverges, in which case we can split the data set into two parts at the identified time point. (In many cases, the transition point or points won’t be so obvious, in which case the investigation might be more involved.) By splitting the data at day 150, we get the desired estimates:

dd2 <- survSplit(Surv(time, event) ~ ., data= dd, cut=c(150),
                 episode= "tgroup", id="newid")

coxfit2 <- survival::coxph(Surv(tstart, time, event) ~ x:strata(tgroup), data=dd2)

Characteristic	log(HR)¹	95% CI¹	p-value
x * strata(tgroup)
x * tgroup=1	-1.5	-1.8, -1.1	<0.001
x * tgroup=2	-0.54	-0.78, -0.29	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

And the diagnostic test of proportionality confirms the appropriateness of the model:

cox.zph(coxfit2)

##                  chisq df   p
## x:strata(tgroup)  1.38  2 0.5
## GLOBAL            1.38  2 0.5

The actual data generation process implemented in simstudy is based on an algorithm described in this paper by Peter Austin.

Reference:

Austin, Peter C. “Generating survival times to simulate Cox proportional hazards models with time‐varying covariates.” Statistics in medicine 31, no. 29 (2012): 3946-3958.

Adding competing risks in survival data generation

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 15 Mar 2022 00:00:00 +0000

I am working on an update of simstudy that will make generating survival/time-to-event data a bit more flexible. There are two biggish enhancements. The first facilitates generation of competing events, and the second allows for the possibility of generating survival data that has time-dependent hazard ratios. This post focuses on the first enhancement, and a follow up will provide examples of the second. (If you want to try this at home, you will need the development version of simstudy, which you can install using devtools::install_github(“kgoldfeld/simstudy”).)

Competing risks

In standard survival analysis, we are interested in understanding the distribution of time to a single event, such as time to death or time to symptom-free recovery. However, there may be situations where two there are at least two possible events, and the occurrence of one precludes the occurrence of the other. In this case, we say that the events are competing risks. In the work that motivated this simstudy enhancement, we are trying to model the time to opioid overdose death for people being released from jail, but there are at least two other competing risks, death from other causes and re-incarceration. In addition, there is censoring, which from a data generation point of view (though maybe not from an analytic perspective), can also be considered a competing risk.

Generating competing risk outcomes under 0.4.0

Previously, it was possible to generate competing risk outcomes, but it had to be done “manually” with additional lines of code or data definitions. It was a bit cumbersome:

library(simstudy)
library(data.table)
library(survival)

First, we would generate a data set with genSurv:

d1 <- defData(varname = "x1", formula = .5, dist = "binary")
d1 <- defData(d1, "x2", .5, dist = "binary")

dS <- defSurv(varname = "event_1", formula = "-12 - 0.1*x1 - 0.2*x2", shape = 0.3)
dS <- defSurv(dS, "event_2", "-12 - 0.3*x1 - 0.2*x2", shape = 0.3)
dS <- defSurv(dS, "event_3", "-12 - 0.4*x1 - 0.3*x2", shape = 0.3)
dS <- defSurv(dS, "censor", "-13", shape = 0.3)

set.seed(2140)

dtCov <- genData(3001, d1)
dtSurv <- genSurv(dtCov, dS)

head(dtSurv)

##    id x1 x2 censor event_1 event_2 event_3
## 1:  1  0  1 56.809  40.321  35.050  48.356
## 2:  2  0  0 62.278  24.865  23.121  19.198
## 3:  3  1  0 25.650  27.898  24.716  49.633
## 4:  4  1  1 51.486  35.065  44.731  40.086
## 5:  5  1  0 54.522  37.042  49.701  34.751
## 6:  6  1  0 42.749  41.419  34.136  43.811

And then we would add the observed time variable with a new data definition and call to addColumns:

f <- "(time==censor)*0 + (time==event_1)*1 + (time==event_2)*2 + (time==event_3)*3"

cdef <- defDataAdd(varname = "time", 
  formula = "pmin(censor, event_1, event_2, event_3)", dist = "nonrandom")
cdef <- defDataAdd(cdef, varname = "event", 
  formula = f, 
  dist = "nonrandom")

dtSurv_final <- addColumns(cdef, dtSurv)
head(dtSurv_final)

##    id x1 x2 censor event_1 event_2 event_3   time event
## 1:  1  0  1 56.809  40.321  35.050  48.356 35.050     2
## 2:  2  0  0 62.278  24.865  23.121  19.198 19.198     3
## 3:  3  1  0 25.650  27.898  24.716  49.633 24.716     2
## 4:  4  1  1 51.486  35.065  44.731  40.086 35.065     1
## 5:  5  1  0 54.522  37.042  49.701  34.751 34.751     3
## 6:  6  1  0 42.749  41.419  34.136  43.811 34.136     2

Streamlined approach

The function addCompRisk can now generate an observed time variable (which will be the first of the four event times). This is done by specifying a timeName argument that will represent the observed time value. The event status is indicated in the field set by the eventName argument (which defaults to “event”). And, if a variable name is indicated in the censorName argument, the censored events automatically have a value of 0.

dtSurv_final <- addCompRisk(dtSurv, 
  events = c("event_1", "event_2", "event_3", "censor"), 
  timeName = "time", censorName = "censor")

head(dtSurv_final)

##    id x1 x2   time event    type
## 1:  1  0  1 35.050     2 event_2
## 2:  2  0  0 19.198     3 event_3
## 3:  3  1  0 24.716     2 event_2
## 4:  4  1  1 35.065     1 event_1
## 5:  5  1  0 34.751     3 event_3
## 6:  6  1  0 34.136     2 event_2

The competing risk data can be plotted using the cumulative incidence functions (rather than the survival curves):

The data generation can be streamlined even further, as there is no need to call addCompRisk at all. The same arguments can be provided directly in the call to genSurv, so everything is done at once. And there is no need to explicitly name the events, as they are extracted from the survival distribution definitions:

dtSurv_final <- genSurv(dtCov, dS, timeName = "time", censorName = "censor")
head(dtSurv_final)

##    id x1 x2   time event    type
## 1:  1  0  1 30.229     3 event_3
## 2:  2  0  0 28.473     1 event_1
## 3:  3  1  0 11.654     0  censor
## 4:  4  1  1 28.248     1 event_1
## 5:  5  1  0 28.524     3 event_3
## 6:  6  1  0 18.255     0  censor

If you don’t want to generate competing risk data, and you can set timeName to NULL, or just not specify it:

dtSurv_final <- genSurv(dtCov, dS)
head(dtSurv_final)

##    id x1 x2 censor event_1 event_2 event_3
## 1:  1  0  1 40.276  16.858  42.624  34.609
## 2:  2  0  0 69.382  24.191  37.574  10.783
## 3:  3  1  0 26.637  48.476  54.240  60.507
## 4:  4  1  1 36.697  46.690  36.443  43.195
## 5:  5  1  0 64.498  36.755  39.666  10.935
## 6:  6  1  0 69.150  38.706  44.930  17.668

Follow-up: simstudy function for generating parameters for survival distribution

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 22 Feb 2022 00:00:00 +0000

In the previous post I described how to determine the parameter values for generating a Weibull survival curve that reflects a desired distribution defined by two points along the curve. I went ahead and implemented these ideas in the development version of simstudy 0.4.0.9000, expanding the idea to allow for any number of points rather than just two. This post provides a brief overview of the approach, the code, and a simple example using the parameters to generate simulated data.

The distribution

Just to recap briefly (really, take a glance at the previous post if you haven’t already - this post may make a bit more sense), the times to an event can be drawn from the Weibull distribution by generating a uniform random $u$ variable between 0 and 1 that represents the survival probability. There are two parameters that I am using here - the formula $f$ and shape $\nu$. (There is a third possible scale parameter $\lambda$ that I have set to 1 as a default, so that $f$ alone determines the scale.)

\[T = \left[ \frac{- \text{log}(u) }{\text{exp}(f)} \right] ^ \nu\]

Determining the parameter values

If we have some idea about how the shape of the survival curve might look, we can use this information to find the parameters that will allow us to draw simulated data from that distribution. When I say “we might have some idea,” I mean that we need to have survival probabilities in mind for specific time points.

Last time, I showed that if we have two time points and their associated probabilities, we can generate parameters that define a curve that passes through both points. In the main body of that post I proposed a simple analytic solution for the parameters, but in the addendum of the previous post, I suggested we can minimize a simple loss function to find the same parameters. I’ve decided to implement the minimization algorithm in simstudy, because it can more flexibly incorporate any number of input points, not just two. Here is the more general loss function for $k$ input points:

\[\sum_{i=1}^k\left[ \nu^* \ (\text{log} \left[ -\text{log}(u_i)\right] - f^*) - \text{log}(T_i) \right]^2\]

Implementing the optimization in R

Below is a version of the function survGetParams that is implemented in simstudy. The key components are the specification of the loss function to be optimized and then using the R base optim function to find the pair of parameters that minimize the this function. The optim function requires starting values for the parameters, as well as the data points that are to be used to define the curve. In addition, we set a lower boundary on the shape parameter, which cannot be negative.

library(simstudy)
library(data.table)
library(ggplot2)
library(survival)
library(survminer)

survGetParams <- function(points) {
  
  loss_surv <- function(params, points) {
    
    loss <- function(a, params) {
      ( params[2]*(log(-log(a[2])) - params[1]) - log(a[1]) ) ^ 2
    }
    
    sum(sapply(points, function(a) loss(a, params)))
    
  }
  
  r <- optim(
    par = c(1, 1), 
    fn = loss_surv, 
    points = points,
    method = "L-BFGS-B", 
    lower = c(-Inf, 0),
    upper = c(Inf, Inf)
  )
  
  return(r$par)
}

In the first example, we would like to find the parameters for a distribution where 20% of the population has an event (e.g. does not survive) before 100 days and 90% have an event before 200 days. Alternatively, we can say 80% survive until day 100, and 10% survive until day 200:

points <- list(c(100, 0.80), c(200, 0.10))
r <- survGetParams(points)
r

## [1] -17.0065167   0.2969817

We can view the Weibull curve generated by the optimized parameters in the using the function survParamPlot, which is also implemented in simstudy. When we specify two points, we can find an exact curve:

survParamPlot <- function(f, shape, points, n = 100) {
  
  u <- seq(1, 0.001, length = n)
  
  dd <- data.table::data.table(
    f = f,
    shape = shape,
    T = (-(log(u)/exp(f)))^(shape),
    p = round(1 - cumsum(rep(1/length(u), length(u))), 3)
  )
  
  dpoints <- as.data.frame(do.call(rbind, points))  

  ggplot(data = dd, aes(x = T, y = p)) +
    geom_line(size = 0.8) +
    scale_y_continuous(limits = c(0,1), name = "probability of survival") +
    scale_x_continuous(name = "time") +
    geom_point(data = dpoints, aes(x = V1, y = V2), 
                    pch = 21, fill = "#DCC949", size = 2.5) +
    theme(panel.grid = element_blank(),
          axis.text = element_text(size = 7.5),
          axis.title = element_text(size = 8, face = "bold")
    )
  
}

survParamPlot(f = r[1], shape = r[2], points)

If for some reason, we have additional information about the target distribution that we want to simulate from, we can re-optimize the parameters. In this case, we’ve added an additional constraint that 5% will survive longer than 300 days. With this new set of points, the optimized curve will not necessarily pass through all the points.

points <- list(c(100, 0.80), c(200, 0.10), c(300, 0.05))
r <- survGetParams(points)
survParamPlot(f = r[1], shape = r[2], points)

And finally - here is one last scenario, where we have four target probabilities:

points <- list(c(100, 0.80), c(120, 0.50), c(200, 0.10), c(300, 0.05))
r <- survGetParams(points)
survParamPlot(f = r[1], shape = r[2], points)

Simulating data from a desired distribution

With a target distribution in hand, and a pair of parameters, we are now ready to simulate data. In the first example above, we learned that $f = -17$ and $\nu = 0.3$ gives define the distribution where 80% survive until day 100, and 10% survive until day 200. We can simulate data from that distribution to see how well we do. And while we are at, I’ll add in a treatment that improves survival time (note that negative coefficients imply improvements in survival).

defD <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
defS <- defSurv(varname = "time", formula = "-17 - 0.8 * rx", scale = 1, shape = 0.3)

set.seed(11785)
dd <- genData(200, defD)
dd <- genSurv(dd, defS, digits = 0)
dd$status <- 1

dd

##       id rx time status
##   1:   1  1  180      1
##   2:   2  1  189      1
##   3:   3  1  183      1
##   4:   4  1  280      1
##   5:   5  1   84      1
##  ---                   
## 196: 196  0  219      1
## 197: 197  1  273      1
## 198: 198  0  131      1
## 199: 199  1  261      1
## 200: 200  1  203      1

Eyeballing the survival curve for the control arm ($rx = 0$), we can see that this roughly matches the distribution of 80%/100 and 10%/200. Incidentally, the survival curve for the treatment arm has effectively been generated by a distribution with parameters $f=-17.8$ and $\nu = 0.3$.

fit <- survfit( Surv(time, status) ~ rx, data = dd )
ggsurvplot(fit, data = dd, palette = c("#7D9D33","#CD8862"))

Under the assumption of a constant shape (that is $\nu$ is equivalent for both arms), the log hazard ratio can be viewed as the shift of the scale parameter (here represented entirely by $f$). While that probably won’t help anyone understand the real world implications of the hazard ratio, it does provide some insight into the underlying data generation process.

coxfit <- coxph(Surv(time, status) ~ rx, data = dd)

Characteristic	log(HR)¹	95% CI¹	p-value
rx	-0.78	-1.1, -0.49	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

Simulating survival outcomes: setting the parameters for the desired distribution

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 08 Feb 2022 00:00:00 +0000

The package simstudy has some functions that facilitate generating survival data using an underlying Weibull distribution. Originally, I added this to the package because I thought it would be interesting to try to do, and I figured it would be useful for me someday (and hopefully some others, as well). Well, now I am working on a project that involves evaluating at least two survival-type processes that are occurring simultaneously. To get a handle on the analytic models we might use, I’ve started to try to simulate a simplified version of the data that we have.

At some point, I’d like to describe the motivating simulations in more detail. But here, I want to focus more generally on the underlying survival data generating process used in simstudy, looking in particular at how to identify the parameters so that the simulated data will come close to matching a desired distribution.

Generating survival data in simstudy

To put all of this in context, this is a simple example of how we can generate survival data using simstudy and then take a look at it using the survival and survminer packages:

library(simstudy)
library(data.table)
library(ggplot2)
library(survminer)
library(survival)

def <- defSurv(varname = "time", formula = -13.7, scale = 1, shape = 0.49)

set.seed(1134)
dd <- genData(101)
dd <- genSurv(dd, def, digits = 0)
dd$status <- 1

dd

##       id time status
##   1:   1  604      1
##   2:   2 1012      1
##   3:   3 1338      1
##   4:   4 1009      1
##   5:   5  335      1
##  ---                
##  97:  97  756      1
##  98:  98  971      1
##  99:  99  381      1
## 100: 100  952      1
## 101: 101  697      1

fit <- survfit( Surv(time, status) ~ 1, data = dd )
ggsurvplot(fit, data = dd, palette = "#7D9D33")

So, we have generated survival data, and it all looks great. However, where did those parameters (formula, scale, and shape) in defSurv come from? That’s what we’re focusing on here.

Weibull-Cox proprtional hazard data generation process

This underlying data generating process draws from a Weibull distribution and satisfies the requirements of a Cox proportional hazards model. The approach was drawn directly from this Bender, Augustin, and Blettner paper, so head over there if you really want the details.

The times to survival $T$ are generated using three parameters, $\lambda$ (scale), $\nu$ (shape), and $f$, which is really $\mathbf{\beta^\prime x}$ from a Cox proportional hazard model that may include a covariate vector $\mathbf{x}$. In the examples here, there will only be an intercept.

\[T = \left[ \frac{- \lambda \ \text{log}(u) }{\text{exp}(f)} \right] ^ \nu\]

A single instance $T$ is “drawn” from the Weibull distribution by generating $u$ from the uniform $U(0,1)$ distribution. It will be the case that $(1-u) \%$ of the survival times $T$ will fall below the values of $T$ determined by $u$; this will be helpful later when we need to generate data with specific distributions in mind.

It turns out that we don’t really need the scale parameter $\lambda$, because it can be absorbed into $f$, so in all the examples that follow, we’ll set $\lambda = 1$, which leaves us with

\[T = \left[ \frac{- \text{log}(u) }{\text{exp}(f)} \right] ^ \nu\]

Visualizing the survival curves

Weibull distribution data generation is extremely flexible, and can provide an infinite number of distributions of a wide range of shapes. Here are some examples, but first, to get things started, here are two more packages needed to run all the code here:

library(data.table)
library(ggplot2)

The function get_surv generates data for the survival curve. It is deterministic in that it does not generate draws of $u$, but calculates a specific $T$ for each value of $u$ distribution evenly between 0 and 1.

get_surv <- function(f, shape, n = 100) {
  
  u <- seq(1, 0.001, length = n)
  
  dd <- data.table(
    f = f,
    shape = shape,
    T = (-(log(u)/exp(f)))^(shape),
    p = round(1 - cumsum(rep(1/length(u), length(u))), 3)
  )
  
  return(dd)

}

get_surv(-10, .3, n = 10)

##       f shape        T   p
##  1: -10   0.3  0.00000 0.9
##  2: -10   0.3 10.56988 0.8
##  3: -10   0.3 13.26785 0.7
##  4: -10   0.3 15.31471 0.6
##  5: -10   0.3 17.11875 0.5
##  6: -10   0.3 18.85288 0.4
##  7: -10   0.3 20.64903 0.3
##  8: -10   0.3 22.68619 0.2
##  9: -10   0.3 25.40811 0.1
## 10: -10   0.3 35.86613 0.0

Here are 16 arbitrary distributions using four different values of $f$ and $\nu$. Each panel represents a different value of $\nu$, ranging from 0.16 to 0.22.

f <- c(-26, -27, -28, -29)
shape <- c(0.16, .18, .20, .22)

eg <- expand.grid(f=f, shape=shape)
eg <- asplit(eg, MARGIN = 1)

l <- lapply(eg, function(x) get_surv(x[1], x[2]))
l <- rbindlist(l)

ggplot(data = l, aes(x = T, y = p)) +
  geom_line(aes(group = f, color = factor(f))) +
  ylim(0,1) +
  xlim(0, 800) +
  facet_grid ( ~ shape) +
  theme(panel.grid = element_blank(),
        axis.text = element_text(size = 7.5)) +
  scale_color_manual(
    values = c("#7D9D33", "#CED38C", "#DCC949", "#BCA888", "#CD8862", "#775B24"),
    name = "  f"
  )

Generating a particular distribution

The interpretation of the parameters is a bit opaque. If we have covariates embedded in $f$ the coefficients do have pretty clear interpretations as hazard ratios. But the intercept term (remember, I have set $\lambda = 1$) really defines the scale. So how do we go about selecting values for $f$ and $\nu$ to get the desired distribution?

If we can reasonably characterize the desired distribution by two points on the survival curve, the task actually becomes remarkably easy. By this, I mean we pick two time points and a probability of survival for each time point. For example, we may want a distribution with 80% survival until day 180 and 20% survival at day 365. Or, in a second scenario, we may want to have 90% survival at day 180 and 40% survival at day 365. Here is an animation of how we might find the curves by adjusting $f$ and $\nu$ (see addendum for code to generate this):

Determining the parameter values

If we have two points in mind, there is an extremely simple analytic solution that can derive $f$ and $\nu$ so that we can identify a Weibull-Cox based survival curve that is guaranteed to pass through both points.

For any time point $T$, we have from the equation above

\[\begin{aligned} log(T) &= \nu \ \text{log} \left[ \frac{-\text{log}(u)}{exp(f)} \right] \\ \\ &= \nu \ \left ( \text{log}\left[-\text{log}(u)\right] - \text{log}\left[ \text{exp}(f) \right] \right ) \\ \\ &= \nu \ (\text{log}\left[-\text{log}(u)\right] - f) \end{aligned}\]

If we have desired points $(T_1,u_1)$ and $(T_2,u_2)$, then we write can a simple system of two equations with two unknowns, $f^*$ and $\nu^*$, the target parameters:

\[ \text{log}(T_1) = \nu \ (\text{log} \left[ -\text{log}(u_1)\right] - f)\]

\[ \text{log}(T_2) = \nu \ (\text{log} \left[ -\text{log}(u_2)\right] - f)\]

Using simple algebra, we can rearrange terms to provide solutions for $f$ and $\nu$ (I’ll spare you the extra steps):

\[\nu^* = \frac{\text{log}(T_2) - \text{log}(T_1)}{\text{log}(-\text{log}(u_2)) - \text{log}(-\text{log}(u_1))}\]

\[f^* = \text{log}(-\text{log}(u_1)) - \nu^{*^{-1}} \text{log}(T_1)\]

Scenario three

If we want a curve that where there is 95% survival at day 180 and 40% survival at the end of 2 years, the desired parameters are

\[\nu^* = \frac{\text{log}(365*2) - \text{log}(180)}{\text{log}(-\text{log}(0.40)) - \text{log}(-\text{log}(0.95))} = \frac{1.400}{2.883} = 0.486\]

\[f^* = \text{log}(-\text{log}(0.95)) - \frac{1}{0.486} \text{log}(180) = -2.970 - 10.685 = -13.655\]

We can generate points along this curve and then plot them:

dsurv <- get_surv(-13.655, 0.486, n = 1000)
dsurv

##             f shape          T     p
##    1: -13.655 0.486    0.00000 0.999
##    2: -13.655 0.486   26.55994 0.998
##    3: -13.655 0.486   37.20774 0.997
##    4: -13.655 0.486   45.32308 0.996
##    5: -13.655 0.486   52.13693 0.995
##   ---                               
##  996: -13.655 0.486 1714.16472 0.004
##  997: -13.655 0.486 1748.87893 0.003
##  998: -13.655 0.486 1792.58585 0.002
##  999: -13.655 0.486 1852.33947 0.001
## 1000: -13.655 0.486 1950.02084 0.000

Of course, what we really want to do is sample from a distribution defined by these parameters in order to conduct simulation experiments, not just generate and look at deterministic functions. This would include adding covariates and possibly censoring. But all of that remains for another day.

Reference:

Bender, Ralf, Thomas Augustin, and Maria Blettner. “Generating survival times to simulate Cox proportional hazards models.” Statistics in medicine 24, no. 11 (2005): 1713-1723.

Addendum

Here is the code that is used to generate the animated plot, which done with the gganimate package:

library(gganimate)

sdd <- list()

sdd[[1]] <- get_surv(-6,  1, n = 1000)
sdd[[2]] <- get_surv(-7.566, 0.783 , n = 1000)
sdd[[3]] <- get_surv(-9.981, 0.587, n = 1000)
sdd[[4]] <- get_surv(-11.137, 0.521, n = 1000)
sdd[[5]] <- get_surv(-13.814, 0.417, n = 1000)
sdd[[6]] <- get_surv(-16.014, 0.358, n = 1000)
sdd[[7]] <- get_surv(-6,  1, n = 1000)
sdd[[8]] <- get_surv(-8.125, 0.777 , n = 1000)
sdd[[9]] <- get_surv(-10.183, 0.619, n = 1000)
sdd[[10]] <- get_surv(-12.481, 0.490, n = 1000)
sdd[[11]] <- get_surv(-14.595, 0.414, n = 1000)
sdd[[12]] <- get_surv(-18.139, 0.327, n = 1000)

k <- length(sdd)

sdds <- lapply(1:k, function(x) sdd[[x]][ , c("iter", "color") := list(x, "black")])
sdds[[k/2]][, color := "green"]
sdds[[k]][, color := "green"]
sdd <- rbindlist(sdds)

targets <- data.table(iter = 1:k, days1 = rep(180, k), days2 = rep(365, k),
  p1 = rep(c(.8, .9), each = k/2), p2 = rep(c(.2, .4), each = k/2))

dt_anot <- sdd[, .SD[1,], keyby = iter]
dt_anot[iter <= (k/2), targets := 1]
dt_anot[iter > (k/2), targets := 2]
dt_anot[, targets := factor(targets, labels = c("Scenario one", "Scenario two"))]
dt_anot[, color := "black"]
dt_anot[iter == (k/2), color := "green"]
dt_anot[iter == k, color := "green"]

a <- ggplot() +
  geom_point(data = targets, aes(x = days1, y=p1), pch = 1, size = 2) +
  geom_point(data = targets, aes(x = days2, y=p2), pch = 1, size = 2) +
  geom_point(data = sdd, aes(x = T, y = p, group = p, color = color), size = .2) +
  geom_vline(xintercept = c(180, 365), lty = 1, size = .3, color = "grey70") +
  geom_text(x = 750, y = .68, size = 5.5, hjust = "left", fontface = "bold",
            aes(label = targets), data = dt_anot) +
  geom_text(x = 750, y = .6, size = 5.5, hjust = "left",
            aes(label = paste("f:", f), color = color), data = dt_anot) +
  geom_text(x = 750, y = .54, size = 5.5, hjust = "left",
            aes(label = paste("shape:", shape), color = color), data = dt_anot) +
  scale_x_continuous(limits = c(0, 1250), 
                     breaks = c(seq(0, 1250, by = 250), 180, 365), name = "time") +
  scale_color_manual(values = c("black","#7D9D33")) +
  scale_y_continuous(limits = c(0.05, 0.995), 
                     breaks = c(0.2, 0.4, 0.6, 0.8), name = "probability of survival") +
  theme(panel.grid = element_blank(),
        legend.position = "none") +
  transition_states(iter, state_length = 1, transition_length = 1) 

animate(a, duration = 24, fps = 10, height = 350, width = 550)

A note on where the parameters for the animation came from

You may be wondering where the parameters used in the animation come from. I really wanted to generate a series of curves that started at a fair distance from the true value and converged to the right spot. My idea was to use a simple loss function that involved the unknown parameters $f$ and $\nu$, which would be optimized (in this case minimized) at the correct values (the same as those derived above). If I could recover the interim values of optimization algorithm, those would provide a sequence of parameters and curves the converge on the true values.

The loss function is a simple squared loss, which is the sum of the squared loss for both points that define the curve:

\[\left[ \hat{\nu} \ (\text{log} \left[ -\text{log}(u_1)\right] - \hat{f}) - \text{log}(T_1) \right]^2 + \left[ \hat{\nu} \ (\text{log} \left[ -\text{log}(u_2)\right] - \hat{f}) - \text{log}(T_2) \right]^2\]

This is implemented as function fx, which is to be optimized using function optim. I know this may not be the best optimization option in R, but given that this is quite a simple problem, it should suffice. In the function $x[1]$ represents $f$, and $x[2]$ represents $\nu$.

fx <- function(x, p, times) {
  (x[2]*(log(-log(p[1])) - x[1]) - log(times[1])) ^ 2 + 
  (x[2]*(log(-log(p[2])) - x[1]) - log(times[2])) ^ 2 
}

The optimization provides starting values for $f$ and $\nu$. I chose values of $f$ and $\nu$ that would locate the initial curve between the two target points. $\nu$ is constrained to be non-negative (and $f$ is unconstrained). The key here is that the trace option is set so that interim values of $f$ and $\nu$ are reported. I am not showing the full output here, because it is quite lengthy. I only used four interim values (plus the starting and ending values) to create the animation. The final output includes the values of $f$ and $\nu$ that optimize the quadratic loss:

optim(
  par = c(-(log((180+365)/2) - log(-log(.5))), 1), 
  fn = fx, 
  p = c(.8, .2), 
  times = c(180, 365),
  method = "L-BFGS-B", 
  lower = c(-Inf, 0),
  upper = c(Inf, Inf),
  control= list(trace = 5)
)

## $par
## [1] -16.0137061   0.3577952
## 
## $value
## [1] 1.065948e-13
## 
## $counts
## function gradient 
##       33       33 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Incidentally, our analytic formulas give us

\[\nu^* = \frac{\text{log}(365) - \text{log}(180)}{\text{log}(-\text{log}(0.2)) - \text{log}(-\text{log}(0.8))} = 0.3577951\]

\[f^* = \text{log}(-\text{log}(0.8)) - \frac{1}{0.3577951} \text{log}(180) =-16.01371\]

simstudy update: ordinal data generation that violates proportionality

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 25 Jan 2022 00:00:00 +0000

Version 0.4.0 of simstudy is now available on CRAN and GitHub. This update includes two enhancements (and at least one major bug fix). genOrdCat now includes an argument to generate ordinal data without an assumption of cumulative proportional odds. And two new functions defRepeat and defRepeatAdd make it a bit easier to define multiple variables that share the same distribution assumptions.

Ordinal data

In simstudy, it is relatively easy to specify multinomial distributions that characterize categorical data. Order becomes relevant when the categories take on meanings related to strength of opinion or agreement (as in a Likert-type response) or frequency. A motivating example could be when a response variable takes on four possible values: (1) strongly disagree, (2) disagree, (4) agree, (5) strongly agree. There is a natural order to the response possibilities.

If we are interested in comparing responses for two groups (say an exposed group and an unexposed group), we can look at the cumulative odds of the response a each level $x$ of the response:

\[\small{\frac{P(response > x|exposed)}{P(response \le x|exposed)} \ \ vs. \ \frac{P(response > x|unexposed)}{P(response \le x|unexposed)}},\]

The comparison is often a ratio of those two odds - the cumulative odds ratio - or a log of the odds ratio.

genOrdCat

The way to generate ordered categorical data in simstudy is with the function genOrdCat. The probability of responses or categories is specified for the reference group - in this case the unexposed. The effect of exposure (and any other covariates) is expressed in an adjustment variable (here z). In the data generating process defined below, we are saying that the cumulative odds for the exposed is about 1/2 the odds for the unexposed at each level of response x. This is the proportional odds assumption, and on the log(OR) scale this is $log(0.5) = -0.7$.

baseprobs <- c(0.35, 0.25, 0.20, 0.20)

defA <- defData(varname = "exposed", formula = "1;1", dist = "trtAssign")
defA <- defData(defA, varname = "z", formula = "-0.7 * exposed", dist = "nonrandom")

set.seed(230)

dd <- genData(25000, defA)
dx <- genOrdCat(dd, adjVar = "z", baseprobs, catVar = "response")

Here is a manual calculation of the observed probabilities and odds (for a more visual interpretation of all of this, see this description):

dp <- dx[, .(n = .N), keyby = .(exposed, response)]
dp[, p := n/sum(n), keyby = .(exposed)]
dp[, cump := round(cumsum(p),3), keyby = .(exposed)]
dp[, codds := (1-cump)/cump]
dp[, lcodds := log(codds)]
dp

##    exposed response    n    p cump codds lcodds
## 1:       0        1 4406 0.35 0.35  1.84   0.61
## 2:       0        2 3168 0.25 0.61  0.65  -0.43
## 3:       0        3 2471 0.20 0.80  0.24  -1.41
## 4:       0        4 2455 0.20 1.00  0.00   -Inf
## 5:       1        1 6616 0.53 0.53  0.89  -0.12
## 6:       1        2 2860 0.23 0.76  0.32  -1.14
## 7:       1        3 1638 0.13 0.89  0.12  -2.08
## 8:       1        4 1386 0.11 1.00  0.00   -Inf

We can calculate the cumulative odds ratio at each response level …

dc <- dcast(dp, response ~ exposed, value.var = "codds")
dc [, cOR := `1`/`0`]
dc

##    response    0    1  cOR
## 1:        1 1.84 0.89 0.48
## 2:        2 0.65 0.32 0.49
## 3:        3 0.24 0.12 0.51
## 4:        4 0.00 0.00  NaN

and the log(cOR):

dc <- dcast(dp, response ~ exposed, value.var = "lcodds")
dc [, lcOR := `1` - `0`]
dc

##    response     0     1  lcOR
## 1:        1  0.61 -0.12 -0.73
## 2:        2 -0.43 -1.14 -0.71
## 3:        3 -1.41 -2.08 -0.67
## 4:        4  -Inf  -Inf   NaN

Estimating the parameters of the model using function clm in the ordinal package, we can recover the original parameters quite well. Note that the threshold coefficients are log cumulative odds at each response level for the reference group, the unexposed.

library(ordinal)
clmFit <- clm(response ~ exposed, data = dx)
summary(clmFit)

## formula: response ~ exposed
## data:    dx
## 
##  link  threshold nobs  logLik    AIC      niter max.grad cond.H 
##  logit flexible  25000 -31750.20 63508.39 4(0)  6.05e-07 1.9e+01
## 
## Coefficients:
##         Estimate Std. Error z value Pr(>|z|)    
## exposed  -0.7146     0.0236   -30.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -0.6017     0.0176   -34.3
## 2|3   0.4305     0.0173    24.9
## 3|4   1.3941     0.0201    69.3

A plot of the modeled cumulative probabilities (the lines) shows that the proportionality assumption fit the observed data (the points) quite well.

Non-proportional odds

With the recent update, it is now possible to generate data that violate the proportionality assumption by using new arguments npVar and npAdj. npVar indicates the variable(s) for which the non-proportional assumption is violated, and npAdj is a vector that specifies the extent and direction of the violation at each level of the response (on the logit scale). (Since the log odds ratio for the highest level response is infinite, the final value in the vector has no impact.)

dx <- genOrdCat(dd, baseprobs = baseprobs, catVar = "response", adjVar = "z",
                npVar = "exposed", npAdj = c(1.0, 0, -1.0, 0))

dp <- dx[, .(n = .N), keyby = .(exposed, response)]
dp[, p := n/sum(n), keyby = .(exposed)]
dp[, cump := round(cumsum(p),3), keyby = .(exposed)]
dp[, codds := (1-cump)/cump]
dp[, lcodds := log(codds)]
dp

##    exposed response    n     p cump codds lcodds
## 1:       0        1 4351 0.348 0.35 1.874   0.63
## 2:       0        2 3145 0.252 0.60 0.667  -0.41
## 3:       0        3 2444 0.196 0.80 0.258  -1.36
## 4:       0        4 2560 0.205 1.00 0.000   -Inf
## 5:       1        1 3506 0.280 0.28 2.571   0.94
## 6:       1        2 5784 0.463 0.74 0.346  -1.06
## 7:       1        3 2629 0.210 0.95 0.048  -3.03
## 8:       1        4  581 0.046 1.00 0.000   -Inf

We can see that the cumulative OR for response level 2 remains close to 0.5, but the cORs shift away from 0.5 response levels 1 and 3.

dc <- dcast(dp, response ~ exposed, value.var = "codds")
dc [, cOR := `1`/`0`]
dc

##    response    0     1  cOR
## 1:        1 1.87 2.571 1.37
## 2:        2 0.67 0.346 0.52
## 3:        3 0.26 0.048 0.19
## 4:        4 0.00 0.000  NaN

On the log odds scale, it is possible to see the direct effect of the values specified in the adjustment vector npAdj. The observed log cumulative OR at response level 1 is $1.0 - 0.7 = 0.3$, and the at level 3 it is $-1.0 - 0.7 = -1.7:$

dc <- dcast(dp, response ~ exposed, value.var = "lcodds")
dc [, lcOR := `1` - `0`]
dc

##    response     0     1  lcOR
## 1:        1  0.63  0.94  0.32
## 2:        2 -0.41 -1.06 -0.66
## 3:        3 -1.36 -3.03 -1.68
## 4:        4  -Inf  -Inf   NaN

The lack of proportionality is confirmed by a plot of the model fit with a proportional odds assumption along with the observed cumulative proportions. Since the model imposes proportionality, the observed points no longer lie along the prediction line:

Generating multiple variables with a single definition

defRepeat is a new function that allows us to specify multiple versions of a variable based on a single set of distribution assumptions. (There is an similar function defRepeatAdd to be used for definitions when adding data to an existing data set.) The function will add nvar variables to the data definition table, each of which will be specified with a single set of distribution assumptions. The names of the variables will be based on the prefix argument and the distribution assumptions are specified as they are in the defData function. Calls to defRepeat can be integrated with calls to defData.

def <- defRepeat(nVars = 4, prefix = "g", formula = "1/3;1/3;1/3", 
   variance = 0, dist = "categorical")
def <- defData(def, varname = "a", formula = "1;1", dist = "trtAssign")
def <- defRepeat(def, 3, "b", formula = "5 + a", variance = 3, dist = "normal")
def <- defData(def, "y", formula = "0.10", dist = "binary")

def

##    varname     formula variance        dist     link
## 1:      g1 1/3;1/3;1/3        0 categorical identity
## 2:      g2 1/3;1/3;1/3        0 categorical identity
## 3:      g3 1/3;1/3;1/3        0 categorical identity
## 4:      g4 1/3;1/3;1/3        0 categorical identity
## 5:       a         1;1        0   trtAssign identity
## 6:      b1       5 + a        3      normal identity
## 7:      b2       5 + a        3      normal identity
## 8:      b3       5 + a        3      normal identity
## 9:       y        0.10        0      binary identity

genData(5, def)

##    id g1 g2 g3 g4 a  b1  b2  b3 y
## 1:  1  1  3  3  1 1 8.7 8.2 7.2 0
## 2:  2  1  2  2  1 0 2.0 4.6 2.8 0
## 3:  3  2  2  2  1 1 3.2 6.7 4.8 0
## 4:  4  1  2  2  1 1 6.3 6.6 8.5 0
## 5:  5  2  3  1  2 0 5.2 5.1 9.5 0

Including uncertainty when comparing response rates across clusters

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 18 Jan 2022 00:00:00 +0000

Since this is a holiday weekend here in the US, I thought I would write up something relatively short and simple since I am supposed to be relaxing. A few weeks ago, someone presented me with some data that showed response rates to a survey that was conducted at about 30 different locations. The team that collected the data was interested in understanding if there were some sites that had response rates that might have been too low. To determine this, they generated a plot that looked something like this:

It looks like a few sites are far enough below the target threshold of 75% to merit concern. The question is, is this concern justified? Is this the best we can do to draw such a conclusion?

Actually, the data I’ve shown here are simulated under the assumption that each site has the same 80% underlying probability of response, so in truth, there is no need to be concerned that some sites are under performers; if they fell short, it was only because they were unlucky. The problem with the plot is that it ignores any uncertainty that might highlight this. I thought it would be fun to show a couple of ways how we might estimate that uncertainty for each site, and then plot those solutions.

Data simulation

The data simulation has two key elements. First, the size of the sites (i.e. the total number of possible responses) is assumed to have a negative binomial distribution with a mean $\mu$ of 80 and dispersion parameter $d$ set at 0.3; the average size of the sites is 80 with standard deviation of $\sqrt{\mu + \mu^2*d} = 44.7$. This is important, because the estimates for the smaller sites should reflect more uncertainty. Second, the probability of response has a binomial distribution with mean 0.80. I am using a logit link, and the log odds ratio is $log(.8/.2) = 1.386$. In the last step of the data generation, I’m calculating the observed probability of response $p$:

library(ggplot2)
library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggdist)
library(gtsummary)

def <- defData(varname = "n", formula = 80, variance = .3, dist="negBinomial")
def <- defData(def, varname = "y", formula = 1.386, 
          variance = "n", link = "logit", dist = "binomial")
def <- defData(def, varname = "p", formula = "y/n", dist = "nonrandom")

set.seed(4601)
dd <- genData(30, def, id = "site")

We can fit a generalized linear model with just an intercept to show that we can recover the log odds used to generate the data. Note that in the glm modeling statement, we are modeling the responses and non-responses in aggregate form, as opposed to individual 1’s and 0’s:

fit1 <- glm(cbind(y, n - y) ~ 1, data = dd, family = "binomial")

tbl_regression(fit1, intercept = TRUE)

Characteristic	log(OR)¹	95% CI¹	p-value
(Intercept)	1.3	1.2, 1.4	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

It is easy to recover the estimated probability by extracting the parameter estimate for the log odds (lodds) and converting it to a probability using

\[p = \frac{1}{1 + e^{(-lodds)}}\]

lOR <- coef(summary(fit1))[1]
1/(1+exp(-lOR))

## [1] 0.78

And here is a 95% confidence interval for the estimated probability:

ci <- data.table(t(confint(fit1)))
setnames(ci, c("l95_lo", "u95_lo"))
ci[, .(1/(1+exp(-l95_lo)), 1/(1+exp(-u95_lo)))]

##      V1  V2
## 1: 0.76 0.8

For completeness, here is the observed probability:

with(dd, sum(y)/sum(n))

## [1] 0.78

Site-specific probabilities

So far what we’ve done doesn’t really help us with getting at the site-level estimates. One way to do this is to fit the same model, but with site-specific intercepts.

fit2 <- glm(cbind(y, n - y) ~ factor(site) - 1, data = dd, family = "binomial")

Just as before we can get point estimates and 95% confidence intervals for each location based on the model’s estimated site-specific log odds.

sites <- rownames(coef(summary(fit2)))
p_est <- 1/(1 + exp(-coef(summary(fit2))[,"Estimate"]))

ci <- data.table(confint(fit2)) 
setnames(ci, c("l95_lo", "u95_lo"))
ci[, `:=`(l95_p = 1/(1+exp(-l95_lo)), u95_p = 1/(1+exp(-u95_lo)))]

Plotting the point estimates with confidence intervals gives us a slightly different picture than the initial bar plot. The outliers at the bottom all have confidence intervals that cross the desired 75% threshold, suggesting that any differences might be due to chance.

dp <- data.table(sites, p_est, ci[, .(l95_p, u95_p)])

setkey(dp, p_est)
dp[, index := .I]

ggplot(data = dp, aes(x = p_est, y = index)) +
  geom_vline(xintercept = 0.75, color = "white", size = .8) +
  geom_point(size = .8, color = "#23518e") +
  geom_segment(aes(x = l95_p, xend = u95_p, yend = index),
               color = "grey40", size = .2) +
  theme(panel.grid = element_blank(),
        axis.text.y= element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(size = 11, face = "bold")) +
  scale_x_continuous(limits = c(0.2,1), breaks = seq(0.2,1, by = 0.2),
                     name = "response rate") +
  ylab("site")

Bayesian estimation

This problem also lends itself very nicely to a hierarchical Bayesian approach, which is the second estimation method that we’ll use here. Taking this approach, we can assume that each site $s$ has its own underlying response probability $\theta_s$, and these probabilities are drawn from a common Beta distribution (where values range from 0 to 1) with unknown parameters $\alpha$ and $\beta$:

\[\theta_s \sim Beta(\alpha, \beta), \ \ s \in \{1,2,\dots,30\}\]

The model is implemented easily in Stan. The output of the model is joint distribution of $\alpha$, $\beta$, the $\theta_s\text{'s}$, and $\mu$. $\mu$ is really the overall mean response rate based on the Beta distribution parameter estimates for $\alpha$ and $\beta$:

\[\mu = \frac{\alpha}{\alpha + \beta}\] Here is the Stan implementation:

data {
  int<lower=0> S;
  int<lower=0> y[S]; // numerator
  int<lower=0> n[S]; // total observations (denominator)
}

parameters {
  real<lower=0,upper=1> theta[S];
  real<lower=0> alpha;
  real<lower=0> beta;
}

model {
  alpha ~ normal(0, 10);
  beta ~ normal(0, 10);
  
  theta ~ beta(alpha, beta);
  
  y ~ binomial(n, theta);
}

generated quantities {
  real mu;
  mu = alpha/(alpha + beta);
}

The Stan model is compiled and sampled using the cmdstanr package. I’m generating 20,000 samples (using 4 chains), following a warm-up of 1,000 samples in each of the chains.

mod <- cmdstan_model("code/binom.stan")

data_list <- list(S = nrow(dd), y = dd$y, n = dd$n)

fit <- mod$sample(
  data = data_list,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 1000,
  iter_sampling = 5000,
  show_messages = FALSE
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.5 seconds.
## Chain 3 finished in 0.5 seconds.
## Chain 4 finished in 0.5 seconds.
## Chain 2 finished in 0.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.5 seconds.
## Total execution time: 0.6 seconds.

Here are summary statistics for the key parameters $\alpha$, $\beta$, and $\mu$. I’ve also included estimates of $\theta_s$ for two of the sites:

fit$summary(c("alpha", "beta", "mu","theta[1]", "theta[3]"))

## # A tibble: 5 × 10
##   variable   mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 alpha    28.2   27.9   6.13   6.24   18.6   38.7    1.00   10819.   11066.
## 2 beta      8.48   8.37  1.87   1.89    5.58  11.7    1.00   10217.   11486.
## 3 mu        0.768  0.769 0.0158 0.0156  0.742  0.794  1.00   18130.   16541.
## 4 theta[1]  0.756  0.758 0.0424 0.0427  0.683  0.823  1.00   28870.   15269.
## 5 theta[3]  0.797  0.799 0.0389 0.0388  0.730  0.858  1.00   25294.   14662.

And finally, we can plot the site-specific estimates of $\theta_s$, showing the median and 95% credible intervals of the posterior distribution. This biggest difference between this plot and the one above based on the generalized linear model is that the medians seem to shrink towards the common median and the credible intervals are narrower. Even the smaller sites have narrower credible intervals, because the estimates are pooling information across the sites.

In the end, we would draw the same conclusions either way: we have no reason to believe that any sites are under performing.

post_array <- fit$draws()

df <- data.frame(as_draws_rvars(fit$draws(variables = "theta")))
df$index <- rank(median(df$theta))

ggplot(data = df, aes(dist = theta, y = index)) +
  geom_vline(xintercept = 0.75, color = "white", size = .8) +
  stat_dist_pointinterval(.width = c(.95), 
                          point_color = "#23518e",
                          interval_color = "grey80",
                          interval_size = 1,
                          point_size = .4) +
  theme(panel.grid = element_blank(),
        axis.text.y= element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(size = 11, face = "bold")) +
  scale_x_continuous(limits = c(0.2,1), breaks = seq(0.2,1, by = 0.2),
                     name = "response rate") +
  ylab("site")

I did conduct some simulations where there were actually true underlying differences between the sites, but to keep this post more manageable, I will not include that here - I leave that data generation to you as an exercise.

Skeptical Bayesian priors might help minimize skepticism about subgroup analyses

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 04 Jan 2022 00:00:00 +0000

Over the past couple of years, I have been working with an amazing group of investigators as part of the CONTAIN trial to study whether COVID-19 convalescent plasma (CCP) can improve the clinical status of patients hospitalized with COVID-19 and requiring noninvasive supplemental oxygen. This was a multi-site study in the US that randomized 941 patients to either CCP or a saline solution placebo. The overall findings suggest that CCP did not benefit the patients who received it, but if you drill down a little deeper, the story may be more complicated than that.

Of course, it is the “drilling down” part that gets people (and biostatisticians) a little nervous. Once we get beyond the primary analysis, all bets are off. If we look hard enough at the data, we may eventually find something that is interesting to report on. But, just because we find something with the data in hand, it does not mean that we would find it again in another data set from another study. Any conclusions we draw may be unwarranted:

Source: XKCD

The CONTAIN trial was conducted under very difficult circumstances, where the context was rapidly changing over time. Particularly challenging was the fact that new therapies were introduced for hospitalized patients over the course of the trial. And when we looked at the data, we noticed that while all patients had poor clinical outcomes in the first several months of the study, CCP appeared to offer some benefits. Later on in the study, when corticosteroids and remdesivir were standard of care, overall patient outcomes were dramatically improved, and CCP was no longer providing any benefits. This was a very interesting finding that we felt merited some discussion.

Not surprisingly, we have received some push back, suggesting that this finding is a classic case of regression to the mean. Normally, I would not have been comfortable presenting those findings, particularly not in a highly visible journal article. But we had used a Bayesian modelling framework with quite quite skeptical prior distribution assumptions to evaluate the primary outcome and the exploratory outcomes, so we felt that while we needed to be cautious in how we presented the results, these were not green jelly bean findings. Given the strong biological plausibility, we felt quite strongly about adding these findings to the growing body of literature about CCP.

In this post, I am sharing a series of simulations to explore how conservative our conservative approach really is. This is another look at assessing Type I error rates, a frequentist notion, in the context of a Bayesian study design.

The subgroup-specific treatment effect

Let’s say there is a data generation process for individual $i$ with a (binary) outcome $Y_i$ with some (binary) treatment or exposure $A_i$ that looks like

\[log\left(\frac{P(Y_i=1)}{P(Y_i=0)}\right ) = \alpha + \delta A_i.\]

The log-odds is dependent only on the level of exposure $A$.

Let’s say that we’ve also measured some characteristic $G$, which is a categorical variable with three levels. While the primary aim of the study is to estimate $\delta$, the log-odds ratio comparing treated with controls, we might also be interested in a subgroup analysis based on $G$. That is, is there a unique group-level treatment effect for any level of $G$, $G \in \{1,2,3\}$? In terms of the model, this would be

\[ log\left(\frac{P(Y_i=1|G_i=g)}{P(Y_i=0|G_i=g)}\right ) = \alpha_g + \delta_g A_i, \ \ g \in \{1,2,3\}.\]

In the case here, we actually know that $\alpha_g = \alpha$ and $\delta_g = \delta$ for all $g$ because the data generation process is independent of $G$. However, due to sampling error it is quite possible that we will observe some differences in the data.

Simulation

We start by simulating this case of a single grouping with three potential subgroup analyses. First, here are the libraries used to create the examples in this post:

library(simstudy)
library(data.table)
library(posterior)
library(bayesplot)
library(cmdstanr)
library(gtsummary)
library(paletteer)
library(ggplot2)
library(ggdist)

Data generation

I’m generating 1000 subjects, with 500 in each treatment arm $A$. About a third fall into each level of $G$, and the binary outcome $Y$ takes a value of $1$ $10\%$ of the time for all subjects regardless of their treatment or group level

def <- defData(varname = "g", formula = "1/3;1/3;1/3", dist = "categorical")
def <- defData(def, "a", formula = "1;1", dist = "trtAssign")
def <- defData(def, "y", formula = "0.10", dist = "binary")

RNGkind("L'Ecuyer-CMRG")
set.seed(67)

dd <- genData(1000, def)
dd

##         id g a y
##    1:    1 3 1 0
##    2:    2 3 1 0
##    3:    3 2 1 0
##    4:    4 2 0 0
##    5:    5 3 0 0
##   ---           
##  996:  996 2 0 0
##  997:  997 1 0 0
##  998:  998 1 1 0
##  999:  999 3 0 0
## 1000: 1000 3 1 0

We can fit a simple logistic regression model to estimate $\delta$, the overall effect of the treatment. We see that the estimate of the OR is very close to 1, suggesting the odds of $Y=1$ is similar for both groups, so no apparent treatment effect. (Note that the exponentiated intercept is an estimate of the odds of $Y=1$ for the control arm. The data generation process assumed $P(Y=1) = 0.10$, so the odds are $0.1/0.9 = 0.11$.)

fitglm <- glm(y ~ a, data = dd, family = "binomial")
tbl_regression(fitglm, intercept = TRUE, exponentiate = TRUE)

Characteristic	OR¹	95% CI¹	p-value
(Intercept)	0.11	0.08, 0.14	<0.001
a	1.02	0.67, 1.55	>0.9
¹ OR = Odds Ratio, CI = Confidence Interval

So, we could not infer that for the entire group treatment $A$ has any effect. But, maybe for some subgroups there is a treatment effect? We can fit a generalized linear model that allows the intercept and effect estimate to vary by level of $G$, and assess whether this is the case for subgroups defined by $G$.

fitglm <- glm(y ~ factor(g) + a:factor(g) - 1, data = dd, family = "binomial")
tbl_regression(fitglm, exponentiate = TRUE)

Characteristic	OR¹	95% CI¹	p-value
factor(g)
1	0.09	0.05, 0.15	<0.001
2	0.11	0.07, 0.18	<0.001
3	0.12	0.08, 0.20	<0.001
factor(g) * a
1 * a	1.23	0.57, 2.76	0.6
2 * a	1.24	0.64, 2.45	0.5
3 * a	0.69	0.32, 1.44	0.3
¹ OR = Odds Ratio, CI = Confidence Interval

While there is more variability in the estimates, we still wouldn’t conclude that there are treatment effects within each level of $G$.

The Bayesian model

Since the purpose of this post is to illustrate how an appropriately specified Bayesian model can provide slightly more reliable estimates, particularly in the case where there really are no underlying treatment effects, here is a Bayes model that estimates subgroup-level intercepts and treatment effects:

\[Y_{ig} \sim Bin(p_{ig})\] \[log\left(\frac{p_{ig}}{1-p_{ig}}\right) = \alpha_g + \delta_gA_i, \ \ g \in \{1,2,3\} \]

The prior distribution assumptions for the parameters $\alpha_g$ and $\delta_g$ are

\[\begin{aligned} \alpha_g &\sim N(\mu = 0, \sigma = 10), \ \ g \in \{1,2,3\} \\ \delta_g &\sim N(\mu=\delta, \sigma = 0.3537), \ \ g \in \{1,2,3\} \\ \delta &\sim N(\mu = 0, \sigma = 0.3537) \end{aligned}\]

Note that the variance of the $\delta_g\text{'s}$ around $\delta$ has been specified, but it could be estimated. However, since there are very few levels of $G$, estimation of the variance can be slow; to speed the simulations, I’ve chosen a quicker path by specifying a pretty informative prior.

Fitting the Bayes model

The Stan code that implements this model can be found in the addendum. To estimate the model, the the data need to be passed as a list - and here is a function to convert the R data into the proper format:

listdat <- function(dx, grpvar) {
  
  dx[, grp := factor(get(grpvar))]
  
  N <- dx[, .N]
  L <- dx[, nlevels(grp)]
  y <- dx[, y]
  a <- dx[, a]
  grp <-dx[, as.numeric(grp)]
  
  list(N = N, L = L, y = y, a = a, grp = grp)
}

After compiling the program, samples from the posterior are drawn using four chains. There will be a total of 12000 samples (not including the warm-up samples):

mod <- cmdstan_model("extra/simulation.stan")

fitbayes <- mod$sample(
    data = listdat(dd, "g"),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 1500,
    iter_sampling = 3000
)

## Running MCMC with 4 parallel chains...
## 
## Chain 3 finished in 5.3 seconds.
## Chain 2 finished in 5.7 seconds.
## Chain 1 finished in 5.9 seconds.
## Chain 4 finished in 6.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.8 seconds.
## Total execution time: 6.2 seconds.

The estimates are quite similar to the glm estimates, though the ORs are pulled slightly towards 1 as a result of the informative prior. This is going to be the trick that ultimately protects the Type I error rate from completely blowing up.

fitbayes$summary(c("Odds_g","OR_g"))

## # A tibble: 6 × 10
##   variable    mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Odds_g[1] 0.0932 0.0910 0.0237 0.0230 0.0586 0.136  1.00    8636.    9489.
## 2 Odds_g[2] 0.118  0.116  0.0258 0.0254 0.0800 0.165  1.00    8924.    8875.
## 3 Odds_g[3] 0.115  0.112  0.0262 0.0260 0.0757 0.161  1.00    9734.    9569.
## 4 OR_g[1]   1.16   1.11   0.345  0.317  0.689  1.80   1.00    8077.    8824.
## 5 OR_g[2]   1.17   1.13   0.320  0.300  0.724  1.77   1.00    7922.    8641.
## 6 OR_g[3]   0.884  0.848  0.256  0.239  0.527  1.36   1.00    8817.    8620.

And here is a plot of the posterior distributions for the treatment effect at each subgroup defined by the levels of $G$:

OR_df <- data.frame(as_draws_rvars(fitbayes$draws(variables = "OR_g")))

p_data <- with(OR_df, data.frame(
  cat = c("c1", "c2", "c3"),
  OR_g = OR_g
))

ggplot(data = p_data, aes(dist = OR_g, y = cat)) +
  geom_vline(xintercept = 1, color = "grey80", size = .3) +
  stat_dist_halfeye(fill = palettes_d$awtools$a_palette[6], position="dodge") +
  theme(panel.grid = element_blank()) +
  ylab("category") +
  scale_x_continuous(limits = c(0, 2), name = "OR")

It is easy to see that the 95% credible intervals all include the value of 1, no treatment effect, so we wouldn’t be tempted to conclude that there is any treatment effect. We can also check manually to see if at least one of the credible intervals excludes 1. The answer is still “no.”

with(OR_df, any((quantile(OR_g, .025) > 1) | (quantile(OR_g, .975) < 1)))

## [1] FALSE

Increasing the number of subgroups

As the jelly beans make clear, things can really start to go awry when we start to investigate many possible subgroups. This can either be a single characteristic (like color) that has many, many levels, each of which can be a subgroup, or this can be many different characteristics that each have a few different levels. In our trial, we had the latter, all based on baseline data collection (and all reported as categories). These included health status, age, blood type, time between COVID symptom onset, medications, quarter of enrollment, and others.

My interest here is to see how fast the Type I error increases as the number of subgroups increases. I consider that a Type I error has occurred if any of the subgroups would be declared what I am calling potentially “interesting”. “Interesting” and by no means definitive, because while the results might suggest that treatment effects are stronger in a particular subgroup, we do need to be aware that these are exploratory analyses.

To explore Type I error rates, I will generate 20 categories, each of which has three levels. I can then use the model estimates from each of the subgroup analyses to evaluate how many times I would draw the wrong conclusion.

Generating multiple categories

To illustrate how I code all of this, I am starting with a case where there are four categories, each with three levels. In the data set, these categories are named g1, g2, g3, and g4:

genRepeatDef <- function(nvars, prefix, formula, variance, dist, link = "identity") {
  varnames <- paste0(prefix, 1 : nvars)
  data.table(varname = varnames, 
             formula = formula, 
             variance = variance, 
             dist = dist, link = link)
}

def <- genRepeatDef(4, "g", "1/3;1/3;1/3", 0, "categorical")
def <- defData(def, "a", formula = "1;1", dist = "trtAssign")
def <- defData(def, "y", formula = "0.10", dist = "binary")

def

##    varname     formula variance        dist     link
## 1:      g1 1/3;1/3;1/3        0 categorical identity
## 2:      g2 1/3;1/3;1/3        0 categorical identity
## 3:      g3 1/3;1/3;1/3        0 categorical identity
## 4:      g4 1/3;1/3;1/3        0 categorical identity
## 5:       a         1;1        0   trtAssign identity
## 6:       y        0.10        0      binary identity

A single data set based on these definitions looks like this:

RNGkind("L'Ecuyer-CMRG")
set.seed(67) #4386212 83861 7611291

dd <- genData(1000, def)
dd

##         id g1 g2 g3 g4 a y
##    1:    1  3  3  2  1 0 0
##    2:    2  3  3  1  3 1 1
##    3:    3  2  3  2  1 1 0
##    4:    4  2  1  1  1 0 0
##    5:    5  3  1  2  3 0 0
##   ---                     
##  996:  996  2  1  2  1 0 0
##  997:  997  1  1  2  2 1 0
##  998:  998  1  1  2  1 0 0
##  999:  999  3  2  1  1 1 0
## 1000: 1000  3  3  3  3 1 0

The function fitmods estimates both the glm and stan models for a single category. The models provide subgroup estimates of treatment effects, just as the example above did:

fitmods <-function(dx, grpvar) {
  
  # GLM
  
  dx[, grp := factor(get(grpvar))]
  fitglm <- glm(y ~ grp + a:grp - 1, data = dx, family = "binomial")
  
  pvals <- coef(summary(fitglm))[, "Pr(>|z|)"]

  lpval <- length(pvals)
  freq_res <- any(pvals[(lpval/2 + 1) : lpval] < 0.05)
  
  # Bayes
  
  dat <- listdat(dx, grpvar)
  
  fitbayes <- mod$sample(
    data = dat,
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 1500,
    iter_sampling = 3000
  )
  
  OR_df <- data.frame(as_draws_rvars(fitbayes$draws(variables = "OR_g")))
  bayes_res <- with(OR_df, 
    any((quantile(OR_g, .025) > 1) | (quantile(OR_g, .975) < 1)))
  
  # Return results

  list(type1_dt = data.table(var = grpvar, bayes_res, freq_res),
       OR_post = data.frame(var = grpvar, cat = paste0("c",1:3), OR_df)
  )
  
}

In this case, I am calling the function fitmods for each of the four categorical groupings g1 through g4:

res <- lapply(paste0("g", 1:4), function(a) fitmods(dd, a))

## Running MCMC with 4 parallel chains...
## 
## Chain 2 finished in 5.3 seconds.
## Chain 1 finished in 5.4 seconds.
## Chain 4 finished in 5.5 seconds.
## Chain 3 finished in 5.7 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.5 seconds.
## Total execution time: 5.7 seconds.
## Running MCMC with 4 parallel chains...
## 
## Chain 2 finished in 5.5 seconds.
## Chain 1 finished in 5.6 seconds.
## Chain 3 finished in 5.6 seconds.
## Chain 4 finished in 5.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.5 seconds.
## Total execution time: 5.6 seconds.
## Running MCMC with 4 parallel chains...
## 
## Chain 4 finished in 5.5 seconds.
## Chain 2 finished in 5.6 seconds.
## Chain 1 finished in 5.8 seconds.
## Chain 3 finished in 5.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.7 seconds.
## Total execution time: 5.9 seconds.
## Running MCMC with 4 parallel chains...
## 
## Chain 2 finished in 5.8 seconds.
## Chain 1 finished in 5.8 seconds.
## Chain 3 finished in 5.8 seconds.
## Chain 4 finished in 6.0 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 5.9 seconds.
## Total execution time: 6.1 seconds.

One of the arguments returned by fitmods is a data.table of summary results for each of the four categorical grouping. If any of the effect estimates for one or more of the three subgroup levels in a category was deemed “interesting” (based on a p-value < 0.05 for the glm model, and the value 1 falling outside the 95% credible interval for the stan model), then the function returned a value of 1 or TRUE. In this case, at least one of the subgroups within g4 would have been declared interesting based on the glm and stan models; at least one of the subgroups within g2 would have been deemed interesting, but only based on the glm model:

type1_dt <- rbindlist(lapply(res, function(x) x$type1_dt))
type1_dt

##    var bayes_res freq_res
## 1:  g1     FALSE    FALSE
## 2:  g2     FALSE     TRUE
## 3:  g3     FALSE    FALSE
## 4:  g4      TRUE     TRUE

We can see this visually for the stan models by looking at the density plots for each subgroup within each category:

dc <- do.call("rbind", lapply(res, function(x) x$OR_post))

ggplot(data = dc, aes(dist = OR_g, y = cat)) +
  geom_vline(xintercept = 1, color = "grey80", size = .3) +
  stat_dist_halfeye(fill = palettes_d$awtools$a_palette[7], alpha = .8) +
  theme(panel.grid = element_blank()) +
  xlab("OR") +
  facet_wrap(~ var)

We calculate the first occurrence of an interesting subgroup by looking across g1 through g4. This will be useful for figuring out the Type I error rates for different numbers of categories. In this case, the first interesting subgroup based on the stan model is found when there are four categories; for the glm model, the first interesting subgroup is with two categories.

first_true <- sapply(type1_dt[, c(2,3)], function(x) match(TRUE, x))
first_true

## bayes_res  freq_res 
##         4         2

Simulation study results

Now, with that background, here are the results based on 1050 simulated data sets (because I had access to 75 cores on a high performance computing cluster) with 20 categorical groups of three levels each. For each data set $r, \ r \in \{1,\dots,1050\}$, I determined the first occurrence of an “interesting” finding among the 20 categories for each model, and stored this in $F_{br}$ and $F_{fr}$ for the Bayes and frequentist models, respectively. $T_m(g)$ is the error rate for the model type $m$, $m \in \{b, f\}$ with $g$ number of categories, and

\[ T_m(g) = \frac{\sum_{r=1}^{1050} I(F_{mr} \le g)}{1050} \]

Here is a plot of the Type I error rates calculated different numbers of categories. With the frequentist model (based on p-values) the error rates get quite large quite quickly, exceeding 50% by the time we reach 8 categories. In comparison, the error rates under the Bayes model with skeptical prior assumptions are held in check quite a bit better.

Even for the Bayes approach, however, the error rate is close to 20% for 12 categories. So, we still need to be careful in drawing conclusions. In the case that we do find some potentially interesting results (which we did in the case of the CCP CONTAIN trial), readers certainly have a right to be skeptical, but there is no reason to completely dismiss the findings out of hand. These sorts of findings suggest that more work needs to be done to better understand the nature of the treatment effects.

Addendum

data {
  int<lower=0> N;              // number of observations
  int<lower=1> L;              // number of levels
  int<lower=0,upper=1> y[N];   // vector of categorical outcomes
  int<lower=0,upper=1> a[N];   // treatment arm for individual
  int<lower=1,upper=4> grp[N]; // grp for individual  
}

parameters {
  vector[L] alpha_g;           // group effect
  real delta_g[L];             // group treatment effects
  real delta;                  // overall treatment effect
}

transformed parameters{ 
  
  vector[N] yhat;

  for (i in 1:N)  
    yhat[i] = alpha_g[grp[i]] + a[i] * delta_g[grp[i]];
}

model {
  
  // priors
  
  alpha_g ~ normal(0, 10);
  delta_g ~ normal(delta, 0.3537);
  delta ~ normal(0, 0.3537);
  
  // outcome model
  
  for (i in 1:N)
    y[i] ~  bernoulli_logit(yhat[i]);
}

generated quantities {
  
  real<lower = 0> OR_g[L];
  real<lower = 0> Odds_g[L];

  real<lower = 0> OR;

  for (i in 1:L) {
    OR_g[i] = exp(delta_g[i]);   
    Odds_g[i] = exp(alpha_g[i]);
  }
  
  OR = exp(delta);
  
}

Controlling Type I error in RCTs with interim looks: a Bayesian perspective

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 Dec 2021 00:00:00 +0000

Recently, a colleague submitted a paper describing the results of a Bayesian adaptive trial where the research team estimated the probability of effectiveness at various points during the trial. This trial was designed to stop as soon as the probability of effectiveness exceeded a pre-specified threshold. The journal rejected the paper on the grounds that these repeated interim looks inflated the Type I error rate, and increased the chances that any conclusions drawn from the study could have been misleading. Was this a reasonable position for the journal editors to take?

My colleague’s experience resonated with me, as I’ve been thinking a bit about how frequentist concepts like Type I error rates and statistical power should be considered in the context of studies that are using Bayesian designs and modeling. Although these frequentist probabilities are not necessarily a natural logical or philosophical fit with the Bayesian approach, many reviewers, funders, and regulators will often require that the Bayesian design be justified by confirming control of Type I error at a pre-specified level.

My first inclination, of course, was to do a quick simulation to see if the journal’s concerns had any merit. So, I generated lots of data sets with two treatment arms but no treatment effect and simulated interim looks for each data set; my goal was to see how frequently we would be misled by the findings to draw unwarranted conclusions. Spoiler alert: the editor had reasons to be concerned; however, if we make some modifications to the evaluation criteria, we may be able to alleviate some of those concerns. This post documents the process and shows some results.

A little philosophy

Any post describing Bayesian models, p-values, and Type I errors - even one focused on simulation - would be remiss without at least some discussion of underlying philosophical differences between the two approaches. To start, this statement succinctly describes a key conflict:

The Bayesian approach is based on determining the probability of a hypothesis with a model using an “a priori” probability that is then updated based on data. On the contrary, [in] classical hypothesis testing … the error-I type is the probability of wrongly refusing the null hypothesis when it is true. Thus, [frequentist logic] is completely different from Bayesian logic (since probability is referred to making a mistake, not to the hypothesis itself).

So, the Bayesian approach is concerned with estimating $P(hypothesis \ | \ data)$, whereas the frequentist p-value is an estimate of $P(data \ | \ null \ hypothesis)$, quite different animals and it isn’t obvious they can be reconciled. Indeed, as Frank Harrell points out, there is a logical inconsistency with assuming that a particular (null) hypothesis is true and then trying to draw conclusions about that very same hypothesis:

The probability of a treatment not being effective is the probability of ``regulator’s regret.’’ One must be very clear on what is conditioned upon (assumed) in computing this probability. Does one condition on the true effectiveness or does one condition on the available data? Type I error conditions on the treatment having no effect … Can one quantify evidence for making a wrong decision if one assumes that all conclusions of non-zero effect are wrong up front because $H_0$ was assumed to be true? Aren’t useful error probabilities the ones that are not based on assumptions about what we are assessing but rather just on the data available to us? [emphasis added]

In addition to Harrell’s post, there is a paper by Greenland et al that describes how statistical tests, confidence intervals, and statistical power can easily be misinterpreted and abused. However, these points of view have not been fully adopted by the scientific community more broadly, and Type I error rates and statistical power are still the predominant way of evaluating study designs.

This paper by Ryan et al acknowledges these realities while asking whether “we need to adjust for interim analyses in a Bayesian adaptive trial design?” The conclusion is “yes” if the goal is indeed to satisfy operating characteristics defined within the frequentist paradigm - and I will get to that in a second - but they conclude, maybe a little wistfully, that

if we avoid this dichotomisation [reject or fail to reject] and simply report the posterior probability of benefit, then we could potentially avoid having to specify the type I error of a Bayesian design.

But until more reviewers and funders accept this, investigators who opt for a Bayesian approach will likely still be required to square this circle.

Simulating interim looks

To assess the Type I error, I’ve used a relatively simple data generating process and estimation model. There is a continuous outcome $Y$ that is normally distributed, and the mean is entirely a function of the treatment arm assignment $Z$. When $Z_i = 0$, the subject is in the control arm and the mean is $\alpha$; when $Z_i = 1$, the subject is in the treatment arm, and the mean is $\alpha + \beta$. The standard deviation for both groups is assumed to be the same, $\sigma_s$:

\[ Y_i \sim N(\alpha + \beta Z_i, \sigma_s) \]

To fit the model, we have to assume prior distributions for the parameters $\alpha$, $\beta$, and $\sigma_s$; here is the assumption for $\beta$, which is the treatment effect, the parameter of primary interest:

\[ \beta \sim t_\text{student}(df = 3, \mu = 0, \sigma = 10) \]

In the model specification below, I have allowed the prior standard deviation of $\beta$ to be specified as an argument, so that we can explore the impact of different assumptions on the operating characteristics. Here is the Stan code to implement the model:

data {
  int<lower=0> N;
  int<lower=0,upper=1> rx[N];
  vector[N] y;
  real p_mu;
  real p_sigma;
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

transformed parameters {
  real yhat[N];
  
  for (i in 1:N) 
    yhat[i] = alpha + beta * rx[i];
}

model {
  alpha ~ student_t(3, 0, 10);
  beta ~ student_t(3, p_mu, p_sigma);
  
  y ~ normal(yhat, sigma);
}

And here is a call to compile the code using package cmdstanr:

library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)

mod <- cmdstan_model("code/multiple.stan")

I’ve written a function bayes_fit that (1) estimates the model, (2) collects the samples of $\beta$ from the posterior distribution, and (3) returns the probability of “success.” Here, success is determined one of two ways. In the first, we evaluate $P(\beta > 0)$ (using the posterior distribution) and declare success if this probability exceeds 95%; this is the probability that the intervention is successful. The second decision rule uses the first criterion and adds an additional one to ensure that the effect size is clinically meaningful by requiring $P(\beta > M) > 50\%$. $M$ is some meaningful threshold that has been agreed upon prior to conducting the study. Under the two-part decision rule, both requirements ( $P(\beta > 0) > 95\%$ and $P(\beta > M) > 50\%$) need to be satisfied in order to declare success.

bayes_fit <- function(dx, p_sigma, m_effect, decision_rule, x) {
  
  # 1: estimate model
  
  data_list <- list(N = nrow(dx), y = dx$y, rx = dx$rx, p_mu = 0, p_sigma = p_sigma)

  fit <- mod$sample(
    data = data_list,
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 500,
    iter_sampling = 2500,
    step_size = 0.1,
    show_messages = FALSE
  )
  
  # 2: collect sample of betas from posterior
  
  df <- data.frame(as_draws_rvars(fit$draws(variables = "beta")))
  
  # 3: evaluate success based on desired decision rule
  
  if (decision_rule == 1) {
    return((mean(df$beta > 0) > 0.95))
  } else { # decision_rule == 2
    return( ((mean(df$beta > 0) > 0.95) & (mean(df$beta > m_effect ) > 0.5)) )  
  }
}

Function freq_fit fits a linear model and returns a p-value for the estimate of $\beta$, to be used as a basis for comparison with the Bayesian models:

freq_fit <- function(dx) {
  
  lmfit <- lm(y ~ rx, data = dx)
  coef(summary(lmfit))["rx", "Pr(>|t|)"]
  
}

The data generation process is simple and assumes that $\alpha = 0$, and more importantly, that the treatment effect $\beta=0$ (and that $\sigma_s = 1$)

def <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
def <- defData(def, varname = "y", formula = 0, variance = 1, dist = "normal")
  
set.seed(1918721)
dd <- genData(1000, def)

dd

##         id rx      y
##    1:    1  1 -0.593
##    2:    2  1  0.648
##    3:    3  0 -0.819
##    4:    4  0 -0.386
##    5:    5  0 -0.304
##   ---               
##  996:  996  1  0.343
##  997:  997  0 -0.948
##  998:  998  1 -0.497
##  999:  999  0 -1.301
## 1000: 1000  0  0.282

And here we are at the key point where we simulate the interim looks. I’ve taken a crude, probably inefficient approach that has the advantage of being extremely easy to code. Using the R function sapply, we can sequentially estimate models using incrementally larger segments of a particular data set. In the first round, the model is estimated for the first 100 observations, in the second round the model is estimated for the first 200, and so on. For each round, we get an indicator for whether the trial would have been declared successful based on whatever criteria were being used. If any of the interim looks results in a success, then that simulated study would be deemed (inappropriately) successful. This approach is inherently inefficient, because we are conducting all interim looks regardless of the outcome of earlier looks; in effect, we are fitting too many models. But in the case where only a relatively small number of studies will be “successful,” this is a small price to pay for ease of coding.

bayes_ci <- sapply(seq(100, 1000, by = 100), 
    function(x) bayes_fit(dd[1:x], p_sigma = 10, m_effect = 0.2, x))

The function returns a vector of success indicators, one for each interim look.

bayes_ci

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Here is a visualization representing the underlying posterior densities at each interim look. It is clear that less than 95% of each density (shaded in darker red) falls to the right of $0$ at each stage of evaluation, and based on the single decision rule, the study would not have been declared successful at any interim look:

In practice, all we need to do is check whether any of the looks has resulted in a “success”. In this case, there were no successes:

any(bayes_ci)

## [1] FALSE

Operating characteristics: Type I error rates

Single decision rule

To compare the Type I error rates for each approach, I considered smaller and larger sample sizes:

160 subjects, with interim looks every 20 subjects starting after the 80th subject has been enrolled
1000 subjects, with interim looks every 100 subjects

For the Bayesian models, I assumed three different prior distribution assumptions for $\beta$:

\[\beta \sim t_{\text{student}}(\text{df}=3, 0, \sigma), \ \sigma \in \{1, 5, 10\}\]

I generated 3000 data sets under each scenario, and the Type I error rate is the proportion of those 3000 iterations that were declared a “success” under each model. (Code for running these replications on a high performance computing cluster is shown below in the addendum.) Here are the estimates of the error rate using the single criteria for the threshold, comparing the frequentist analysis and the Bayesian analysis.

Type I error rates based on 160 subjects/5 interim looks

##    p_sigma m_effect  freq  bayes
## 1:       1        0 0.118 0.0937
## 2:       5        0 0.099 0.0993
## 3:      10        0 0.101 0.0917

Type I error rates based on 1000 subjects/10 interim looks

##    p_sigma m_effect  freq bayes
## 1:       1        0 0.193 0.157
## 2:       5        0 0.197 0.166
## 3:      10        0 0.189 0.169

At both sample size levels, the Bayesian analysis has a slightly lower Type I error rate compared to the frequentist analysis, although in both cases error rates are considerably inflated beyond 5%. It does not appear that, in this setting at least, that the prior distribution assumptions have much impact on the estimated Type I error rate.

Two crieteria

It is pretty clear in this simple scenario the single Bayesian criterion might not satisfy reviewers looking for control at the 5% level, so we might need to use a modified approach. The next set of simulations explore the double decision rule. In addition to the sample size variation and different prior distribution assumptions, I also considered three possible thresholds for the second criteria in exploring impacts on Type I error rates:

\[P(\beta > 0) > 95\% \ \ \textbf{and} \ \ P(\beta > M) > 50\%, \ M \in \{0.2, 0.3, 0.4\}\]

Introducing the second criteria in this case substantially lowers the Type 1 error rate, particularly when the threshold for the second criteria is more conservative (higher):

Once again, the prior distribution assumptions for $\beta$ are not very important, but the threshold $M$ is clearly key to controlling the Type I error, particularly when the sample size is smaller. With sample size of 160, 5% error rates are achieved with $M=0.4$, though with 1000 subjects, we could use $M=0.3$.

The overall takeaway is that if funders and reviewers insist that Bayesian study designs control Type I error rates at a level of 5%, decision criteria and possibly modelling assumptions might need to be adjusted to accommodate those requirements. This is particularly true if the study design allows for interim evaluations of the data, as is common in a study using an adaptive Bayesian design. Simulations similar to what I have done here will be required to demonstrate that these conditions are satisfied.

References:

Ryan, Elizabeth G., Kristian Brock, Simon Gates, and Daniel Slade. “Do we need to adjust for interim analyses in a Bayesian adaptive trial design?.” BMC medical research methodology 20, no. 1 (2020): 1-9.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.” European journal of epidemiology 31, no. 4 (2016): 337-350.

Addendum

Code used to generate p-value estimates for a variety of scenarios was written to be executed on a high performance computing cluster, which is a highly parallel computing environment. I am using package rslurm to manage the parallelization of the process (there is an alternative package called slurmR which I have used in the past). The replications of each of the scenarios (represented by a set of simulation parameters) are stored in data frame scenarios, and the function slurm_apply cycles through the data frame, with many replications executed in parallel. The function s_replication calls the two functions bayes_fit and freq_fit described above.

library(simstudy)
library(data.table)
library(parallel)
library(cmdstanr)
library(posterior)
library(rslurm)

### Function to generate data and estimate parameters

s_replicate <- function(iter, p_sigma, decision_rule, m_effect, seq) {
  
  set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
  
  def <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
  def <- defData(def, varname = "y", formula = 0, variance = 1, dist = "normal")
  
  dd <- genData(end, def)
  
  freq_ps <- sapply(seq(start, end, by = by), function(x) freq_fit(dd[1:x]))
  freq_effect <- any(freq_ps < 0.05)
  
  bayes_ci <- sapply(seq(start, end, by = by), 
    function(x) bayes_fit(dd[1:x], p_sigma, m_effect, decision_rule, x))
  bayes_effect <- any(bayes_ci)
  
  return(data.table(seq, iter, p_sigma, m_effect, decision_rule, 
    freq_effect, bayes_effect))  
}

### Set simulation parameters

scenario_dt <- function(...) {
  argdt <- data.table(expand.grid(...))
  argdt[, seq := .I]
  argdt[]
}

iter <- c(1:1000)
p_sigma <- c(1, 5, 10)
decision_rule = 2
m_effect <- c(0.2, 0.3, 0.4) # if decision_rule = 2
# decision_rule = 1
# m_effect <- 0

start <- 100L
end <- 1000L
by <- 100L

scenarios <- scenario_dt(
  iter = iter, 
  p_sigma = p_sigma, 
  decision_rule = decision_rule,
  m_effect = m_effect
)

### Compile stan code

set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
mod <- cmdstan_model("multiple.stan")

### Set rslurm arguments

sopts <- list(time = '12:00:00', partition = "cpu_short", `mem-per-cpu` = "5G")
sobjs <- c("freq_fit", "bayes_fit", "mod", "start", "end", "by")

### Replicate over iterations

sjob <- slurm_apply(
  f = s_replicate, # the function
  params = scenarios, # a data frame
  jobname = 'mult_i',
  nodes = 50, 
  slurm_options = sopts,
  global_objects = sobjs,
  submit = TRUE
)

### Collect the results and save them

res <- get_slurm_out(sjob, outtype = 'table', wait = TRUE)
save(res, file = "/gpfs/.../mult.rda")

Exploring design effects of stepped wedge designs with baseline measurements

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 07 Dec 2021 00:00:00 +0000

In the previous post, I described an incipient effort that I am undertaking with two colleagues, Monica Taljaard and Fan Li, to better understand the implications for collecting baseline measurements on sample size requirements for stepped wedge cluster randomized trials. (The three of us are on the Design and Statistics Core of the NIA IMPACT Collaboratory.) In that post, I conducted a series of simulations that illustrated the design effects in parallel cluster randomized trials derived analytically in a paper by Teerenstra et al. In this post, I am extending those simulations to stepped wedge trials; the hope is that the design effects can be formally derived some point soon.

Stepped wedge designs

At the end of the previous post, I provided links to a few earlier entries where I described stepped wedge designs in some detail. The first post I suggested provides an introductory overview.

In the stepped-wedge design, all clusters in a trial will receive the intervention at some point, but the start of the intervention at different sites will be staggered. The amount of time in each state (control or intervention) will differ for each cluster (or group of clusters if there are waves of more than one cluster starting up at the same time). In this design, time is divided into discrete data collection/phase-in periods.

In the figure below there are four waves of clusters (the rows), with sets of five clusters (the columns) in each wave. All waves are in the control condition in the first observation period, Wave 1 starts the intervention in period 2, Wave 2 starts in period 3, Wave 3 in period 4, and the last wave starts in period 5. Data from individuals in each cluster are collected for all periods, regardless of intervention status. The points in the figure represent individual (follow-up) responses within each cluster and time period.

Data generation process

In order to accommodate a range of scenarios, I’ve written a general function base_sw that will be used in the data generating process (see addendum). This function allows for time period effects (specified as a time trend) as well as all the possible cluster and individual random effects. Importantly, this function will generate two outcome measures for each subject within a specific cluster and time period.

This is the general data generation process that I’ve used here:

\[ Y_{ijkt} = \tau (t-1) + \gamma k Z_{jt} + c_j + cp_{jt} + s_{ijt} + sp_{ijkt}, \]

where $Y_{ijkt}$ is the outcome for subject $i$ in cluster $j$ measured in period $t,$ $t \in \{1,\dots, T\};$ $k \in \{0, 1\}, \ k=0$ if the measurement is taken at baseline and $k=1$ if the measurement is the follow-up. $Z_{jt}$ is a treatment indicator for cluster $j$ during period $t$.

The parameter $\tau$ is the general time trend. (The linear time trend is not a requirement - it is just what I am using here to easily generate time period effects.) $\gamma$ is the treatment effect, which can only impact follow-up measurements. $c_j$ and $cp_{jt}$ are the cluster and cluster-period specific random effects with distributions $N(0, \sigma_c^2)$ and $N(0, \sigma_{cp}^2)$, respectively. $s_{ijt}$ is the subject-level effect (and is specific to period $t$ since the subject is only observed in that period), $s_{ijt} \sim N(0, \sigma_{s}^2)$. $sp_{ijkt}$ is the measurement noise, $sp_{ijkt} \sim N(0, \sigma_{sp}^2).$

The code itself is a little involved, because there are two longitudinal processes that need to be combined - clusters are repeated over time and subjects are repeated over time within each cluster period. In addition, there are two possible layers of cluster and subject random effects.

Below are three plots of data generated from three possible scenarios, using ten sites divided into two waves. The first scenario includes cluster and subject specific effects, but no fixed or random period effects. The second adds fixed period effects, and the third adds both fixed and random period effects:

dd_1 <- base_sw(
  effect = 1, 
  trend = 0, 
  nsites = 10, 
  nwaves = 2, 
  nperiods = 5, 
  n = 30, 
  s_c = 6, 
  s_cp = 0, 
  s_s = 44, 
  s_sp = 20
)

dd_2 <- base_sw(1, 3, 10, 2, 5, 30, 6, 0, 44, 20)
dd_3 <- base_sw(1, 3, 10, 2, 5, 30, 4, 2, 44, 20)

Models for estimating treatment effects

Of course, given a data set, there are many ways we can choose to analyze it. Which model we pick depends in large part on the structure of the data and the assumptions we are willing to make about the data generation process. Here, we describe five models, all of which are linear mixed effects models. Ideally, we will choose the one that provides the most information given the data on hand, and we hope these explorations might help clarify which of these is our best bet.

To start, we are focusing on a situation where there is no cluster-period effect (Scenario 2 from above), so that $cp_{jt} = 0$ for all $j$ and $t$. The true effect size $\gamma = 1$ and the time trend $\tau = 3$. There will be 20 clusters divided into 4 waves of 5 clusters, with each wave starting in period $t=1$, and the intervention is rolled out in periods 2 through 5, just as in the figure at the top. There will be 30 subjects for each cluster, each period. $\sigma_c^2 = 6$, $\sigma_{cp}^2 = 0$, $\sigma_s^2 = 44$, and $\sigma_{sp}^2 = 20.$ (See the second part of the addendum for examples from Scenarios 1 and 3.)

Here is a data set generated under these assumptions:

dd <- base_sw(1, 3, 20, 4, 5, 30, 6, 0, 44, 20)
dd[, t := factor(s_per, labels = c(1:5))]

dd

##       s_time site s_per   id k    c c.p     s p_time startTrt Z    y t
##    1:      1    1     0    1 0  2.4   0 -0.43      1        1 0  3.4 1
##    2:      1    1     0    1 1  2.4   0 -0.43      2        1 0 -7.6 1
##    3:      1    1     0    2 0  2.4   0 12.76      3        1 0 10.4 1
##    4:      1    1     0    2 1  2.4   0 12.76      4        1 0  9.4 1
##    5:      1    1     0    3 0  2.4   0  0.89      5        1 0  2.5 1
##   ---                                                                 
## 5996:    100   20     4 2998 1 -3.9   0  0.72   5996        4 1 12.5 5
## 5997:    100   20     4 2999 0 -3.9   0  3.29   5997        4 1  9.6 5
## 5998:    100   20     4 2999 1 -3.9   0  3.29   5998        4 1 16.0 5
## 5999:    100   20     4 3000 0 -3.9   0  0.86   5999        4 1  2.7 5
## 6000:    100   20     4 3000 1 -3.9   0  0.86   6000        4 1  8.8 5

And here is a subset of the data that we will actually be able to observe in the real world, which will be used to estimate the models:

dd[, .(id, site, t, k, Z, y)]

##         id site t k Z    y
##    1:    1    1 1 0 0  3.4
##    2:    1    1 1 1 0 -7.6
##    3:    2    1 1 0 0 10.4
##    4:    2    1 1 1 0  9.4
##    5:    3    1 1 0 0  2.5
##   ---                     
## 5996: 2998   20 5 1 1 12.5
## 5997: 2999   20 5 0 1  9.6
## 5998: 2999   20 5 1 1 16.0
## 5999: 3000   20 5 0 1  2.7
## 6000: 3000   20 5 1 1  8.8

Here is a description of each of the five models, along with code to estimate the models:

Model 1: Hussey & Hughes model with no baseline measurement

If we do not have access to the baseline data (or for some reason, choose to ignore it in the modeling), we could fit the stepped wedge model described by Hussey & Hughes:

\[ Y_{ijt} = \alpha_t + \gamma Z_{jt} + c_j + s_{ijt}, \]

where $Y_{ijt}$ is the outcome for subject $i$ in cluster $j$ measured in period $t$. $Z_{jt}$ is a treatment indicator for cluster $j$ during period $t$. The $\alpha_t\text{'s}$ are the period-specific fixed effects and $\gamma$ is the treatment effect. $c_j$ is the cluster specific random effects with distribution $N(0, \sigma_c^2)$. $s_{ijt}$ is the subject-level effect or noise, $s_{ijt} \sim N(0, \sigma_{s}^2).$ Since there is only a single measurement for each subject, there is only a single subject-level effect. And because we have assumed no cluster-period random effects, $cp_{jt}$ is not included in the model (and will also be the case for the other models described below).

This model can be estimated using the lmer function in package lme4. Note that we are using only the follow-up data in the model estimation:

d_std <- dd[k == 1]
fit_1 <-  lmer( y ~  t + Z  - 1 + (1|site), data = d_std)

tbl_regression(fit_1, tidy_fun = broom.mixed::tidy) %>%
   modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
t
1	-0.73	-1.9, 0.43	0.2
2	2.9	1.7, 4.1	<0.001
3	5.0	3.8, 6.3	<0.001
4	8.3	7.0, 9.7	<0.001
5	11	9.7, 13	<0.001
Z	1.0	-0.03, 2.0	0.057
site.sd__(Intercept)	2.1
Residual.sd__Observation	7.9

The treatment and period effects are recovered quite well, as are the variance estimates of the random effects.

Model 2: Hussey & Hughes extended to baseline measurement

If we have access to both baseline and follow-up outcome measurements for each subject, we have a number of modeling options, four of which are described here. First, we can extend the Hussey & Hughes model to estimate the treatment effect. The baseline and follow-up measurements are assumed to be collected in the same period $t$. The baseline measurement is collected prior to randomization (so will by definition be under the control condition). The follow-up measurement will be under the control or treatment condition, depending on the cluster and time period.

\[ Y_{ijkt} = \alpha_t + \gamma k Z_{jt} + c_j + s_{ijt} + sp_{ijkt}, \]

where $Y_{ijkt}$ is outcome $k$ for subject $i$ in cluster $j$ measured in period $t$. $k = 0$ when the measurement is from baseline and $k=1$ when it is the follow-up. $Z_{jt}$ is a treatment indicator for cluster $j$ during period $t$. The parameters $\alpha_t$ and $\gamma$ are the same as above. The random effect $c_j$ is also unchanged from the previous model. The subject-level effects of this model are different, reflecting the fact that there are now two measurements per subject: $s_{ijt}$ is the overall subject-level effect for subject $i$ in cluster $j$ during period $t$, $s_{ijt} \sim N(0,\sigma_s^2)$ and $sp_{ijkt}$ is the measurement noise (which assumes that any two measurements from the same individual will vary), $sp_{ijkt} \sim N(0,\sigma_{sp}^2)$.

Once again, the model estimates recover the true values pretty well:

fit_2 <- lmer(y ~ t + k:Z  - 1 +  (1|id:site) + (1|site), data = dd)

Characteristic	Beta	95% CI	p-value
t
1	-0.82	-2.0, 0.33	0.2
2	2.8	1.6, 3.9	<0.001
3	5.1	4.0, 6.3	<0.001
4	8.2	7.0, 9.3	<0.001
5	11	9.9, 12	<0.001
k * Z	1.1	0.74, 1.4	<0.001
id:site.sd__(Intercept)	6.5
site.sd__(Intercept)	2.1
Residual.sd__Observation	4.5

Model 3: Teerenstra et al difference in change model

In the previous post, I described a “difference in change” model proposed in a paper by Teerenstra et al for parallel designs. This can be extended to the context of a stepped wedge design:

\[ Y_{ijkt} = \alpha_t + \gamma_0 k + \gamma_1 Z_{jt} + \gamma_2 k Z_{jt} + c_j + s_{ijt} + sp_{ijkt} \]

This is essentially the same as Model 2, though this expanded model makes slightly fewer assumptions. In particular, Model 3 includes the additional parameters $\gamma_0$ and $\gamma_1$, where $\gamma_0$ is the change from baseline to follow-up in the control arm and $\gamma_{1}$ is the difference at baseline between control and treatment arms (we would expect this to be $0$ in a randomized trial). $\gamma_{2}$ is the difference in the change from baseline to follow-up between the two arms. In Model 2, we essentially make the assumption that both $\gamma_0 = 0$ and $\gamma_1=0$, which is probably true on average in a randomized trial with no secular trend between baseline and follow-up measurements (and is certainly true in our simulated data generation process). In this case, under the assumption of randomization and no within-period secular trend, $\gamma_2$ should be equivalent to $\gamma$ in Model 2.

fit_3 <- lmer(y ~  t + k * Z - 1 + (1|id:site) + (1|site), data = dd)

Characteristic	Beta	95% CI	p-value
t
1	-0.92	-2.1, 0.23	0.11
2	2.6	1.5, 3.8	<0.001
3	4.9	3.7, 6.1	<0.001
4	7.9	6.5, 9.2	<0.001
5	11	9.2, 12	<0.001
k	0.20	-0.12, 0.52	0.2
Z	0.38	-0.57, 1.3	0.4
k * Z	0.83	0.38, 1.3	<0.001
id:site.sd__(Intercept)	6.5
site.sd__(Intercept)	2.1
Residual.sd__Observation	4.5

The notable difference between the the Model 2 and Model 3 estimates is that the width of the confidence interval for the treatment effect estimate is considerably larger in Model 3. We will see how this plays out in the power estimates down below.

Model 4: ANCOVA

The next model is an ANCOVA model that adjusts each follow-up outcome $Y_1$ with the observed baseline measurement $Y_0.$ Because we have only a single outcome per subject, the subject-specific effect $sp_{ijkt}$ disappears, and $s_{ijt}$ is now individual subject-level noise.

\[ Y_{ijt} = \alpha_t + \beta Y_0 + \gamma Z_{jt} + c_j + s_{ijt} \]

dd_b <- dd[k ==0, .(id, y0 = y)]
dd_f <- dd[k ==1, ]

dd <- merge(dd_f, dd_b, by = "id")

fit_4 <- lmer(y ~  t +  y0 + Z  - 1 + (1|site), data = dd)

Characteristic	Beta	95% CI	p-value
t
1	-0.11	-0.66, 0.44	0.7
2	1.1	0.54, 1.7	<0.001
3	1.5	0.85, 2.2	<0.001
4	2.9	2.1, 3.7	<0.001
5	3.9	2.9, 4.8	<0.001
y0	0.69	0.67, 0.72	<0.001
Z	0.74	0.06, 1.4	0.033
site.sd__(Intercept)	0.64
Residual.sd__Observation	5.8

Model 5: Change as outcome

The final model considers the change from baseline to follow-up ($D_{ijt} = Y_{ij1t} - Y_{ij0t}$) as the outcome of interest:

\[ D_{ijt} = \alpha_t + \gamma Z_{jt} + c_j + s_{ijt} \]

dd[, d := y - y0]

fit_5 <- lmer(d ~  t + Z - 1 + (1|site), data = dd)

Characteristic	Beta	95% CI	p-value
t
1	0.17	-0.35, 0.69	0.5
2	0.33	-0.21, 0.88	0.2
3	-0.04	-0.66, 0.58	>0.9
4	0.55	-0.17, 1.3	0.13
5	0.63	-0.22, 1.5	0.15
Z	0.58	-0.09, 1.3	0.087
site.sd__(Intercept)	0.26
Residual.sd__Observation	6.3

Power estimates

The ultimate goal of simulating data under different scenarios is to compare how efficiently different modeling assumptions estimate the treatment effect. In frequentist terms, if we assume a specific effect size, which model will lead us to the “correct” inference most often? Or putting it more technically, which model has the most statistical power?

By repeatedly generate data sets, we can estimate the parameters using each of the models and compare the overall proportion of data sets where the observed p-value is less than 0.05 to evaluate the power of each modeling approach. We have generated data under a range of effect size and time trend assumptions, as well as under the different random effect assumptions to see if relative power changes across different sets of assumptions.

The simulations were conducted under a range of sample size assumptions (we assumed between 12 and 60 clusters) as well as two different time trends. (See the last section of the addendum for example code used for the power analysis.) Power was estimated for each set of assumptions using 1500 data sets.

First up is Scenario 1:

The curves in the figures show estimated power under a range of effect size (the columns of plots), time trends (the rows of plots) and sample size assumptions (the x-axes). In the case without any fixed or random period effects, there really is no difference between Models 3, 4, and 5; power appears to be consistent under the different scenarios. Model 2, the Hussey & Hughes model for baseline and follow-up measures, has the most power; this is not surprising, given that the model has an embedded set of simplifying assumptions relative to Model 3 (see above). On the flip side, if those assumptions are violated, the Model 3 estimates for the treatment effect may be biased, so it is not the obvious choice. On the other end of the spectrum, Model 1 that ignores the baseline measures provides the least power; it seems pretty clear that if the two measurements are available, it would be unwise to throw out one of them.

The relative weakness of Model 1 and relative strength (though potential bias) of Model 2 persist in the remaining two scenarios (below), so the interesting question is how the remaining three models fare. In the second scenario, where there are fixed period effects but no random period effects, differences emerge. Model 3, the difference in change or Teerenstra model, appears to have more power than the ANCOVA or change models across all effect sizes and secular trends.

In Scenario 3, where we have both fixed and random period effects, the general ordering of the models appears similar to the ranking in Scenario 2. However, Model 5, the change model appears, surprisingly, slightly superior to Model 4, the ANCOVA model.

Back in the real world, of course, we do not know the true data generation process. It seems pretty clear that if we can make the assumptions that $\gamma_0$ and $\gamma_1 = 0$, then Model 2 appears to be the way to go. However, if we want to be careful not to introduce bias, Model 3 might preferred.

The important thing to note is that statistical power translates directly to sample size. If we are interested in achieving 80% power, we can see how the number of required clusters is reduced as we move leftward across the horizontal white white lines in the plots, from Model 1 to Models 4 & 5 to Model 3 and then to Model 2. This is the design effect in action.

Of course, these conclusions are based on the specific data generating assumptions of the simulations. More general guidelines will require the formal derivation of design effects of the different models under the various scenarios. That is the next step.

References:

Hussey, Michael A., and James P. Hughes. “Design and analysis of stepped wedge cluster randomized trials.” Contemporary clinical trials 28, no. 2 (2007): 182-191.

Teerenstra, Steven, Sandra Eldridge, Maud Graff, Esther de Hoop, and George F. Borm. “A simple sample size formula for analysis of covariance in cluster randomized trials.” Statistics in medicine 31, no. 20 (2012): 2169-2178.

Support:

Addendum

(A) General data generation code: function `base_sw`

base_sw <- function(effect, trend, nsites, nwaves, nperiods, n, s_c, s_cp, s_s, s_sp) {
  
  # define the cluster and subject level effects
  
  defC <- defData(varname = "c", formula = 0, variance = "..s_c")
  defCP <- defDataAdd(varname = "c.p", formula = 0, variance = "..s_cp")
  defS <- defDataAdd(varname = "s", formula = 0, variance = "..s_s")
  defSP <- defDataAdd(varname = "y",
    formula = "..trend * s_per + ..effect * Z * k + c + c.p + s", 
    variance = "..s_sp"
  )
  
  # generate clusters
  
  dsite <- genData(nsites, defC, id = "site")
  
  # generate cluster-period data
  
  dper <- addPeriods(dsite, nPeriods = nperiods, idvars = "site", 
                     timeid = "s_time", perName = "s_per")
  dper <- addColumns(defCP, dper)
  
  # make treatment assignments
  
  dsw <- trtStepWedge(dper, "site", nWaves = nwaves, lenWaves = 1, 
                      startPer = 1, perName = "s_per",
                      grpName = "Z")
  
  # generate individual level data within each cluster
  
  dpat <- genCluster(dper, cLevelVar = "s_time", 
                     numIndsVar = n, level1ID = "id")
  dpat <- addColumns(defS, dpat)
  
  # generate two observation periods for each subject - baseline and follow-up
  
  dpat <- addPeriods(dpat, nPeriods=2, idvars="id", timeid="p_time", perName="k")

  # merge subjects with stepped wedge assignments and generate outcome
  
  setkey(dpat, "s_time", "site", "s_per", "id")
  dsw[, c("c", "c.p") := NULL]
  setkey(dsw, "s_time", "site", "s_per")
  
  dpat <- merge(dpat, dsw)
  setkey(dpat, "id", "k")
  dpat <- addColumns(defSP, dpat)
}

(B) Model 3 under Scenarios 1 and 3

The modeling examples above were based on Scenario 2 assumptions. Here is Model 3 under Scenarios 1 and 3. (The other models would be described and implemented in similar fashion.)

Scenario 1

In Scenario 1, there are no fixed or random period effects, so there is a constant intercept $\alpha$ for each period, rather than period-specific intercepts $\alpha_t$, and there is no site-period random effect $cp_{jt}$:

\[ Y_{ijkt} = \alpha + \gamma_0 k + \gamma_1 Z_{jk} + \gamma_2 k Z_{jk} + c_j + s_{ijt} + sp_{ijkt} \]

dd <- base_sw(effect = 1, trend = 0, 20, 4, 5, 30, 6, 0, 44, 20)

fit_1a <- lmer(y ~  k * Z + (1|id:site) + (1|site), data = dd)

Characteristic	Beta	95% CI	p-value
k	0.24	-0.09, 0.56	0.2
Z	0.51	-0.11, 1.1	0.11
k * Z	0.47	0.01, 0.93	0.044
id:site.sd__(Intercept)	6.5
site.sd__(Intercept)	2.1
Residual.sd__Observation	4.5

Scenario 3

Scenario 3 includes both fixed and random period effects, so we have $\alpha_t$ and $cp_{jt}$.

\[ Y_{ijkt} = \alpha_t + \gamma_0 k + \gamma_1 Z_{jk} + \gamma_2 k Z_{jk} + c_j + cp_{jt} + s_{ijt} + sp_{ijkt} \]

dd <- base_sw(effect = 1, trend = 3, 20, 4, 5, 30, 4, 2, 44, 20)
dd[, t := factor(s_per, labels = c(1, 2, 3, 4, 5))]

fit_2a <- lmer(y ~  t + k * Z - 1 + (1|id:site) + (1|s_time:site) + (1|site), 
  data = dd)

Characteristic	Beta	95% CI	p-value
t
1	-0.89	-2.1, 0.36	0.2
2	2.6	1.3, 3.9	<0.001
3	6.0	4.6, 7.4	<0.001
4	9.6	8.0, 11	<0.001
5	13	11, 15	<0.001
k	0.07	-0.25, 0.38	0.7
Z	-1.5	-2.9, -0.15	0.030
k * Z	1.0	0.50, 1.4	<0.001
id:site.sd__(Intercept)	6.8
s_time:site.sd__(Intercept)	1.4
site.sd__(Intercept)	1.9
Residual.sd__Observation	4.4

(C) Code to estimate statistical power

Here is the code to estimate statistical power of each model under a range of parameter assumptions, assuming Scenario 2 with fixed period effects. The estimation was conducted using a high performance computing cluster, because the replication process is computationally intensive.

library(simstudy)
library(data.table)
library(slurmR)

s_single_rep <- function(argsvec) {
  
  list2env(as.list(argsvec), envir = environment())

  dd <- base_sw(effect, trend, sites, waves, periods, npat, 6, 0, 44, 20)
  
  # Model 1
  
  d_std <- dd[k == 1]
  fit_1 <-  lmer( y ~  t + Z  - 1 + (1|site), data = d_std)  
  pval_1 <- coef(summary(fit_1))["Z", "Pr(>|t|)"]
  
  # Model 2
  
  fit_2 <- lmer(y ~ t + k:Z  - 1 +  (1|id:site) + (1|site), data = dd)
  pval_2 <- coef(summary(fit_2))["k:Z", "Pr(>|t|)"]
  
  # Model 3
  
  fit_3 <- lmer(y ~  t + k * Z - 1 + (1|id:site) + (1|site), data = dd)
  pval_3 <- coef(summary(fit_3))["k:Z", "Pr(>|t|)"]
  
  # Model 4
  
  dd_b <- dd[p_fu == 0, .(id, y0 = y)]
  dd_f <- dd[p_fu == 1, ]
  dd_a <- merge(dd_f, dd_b, by = "id")
  
  fit_4 <- lmer(y ~  t + k * Z - 1 + (1|id:site) + (1|site), data = dd_a)
  pval_4 <- coef(summary(fit_4))["Z", "Pr(>|t|)"]
  
  # Model 5
  
  dd[, d := y - y0]

  fit_5 <- lmer(d ~  t + Z - 1 + (1|site), data = dd)
  pval_5 <- coef(summary(fit_5))["Z", "Pr(>|t|)"]
  
  ###
  
  data.table(pval_1, pval_2, pval_3, pval_4, pval_5)
  
}

s_replicate <- function(argsvec) {
  
  model_results <- rbindlist(
    parallel::mclapply(
      X = 1:1500, 
      FUN = function(x) s_single_rep(argsvec), 
      mc.cores = 8)
  )
  
  #--- summary statistics ---#
  
  power <- model_results[, .(pval_1 = mean(pval_1 <= 0.05),
                             pval_2 = mean(pval_2 <= 0.05),
                             pval_3 = mean(pval_3 <= 0.05),
                             pval_4 = mean(pval_4 <= 0.05),
                             pval_5 = mean(pval_5 <= 0.05)
  )]
  
  summary_stats <- data.table(t(argsvec), power)
  
  return(summary_stats) # summary_stats is a data.table}

}

### Set simulation parameters

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

effect <- 1
trend <- c(0, 3)
sites <- c(12, 20, 28, 36, 44, 52, 60)
npat <- 30
waves <- 4
periods = 5

scenarios <- scenario_list(effect = effect, trend = trend, 
  sites = sites, npat=npat, waves = waves, periods = periods)

### Execute simulation

job <- Slurm_lapply(
  X = scenarios, 
  FUN = s_replicate, 
  njobs = min(length(scenarios), 90L), 
  mc.cores = 8L,
  job_name = "i_swt",
  tmp_path = "/gpfs/.../scratch",
  plan = "wait",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("simple_swt", "base_swt", "s_single_rep"),
  overwrite = TRUE
)

### Gather and save results

res <- Slurm_collect(job)
save(res, file = "/gpfs/.../res.rda")

The design effect of a cluster randomized trial with baseline measurements

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 23 Nov 2021 00:00:00 +0000

Is it possible to reduce the sample size requirements of a stepped wedge cluster randomized trial simply by collecting baseline information? In a trial with randomization at the individual level, it is generally the case that if we are able to measure an outcome for subjects at two time periods, first at baseline and then at follow-up, we can reduce the overall sample size. But does this extend to (a) cluster randomized trials generally, and to (b) stepped wedge designs more specifically?

The answer to (a) is a definite “yes,” as described in a 2012 paper by Teerenstra et al (more details on that below). As for (b), two colleagues on the Design and Statistics Core of the NIA IMPACT Collaboratory, Monica Taljaard and Fan Li, and I have just started thinking about this. Ultimately, we hope to have an analytic solution that provides more formal guidance for stepped wedge designs; but to get things started, we thought we could explore a bit using simulation.

Quick overview

Generally speaking, why might baseline measurements have any impact at all? The curse of any clinical trial is variability - the more noise (variability) there is in the outcome, the more difficult it is to identify the signal (effect). For example, if we are interested in measuring the impact of an intervention on the quality of life (QOL) across a diverse range of patients, the measurement (which typically ranges from 0 to 1) might vary considerably from person to person, regardless of the intervention. If the intervention has a real but moderate effect of, say, 0.1 points, it could easily get lost if the standard deviation is considerably larger, say 0.25.

It turns out that if we collect baseline QOL scores and can “control” for those measurements in some way (by conducting a repeated measures analysis, using ANCOVA, or assessing the difference itself as an outcome), we might be able to reduce the variability across study subjects sufficiently to give us a better chance at picking up the signal. Previously, I’ve written about baseline covariate adjustment in the context of clinical trials where randomization is at the individual subject level; now we will turn to the case where randomization is at the cluster or site level.

This post focuses on work already done to derive design effects for parallel cluster randomized trials (CRTs) that collect baseline measurements; we will get to stepped wedge designs in future posts. I described the design effect pretty generally in an earlier post, but the paper by Teerenstra et al, titled “A simple sample size formula for analysis of covariance in cluster randomized trials” provides a great foundation to understand how baseline measurements can impact sample sizes in clustered designs.

Here’s a brief outline of what follows: after showing an example based on a simple 2-arm randomized control trial with 350 subjects that has 80% power to detect a standardized effect size of 0.3, I describe and simulate a series of designs with cluster sizes of 30 subjects that require progressively fewer clusters but also provide 80% power under the same effect size and total variance assumptions: a simple CRT that needs 64 sites, a cross-sectional pre-post design that needs 52, a repeated measures design that needs 38, and a repeated measures design that models follow-up outcomes only (i.e. uses an ANCOVA model) that requires only 32.

Simple RCT

We start with a simple RCT (without any clustering) that randomizes individuals to treatment or control.

\[ Y_{i} = \alpha + \delta Z_{i} + s_{i} \] where $Y_{i}$ is a continuous outcome measure for individual $i$, and $Z_{i}$ is the treatment status of individual $i$. $\delta$ is the treatment effect. $s_{i} \sim N(0, \sigma_s^2)$ are the individual random effects or noise.

Now that we are about to start coding, here are the necessary packages:

RNGkind("L'Ecuyer-CMRG")
set.seed(19287)

library(simstudy)
library(ggplot2)
library(lmerTest)
library(parallel)
library(data.table)
library(pwr)
library(gtsummary)
library(paletteer)
library(magrittr)

In the examples that follow, overall variance $\sigma^2 = 64$. In this first example, then, $\sigma_s^2 = 64$ since that is the only source of variation. The overall effect size $\delta$, which is the difference in average scores across treatment groups, is assumed to be 2.4, a standardized effect size $2.4/8 = 0.3.$ We will need to generate 350 individual subjects (175 in each arm) to achieve power of 80%.

pwr.t.test(d = 0.3, power = 0.80)

## 
##      Two-sample t test power calculation 
## 
##               n = 175
##               d = 0.3
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Data generation process

Here is the data definition and generation process:

simple_rct <- function(N) {
  
  # data definition for outcome
  
  defS <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
  defS <- defData(defS, varname = "y", formula = "2.4*rx", variance = 64, dist = "normal")

  dd <- genData(N, defS)
  
  dd[]
}

dd <- simple_rct(350)

Here is a visualization of the outcome measures by treatment arm.

Estimating effect size

A simple linear regression model estimates the effect size:

fit1 <- lm(y ~  rx, data = dd)
tbl_regression(fit1) %>% 
  modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
rx	3.2	1.6, 4.9	<0.001

Confirming power

We can confirm the power by repeatedly generating data sets and fitting models, recording the p-values for each replication.

replicate <- function() {
  dd <- simple_rct(350)
  fit1 <- lm(y ~  rx, data = dd)
  
  coef(summary(fit1))["rx", "Pr(>|t|)"]
}

p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)

Here is the estimated power based on 1000 replications:

mean(unlist(p_values) < 0.05)

## [1] 0.79

Parallel cluster randomized trial

If we need to randomize at the site level (i.e., conduct a CRT), we can describe the data generation process as

\[Y_{ij} = \alpha + \delta Z_{j} + c_j + s_i\]

where $Y_{ij}$ is a continuous outcome for subject $i$ in site $j$. $Z_{j}$ is the treatment indicator for site $j$. Again, $\delta$ is the treatment effect. $c_j \sim N(0,\sigma_c^2)$ is a site level effect, and $s_i \sim N(0, \sigma_s^2)$ is the subject level effect. The correlation of any two subjects in a cluster is $\rho$ (the ICC):

\[\rho = \frac{\sigma_c^2}{\sigma_c^2 + \sigma_s^2}\]

If we have a pre-specified number ($n$) of subjects at each site, we can estimate the sample size required in the CRT might applying a design effect $1+(n-1)\rho$ to the sample size of an RCT that has the same overall variance. So, if $\sigma_c^2 + \sigma_s^2 = 64$, we can augment the sample size we used in the initial example. If $\sigma_c^2 = 9.6$ + $\sigma_s^2 = 54.4$, $\rho = 0.15$. We anticipate having 30 subjects at each site so the design effect is

\[1 + (30 - 1) \times 0.15 = 5.35\]

This means we will need $5.35 \times 350 = 1872$ total subjects based on the same effect size and power assumptions. Since we anticipate 30 subjects per site, we need $1872 / 30 = 62.4$ sites - we will round up to the nearest even number and use 64 sites.

Data generation process

simple_crt <- function(nsites, n) {
  
  defC <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
  defC <- defData(defC, varname = "c", formula = "0", variance = 9.6, dist = "normal")  
  
  defS <- defDataAdd(varname="y", formula="c + 2.4*rx", variance = 54.4, dist="normal")

# site/cluster level data
  
  dc <- genData(nsites, defC, id = "site")

# individual level data
  
  dd <- genCluster(dc, "site", n, "id")
  dd <- addColumns(defS, dd)
  
  dd[]
}

dd <- simple_crt(20, 50)

Once again, the sites randomized to the treatment arm are colored red:

Estimating effect size

A mixed effects model is used to estimate the effect size. I’m using a larger data set to recover the parameters used in the data generation process:

dd <- simple_crt(200,100)

fit2 <- lmer(y ~  rx + (1|site), data = dd)
tbl_regression(fit2, tidy_fun = broom.mixed::tidy)  %>% 
  modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
rx	1.2	0.21, 2.1	0.018
site.sd__(Intercept)	3.4
Residual.sd__Observation	7.4

Confirming power

Now, I will confirm power using 64 sites with 30 subjects per site, for a total of 1920 subjects (compared with only 350 in the RCT):

replicate <- function() {
  dd <- simple_crt(64, 30)
  fit2 <- lmer(y ~  rx + (1|site), data = dd)
  
  coef(summary(fit2))["rx", "Pr(>|t|)"]
}

p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)

mean(unlist(p_values) < 0.05)

## [1] 0.8

CRT with baseline measurement

We paid quite a hefty price moving from an RCT to a CRT in terms of the number of subjects we need to collect data on. If these data are coming from administrative systems, that added burden might not be an issue, but if we need to consent all the subjects and survey them individually, this could be quite burdensome.

We may be able to decrease the required number of clusters (i.e. reduce the design effect) if we can collect baseline measurements of the outcome. The baseline and follow-up measurements can be collected from the same subjects or different subjects, though the impact on the design effect depends on what approach is taken.

\[ Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1}k Z_{j} + c_j + cp_{jk} + s_{ij} + sp_{ijk} \]

where $Y_{ijk}$ is a continuous outcome measure for individual $i$ in site $j$ and measurement $k \in \{0,1\}$. $k=0$ for baseline measurement, and $k=1$ for the follow-up. $Z_{j}$ is the treatment status of cluster $j$, $Z_{j} \in \{0,1\}.$ $\alpha_0$ is the mean outcome at baseline for subjects in the control clusters, $\alpha_1$ is the change from baseline to follow-up in the control arm, $\delta_{0}$ is the difference at baseline between control and treatment arms (we would expect this to be $0$ in a randomized trial), and $\delta_{1}$ is the difference in the change from baseline to follow-up between the two arms. (In a randomized trial, since $\delta_0$ should be close to $0$, $\delta_1$ is the treatment effect.)

The model has cluster-specific and subject-specific random effects. For both, there can be time-invariant effects and time-varying effects. $c_j \sim N(0,\sigma_c^2)$ are time invariant site-specific effects, and $cp_{jk}$ are the site-specific period (time varying) effects, where $c_{jk} \sim N(0, \sigma_{cp}^2)$. At the subject level there can be $s_{ij} \sim N(0, \sigma_s^2)$ and $sp_{ijk} \sim N(0, \sigma_{sp}^2)$.

Here is the generic code that will facilitate data generation in this model:

crt_base <- function(effect, nsites, n, s_c, s_cp, s_s, s_sp) {

  defC <- defData(varname = "c", formula = 0, variance = "..s_c")
  defC <- defData(defC, varname = "rx", formula = "1;1", dist = "trtAssign")
  
  defCP <- defDataAdd(varname = "c.p", formula = 0, variance = "..s_cp")
  
  defS <- defDataAdd(varname = "s", formula = 0, variance = "..s_s")
  
  defSP <- defDataAdd(varname = "y",
    formula = "..effect * rx * period + c + c.p + s", 
    variance ="..s_sp")
  
  dc <- genData(nsites, defC, id = "site")

  dcp <- addPeriods(dc, 2, "site")
  dcp <- addColumns(defCP, dcp)
  dcp <- dcp[, .(site, period, c.p, timeID)]
  
  ds <- genCluster(dc, "site", n, "id")
  ds <- addColumns(defS, ds)
  
  dsp <- addPeriods(ds, 2)
  setnames(dsp, "timeID", "obsID")
  
  setkey(dsp, site, period)
  setkey(dcp, site, period)
  
  dd <- merge(dsp, dcp)
  dd <- addColumns(defSP, dd)
  setkey(dd, site, id, period)
  
  dd[]
}

Design effect

In their paper, Teerenstra et al develop a design effect that takes into account the baseline measurement. Here are a few key quantities that are needed for the calculation:

The correlation of two subject measurements in the same cluster and same time period is the ICC or $\rho$, and is:

\[\rho = \frac{\sigma_c^2 + \sigma_{cp}^2}{\sigma_c^2 + \sigma_{cp}^2 + \sigma_s^2 + \sigma_{sp}^2} \]

In order to estimate design effect, we need two more correlations. The correlation between baseline and follow-up random effects at the cluster level is

\[\rho_c = \frac{\sigma_c^2}{\sigma_c^2 + \sigma_{cp}^2}\]

and the correlation between baseline and follow-up random effects at the subject level is

\[\rho_s = \frac{\sigma_s^2}{\sigma_s^2 + \sigma_{sp}^2}\]

A value $r$ is used to estimate the design effect, and is defined as

\[ r = \frac{n\rho\rho_c + (1-\rho)\rho_s}{1 + (n-1)\rho}\]

If we are able to collect baseline measurements and our focus is on estimating $\delta_1$ from the model, the design effect is slightly modified from before:

\[ (1 + (n-1)\rho)(2(1-r)) \]

Cross-sectional cohorts

We may not be able to collect two measurements for each subject at a site, but if we can collect measurements on two different cohorts, one at baseline before the intervention is implemented, and one cohort in a second period (either after the intervention has been implemented or not, depending on the randomization assignment of the cluster), we might be able to reduce the number of clusters.

In this case, $\sigma_s^2 = 0$ and $\rho_s = 0$, so the general model reduces to

\[ Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1} k Z_{j} + c_j + cp_{jk} + sp_{ijk} \]

Data generation process

The parameters for this simulation are $\delta_1 = 2.4$, $\sigma_c^2 = 6.8$, $\sigma_{cp}^2 = 2.8$, $\sigma_{sp}^2 = 54.4$. Total variance $\sigma_c^2 + \sigma_{cp}^2 + \sigma_{sp}^2 = 6.8 + 2.8 + 54.4 = 64$, as used previously.

dd <- crt_base(effect = 2.4, nsites = 20, n = 30, s_c=6.8, s_cp=2.8, s_s=0, s_sp=54.4)

Here is a visualization of the outcome measures by site and by period, with the sites in the treatment arm colored in red (only in the follow-up period).

Estimating effect size

To estimate the effect size we fit a mixed effect model with cluster-specific effects only (both time invariant and time varying).

dd <- crt_base(effect = 2.4, nsites = 200, n = 100, s_c=6.8, s_cp=2.8, s_s=0, s_sp=54.4)

fit3 <- lmer(y ~ period*rx+ (1|timeID:site) + (1 | site), data = dd)
tbl_regression(fit3, tidy_fun = broom.mixed::tidy)  %>% 
  modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
period	-0.03	-0.52, 0.46	>0.9
rx	0.17	-0.78, 1.1	0.7
period * rx	2.7	2.0, 3.4	<0.001
timeID:site.sd__(Intercept)	1.6
site.sd__(Intercept)	2.9
Residual.sd__Observation	7.4

Confirming power

Based on the variance assumptions, we can update our design effect:

s_c <- 6.8
s_cp <- 2.8
s_s <- 0
s_sp <- 54.4

rho <- (s_c + s_cp)/(s_c + s_cp + s_s + s_sp)
rho_c <- s_c/(s_c + s_cp)
rho_s <- s_s/(s_s + s_sp)

n <- 30

r <- (n * rho * rho_c + (1-rho) * rho_s) / (1 + (n-1) * rho)

The design effect for the CRT without any baseline measurement was 5.35. With the two-cohort design, the design effect is reduced slightly:

(des_effect <- (1 + (n - 1) * rho) * 2 * (1 - r))

## [1] 4.3

des_effect * 350 / n

## [1] 50

The desired number of sites is over 50, so rounding up to the next even number gives us 52:

replicate <- function() {
  dd <- crt_base(2.4, 52, 30, s_c = 6.8, s_cp = 2.8, s_s = 0, s_sp = 54.4)
  fit3 <- lmer(y ~ period * rx+ (1|timeID:site) + (1 | site), data = dd)
  
  coef(summary(fit3))["period:rx", "Pr(>|t|)"]
}

p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)

mean(unlist(p_values) < 0.05)

## [1] 0.8

Repeated measurements

We can reduce the number of clusters further if instead of measuring one cohort prior to the intervention and another after the intervention, we measure a single cohort twice - once at baseline and once at follow-up. Now we use the full model that decomposes the subject level variance into a time invariant effect ($c_j$) and a time varying effect $cp_{jk}$:

\[ Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1} k Z_{j} + c_j + cp_{jk} + s_{ij} + sp_{ijk} \]

Data generation process

These are the parameters, $\delta_1 = 2.4$, $\sigma_c^2 = 6.8$, $\sigma_{cp}^2 = 2.8$, $\sigma_s = 38,$ and $\sigma_{sp}^2 = 16.4$.

dd <- crt_base(effect=2.4, nsites=20, n=30, s_c=6.8, s_cp=2.8, s_s=38, s_sp=16.4)

Here is what the data look like; each line represents an individual subject at the two time points, baseline and follow-up.

Estimating effect size

The mixed effect model includes cluster-specific effects only (both time invariant and time varying), as well as subject level effects. Again, total variance ($\sigma_c^2 + \sigma_{cp}^2 + \sigma_s^2 + \sigma_{sp}^2$) is 64.

dd <- crt_base(effect = 2.4, nsites = 200, n = 100, 
  s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)

fit4 <- lmer(y ~ period*rx + (1 | id:site) + (1|timeID:site) + (1 | site), data = dd)
tbl_regression(fit4, tidy_fun = broom.mixed::tidy)  %>% 
  modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
period	-0.21	-0.73, 0.31	0.4
rx	-0.19	-1.1, 0.73	0.7
period * rx	2.4	1.7, 3.2	<0.001
id:site.sd__(Intercept)	6.2
timeID:site.sd__(Intercept)	1.8
site.sd__(Intercept)	2.7
Residual.sd__Observation	4.1

Confirming power

Based on the variance assumptions, we can update our design effect a second time:

s_c <- 6.8
s_cp <- 2.8
s_s <- 38
s_sp <- 16.4

rho <- (s_c + s_cp)/(s_c + s_cp + s_s + s_sp)
rho_c <- s_c/(s_c + s_cp)
rho_s <- s_s/(s_s + s_sp)

n <- 30

r <- (n * rho * rho_c + (1-rho) * rho_s) / (1 + (n-1) * rho)

And again, the design effect (and sample size requirement) is reduced:

(des_effect <- (1 + (n - 1) * rho) * 2 * (1 - r))

## [1] 3.1

des_effect * 350 / n

## [1] 37

The desired number of sites is over 36, so I will round up to 38:

replicate <- function() {
  dd <- crt_base(2.4, 38, 30, s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
  fit4 <-  lmer(y ~ period*rx + (1 | id:site) + (1|timeID:site) + (1 | site), data = dd)
  
  coef(summary(fit4))["period:rx", "Pr(>|t|)"]
}

p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)

mean(unlist(p_values) < 0.05)

## [1] 0.79

Repeated measurements - ANCOVA

We may be able to reduce the number of clusters even further by changing the model so that we are comparing follow-up outcomes of the two treatment arms (as opposed to measuring the differences in changes as we just did). This model is

\[ Y_{ij1} = \alpha_0 + \gamma Y_{ij0} + \delta Z_j + c_j + s_{ij} \]

where we have adjusted for baseline measurement $Y_{ij0}.$ Even though the estimation model has changed, I am using the exact same data generation process as before, with the same effect size and variance assumptions:

dd <- crt_base(effect = 2.4, nsites = 200, n = 100, 
  s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)

dobs <- dd[, .(site, rx, id, period, timeID, y)]
dobs <- dcast(dobs, site + rx + id ~ period, value.var = "y")

fit5 <- lmer(`1` ~ `0` + rx + (1 | site), data = dobs)
tbl_regression(fit5, tidy_fun = broom.mixed::tidy)  %>% 
  modify_footnote(ci ~ NA, abbreviation = TRUE)

Characteristic	Beta	95% CI	p-value
0	0.70	0.69, 0.71	<0.001
rx	2.5	1.8, 3.1	<0.001
site.sd__(Intercept)	2.2
Residual.sd__Observation	5.3

Design effect

Teerenstra et al derived an alternative design effect that is specific to the ANCOVA model:

\[ (1 + (n-1)\rho) (1-r^2) \]

where $r$ is the same as before. Since $(1-r^2) < 2(1-r), \ 0 \le r < 1$, this will be a reduction from the earlier model.

(des_effect <- (1 + (n - 1) * rho) * (1 - r^2))

## [1] 2.7

des_effect * 350 / n

## [1] 31

Confirming power

replicate <- function() {
  
  dd <- crt_base(2.4, 32, 30, s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
  dobs <- dd[, .(site, rx, id, period, timeID, y)]
  dobs <- dcast(dobs, site + rx + id ~ period, value.var = "y")

  fit5 <- lmer(`1` ~ `0` + rx + (1 | site), data = dobs)
  coef(summary(fit5))["rx", "Pr(>|t|)"]
}

p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)

mean(unlist(p_values) < 0.05)

## [1] 0.78

Next steps

These simulations confirmed the design effects derived by Teerenstra et al. In the next post, we will turn to baseline measurements in the context of a stepped wedge design, to see if these results translate to a more complex setting. The design effects themselves have not yet been derived. In the meantime, to get yourself psyched up for what is coming, you can read more generally about stepped wedge designs here, here, here, here, here, and here.

Update: you can now proceed directly to the second part.

Reference:

Support:

simstudy update: adding flexibility to data generation

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 09 Nov 2021 00:00:00 +0000

A new version of simstudy (0.3.0) is now available on CRAN and on the package website. Along with some less exciting bug fixes, we have added capabilities to a few existing features: double-dot variable reference, treatment assignment, and categorical data definition. These simple additions should make the data generation process a little smoother and more flexible.

Using non-scalar double-dot variable reference

Double-dot notation was introduced in the last version of simstudy to allow data definitions to be more dynamic. Previously, the double-dot variable could only be a scalar value, and with the current version, double-dot notation is now also array-friendly.

Before the examples, here are the necessary packages for this post:

library(simstudy)
library(data.table)
library(ggplot2)

Example 1

In the first example, we want to create a mixture distribution from a vector of values (which we can also do using a categorical distribution, more on that in a little bit). We can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4, specified as

sizes <- c(2, 4)

The data definition references each element of the vector:

defblk <- defData(varname = "blksize", 
   formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")

genData(1000, defblk)

##         id blksize
##    1:    1       2
##    2:    2       4
##    3:    3       2
##    4:    4       4
##    5:    5       4
##   ---             
##  996:  996       4
##  997:  997       4
##  998:  998       4
##  999:  999       4
## 1000: 1000       4

Example 2

In this second example, there is a vector variable tau of positive real numbers that sum to 1, and we want to calculate the weighted average of three numbers using tau as the weights. We could use the following code to estimate a weighted average theta:

tau <- rgamma(3, 5, 2)
tau <- tau / sum(tau)
tau

## [1] 0.362 0.400 0.238

d <- defData(varname = "a", formula = 3, variance = 4)
d <- defData(d, varname = "b", formula = 8, variance = 2)
d <- defData(d, varname = "c", formula = 11, variance = 6)
d <- defData(d, varname = "theta", formula = "..tau[1]*a + ..tau[2]*b + ..tau[3]*c", 
  dist = "nonrandom")

set.seed(19483)
genData(4, d)

##    id    a    b     c theta
## 1:  1 1.87 8.16 13.72  7.21
## 2:  2 3.45 7.45  6.08  5.68
## 3:  3 7.41 6.27 10.21  7.62
## 4:  4 2.34 9.52 10.01  7.04

However, we can simplify the calculation of theta a bit by using matrix multiplication:

d <- updateDef(d, changevar = "theta", newformula = "t(..tau) %*% c(a, b, c)")

set.seed(19483)
genData(4, d)

##    id    a    b     c theta
## 1:  1 1.87 8.16 13.72  7.21
## 2:  2 3.45 7.45  6.08  5.68
## 3:  3 7.41 6.27 10.21  7.62
## 4:  4 2.34 9.52 10.01  7.04

Example 3

The arrays can also have multiple dimensions, as in a $m \times n$ matrix. If we want to specify the mean outcomes for a $2 \times 2$ factorial study design with two interventions $a$ and $b$, we can use a simple matrix and draw the means directly from the matrix, which in this example is stored in the variable effect:

effect <- matrix(c(0, 8, 10, 12), nrow = 2)
effect

##      [,1] [,2]
## [1,]    0   10
## [2,]    8   12

Using double dot notation, it is possible to reference the matrix cell values directly, depending on the values of a and b:

d1 <- defData(varname = "a", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "b", formula = ".5;.5", 
              variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "outcome", formula = "..effect[a, b]",
              variance = 9, dist="normal")

dx <- genData(1000, d1)
dx

##         id a b outcome
##    1:    1 1 2   12.07
##    2:    2 2 2    9.70
##    3:    3 2 2   10.76
##    4:    4 2 2   11.04
##    5:    5 2 1    5.51
##   ---                 
##  996:  996 1 1   -2.80
##  997:  997 2 1    5.15
##  998:  998 2 2   19.47
##  999:  999 2 1   10.53
## 1000: 1000 2 1    3.89

The plot shows individual values as well as mean values by intervention arm:

Assigned treatment using trtAssign distribution in `defData`

The function trtAssign currently provides functionality to randomize group assignments using stratification and non-standard ratios (e.g. 2:2:1 for a three arm trial). Starting with version 0.3.0, it is also possible to generate these treatment assignments directly in the defData and genData process without a separate call to trtAssign. We’ve done this by adding trtAssign as a possible distribution.

In this example, randomization is stratified by gender and age (specified in the variance argument), and randomization is 1:1 treatment to control (specified in formula). The outcome y is effected by both of these factors as well as the treatment assignment variable rx.

def <- defData(varname = "male", dist = "binary", 
               formula = .5 , id="cid")
def <- defData(def, varname = "over65", dist = "binary",  
               formula = "-1.7 + .8*male", link="logit")
def <- defData(def, varname = "rx", dist = "trtAssign",
               formula = "1;1", variance = "male;over65")
def <- defData(def, varname = "y", dist = "normal", 
               formula = "20 + 5*male + 10*over65 + 10*rx", variance = 40)

dtstudy <- genData(330, def)
dtstudy

##      cid male over65 rx    y
##   1:   1    1      0  0 20.4
##   2:   2    1      0  0 23.9
##   3:   3    0      1  0 23.7
##   4:   4    1      0  1 25.9
##   5:   5    0      1  0 35.4
##  ---                        
## 326: 326    1      1  1 46.3
## 327: 327    1      0  1 33.2
## 328: 328    1      0  1 35.5
## 329: 329    1      1  0 42.2
## 330: 330    0      0  0 12.5

Here are the counts and average outcomes for each gender, age, and treatment combination:

dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)]

##    male over65 rx  n  avg
## 1:    0      0  0 72 20.3
## 2:    0      0  1 72 30.3
## 3:    0      1  0 12 28.7
## 4:    0      1  1 12 38.5
## 5:    1      0  0 55 25.0
## 6:    1      0  1 56 34.3
## 7:    1      1  0 26 36.6
## 8:    1      1  1 25 45.7

Categogorical data

Finally, in previous versions, the categorical distribution generated a set of integer categories:

def <- defData(varname = "grp", formula = ".4;.3;.2;.1", dist="categorical")
genData(1000, def)

##         id grp
##    1:    1   1
##    2:    2   4
##    3:    3   4
##    4:    4   1
##    5:    5   2
##   ---         
##  996:  996   1
##  997:  997   4
##  998:  998   2
##  999:  999   4
## 1000: 1000   1

Now, it is possible to generate specific values or string categories by using the variance argument:

def <- defData(varname = "grp", formula = ".4;.3;.2;.1", 
               variance = "a;b;c;d", dist="categorical")

dd <- genData(1000, def)
dd

##         id grp
##    1:    1   d
##    2:    2   b
##    3:    3   a
##    4:    4   c
##    5:    5   b
##   ---         
##  996:  996   d
##  997:  997   a
##  998:  998   b
##  999:  999   a
## 1000: 1000   d

To replicate Example 1 above, here is an alternative way to generate block sizes of 2 and 4 using the categorical distribution and the new functionality.

defblk <- defData(varname = "blksize", 
   formula = ".5;.5", variance = "2;4", dist = "categorical")

genData(1000, defblk)

##         id blksize
##    1:    1       2
##    2:    2       4
##    3:    3       2
##    4:    4       4
##    5:    5       4
##   ---             
##  996:  996       2
##  997:  997       2
##  998:  998       2
##  999:  999       2
## 1000: 1000       2

Possible next steps

As we expand the functionality of simstudy, we realize that the current structure of the data definition table cannot always easily accommodate all of our new ideas. As a result, we have ended up having to shoehorn some solutions in non-intuitive ways as we grow. We are in the process of reconsidering that structure so that we won’t have (as many of) these awkward specifications in the future (though we will be making everything backwards compatible, so no worries there).

Sample size requirements for a Bayesian factorial study design

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 26 Oct 2021 00:00:00 +0000

How do you determine sample size when the goal of a study is not to conduct a null hypothesis test but to provide an estimate of multiple effect sizes? I needed to get a handle on this for a recent grant submission, which I’ve been writing about over the past month, here and here. (I provide a little more context for all of this in those earlier posts.) The statistical inference in the study will be based on the estimated posterior distributions from a Bayesian model, so it seems like we’d like those distributions to be as informative as possible. We need to set the sample size large enough to reduce the dispersion of those distributions to a helpful level.

Once I determined that I wanted to target the variance of the posterior distributions, it was just a matter of figuring out what that target should be and then simulate data to see what sample sizes could give us that target. I used the expected standard deviation ($\sigma$) as the criterion for sample size selection.

Setting the target

To determine the target level of precision, I assessed the width of the posterior distributions under different standard deviations. In particular, I identified the posterior probabilities with a mean OR = 1.25 $(log(OR) = 0.22)$ where $P(log(OR) > 0) \ge 0.95$. The target OR is somewhat arbitrary, but seemed like a meaningful effect size based on discussions with my collaborators.

I did a quick search for the standard deviation that would yield a 95% threshold at or very close to 0. That is, 95% of the distribution should lie to the right of 0. Assuming that the target posterior distribution will be approximately normal with a mean of 0.22, I used the qnorm function to find the 95% thresholds for range of standard deviations between 0.10 and 0.15.

sd <- seq(.15, .10, by = -0.005)
cbind(sd, threshold = round(qnorm(.05, .22 , sd = sd), 3))

##          sd threshold
##  [1,] 0.150    -0.027
##  [2,] 0.145    -0.019
##  [3,] 0.140    -0.010
##  [4,] 0.135    -0.002
##  [5,] 0.130     0.006
##  [6,] 0.125     0.014
##  [7,] 0.120     0.023
##  [8,] 0.115     0.031
##  [9,] 0.110     0.039
## [10,] 0.105     0.047
## [11,] 0.100     0.056

It looks like the target standard deviation should be close to 0.135, which is also apparent from the plot of the 95% intervals centered at 0.22:

Using simulation to establish sample size

The final step was to repeatedly simulate data sets using different sample size assumptions, fitting models, and estimating the posterior distribution standard deviations for associated with each data set (and sample size). I evaluated sample sizes ranging from 400 to 650 individuals, increasing in increments of 50. For each sample size, I generated 250 data sets, for a total of 1,500 data sets and model estimates. Given that each model estimation is quite resource intensive, I generated all the data and estimated the models using a high performance computing environment that provided me with 90 nodes and 4 processors on each node so that the Bayesian MCMC process could all run in parallel - so parallelization of parallel processes. In total, this took about 2 hours to run.

(I am including the code in the addendum below. The structure is similar to what I have described in the past on how one might do these types of explorations with simulated data and Bayesian modelling.)

Below is the output for a single data set to provide an example of the data being generated by the simulations. We have estimated seven log-odds ratios (see here for an explanation of why there are seven), and the simulation returns a summary of the posterior distribution for each: selected quantiles and the standard deviation.

##      n    var p0.025 p0.25  p0.5 p0.75 p0.975    sd
## 1: 650 lOR[1]  0.113 0.269 0.354 0.441  0.603 0.126
## 2: 650 lOR[2]  0.410 0.569 0.652 0.735  0.892 0.123
## 3: 650 lOR[3]  0.427 0.585 0.667 0.752  0.906 0.123
## 4: 650 lOR[4]  0.367 0.526 0.608 0.691  0.851 0.122
## 5: 650 lOR[5]  0.436 0.592 0.675 0.757  0.913 0.122
## 6: 650 lOR[6]  0.703 0.861 0.945 1.028  1.180 0.122
## 7: 650 lOR[7]  0.722 0.886 0.969 1.054  1.214 0.125

The plot below shows the estimated standard deviations for a single log-odds ratio (in this case $\lambda_4$), with a point for each of the 1,500 simulate data sets. At 550 subjects, the mean standard deviation (represented by the curve) is starting to get close to 0.135, but there is still quite a bit of uncertainty. To be safe, we might want to set the upper limit for the study to be 600 patients, because we are quite confident that the standard deviation will be low enough to meet our criteria (almost 90% of the standard deviations from the simulations were below 0.135, though at 650 patients that proportion was over 98%).

Addendum

This code generates repeated data sets under different sample size assumptions and draws samples from the posterior distribution for each of those data sets. The simulations are set up to run on a high performance computing (HPC) environment, so multiple data sets can be generated and analyzed simultaneously. If you do not have access to and HPC, you can run locally using lapply or mclapply rather than Slurm_lapply, but unless you have an extremely powerful desktop or laptop, expect these kinds of simulations to take days rather than hours.

One particularly interesting feature of the data generation process used in these simulations is that the effect size parameters are not considered to be fixed, but are themselves drawn from a distribution of parameters. Given that we are never certain about what the parameters should be in the data generation process, this adds an appropriate level of uncertainty that gets reflected in our target estimates. If we are slightly conservative in our sample size selection, this will take into account this additional uncertainty. Of course, how much uncertainty will depend on the situation.

library(cmdstanr)
library(simstudy)
library(data.table)
library(posterior)
library(slurmR)
library(glue)

s_define <- function() {
  
  f <- "..t_0 + ..t_a*a + ..t_b*b + ..t_c*c + 
      ..t_ab*a*b + ..t_ac*a*c + ..t_bc*b*c + ..t_abc*a*b*c"
  
  defY <- defDataAdd(varname = "y", formula = f, dist = "binary", link="logit")
  
  return(list(defY = defY)) 
  
}

s_generate <- function(list_of_defs, argsvec) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  # introducing uncertainty into the data generation process
  
  t_0 <- mu_int
  t_a <- rnorm(1, mu_a, .10)
  t_b <- rnorm(1, mu_b, .10)
  t_c <- rnorm(1, mu_c, .10)
  t_ab <- rnorm(1, mu_ab, .10)
  t_ac <- rnorm(1, mu_ac, .10)
  t_bc <- rnorm(1, mu_bc, .10)
  t_abc <- mu_abc
  
  dd <- genData(8 * n)
  dd <- addMultiFac(dd, nFactors = 3, colNames = c("a", "b", "c"))
  dd <- addColumns(defY, dd)
  
  return(dd)
  
}

s_model <- function(generated_data, mod) {
  
  dt_to_list <- function(dx) {
    
    N <- nrow(dx)                               
    x_abc <- model.matrix(~a*b*c, data = dx)
    y <- dx[, y]
    
    list(N = N, x_abc = x_abc, y = y)
  }
  
  fit <- mod$sample(
    data = dt_to_list(generated_data),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 500,
    iter_sampling = 2500,
    adapt_delta = 0.98,
    max_treedepth = 20,
    show_messages = FALSE
  )
  
  posterior <- data.frame(as_draws_rvars(fit$draws(variables = "lOR")))
  
  pcts <- c(.025, 0.25, .50, 0.75, .975)
  sumstats <- data.table(t(quantile(posterior$lOR, pcts)))
  setnames(sumstats, glue("p{pcts}"))
  sumstats$sd <- sd(posterior$lOR)
  sumstats$var <- glue("lOR[{1:7}]") 
  
  return(sumstats) # model_results is a data.table
  
}

s_replicate <- function(argsvec, mod) {
  
  set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
  
  list_of_defs <- s_define()
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data, mod)
  
  #--- summary statistics ---#
  
  summary_stats <- data.table(t(argsvec), model_results)
  
  return(summary_stats) # summary_stats is a data.table
}

#--- Set arguments ---#

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

n <- c(400, 450, 500, 550, 600, 650)

mu_int <- -1.4
mu_m <- 0.5
mu_x <- -0.3
mu_abc <- 0.3

scenarios <- scenario_list(n = n,
  mu_int = mu_int, mu_a = mu_m, mu_b = mu_m, mu_c = mu_m, 
  mu_ab = mu_x, mu_ac = mu_x, mu_bc = mu_x, mu_abc = mu_abc)

scenarios <- rep(scenarios, each = 250)

#--- run on HPC ---#

set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
smodel <- cmdstan_model("/gpfs/.../model_ind.stan")

job <- Slurm_lapply(
  X = scenarios, 
  FUN = s_replicate, 
  mod = smodel,
  njobs = min(90L, length(scenarios)), 
  mc.cores = 4L,
  job_name = "i_ss",
  tmp_path = "/gpfs/.../scratch",
  plan = "wait",
  sbatch_opt = list(time = "12:00:00", partition = "cpu_short", `mem-per-cpu` = "4G"),
  export = c("s_define", "s_generate", "s_model"),
  overwrite = TRUE
)

res <- Slurm_collect(job)

save(res, file = "/gpfs/.../post_ss.rda")

A Bayesian analysis of a factorial design focusing on effect size estimates

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 12 Oct 2021 00:00:00 +0000

Factorial study designs present a number of analytic challenges, not least of which is how to best understand whether simultaneously applying multiple interventions is beneficial. Last time I presented a possible approach that focuses on estimating the variance of effect size estimates using a Bayesian model. The scenario I used there focused on a hypothetical study evaluating two interventions with four different levels each. This time around, I am considering a proposed study to reduce emergency department (ED) use for patients living with dementia that I am actually involved with. This study would have three different interventions, but only two levels for each (i.e., yes or no), for a total of 8 arms. In this case - the model I proposed previously does not seem like it would work well; the posterior distributions based on the variance-based model turn out to be bi-modal in shape, making it quite difficult to interpret the findings. So, I decided to turn the focus away from variance and emphasize the effect size estimates for each arm compared to control.

Model specification

As I mentioned, this is a case with three interventions ($a$, $b$, and $c$), each of which has two levels; the full factorial design will have 8 arms:

\[\begin{aligned} (1) \ a&=0, \ b=0, \ c=0 \\ (2) \ a&=1, \ b=0, \ c=0 \\ (3) \ a&=0, \ b=1, \ c=0 \\ (4) \ a&=0, \ b=0, \ c=1 \\ (5) \ a&=1, \ b=1, \ c=0 \\ (6) \ a&=1, \ b=0, \ c=1 \\ (7) \ a&=0, \ b=1, \ c=1 \\ (8) \ a&=1, \ b=1, \ c=1 \\ \end{aligned}\]

Although the proposed study is a cluster randomized trial, where each participating site will be assigned to one of the eight arms, I am simplifying things here a bit by assuming each individual patient $i$ will be randomized to each of $a$, $b$, and $c$, and $a_i \in \{0,1\}$, $b_i\in \{0,1\}$, and $c_i\in \{0,1\}$.

Here is a model for outcome $y_i$, a binary measure $(y_i \in {0,1})$, where the log-odds of the outcome for each patient is a function of the random assignment:

\[ y_{i} \sim \text{binomial}\left(p_{i}\right) \]

\[ \text{log}\left( \frac{p_{i}}{1-p_{i}}\right) = \tau_0 + \tau_a a_i + \tau_b b_i + \tau_c c_i + \tau_{ab} a_i b_i + \tau_{ac} a_i c_i + \tau_{bc} b_i c_i + \tau_{abc}a_i b_i c_i \]

This is just a standard logistic model specification, where the parameters can be interpreted as log-odds ratios. For example, $\lambda_b = \tau_b$ is the log odds ratio comparing patients randomized to receive only $b$ (group 3 from above) with the control arm where patients receive none of the interventions (group 1), and $\lambda_{ac} = \tau_a + \tau_c + \tau_{ac}$ is the log odds ratio comparing patients randomized to only $a$ and $c$ but not $b$ (group 6) compared with the control patients (group 1). This is the full set of log odds ratios for this design:

\[\begin{aligned} \lambda_a &= \tau_a \\ \lambda_b &= \tau_b \\ \lambda_c &= \tau_c \\ \lambda_{ab} &= \tau_a + \tau_b + \tau_{ab} \\ \lambda_{ac} &= \tau_a + \tau_c + \tau_{ac} \\ \lambda_{bc} &= \tau_b + \tau_c + \tau_{bc} \\ \lambda_{abc} &= \tau_a + \tau_b + \tau_c + \tau_{ab} + \tau_{ac} + \tau_{bc} + \tau_{abc} \\ \end{aligned}\]

The focus of the analysis is to estimate posterior probability distributions for the $\lambda\text{'s}$, and possibly to compare across the $\lambda\text{'s}$ (also using posterior distributions) to assess whether combining multiple interventions seems beneficial.

Prior distribution assumptions

Rere are the prior distribution assumptions for the parameters in the Bayesian model:

\[\begin{aligned} \tau_0 &\sim N(\mu=0, \sigma = 1) \\ \tau_a, \tau_b, \tau_c &\sim N(\mu = \delta_m, \sigma = \sigma_m) \\ \tau_{ab}, \tau_{ac}, \tau_{bc} &\sim N(\mu = \delta_x, \sigma = \sigma_x) \\ \tau_{abc} &\sim N(\mu = 0, \sigma = 1) \\ \delta_m &\sim N(\mu = 0, \sigma = 1) \\ \sigma_m &\sim t_\text{student}(\text{df}=3, \mu=0, \sigma = 2.5), \ \sigma_m \ge 0 \\ \delta_x &\sim N(0, 1) \\ \sigma_x &\sim t_\text{student}(\text{df}=3, \mu = 0, \sigma = 2.5), \ \sigma_x \ge 0 \\ \end{aligned}\]

While the focus of this model estimation is different from the approach I discussed last time, the prior distributions here share a key element with the earlier model. The priors for the main effects $\tau_a, \ \tau_b, \text{ and } \tau_c$ share a common mean $\delta_m$ and standard deviation $\sigma_m$. Likewise the prior distributions for the pair-wise interaction effects share a common mean $\delta_x$ and standard deviation $\sigma_x$. These four hyperparameters are estimated from the data. The prior distributions for the mean intervention effects $\delta_m$ and $\delta_x$ are specified with the aim towards conservativism or skepticism, with a large portion of the distribution centered around 0. The priors for the variance parameters are more diffuse (using a $t$-distribution with 3-degrees of freedom, a compromise between a Cauchy distribution with very broad tails and a normal distribution with more constrained tails).

Statistical inference will be based on an examination of the posterior distributions for the log odds ratios comparing each of the treatment combinations with the control arm where none of the interventions is implemented. We can also compare across different combinations to assess if one particular combination seems to be stronger than another. Since we are not using a null-hypothesis testing framework and the effect estimates are pooled across the interventions, adjustments for multiple testing are not necessary. (In the future, I can show results of the experiments where I explored the operating characteristics of these models. Because of the pooling and shrinkage that is built into the model, there are no inflated type 1 errors, analogous to the situation where I evaluated Bayesian methods for subgroup analysis.)

Data definition and generation

Here are the libraries needed for the simulation, model estimation, and presentation of results:

library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(glue)
library(ggplot2)
library(cowplot)
library(ggdist)
library(paletteer)

In this simulation, the log odds for the outcome in the control group has been set at -1.4, corresponding to odds = exp(-1.4) = 0.25, and probability of outcome = 1/(1+exp(1.4) = 20%. Here are the log-odds ratios that I assumed for each of the different arms with at least one treatment assignment:

\[\begin{aligned} \lambda_a &= 0.5 \\ \lambda_b &= 0.6 \\ \lambda_c &= 0.0 \\ \lambda_{ab} &= 0.5 + 0.7 - 0.3 = 0.9 \\ \lambda_{ac} &= 0.5 + 0.0 + 0.0 = 0.5 \\ \lambda_{bc} &= 0.7 + 0.0 + 0.0 = 0.7 \\ \lambda_{abc} &= 0.5 + 0.7 + 0.0 - 0.3 + 0.0 + 0.0 + 0.0 = 0.9 \\ \end{aligned}\]

f <- "..t_0 + ..t_a*a + ..t_b*b + ..t_c*c + 
      ..t_ab*a*b + ..t_ac*a*c + ..t_bc*b*c + ..t_abc*a*b*c"

defY <- defDataAdd(varname = "y", formula = f, dist = "binary", link="logit")

t_0 <-  -1.4

t_a <-   0.5
t_b <-   0.7
t_ab <- -0.3

t_c <- t_ac <- t_bc <- t_abc <- 0.0

4000 patients will be randomized to the eight arms, 500 in each:

set.seed(37159)

dd <- genData(8*500)
dd <- addMultiFac(dd, nFactors = 3, colNames = c("a", "b", "c"))
dd <- addColumns(defY, dd)

dd

##         id a b c y
##    1:    1 1 0 0 0
##    2:    2 1 0 0 0
##    3:    3 1 0 0 1
##    4:    4 1 0 0 0
##    5:    5 0 0 1 0
##   ---             
## 3996: 3996 1 1 1 0
## 3997: 3997 1 1 0 0
## 3998: 3998 1 0 1 0
## 3999: 3999 0 0 1 0
## 4000: 4000 0 1 0 1

Here are the observed proportions by treatment arm. The fact that the two panels ($c = 0$ and $c = 1$) are pretty similar are an indication that intervention $c$ has no impact. And the fact that lines are not parallel in each panel are an indication that there is some interaction (in this case negative).

Model fitting

The Bayesian sampling is using four chains of length 2,500 (following 1,000 warm-up iterations for each), so the posterior distribution will be estimated with 10,000 total samples. The code for the Stan model can be found in the addendum.

dt_to_list <- function(dx) {
  
  N <- nrow(dx)                          
  x_abc <- model.matrix(~a*b*c, data = dx)
  y <- dx[, y]
  
  list(N = N, x_abc = x_abc, y = y)
}

mod <- cmdstan_model("code/model_ind.stan")

fit <- mod$sample(
  data = dt_to_list(dd),
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 1000,
  iter_sampling = 2500,
  adapt_delta = 0.98,
  max_treedepth = 20,
  show_messages = FALSE,
  seed = 29817
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 113.5 seconds.
## Chain 3 finished in 124.9 seconds.
## Chain 4 finished in 129.5 seconds.
## Chain 2 finished in 130.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 124.5 seconds.
## Total execution time: 130.5 seconds.

Presenting the results

Here is the code for the first plot, which shows the distribution of effect sizes (on the log-odds scale) for each of the intervention arms. I’ve extracted the samples using the posterior package function as_draw_rvars that I recently described here.

posterior <- data.frame(as_draws_rvars(fit$draws(variables = "lOR")))

pcts <- c(.025, 0.25, .50, 0.75, .975)
sumstats <- data.table(t(quantile(posterior$lOR, pcts)))
setnames(sumstats, glue("p{pcts}"))
sumstats$var <- glue("lOR[{1:7}]") 

p <- ggplot(data = sumstats, aes(y = var, yend = var)) +
  geom_vline(xintercept = 0, color = "grey85") +
  geom_segment(aes(x = p0.025, xend = p0.975)) +
  geom_segment(aes(x = p0.25, xend = p0.75), 
    size = 1.25, color = palettes_d$wesanderson$Moonrise2[2]) +
  geom_point(aes(x = p0.5), size = 2.5) +
  theme(panel.grid = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(margin = margin(t = 0, r = -12, b = 0, l = 0)),
        plot.title = element_text(size = 10, face = "bold")
  ) +
  ylab("treatment assignments (three interventions)") +
  xlab("log odds ratio") +
  xlim(-.5, 1.5) +
  ggtitle("Posterior distribution of log OR by treatment assignment")

pimage <- axis_canvas(p, axis = 'y') +
  draw_image("r_icons/r111.png", y = 6.5, scale = 0.35) +
  draw_image("r_icons/r011.png", y = 5.5, scale = 0.35) +
  draw_image("r_icons/r101.png", y = 4.5, scale = 0.35) +
  draw_image("r_icons/r110.png", y = 3.5, scale = 0.35) +
  draw_image("r_icons/r001.png", y = 2.5, scale = 0.35) +
  draw_image("r_icons/r010.png", y = 1.5, scale = 0.35) +
  draw_image("r_icons/r100.png", y = 0.5, scale = 0.35)

Looking at the figure, it is apparent that that $a$ and $b$ likely had an effect, while $c$ probably did not. It also appears that the combination of $a$ and $b$ might be an improvement, both with and without $c$:

ggdraw(insert_yaxis_grob(p, pimage, position = "left", width = grid::unit(.17, "null")))

In the next and last plot, my goal is to compare the log-odds ratios of the different arms. I am showing the the posterior distributions for the differences between the estimated log-odds ratios. In this particular data set, $a$ does not look any different from $b$, but the combination of $a$ and $b$ does indeed look superior to either alone, regardless of whether $c$ is involved:

data <- with(posterior, data.frame(
  x = c(
    "(1) b vs a",
    "(2) ab vs a", 
    "(3) ab vs b", 
    "(4) abc vs ab",
    "(5) abc vs ac",
    "(6) abc vs bc"
  ),
  diff = c(
    lOR[2] - lOR[1], 
    lOR[4] - lOR[1], 
    lOR[4] - lOR[2], 
    lOR[7] - lOR[4],
    lOR[7] - lOR[5], 
    lOR[7] - lOR[6]
  )
))

ggplot(data = data, aes(dist = diff, x = x)) +
  geom_hline(yintercept = 0, color = "grey80", size = .3) +
  stat_dist_eye(fill = palettes_d$wesanderson$Moonrise2[1], position="dodge") +
  theme(panel.grid = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank()) +
  ylab("difference")

Ultimately, how we present the data and draw our conclusions will depend on what we specify up front regarding the parameters and comparisons of interest. The great thing about a Bayesian model is that we have estimated everything in a single model, so there are no real concerns with multiple comparisons. However, reviewers still like to see results for analyses that were pre-specified. And if a decision is to be made based on those results, those decision rules should be pre-specified. But, my preference would be to show the findings and let readers decide if the results are compelling and/or determine if a more focused trial is needed.

In the next (and most likely, for now at least, final) post on this topic, I plan on describing how I approached sample size estimation for this proposed study.

Addendum

data {
  
  int<lower=0> N;                       // number patients
  matrix<lower=0, upper=1>[N, 8] x_abc;
  int<lower=0,upper=1> y[N];            // outcome for individual i
  
}

parameters {
  
  vector[8] z;

  real delta_m;
  real<lower = 0> sigma_m;
  
  real delta_x;
  real<lower=0> sigma_x;
  
}

transformed parameters {
  
  vector[8] tau;
  
  tau[1] = z[1];
  
  for (i in 2:4){
    tau[i] = sigma_m * z[i] + delta_m;
  }
  
  for (i in 5:7){
    tau[i] = sigma_x * z[i] + delta_x;
  }
  
  tau[8] = z[8];
  
  
}

model {
  
  sigma_m ~ student_t(3, 0, 2.5);
  sigma_x ~ student_t(3, 0, 2.5);

  delta_m ~ normal(0, 1);
  delta_x ~ normal(0, 1);
  
  z ~ std_normal();

  y ~ bernoulli_logit(x_abc * tau);
  
}

generated quantities {
  
  real lOR[7];
  
  lOR[1] = tau[2];                                            //  a=1, b=0, c=0
  lOR[2] = tau[3];                                            //  a=0, b=1, c=0
  lOR[3] = tau[4];                                            //  a=0, b=0, c=1
  lOR[4] = tau[2] + tau[3] + tau[5];                          //  a=1, b=1, c=0
  lOR[5] = tau[2] + tau[4] + tau[6];                          //  a=1, b=0, c=1
  lOR[6] = tau[3] + tau[4] + tau[7];                          //  a=0, b=1, c=1
  lOR[7] = tau[2]+tau[3]+tau[4]+tau[5]+tau[6]+tau[7]+tau[8];  //  a=1, b=1, c=1
  
}

Analyzing a factorial design by focusing on the variance of effect sizes

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 28 Sep 2021 00:00:00 +0000

Way back in 2018, long before the pandemic, I described a soon-to-be implemented simstudy function genMultiFac that facilitates the generation of multi-factorial study data. I followed up that post with a description of how we can use these types of efficient designs to answer multiple questions in the context of a single study.

Fast forward three years, and I am thinking about these designs again for a new grant application that proposes to study simultaneously three interventions aimed at reducing emergency department (ED) use for people living with dementia. The primary interest is to evaluate each intervention on its own terms, but also to assess whether any combinations seem to be particularly effective. While this will be a fairly large cluster randomized trial with about 80 EDs being randomized to one of the 8 possible combinations, I was concerned about our ability to estimate the interaction effects of multiple interventions with sufficient precision to draw useful conclusions, particularly if the combined effects of two or three interventions are less than additive. (That is, two interventions may be better than one, but not twice as good.)

I am thinking that a null hypothesis testing framework might not be so useful here, given the that the various estimates could be highly uncertain, not to mention the multiple statistical tests that we would need to conduct (and presumably adjust for). Rather, a Bayesian approach that pools estimates across interventions and provides posterior probability distributions may provide more insight into how the interventions interact could be a better way to go.

With this in mind, I went to the literature, and I found these papers by Kassler et al and Gelman. They both describe a way of thinking about interaction that emphasizes the estimates of variance across effect estimands. I went ahead and tested the idea with simulated data, which I’m showing here. Ultimately, I decided that this approach will not work so well for our study, and I came up with a pretty simple solution that I will share next time.

Identifying interaction through variance

The scenarios described by both papers involve studies that may be evaluating many possible interventions or exposures, each of which may have two or more levels. If we are dealing with a normally distributed (continuous) outcome measure, we can model that outcome as

\[ y_{i} \sim N\left(\mu = \tau_0 + \tau^1_{j_{1_i}} + \dots + \tau^k_{j_{k_i}} + \tau^{12}_{j_{12_i}} + \dots + \tau^{k-1, k}_{j_{k-1,k_i}} + \tau^{123}_{123_i} + \dots + \tau^{k-2, k-1, k}_{k-2, k-1, k_i} + \dots, \ \sigma = \sigma_0\right), \]

where there are $K$ interventions, and intervention $k$ has $j_k$ levels. So, if intervention $3$ has 4 levels, $j_3 \in \{1,2,3,4\}.$ $\tau_0$ is effectively the grand mean. $\tau^k_1, \tau^k_2, \dots, \tau^k_{j_k},$ are the mean contributions for the $k$th intervention, and we constrain $\sum_{m=1}^{j_k} \tau^k_m = 0.$ Again, for intervention $3$, we would have $\tau^3_1 \dots, \tau^3_4,$ with $\sum_{m=1}^{4} \tau^3_m = 0.$

The adjustments made for the two-way interactions are represented by the $\tau^{12}$’s through the $\tau^{k-1,k}$’s. If intervention 5 has $2$ levels then for the interaction between interventions 3 and 5 we have $\tau^{35}_{11}, \tau^{35}_{12}, \tau^{35}_{21}, \dots, \tau^{35}_{42}$ and $\sum_{m=1}^4 \sum_{n=1}^2 \tau^{35}_{m,n} = 0.$

This pattern continues for higher orders of interaction (i.e. 3-way, 4-way, etc.).

In the Bayesian model, each set of $\tau_k$’s shares a common prior distribution with mean 0 and standard deviation $\sigma_k$:

\[ \tau^k_1, \dots, \tau^k_{j_k} \sim N(\mu = 0, \sigma = \sigma_k), \] where $\sigma_k$ is a hyperparameter that will be estimated from the data. The same is true for the interaction terms for interventions $k$ and $l$:

\[ \tau^{kl}_{11}, \dots, \tau^{kl}_{j_k, j_l} \sim N(\mu = 0, \sigma = \sigma_{kl}), \ \ \text{where } k < l \]

To assess whether there is interaction between the interventions (i.e. the effects are not merely additive), we are actually interested the variance parameters of the interaction $\tau\text{'s}$. If, for example there is no interaction between different levels of interventions of 3 and 5, then $\sigma_{35}$ should be close to $0$, implying that $\tau^{35}_{11} \approx \tau^{35}_{12} \approx \dots \approx \tau^{35}_{42} \approx 0$. On the other hand, if there is some interaction effect, then $\sigma_{35} > 0,$ implying that at least one $\tau^{35} > 0.$

One advantage of the proposed Bayesian model is that we can use partial pooling to get more precise estimates of the variance terms. By this, I mean that we can use information from each $\sigma^{kl}$ to inform the others. So, in the case of 2-way interaction, the prior probability assumption would suggest that the the variance terms were drawn from a common distribution:

\[ \sigma^{12}, \sigma^{13}, \dots, \sigma^{k-1,k} \sim N(\mu = 0, \sigma = \sigma_{\text{2-way}}) \]

We can impose more structure (and hopefully precision) by doing the same for the main effects:

\[ \sigma^{1}, \sigma^{2}, \dots, \sigma^{k} \sim N(\mu = 0, \sigma = \sigma_{\text{main}}) \]

Of course, for each higher order interaction (above 2-way), we could impose the same structure:

\[ \sigma^{123}, \dots, \sigma^{12k}, \dots, \sigma^{k-2, k-1, k} \sim N(\mu = 0, \sigma = \sigma_{\text{3-way}}) \]

And so on. Though at some point, we might want to assume that there is no higher order interaction and exclude it from the model; in most cases, we could stop at 2- or 3-way interaction and probably not sacrifice too much.

Example from simulation

When I set out to explore this model, I started relatively simple, using only two interventions with four levels each. In this case, the factorial study would have 16 total arms $(4 \times 4)$. (Since I am using only 2 interventions, I am changing the notation slightly, using interventions $a$ and $b$ rather than $1$ and $2$.) Individual $i$ is randomized to one level in $a$ and one level $b$, and $a_i \in \{1,2,3,4\}$ and $b_i\in \{1,2,3,4\}$, and $ab_i \in \{11, 12, 13, 14, 21, 22, \dots, 44\}.$ Using the same general model from above, here is the specific model for continuous $y$:

\[ y_{i} \sim N\left(\mu = \tau_0 + \tau^a_{a_i} + \tau^b_{b_i} + \tau^{ab}_{ab_i}, \ \sigma = \sigma_0\right) \]

Take note that we only have a single set of 2-way interactions since there are only two groups of interventions. Because of this, there is no need for a $\sigma_{\text{2-way}}$ hyperparameter; however, there is a hyperparameter $\sigma_{\text{main}}$ to pool across the main effects of $a$ and $b$. Here are the prior distribution assumptions:

\[\begin{aligned} \tau_0 &\sim N(0, 5) \\ \tau^a_1, \tau^a_2, \tau^a_3, \tau^a_4 &\sim N(0, \sigma_a) \\ \tau^b_1, \tau^b_2, \tau^b_3, \tau^b_4 &\sim N(0, \sigma_b) \\ \tau^{ab}_{11}, \tau^{ab}_{12}, \dots \tau^{ab}_{44} &\sim N(0, \sigma_{ab}) \\ \sigma_a, \sigma_b &\sim N(0, \sigma_\text{main}) \\ \sigma_{ab} &\sim N(0, 5) \\ \sigma_\text{main} &\sim N(0, 5) \\ \sigma &\sim N(0,5) \end{aligned}\]

In order to ensure identifiability, we have the following constraints:

\[\begin{aligned} \tau^a_1 + \tau^a_2 + \tau^a_3 + \tau^a_4 &= 0 \\ \tau^b_1 + \tau^b_2 + \tau^b_3 + \tau^b_4 &= 0 \\ \tau^{ab}_{11} + \tau^{ab}_{12} + \dots + \tau^{ab}_{43} + \tau^{ab}_{44} &= 0 \end{aligned}\]

Required libraries

library(simstudy)
library(data.table)
library(cmdstanr)
library(caret)
library(posterior)
library(bayesplot)
library(ggdist)
library(glue)

Data generation

The parameters $\tau_0, \tau_a, \tau_b, \text{ and } \tau_{ab}$ are set so that there is greater variation in treatment $a$ compared to treatment $b$. In both cases, the sum of the parameters is set to $0$.

t_0 <- 0
t_a <- c(-8, -1, 3, 6)
t_b <- c(-3, -1, 0, 4)

The interaction is set in this case so that there is an added effect when both $a=2 \ \& \ b=2$ and $a=3 \ \& \ b=2$. Again, the parameters are set so that the sum-to-zero constraint is maintained.

x <- c(4, 3) 
nox <- - sum(x) / (16 - length(x))

t_ab <- matrix(c(nox, nox, nox, nox,
                 nox,   4, nox, nox,
                 nox,   3, nox, nox,
                 nox, nox, nox, nox), nrow = 4, byrow = TRUE)

t_ab

##      [,1] [,2] [,3] [,4]
## [1,] -0.5 -0.5 -0.5 -0.5
## [2,] -0.5  4.0 -0.5 -0.5
## [3,] -0.5  3.0 -0.5 -0.5
## [4,] -0.5 -0.5 -0.5 -0.5

sum(t_ab)

## [1] 0

The data definitions for the arm assignments and the outcome $y$ are established using the simstudy package:

d1 <- defDataAdd(varname = "y", formula = "mu", variance = 16, dist = "normal")

Now we are ready to generate the data:

set.seed(110)

dd <- genMultiFac(nFactors = 2, levels = 4, each = 30, colNames = c("a", "b"))
dd[, mu := t_0 + t_a[a] + t_b[b] + t_ab[a, b], keyby = id]
dd <- addColumns(d1, dd)

Plot of $\bar{y}$ by arm

The plot shows the the average outcomes by arm. The interaction when $a=2 \ \& \ b=2$ and $a=3 \ \& \ b=2$ is apparent in the two locations where the smooth pattern of increases is interrupted.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Sampling from the posterior

The function shown next simply generates the data needed by Stan. (The Stan implementation is shown below in the addendum.) Take note that we convert the $\tau_{ab}$ design matrix of 0’s and 1’s to a single vector with values ranging from 1 to 16.

dt_to_list <- function(dx) {
  
  dx[, a_f := factor(a)]
  dx[, b_f := factor(b)]
  
  dv <- dummyVars(~ b_f:a_f , data = dx, n = c(4, 4))
  dp <- predict(dv, dx )
  
  N <- nrow(dx)                               ## number of observations 
  I <- 2
  X2 <- 1
  
  main <- as.matrix(dx[,.(a,b)])

  ab <- as.vector(dp %*% c(1:16))  
  x <- as.matrix(ab, nrow = N, ncol = X2)
  
  y <- dx[, y]
  
  list(N=N, I=I, X2=X2, main=main, x=x, y=y)
  
}

I am using cmdstanr to interact with Stan:

mod <- cmdstan_model("code/model_2_factors.stan", force_recompile = TRUE)

fit <- mod$sample(
  data = dt_to_list(dd),
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 500,
  iter_sampling = 2500,
  adapt_delta = 0.99,
  step_size = .05,
  max_treedepth = 20,
  seed = 1721
)

Diagnostic checks

Here is just one set of trace plots for $\tau^a_1, \dots, \tau^a_4$ that indicate the sampling went quite well - the variables not shown were equally well-behaved.

posterior <- as_draws_array(fit$draws())
mcmc_trace(posterior, pars = glue("t[{1},{1:4}]"))

Variance estimates

Since we are focused on the possibility of 2-way interaction, the primary parameter of interest is $\sigma_{ab},$ the variation of the interaction effects. (In the Stan model specification this variance parameter is sigma_x, as in interaction.) The plot shows the 95% credible intervals for each of the main effect variance parameters as well as the interaction variance parameter.

The fact that the two main effect variance parameters ($\sigma_a$ and $\sigma_b$) are greater than zero supports the data generation process which assumed different outcomes for different levels of interventions $a$ and $b$, respectively.

And the credible interval for $\sigma_{ab}$ (sigma_x), likewise is shifted away from zero, suggesting there might be some interaction between $a$ and $b$ at certain levels of each.

mcmc_intervals(posterior, pars = c(glue("sigma_m[{1:2}]"), "sigma_x[1]"))

We can hone in a bit more on the specific estimates of the $\tau_{ab}$’s to see where those interactions might be occurring. It appears that t_x[1,6] (representing $\tau_{22}$) is an important interaction term - which is consistent with the data generation process. However, $\tau_{32}$, represented by t_x[1,10] is not obviously important. Perhaps we need more data.

mcmc_intervals(posterior, pars = glue("t_x[1,{1:16}]"))

Below is a visual representation of how well the model fits the data by showing the interval of predicted cell counts for each $a/b$ pair. The observed means (shown as white dots) sit on top of the predictions (shown by the colored lines), suggesting the model is appropriate.

r <- as_draws_rvars(fit$draws(variables = c("t_0","t", "t_x")))

dnew <- data.frame(
  genMultiFac(nFactors = 2, levels = 4, each = 1, colNames = c("b", "a")))

dnew$yhat <- with(r, 
  rep(t_0, 16) + rep(t[1, ], each = 4) + rep(t[2, ], times = 4) + t(t_x))

ggplot(data = dnew, aes(x=b, dist = yhat)) +
  geom_vline(aes(xintercept = b), color = "white", size = .25) +
  stat_dist_lineribbon() +
  geom_point(data = dsum, aes(y = yhat), color = "white", size = 2) +
  facet_grid(.~a, labeller = labeller(a = label_both)) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_blank())  + 
  scale_fill_brewer()

Only one treatment effect and no interaction

Perhaps the rationale for focusing on the variance can be best appreciated by looking at a contrasting scenario where there is only a single main effect (for intervention $a$) and no interaction. Here we would expect the estimates for the intervention $b$ main effects variance as well as the variance of the interaction terms to be close to zero.

t_0 <- 0
t_a <- c(-8, -1, 3, 6)
t_b <- c(0, 0, 0, 0)
t_ab <- matrix(0, nrow = 4, ncol = 4)

The plot of the observed means is consistent with the data generation process:

And yes, the posterior distribution for $\sigma_{ab}$ (sigma_x) is now very close to zero …

and the effect parameters are all centered around zero:

Once again, the predicted values are quite close to the observed means - indicating the model is a good fit:

Next steps

In the motivating application, there are actually three interventions, but each one has only two levels (yes or no). In this case, the level mean and across-level variance parameters were poorly estimated, probably because there are so few levels. This forced me to take a more traditional approach, where I estimate the means of each randomization arm. I’ll share that next time.

References:

Gelman, Andrew. “Analysis of variance—why it is more important than ever.” The annals of statistics 33, no. 1 (2005): 1-53.

Kassler, Daniel, Ira Nichols-Barrer, and Mariel Finucane. “Beyond “treatment versus control”: How Bayesian analysis makes factorial experiments feasible in education research.” Evaluation review 44, no. 4 (2020): 238-261.

Addendum

The model is implemented in Stan using a non-centered parameterization, so that the parameters $tau$ are a function of a set of $z$ parameters, which are standard normal parameters. This does not dramatically change the estimates, but eliminates divergent chains, improving sampling behavior.

data {
  
  int<lower=1> N;        // number of observations
  int<lower=1> I;        // number of interventions
  int<lower=1> X2;       // number of 2-way interactions
  array[N, I] int main;  // interventions
  array [N, X2] int x;   // interactions - provide levels for each intervention?

  vector[N] y;           // outcome
  
}

parameters {
  
  real t_0;
  
  array[I] vector[3] z_raw;
  array[X2] vector[15] z_x_raw;
  
  real<lower=0> sigma;
  array[I] real<lower=0> sigma_m;
  array[X2] real<lower=0> sigma_x;
  
  real<lower=0> sigma_main;

}

transformed parameters {
  
  // constrain parameters to sum to 0
  
  array[I] vector[4] z; 
  array[X2] vector[16] z_x; 
  
  array[I] vector[4] t;
  array[X2] vector[16] t_x;
  
  vector[N] yhat;
  
  for (i in 1:I) {
    z[i] = append_row(z_raw[i], -sum(z_raw[i]));    
  }
  
  for (i in 1:X2) {
    z_x[i] = append_row(z_x_raw[i], -sum(z_x_raw[i]));    
  }

  for (i in 1:I) 
     for (j in 1:4) 
        t[i, j] = sigma_m[i] * z[i, j];
        
  for (i in 1:X2) 
     for (j in 1:16) 
        t_x[i, j] = sigma_x[i] * z_x[i, j];
     
  // yhat
  
  for (n in 1:N) {
    real ytemp; 
    ytemp = t_0;
    for (i in 1:I) ytemp = ytemp + t[i, main[n, i]]; // 2 sets of main effects
    for (i in 1:X2) ytemp = ytemp + t_x[i, x[n, i]]; // 1 set of interaction effects
    yhat[n] = ytemp;
  }
}

model {
  
  sigma ~ normal(0, 5);
  sigma_m ~ normal(0, sigma_main);
  sigma_x ~ normal(0, 5);
  
  sigma_main ~ normal(0, 5);
  
  t_0 ~ normal(0, 5);

  for (i in 1:I) z_raw[i] ~ std_normal();
  for (i in 1:X2) z_x_raw[i] ~ std_normal();

  y ~ normal(yhat, sigma);

}

Drawing the wrong conclusion about subgroups: a comparison of Bayes and frequentist methods

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 14 Sep 2021 00:00:00 +0000

In the previous post, I simulated data from a hypothetical RCT that had heterogeneous treatment effects across subgroups defined by three covariates. I presented two Bayesian models, a strongly pooled model and an unpooled version, that could be used to estimate all the subgroup effects in a single model. I compared the estimates to a set of linear regression models that were estimated for each subgroup separately.

My goal in doing these comparisons is to see how often we might draw the wrong conclusion about subgroup effects when we conduct these types of analyses. In a typical frequentist framework, the probability of making a mistake is usually considerably greater than the 5% error rate that we allow ourselves, because conducting multiple tests gives us more chances to make a mistake. By using Bayesian hierarchical models that share information across subgroups and more reasonably measure uncertainty, I wanted to see if we can reduce the chances of drawing the wrong conclusions.

Simulation framework

The simulations used here are based on the same general process I used to generate a single data set the last time around. The key difference is that I now want to understand the operating characteristics of the models, and this requires many data sets (and their model fits). Much of the modeling is similar to last time, so I’m primarily showing new code.

This is a pretty computing intensive exercise. While the models don’t take too long to fit, especially with only 150 observations per data set, fitting 2500 sets of models can take some time. As I do for all the simulations that require repeated Bayesian estimation, I executed all of this on a high-performance computer. I used a framework similar to what I’ve described for conducting power analyses and exploring the operating characteristics of Bayesian models.

Definitions

The definitions of the data generation process are the same as in the previous post, except I’ve made the generation of theta more flexible. Last time, I fixed the coefficients ($\tau$’s) at specific values. Here, the $\tau$’s can vary from iteration to iteration. Even though I am generating data with no treatment effect, I am taking a Bayesian point of view on this - so that the treatment effect parameters will have a distribution that is centered around 0 with very low variance.

library(cmdstanr)
library(simstudy)
library(posterior)
library(data.table)
library(slurmR)

setgrp <- function(a, b, c) {
  
  if (a==0 & b==0 & c==0) return(1)
  if (a==1 & b==0 & c==0) return(2)
  if (a==0 & b==1 & c==0) return(3)
  if (a==0 & b==0 & c==1) return(4)
  if (a==1 & b==1 & c==0) return(5)
  if (a==1 & b==0 & c==1) return(6)
  if (a==0 & b==1 & c==1) return(7)
  if (a==1 & b==1 & c==1) return(8)
  
}

s_define <- function() {
  
  d <- defData(varname = "a", formula = 0.6, dist="binary")
  d <- defData(d, varname = "b", formula = 0.4, dist="binary")
  d <- defData(d, varname = "c", formula = 0.3, dist="binary")
  d <- defData(d, varname = "theta",
    formula = "..tau[1] + ..tau[2]*a  + ..tau[3]*b + ..tau[4]*c +
               ..tau[5]*a*b + ..tau[6]*a*c + ..tau[7]*b*c + ..tau[8]*a*b*c",
    dist = "nonrandom"
  )
  
  drx <- defDataAdd(
    varname = "y", formula = "0 + theta*rx", 
    variance = 16, 
    dist = "normal"
  )

  return(list(d = d, drx = drx))
  
}

Data generation

We are generating the eight values of tau for each iteration from a $N(\mu = 0, \sigma = 0.5)$ distribution before generating theta and the outcome y:

s_generate <- function(n, list_of_defs) {
  
  list2env(list_of_defs, envir = environment())
  
  tau <- rnorm(8, 0, .5)
  
  dd <- genData(n, d)
  dd <- trtAssign(dd, grpName = "rx")
  dd <- addColumns(drx, dd)
  
  dd[, grp := setgrp(a, b, c), keyby = id]
  
  dd[]
  
}

Looking at a single data set, we can see that theta is close to, but is not exactly 0, as we would typically do in simulation using a frequentist framework (where the parameters are presumed known).

set.seed(298372)

defs <- s_define()
s_generate(10, defs)

##     id a b c theta rx     y grp
##  1:  1 0 0 1  0.34  0 -3.45   4
##  2:  2 1 0 1  0.78  1  3.20   6
##  3:  3 0 0 0 -0.28  1  7.29   1
##  4:  4 0 1 0 -0.37  1  2.76   3
##  5:  5 0 0 0 -0.28  0 -0.48   1
##  6:  6 1 0 0 -0.25  0  1.09   2
##  7:  7 0 0 0 -0.28  0 -1.45   1
##  8:  8 0 0 1  0.34  1 -5.78   4
##  9:  9 1 0 0 -0.25  1  2.97   2
## 10: 10 1 0 0 -0.25  0 -1.25   2

Model fitting

The models here are precisely how I defined it in the last post. The code is a bit involved, so I’m not including it - let me know if you’d like to see it. For each data set, I fit a set of subgroup-specific linear regression models (as well as an overall model that ignored the subgroups), in addition to the two Bayesian models described in the previous post. Each replication defines the data, generates a new data set, and estimates the three different models before returning the results.

s_model <- function(dd, mod_pool, mod_nopool) {
  ...
}

s_replicate <- function(x, n, mod_pool, mod_nopool) {
  
  set_cmdstan_path(path = "/.../cmdstan/2.25.0")
  
  defs <- s_define()
  generated_data <- s_generate(n, defs)
  estimates <- s_model(generated_data, mod_pool, mod_nopool)

  estimates[]
}

The computation is split up so that 50 multi-core computing nodes run 50 replications. There’s actually parallelization in parallel, as each of the nodes has multiple processors so the Bayesian models can be estimated with parallel chains:

set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")

model_pool <- cmdstan_model("/.../subs_pool_hpc.stan")
model_nopool <- cmdstan_model("/.../subs_nopool_hpc.stan")

job <- Slurm_lapply(
  X = 1:2500, 
  FUN = s_replicate, 
  n = 150,
  mod_pool = model_pool,
  mod_nopool = model_nopool,
  njobs = 50, 
  mc.cores = 4L,
  job_name = "i_subs",
  tmp_path = "/.../scratch",
  plan = "wait",
  sbatch_opt = list(time = "12:00:00", partition = "cpu_short", `mem-per-cpu` = "5G"),
  export = c("s_define", "s_generate", "s_model"),
  overwrite = TRUE
)

job
res <- Slurm_collect(job)

save(res, file = "/.../sub_0.rda")

Results

The figure shows the results from 80 models. Each column is a different subgroup (and the last is the overall treatment effect estimate). The intervals are the 95% credible intervals from the Bayesian models, and the 95% confidence interval from the linear regression model. The intervals are color coded based on whether the interval includes 0 (grey) or not (red). The red intervals are cases where we might incorrectly conclude that there is indeed some sort of effect. There are many more red lines for the linear regression estimates compared to either of the Bayesian models:

For the full set of 2500 replications, about 5% of the intervals from the pooled Bayes did not include 0, lower than the unpooled model, and far below the approach using individual subgroup regression models:

##    pooled unpooled   lm
## 1:  0.051     0.11 0.37

I started off the last post by motivating this set of simulations with an experience I recently had with journal reviewers who were skeptical of an analysis of a subgroup effect size. I’m not sure that the journal reviewers would buy the approach suggested here, but it seems that pooling estimates across subgroups provides a viable way to guard against making overly strong statements about effect sizes when they are not really justified.

Subgroup analysis using a Bayesian hierarchical model

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 31 Aug 2021 00:00:00 +0000

I’m part of a team that recently submitted the results of a randomized clinical trial for publication in a journal. The overall findings of the study were inconclusive, and we certainly didn’t try to hide that fact in our paper. Of course, the story was a bit more complicated, as the RCT was conducted during various phases of the COVID-19 pandemic; the context in which the therapeutic treatment was provided changed over time. In particular, other new treatments became standard of care along the way, resulting in apparent heterogeneous treatment effects for the therapy we were studying. It appears as if the treatment we were studying might have been effective only in one period when alternative treatments were not available. While we planned to evaluate the treatment effect over time, it was not our primary planned analysis, and the journal objected to the inclusion of the these secondary analyses.

Which got me thinking, of course, about subgroup analyses. In the context of a null hypothesis significance testing framework, it is well known that conducting numerous post hoc analyses carries the risk of dramatically inflating the probability of a Type 1 error - concluding there is some sort of effect when in fact there is none. So, if there is no overall effect, and you decide to look at a subgroup of the sample (say patients over 50), you may find that the treatment has an effect in that group. But, if you failed to adjust for multiple tests, than that conclusion may not be warranted. And if that second subgroup analysis was not pre-specified or planned ahead of time, that conclusion may be even more dubious.

If we use a Bayesian approach, we might be able to avoid this problem, and there might be no need to adjust for multiple tests. I have started to explore this a bit using simulated data under different data generation processes and prior distribution assumptions. It might all be a bit too much for a single post, so I am planning on spreading it out a bit.

The data

To get this going, here are the libraries used in this post:

library(simstudy)
library(data.table)
library(ggplot2)
library(cmdstanr)
library(posterior)

In this simulated data set of 150 individuals, there are three binary covariates $A, B, C \in \{0,1\}$ and a treatment indicator $rx \in \{0,1\}$. When we randomize the individuals to arms, we should have pretty good balance across treatment arms, so a comparison of the two treatment arms without adjusting for the covariates should provide a good estimate of the overall treatment effect. However, we might still be interested in looking at specific subgroups defined by $A$, $B$, and $C$, say patients for whom $A=0$ or those where $C=1$. (We could also look at subgroups defined by combinations of these covariates.)

In the data generation process, the treatment effect will be a parameter $\theta$ that will be determined by the levels of the three covariates. In this case, for patients $A=B=C=0$, there will be no treatment effect. However, for patients with only $A=1$ (i.e., $B=0$ and $C=0$), there will be a small treatment effect of $2$, and there will be a slightly larger effect of $4$ for patients with $C=1$, and for patients with $A=1 \ \& \ C=1$, there will be a treatment effect of $5$. For patients with $B=1$ (alone) there is no treatment effect.

d <- defData(varname = "a", formula = 0.6, dist="binary")
d <- defData(d, varname = "b", formula = 0.3, dist="binary")
d <- defData(d, varname = "c", formula = 0.4, dist="binary")
d <- defData(d, varname = "theta", formula = "0 + 2*a  + 4*c - 1*a*c", dist = "nonrandom")

drx <- defDataAdd(varname = "y", formula = "0 + theta*rx", variance = 16, dist = "normal")

In the data generation process, I am assigning eight group identifiers based on the covariates that will be relevant for the Bayes model (described further below).

setgrp <- function(a, b, c) {
  
  if (a==0 & b==0 & c==0) return(1)
  if (a==1 & b==0 & c==0) return(2)
  if (a==0 & b==1 & c==0) return(3)
  if (a==0 & b==0 & c==1) return(4)
  if (a==1 & b==1 & c==0) return(5)
  if (a==1 & b==0 & c==1) return(6)
  if (a==0 & b==1 & c==1) return(7)
  if (a==1 & b==1 & c==1) return(8)
  
}

To generate the data:

set.seed(3871598)

dd <- genData(150, d)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(drx, dd)

dd[, grp:= setgrp(a, b, c), keyby = id]

dd

##       id a b c theta rx     y grp
##   1:   1 1 0 1     5  0  0.28   6
##   2:   2 1 1 0     2  0  3.14   5
##   3:   3 0 0 0     0  0  0.73   1
##   4:   4 1 1 0     2  1  0.78   5
##   5:   5 1 1 1     5  0 -5.94   8
##  ---                             
## 146: 146 1 1 0     2  1  4.68   5
## 147: 147 0 0 1     4  0  3.10   4
## 148: 148 1 0 0     2  0  5.88   2
## 149: 149 1 1 1     5  1  4.22   8
## 150: 150 0 1 1     4  1  4.76   7

Here is a plot of the average outcome $Y$ for each of the subgroups with and without treatment. The treatment effect for a particular subgroup is the difference of the $Y$ values for each segment. Now, it appears that there is a treatment effect for the two subgroups $B=0$ and $B=1$, yet $B$ was not supposed to have any impact on the overall effect size, which is $0$. Just in case this is at all confusing, this is due to the fact that these patients have characteristics $A$ and $C$, which do influence the effect size. Indeed, if you compare the subgroups $B=0$ and $B=1$, it appears that the effect size could be the same, which is consistent with the fact that $B$ has no impact on effect size. This is definitely not the case when comparing $C=0$ and $C=1$. I point this out, because when I report the estimated effect sizes from the models, I will be reporting the subgroup-specific effects shown here, rather than parameter estimates of $\theta$.

Subgroup analysis using simple linear regression

Before jumping into the Bayes models, I am fitting seven simple linear regression models to estimate seven treatment effects, one for each of the six subgroups defined by the covariates $A$, $B$, and $C$, plus an overall estimate.

df <- data.frame(dd)

est_lm <- function(dx) {
  fit <- lm(y ~ rx, data = dx)
  c(coef(fit)["rx"], confint(fit)[2,])
}

est_cis <- function(sub_grp) {
  mean_pred <- lapply(split(df[,c(sub_grp, "y", "rx")], df[, c(sub_grp)]), est_lm) 
  do.call(rbind, mean_pred)
}

ci_subgroups <- do.call(rbind, lapply(c("a","b","c"), est_cis))
ci_overall <- est_lm(dd)

cis <- data.table(
  subgroup = c("a = 0", "a = 1", "b = 0", "b = 1", "c = 0", "c = 1", "overall"),
  model = 3,
  rbind(ci_subgroups, ci_overall)
)

setnames(cis, c("rx","2.5 %", "97.5 %"), c("p.50","p.025", "p.975"))

Inspecting the point estimates (denoted as p.50 for the treatment effect for each subgroup (and the overall group), we see that they match pretty closely with the effect sizes depicted in the figure of the means by subgroup above. I’ll compare these estimates to the Bayes estimates in a bit.

cis

##    subgroup model p.50 p.025 p.975
## 1:    a = 0     3  3.3  1.30   5.3
## 2:    a = 1     3  3.0  1.31   4.6
## 3:    b = 0     3  3.2  1.57   4.9
## 4:    b = 1     3  2.6  0.61   4.6
## 5:    c = 0     3  2.5  0.90   4.1
## 6:    c = 1     3  4.4  2.19   6.5
## 7:  overall     3  3.1  1.79   4.4

Two possible Bayesian models

I am including two Bayesian models here, one that I am calling a pooled model and the other an unpooled model (though the second is not absolutely unpooled, just relatively unpooled). In both cases, the outcome model is described as

\[ y_{ij} \sim N\left(\alpha_j +\theta_{j}x_i, \ \sigma_0 \right) \]

where $y_{ij}$ is the outcome measure for individual $i$ who has covariate/subgroup pattern $j$. (These subgroup patterns were defined above in R code. For example group 1 is all cases where $a=b=c=0$ and group 5 is $a=b=1, \ c=0$.) $x_i$ is a treatment indicator, $x \in \{0,1\}$. $\alpha_j$ is the intercept for covariate pattern $j$ (representing the mean outcome for all patients with pattern $j$ randomized to control). $\theta_j$ represents the treatment effect for patients with pattern $j$. $\sigma_0$ is the within treatment arm/covariate pattern standard deviation, and is assumed to be constant across arms and patterns.

The treatment effect parameter $\theta_j$ can be further parameterized as function of a set of $\tau\text{'s}.$ (This parameterization was inspired by this paper by Jones et al.) The treatment effect is a deterministic function of the covariates $a$, $b$, and $c$ as well as their interactions:

\[\begin{aligned} \theta_1 &= \tau_0 \\ \theta_2 &= \tau_0 + \tau_a \\ \theta_3 &= \tau_0 + \tau_b \\ \theta_4 &= \tau_0 + \tau_c \\ \theta_5 &= \tau_0 + \tau_a + \tau_b + \tau_{ab} \\ \theta_6 &= \tau_0 + \tau_a + \tau_c + \tau_{ac} \\ \theta_7 &= \tau_0 + \tau_b + \tau_c + \tau_{bc} \\ \theta_8 &= \tau_0 + \tau_a + \tau_b + \tau_c + \tau_{ab} + \tau_{ac} + \tau_{bc} + \tau_{abc} \end{aligned}\]

So far, the parameterization for the pooled and unpooled models are the same. Now we see how they diverge:

Pooled model

The idea behind the pooled model is that the main effects of $a$, $b$, $c$ ($\tau_a$, $\tau_b$, and $\tau_c$, respectively) are drawn from the same distribution centered around $\delta_m$ with a standard deviation $\sigma_m$, both of which will be estimated from the data. The estimated effect of one covariate will, to some extent, inform the estimated effect of the others. Of course, as the number of observations increases, the strength of pooling will be reduced. The three 2-level interaction effects ($\tau_{ab}$, $\tau_{ac}$ and $\tau_{bc}$) are independent of the main effects, but they also share a common distribution to be estimated from the data. (In this case we have only a single three-way interaction term $\tau_{abc}$, but if we had 4 covariates rather than 3, we would have 4 three-way interaction terms, which could all share the same prior distribution. At some point, it might be reasonable to exclude higher order interactions, such as four- or five-way interactions.)

\[\begin{aligned} \tau_a, \tau_b, \tau_c &\sim N(\mu = \delta_m, \sigma = \sigma_m) \\ \tau_{ab}, \tau_{ac}, \tau_{bc} &\sim N(\mu = \delta_x, \sigma = \sigma_x) \\ \end{aligned}\]

With the exception of $\alpha_j$ and $\sigma_0$, the prior distributions for the model parameters are quite conservative/pessimistic, centered pretty closely around 0. (It would certainly be wise to explore how these prior assumptions impact the findings, but since this is just an illustrative example, I won’t dwell too much on these particular assumptions).

\[\begin{aligned} \alpha_j &\sim N(\mu = 0, \sigma = 10), \ \ \ j \in \{1,\dots,8\} \\ \tau_0 &\sim N(\mu=0, \sigma = 2) \\ \tau_{abc} &\sim N(\mu = 0, \sigma = 2) \\ \delta_m &\sim N(\mu = 0, \sigma = 2) \\ \delta_x &\sim N(\mu = 0, \sigma = 2) \\ \sigma_0 &\sim N(\mu = 0, \sigma = 10), \ \ \ \sigma_0 \ge 0 \\ \sigma_m &\sim N(\mu = 0, \sigma = 1), \ \ \ \ \sigma_m \ge 0 \\ \sigma_x &\sim N(\mu = 0, \sigma = 1), \ \ \ \ \sigma_x \ge 0 \\ \end{aligned}\]

Unpooled model

In the unpooled model, the $\tau$’s (and $\alpha$’s) are not jointly parameterized with a common mean, and the prior distributions are more diffuse. The only variance estimation is for $\sigma_0$:

\[\begin{aligned} \alpha_j &\sim N(\mu=0, \sigma = 10), \ \ \ j \in \{1,\dots,8\} \\ \tau_0 &\sim N(\mu=0, \sigma = 10) \\ \tau_q &\sim N(\mu=0, \sigma = 10), \ \ \ q \in \{a, b, c\} \\ \tau_{qq} &\sim N(\mu=0, \sigma = 10), \ \ \ qq \in \{ab, ac, bc\} \\ \tau_{abc} &\sim N(\mu = 0, \sigma = 10) \\ \sigma_0 &\sim N(\mu = 0, \sigma = 10), \ \ \ \sigma_0 \ge 0 \\ \end{aligned}\]

Model estimation

I’m using cmdstanr to estimate the models in Stan. (The Stan code is available if any anyone is interested, or you can try to write it yourself.) For each model, I am sampling in 4 chains of length 2500 following 500 warm-up steps. I’ll skip the required diagnostics here (e.g. trace plots) for brevity, but I did check everything, and things looked OK.

model_pool <- cmdstan_model("code/pooled_subgroup.stan")
model_unpool <- cmdstan_model("code/unpooled_subgroup.stan")

fit_pool <- model_pool$sample(
    data = list(N = dd[,.N], rx = dd[,rx], sub_grp = dd[,grp], y=dd[,y]),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 500,
    iter_sampling = 2500,
    adapt_delta = 0.99,
    max_treedepth = 20,
    seed = 898171
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 1.4 seconds.
## Chain 2 finished in 1.4 seconds.
## Chain 3 finished in 1.5 seconds.
## Chain 4 finished in 1.6 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 1.5 seconds.
## Total execution time: 1.8 seconds.

fit_unpool <- model_unpool$sample(
    data = list(N = dd[,.N], rx = dd[,rx], sub_grp = dd[,grp], y=dd[,y], prior_sigma=10),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 500,
    iter_sampling = 2500,
    adapt_delta = 0.99,
    max_treedepth = 20,
    seed = 18717
)

## Running MCMC with 4 parallel chains...
## 
## Chain 3 finished in 1.4 seconds.
## Chain 2 finished in 1.5 seconds.
## Chain 4 finished in 1.7 seconds.
## Chain 1 finished in 2.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 1.7 seconds.
## Total execution time: 2.2 seconds.

Extracting posterior probabilities

In this case, I am actually not directly interested in the effect parameters $\theta_j$, but actually in the estimated treatment effects for the six subgroups defined by $a=0$, $a=1$, $b=0$, $b=1$, $c=0$, and $c=1$. (These groups are not distinct from one another, as each individual has measures for each of $a$, $b$, and $c$.) I will step through the process of how I get these estimates, and then will plot a summary of the estimates.

First, I extract the key parameter estimates into an rvars data structure (I discussed this data structure recently in a couple of posts - here and here). Although the object r below looks like a list of 3 items with just a handful of values, there is actually an entire data set supporting each value that contains 10,000 samples from the posterior distribution. What we are seeing are the mean and standard deviation of those distributions.

r <- as_draws_rvars(fit_pool$draws(variables = c("alpha","theta","sigma")))
r

## # A draws_rvars: 2500 iterations, 4 chains, and 3 variables
## $alpha: rvar<2500,4>[8] mean ± sd:
## [1] -2.42 ± 0.89   0.49 ± 0.79  -1.61 ± 1.46  -0.88 ± 1.09   0.93 ± 1.30 
## [6]  1.06 ± 0.89   2.64 ± 1.59  -0.18 ± 1.19 
## 
## $theta: rvar<2500,4>[8] mean ± sd:
## [1] 2.1 ± 1.03  2.8 ± 0.89  2.7 ± 1.10  3.6 ± 1.11  2.6 ± 1.31  4.2 ± 1.17 
## [7] 4.0 ± 1.31  3.6 ± 1.72 
## 
## $sigma: rvar<2500,4>[1] mean ± sd:
## [1] 3.8 ± 0.23

A cool feature of the rvars data structure (which is part of the package posterior) is that they can be stored in a data.frame, and easily manipulated. Here I am matching the $\theta_j$ to each individual depending on their covariate pattern $j$. The plan is to generate simulated data for each individual based on the estimated means and standard deviations.

df <- as.data.frame(dd)
    
df$theta_hat <- r$theta[dd$grp]
df$alpha_hat <- r$alpha[dd$grp]
df$mu_hat <- with(df, alpha_hat + rx* theta_hat)

Here are the first ten rows (out of the 150 individual records):

head(df, 10)

##    id a b c theta rx     y grp  theta_hat    alpha_hat       mu_hat
## 1   1 1 0 1     5  0  0.28   6 4.2 ± 1.17  1.06 ± 0.89  1.06 ± 0.89
## 2   2 1 1 0     2  0  3.14   5 2.6 ± 1.31  0.93 ± 1.30  0.93 ± 1.30
## 3   3 0 0 0     0  0  0.73   1 2.1 ± 1.03 -2.42 ± 0.89 -2.42 ± 0.89
## 4   4 1 1 0     2  1  0.78   5 2.6 ± 1.31  0.93 ± 1.30  3.52 ± 0.96
## 5   5 1 1 1     5  0 -5.94   8 3.6 ± 1.72 -0.18 ± 1.19 -0.18 ± 1.19
## 6   6 1 1 1     5  0 -1.45   8 3.6 ± 1.72 -0.18 ± 1.19 -0.18 ± 1.19
## 7   7 1 1 0     2  0  5.47   5 2.6 ± 1.31  0.93 ± 1.30  0.93 ± 1.30
## 8   8 1 1 0     2  1 -2.33   5 2.6 ± 1.31  0.93 ± 1.30  3.52 ± 0.96
## 9   9 0 0 1     4  1  0.84   4 3.6 ± 1.11 -0.88 ± 1.09  2.69 ± 1.06
## 10 10 1 0 0     2  1  7.05   2 2.8 ± 0.89  0.49 ± 0.79  3.26 ± 0.78

We can add a column of predicted “values” to the data frame.

df$pred <- rvar_rng(rnorm, nrow(df), df$mu_hat, r$sigma)
head(df[,c("id", "grp", "mu_hat", "pred")], 10)

##    id grp       mu_hat        pred
## 1   1   6  1.06 ± 0.89  1.01 ± 3.9
## 2   2   5  0.93 ± 1.30  0.95 ± 4.0
## 3   3   1 -2.42 ± 0.89 -2.41 ± 3.9
## 4   4   5  3.52 ± 0.96  3.53 ± 3.9
## 5   5   8 -0.18 ± 1.19 -0.19 ± 4.0
## 6   6   8 -0.18 ± 1.19 -0.18 ± 4.0
## 7   7   5  0.93 ± 1.30  0.90 ± 4.1
## 8   8   5  3.52 ± 0.96  3.49 ± 3.9
## 9   9   4  2.69 ± 1.06  2.67 ± 3.9
## 10 10   2  3.26 ± 0.78  3.30 ± 3.9

But note that we don’t just have a single value for each of the 150 individuals, but 10,000 samples for each of the 150 individuals (for a total 1.5 million predicted values.) Here is a little bit of evidence that this is the case, as you can see that this is an rvar of dimension $2500 \text{ samples} \times 4 \text{ chains}$, or $10,000$ predicted values:

df[9, "pred"]

## rvar<2500,4>[1] mean ± sd:
## [1] 2.7 ± 3.9

Finally, we are ready to get estimates of the within-subgroup effect sizes. I’ve written a little function to help out here. For each covariate $a$, $b$, and $c$, the function splits the data set into four subsets. So, for covariate $a$ we have $[a=0 \ \& \ rx=0]$, $[a=0 \ \& \ rx=1]$, $[a=1 \ \& \ rx=0]$, and $a=1 \ \& \ rx=1]$. For each of those subsets, we get a distribution of mean predicted values by averaging across the distribution of individual predicted values. So, the variable effects contains the distribution of effects for the six subgroups ($a=0$, $a=1$, $b=0$, $b=1$, $c=0$, and $c=1$):

est_effects <- function(sub_grp) {
  mean_pred <- lapply(split(df[,c(sub_grp, "rx","pred")], df[, c(sub_grp, "rx")]), 
    function(x) rvar_mean(x$pred)
  )
  c(mean_pred[["0.1"]] - mean_pred[["0.0"]], mean_pred[["1.1"]] - mean_pred[["1.0"]])
}

effects <- do.call(c, lapply(c("a","b","c"), est_effects))
effects

## rvar<2500,4>[6] mean ± sd:
## [1] 2.5 ± 1.3  3.2 ± 1.1  2.8 ± 1.0  3.1 ± 1.5  2.7 ± 1.1  3.5 ± 1.3

We can also get the distribution of the overall (marginal) treatment effect by sub-setting by $rx$ only. The last step is to create a summary table for the pooled model. Remember, the effects table is really a table of distributions, and we can extract summary statistics from those distributions for reporting or plotting. Here, we are extracting the $2.5\%$, $50\%$, and $97.5\%$ quantiles to show the median and a $95\%$ interval.

mean_pred <- lapply(split(df[,c("rx","pred")], df[, "rx"]), function(x) rvar_mean(x$pred))
overall <- mean_pred[["1"]] - mean_pred[["0"]]
    
effects <- c(effects, overall)
    
sumstats_pooled <- data.table( 
  subgroup = c("a = 0", "a = 1", "b = 0", "b = 1", "c = 0", "c = 1", "overall"),
  model = 1,
  p.025 = quantile(effects, 0.025),
  p.50 = quantile(effects, 0.50),
  p.975 = quantile(effects, 0.975)
)

Comparing model estimates

Now to take a look at the distribution of effect sizes based on the different models. (I didn’t show it, but I also created a table called sumstats_unpooled using the same process I just walked you through.) Below is a plot of the effect estimates for each of the subgroups as well as the overall (marginal) effect estimates. The lm plot shows the point estimate with a $95\%$ confidence interval. The other two plots show the medians of the posterior distributions for the subgroup effects along with $95\%$ intervals.

Two important things to see in the plot, which will be very important when I write next time about “Type 1” errors, are the relative length of the intervals and the apparent shrinkage of some of the estimates. In all the cases, the length of the interval for the standard linear regression model is smaller than the two Bayesian models, reflecting less uncertainty. The pooled model also appears to have slightly less uncertainty compared to the unpooled model.

The second point is that the point estimates for the linear regression model and the median estimates for the unpooled model are quite close, while the pooled medians appear to be pulled away. The direction of the shrinkage is not coherent, because there is a mixture of main effects and interaction effects (the $\tau$’s) that are shifting things around. It appears that the effects of the subgroups $a=0,$ $b=0,$ and $c=0$ are being pulled towards each other, and the same appears to be true for the group defined by $a=1,$ $b=1,$ and $c=1.$ This seems right as we know that the underlying parameters $\tau_a$, $\tau_b$, and $\tau_c$ are shrinking towards each other.

If we were using the pooled model to draw conclusions, I would say that it appears that subgroups defined by $c$ seem to have heterogeneous treatment effects, though I would probably want to have more data to confirm, as the intervals are still quite wide. If we use the results from the linear regression model, we might want to proceed with caution, because the intervals are likely too narrow, we have not adjusted for multiple testing. We will see this next time when I look at a case where there are no underlying treatment effects in the data generation process.

Reference:

Jones, Hayley E., David I. Ohlssen, Beat Neuenschwander, Amy Racine, and Michael Branson. “Bayesian models for subgroup analysis in clinical trials.” Clinical Trials 8, no. 2 (2011): 129-143.

Posterior probability checking with rvars: a quick follow-up

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 17 Aug 2021 00:00:00 +0000

This is a relatively brief addendum to last week’s post, where I described how the rvar datatype implemented in the R package posterior makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit.

Immediately after putting the post up, I decided to make sure the correct model with the quadratic term would not result in extreme p-value (i.e. would fall between 0.02 and 0.98). And, again not surprisingly, the model was a good fit. I’m sharing all this here, because I got some advice on how to work with the rvar data a little more efficiently, and wanted to make sure those who are interested could see that. And while I was at it, I decided to investigate the distribution of Bayesian p-values under the condition that the model and data generating process are the same (i.e. the model is correct).

Just as a reminder, here is the data generation process:

\[y \sim N(\mu = 2 + 6*x - 0.3x^2, \ \sigma^2 = 4)\]

Here are the necessary libraries:

library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)

And here is the data generation:

b_quad <- -0.3

ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)

set.seed(72612)
dd <- genData(100, ddef)

The Stan model is slightly modified to include the additional term; $\gamma$ is the quadratic parameter:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}

transformed data{
  vector[N] x2;
  
  for (i in 1:N) {
    x2[i] = x[i]*x[i];
  };
  
}

parameters {
  real alpha;
  real beta;
  real gamma;
  real<lower=0> sigma;
}

model {
  y ~ normal(alpha + beta*x + gamma*x2, sigma);
}

mod <- cmdstan_model("code/quadratic_regression.stan")

fit <- mod$sample(
  data = list(N = nrow(dd), x = dd$x, y = dd$y),
  seed = 72651,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 500,
  iter_sampling = 2500
)

## Running MCMC with 4 parallel chains...
## 
## Chain 3 finished in 0.5 seconds.
## Chain 1 finished in 0.5 seconds.
## Chain 2 finished in 0.5 seconds.
## Chain 4 finished in 0.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.5 seconds.
## Total execution time: 0.6 seconds.

As before, I am plotting the observed (actual data) along with the 80% intervals of predicted values at each level of $x$. The observed data appear to be randomly scattered within the intervals with no apparent pattern:

post_rvars <- as_draws_rvars(fit$draws())

mu <- with(post_rvars, alpha + beta * as_rvar(dd$x) + gamma * as_rvar(dd$x^2))
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)

df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90))))
df.80[, extreme := !(y >= V1 & y <= V2)]

The code to estimate the p-value is slightly modified from last time. The important difference is that the lists of rvars (bin_prop_y and bin_prop_pred) are converted directly into vectors of rvars using the do.call function:

df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)

bin_prop_y <- lapply(split(df, df$grp), function(x) rvar_mean(x$y < x$mu))
rv_y <- do.call(c, bin_prop_y)
T_y <- rvar_var(rv_y)

bin_prop_pred <- lapply(split(df, df$grp), function(x) rvar_mean(x$pred < x$mu))
rv_pred <- do.call(c, bin_prop_pred)
T_pred <- rvar_var(rv_pred)

mean(T_pred > T_y)

## [1] 0.583

In this one case, the p-value is 0.58, suggesting the model is a good fit. But, could this have been a fluke? Looking below at the density plot of p-values based on 10,000 simulated data sets suggests not; indeed, $P(0.02 < \text{p-value} < 0.98) = 99.8\%.$ (If you are interested in the code that estimated the density of p-values, I can post it as well.)

Fitting your model is only the beginning: Bayesian posterior probability checks with rvars

keith.goldfeld@nyumc.org (Keith Goldfeld) — Mon, 09 Aug 2021 00:00:00 +0000

Say we’ve collected data and estimated parameters of a model that give structure to the data. An important question to ask is whether the model is a reasonable approximation of the true underlying data generating process. If we did a good job, we should be able to turn around and generate data from the model itself that looks similar to the data we started with. And if we didn’t do such a great job, the newly generated data will diverge from the original.

If we used a Bayesian approach to estimation, all the information we have about the parameters from our estimated model is contained in the data that have been sampled by the MCMC process. For example, if we are estimating a simple normal regression model with an intercept parameter $\alpha$, a slope parameter $\beta$, and a standard deviation parameter $\sigma$, and we collected 10,000 samples from a posterior distribution, then we will have a multivariate table of possible values of $\alpha$, $\beta$ and $\sigma$. To answer our question regarding model adequacy, we only need to extract the information contained in all that data.

I’ve been casting about for ways to do this extraction efficiently in R, so I posted an inquiry on the Stan Forums to get advice. I got a suggestion to look into the random variable dataytpe (rvar) recently implemented in the package `posterior. Not at all familiar with this, I started off by reading through the vignette, and then at this Kerman & Gelman paper.

To get a get a better handle on the ideas and tools, I decided to simulate some data, fit some models, and investigate what posterior probability checks might like look using rvars. I’m sharing some of the code with you here to give a bit of the flavor of what can be done. A little advanced warning: I am providing more output of the data than usual, because I think it is easier to grasp what is going on if you can see the data in the various stages of transformation.

Before I get started, here is the requisite list of the packages needed to run the code:

library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
library(abind)

Simple linear model

I am first generating data from a simple linear regression model where the outcome $y$ is a function of $x$, and $\alpha = 2$, $\beta=6$, and $\sigma = 2$:

\[y \sim N(\mu = 2 + 6*x, \ \sigma^2 = 4)\]

Data generation

To get things going, I define the relationship between $x$ and $y$ and generate the data using simstudy, and then take a look at the data:

b_quad <- 0

ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)

set.seed(2762)
dd <- genData(100, ddef)

Model fitting

I am using cmdstan and cmdstanr to estimate the model. Here is the Stan code:

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model {
  y ~ normal(alpha + beta*x, sigma);
}

Next, I compile the Stan code, and sample from the posterior. The sampling will be done in four parallel chains of 2,500 (after the warm-up of 500 samples), which will give me a total sample of 10,000 for each parameter. All of the samples are stored in the cmdstan object fit.

mod <- cmdstan_model("code/linear_regression.stan")

fit <- mod$sample(
  data = list(N = nrow(dd), x = dd$x, y = dd$y),
  seed = 93736,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 500,
  iter_sampling = 2500
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.2 seconds.

Extracting results

Typically, I would extract the data using the draws method of the cmdstanr object. By default, the draws method returns an array, which is essentially (though not exactly) a multi-dimensional matrix. In this case there are multiple matrices, one for each parameter. The display of each parameter shows the first five rows of the four chains.

(post_array <- fit$draws())

## # A draws_array: 2500 iterations, 4 chains, and 4 variables
## , , variable = lp__
## 
##          chain
## iteration    1    2    3    4
##         1 -128 -129 -127 -128
##         2 -128 -128 -127 -128
##         3 -128 -129 -130 -128
##         4 -128 -128 -127 -129
##         5 -128 -127 -127 -128
## 
## , , variable = alpha
## 
##          chain
## iteration   1   2   3   4
##         1 2.1 1.5 2.0 2.4
##         2 1.6 1.7 1.8 2.3
##         3 1.7 2.2 1.7 2.4
##         4 1.5 2.5 1.7 1.3
##         5 2.0 2.1 1.6 2.2
## 
## , , variable = beta
## 
##          chain
## iteration   1   2   3   4
##         1 6.1 6.2 6.0 6.0
##         2 6.0 6.1 6.1 6.0
##         3 6.1 5.9 6.1 5.9
##         4 6.1 6.0 6.1 6.1
##         5 6.0 6.0 6.1 6.0
## 
## , , variable = sigma
## 
##          chain
## iteration   1   2   3   4
##         1 2.0 2.4 2.0 2.1
##         2 2.0 2.2 2.0 2.4
##         3 2.3 2.3 1.8 2.3
##         4 2.2 2.2 2.1 2.1
##         5 2.3 2.3 2.3 2.5
## 
## # ... with 2495 more iterations

The package bayesplot uses this array to generate a range of different plots, including the important diagnostic trace plot:

mcmc_trace(post_array, pars = c("alpha", "beta", "sigma"), facet_args = list(nrow = 3))

Random variable datatype

Instead of extracting the array data, it is possible to convert the array into a random variable datatype, or rvar. It is probably easiest to explain what this is by looking more closely at it.

(post_rvars <- as_draws_rvars(fit$draws()))

## # A draws_rvars: 2500 iterations, 4 chains, and 4 variables
## $lp__: rvar<2500,4>[1] mean ± sd:
## [1] -128 ± 1.2 
## 
## $alpha: rvar<2500,4>[1] mean ± sd:
## [1] 1.9 ± 0.41 
## 
## $beta: rvar<2500,4>[1] mean ± sd:
## [1] 6 ± 0.072 
## 
## $sigma: rvar<2500,4>[1] mean ± sd:
## [1] 2.2 ± 0.16

You can see that the post_rvars object is essentially a list of 4 items: lp (log probability), alpha, beta, and sigma. But what exactly are those items, “1.9 ± 0.41” for alpha, “6 ± 0.072” for beta, and “2.2 ± 0.16” for sigma? Well, the rvar is really a shorthand way of representing the detailed data that is in the underlying array, and the text displayed is merely the mean and standard deviation of the underlying data ($\mu ± \sigma$). We can peek under the hood a bit by using the function draws_of, and confirm the mean and standard deviation of the samples:

beta_samples <- draws_of(post_rvars$beta)
data.table(beta_samples)

##          V1
##     1: 6.07
##     2: 6.04
##     3: 6.13
##     4: 6.14
##     5: 5.98
##    ---     
##  9996: 6.00
##  9997: 6.07
##  9998: 6.03
##  9999: 6.13
## 10000: 6.14

c(mean(beta_samples), sd(beta_samples))

## [1] 6.0350 0.0715

Why all the fuss?

The whole point of the rvar datatype is that it makes it much easier to do things like estimate the distributions of the functions of the parameters and to generate predicted values of new observations, both things we need for posterior probability checking. Of course, there are other ways to do all of this (as is always the case in R), but thisrvars` seem to eliminate a lot of the manipulation that might be necessary if we chose to work directly with the data arrays.

In this next step, I am generating the distribution of means $\mu$ for each of the 100 individuals in the data set:

\[\mu_i = \alpha + \beta x_i\] I want to do this, because ultimately, I want to generate predicted values for each individual, which come from $N(\mu_i, \sigma)$. And I am not going to generate just a single predicted value for each individual, but rather 10,000 predicted values for each individual. So, now we will have the distribution of predicted values for each individual, which is quite powerful. And importantly, the distributions of these predicted values will incorporate the uncertainty of each $\mu_i$ and $\sigma$. With the rvar datatype, all of this can be accomplished with just a few commands - no manipulation necessary.

All rvar equations need to be specified using by rvar objects. We need the product of $x_i$ and $\beta$ to get $\mu_i$, but $x_i$ is observed data, not a random variable. No problem - we can covert the vector $x$ into a special kind of constant rvar that does not have a standard deviation. Once this is done, we can generate the $\mu_i$’s

x_rvar <- as_rvar(dd$x)
x_rvar

## rvar<1>[100] mean ± sd:
##   [1]  6.554 ± NA   0.281 ± NA   2.115 ± NA   9.889 ± NA   8.715 ± NA 
##   [6]  3.448 ± NA   5.358 ± NA   1.069 ± NA   7.908 ± NA   0.445 ± NA 
##  [11]  4.574 ± NA   1.581 ± NA   8.915 ± NA   6.063 ± NA   6.278 ± NA 
##  [16]  7.146 ± NA   8.344 ± NA   6.385 ± NA   9.429 ± NA   1.391 ± NA 
##  [21]  3.542 ± NA   6.473 ± NA   4.689 ± NA   2.016 ± NA   3.818 ± NA 
##  [26]  1.566 ± NA   3.315 ± NA   0.118 ± NA   3.317 ± NA   4.523 ± NA 
##  [31]  4.711 ± NA   1.488 ± NA   8.488 ± NA   3.614 ± NA   3.521 ± NA 
##  [36]  0.653 ± NA   3.522 ± NA   5.541 ± NA   3.377 ± NA   0.142 ± NA 
##  [41]  0.626 ± NA   1.211 ± NA   5.616 ± NA   0.210 ± NA   5.320 ± NA 
##  [46]  3.459 ± NA   2.315 ± NA   6.498 ± NA   9.055 ± NA   6.483 ± NA 
##  [51]  1.087 ± NA   2.593 ± NA   8.007 ± NA   1.388 ± NA   7.268 ± NA 
##  [56]  1.101 ± NA   6.091 ± NA   7.920 ± NA   4.646 ± NA   7.842 ± NA 
##  [61]  3.113 ± NA   7.560 ± NA   6.683 ± NA   5.670 ± NA   8.468 ± NA 
##  [66]  9.152 ± NA   0.390 ± NA   4.365 ± NA   8.228 ± NA   6.732 ± NA 
##  [71]  4.516 ± NA   1.166 ± NA   6.558 ± NA   5.129 ± NA   9.666 ± NA 
##  [76]  4.314 ± NA   5.069 ± NA   0.872 ± NA   7.728 ± NA   0.780 ± NA 
##  [81]  0.053 ± NA   1.594 ± NA   5.457 ± NA   9.755 ± NA   0.147 ± NA 
##  [86]  8.765 ± NA   1.165 ± NA   9.738 ± NA   0.195 ± NA   9.965 ± NA 
##  [91]  9.175 ± NA   8.583 ± NA   9.464 ± NA   3.360 ± NA   3.768 ± NA 
##  [96]  9.105 ± NA   1.337 ± NA   7.878 ± NA   4.354 ± NA   2.428 ± NA

mu <- post_rvars$alpha + post_rvars$beta * x_rvar
mu

## rvar<2500,4>[100] mean ± sd:
##   [1] 41.5 ± 0.25   3.6 ± 0.39  14.7 ± 0.29  61.6 ± 0.43  54.5 ± 0.36 
##   [6] 22.7 ± 0.24  34.2 ± 0.22   8.4 ± 0.34  49.6 ± 0.31   4.6 ± 0.38 
##  [11] 29.5 ± 0.22  11.5 ± 0.32  55.7 ± 0.37  38.5 ± 0.24  39.8 ± 0.24 
##  [16] 45.0 ± 0.28  52.3 ± 0.34  40.4 ± 0.25  58.8 ± 0.40  10.3 ± 0.33 
##  [21] 23.3 ± 0.24  41.0 ± 0.25  30.2 ± 0.22  14.1 ± 0.30  25.0 ± 0.23 
##  [26] 11.4 ± 0.32  21.9 ± 0.24   2.6 ± 0.40  21.9 ± 0.24  29.2 ± 0.22 
##  [31] 30.3 ± 0.22  10.9 ± 0.32  53.1 ± 0.34  23.7 ± 0.23  23.2 ± 0.24 
##  [36]  5.9 ± 0.37  23.2 ± 0.24  35.4 ± 0.23  22.3 ± 0.24   2.8 ± 0.40 
##  [41]  5.7 ± 0.37   9.2 ± 0.34  35.8 ± 0.23   3.2 ± 0.39  34.0 ± 0.22 
##  [46] 22.8 ± 0.24  15.9 ± 0.28  41.1 ± 0.25  56.6 ± 0.38  41.0 ± 0.25 
##  [51]  8.5 ± 0.34  17.6 ± 0.27  50.2 ± 0.32  10.3 ± 0.33  45.8 ± 0.28 
##  [56]  8.6 ± 0.34  38.7 ± 0.24  49.7 ± 0.31  30.0 ± 0.22  49.2 ± 0.31 
##  [61] 20.7 ± 0.25  47.5 ± 0.30  42.2 ± 0.26  36.1 ± 0.23  53.0 ± 0.34 
##  [66] 57.1 ± 0.38   4.3 ± 0.38  28.3 ± 0.22  51.6 ± 0.33  42.5 ± 0.26 
##  [71] 29.2 ± 0.22   8.9 ± 0.34  41.5 ± 0.25  32.9 ± 0.22  60.2 ± 0.41 
##  [76] 28.0 ± 0.22  32.5 ± 0.22   7.2 ± 0.36  48.6 ± 0.30   6.6 ± 0.36 
##  [81]  2.2 ± 0.40  11.5 ± 0.32  34.8 ± 0.22  60.8 ± 0.42   2.8 ± 0.40 
##  [86] 54.8 ± 0.36   8.9 ± 0.34  60.7 ± 0.42   3.1 ± 0.40  62.1 ± 0.43 
##  [91] 57.3 ± 0.38  53.7 ± 0.35  59.0 ± 0.40  22.2 ± 0.24  24.7 ± 0.23 
##  [96] 56.9 ± 0.38  10.0 ± 0.33  49.5 ± 0.31  28.2 ± 0.22  16.6 ± 0.28

We can see that mu is an rvar vector of 100 objects, one for each individual $i$. But, as we saw before, each of those objects is actually 10,000 data points - the distribution of $\mu_i$ for each individual. Again, let’s peek under the hood: here is the the distribution of $\mu$ for individual $i=6$:

data.table(draws_of(mu[6]))

##          V1
##     1: 23.0
##     2: 22.4
##     3: 22.8
##     4: 22.7
##     5: 22.6
##    ---     
##  9996: 22.9
##  9997: 22.4
##  9998: 22.1
##  9999: 22.8
## 10000: 22.8

Now we are ready to generate the distribution of predicted values for each individual - again using a single command rvar_rng, specifying that we want to generate data for each individual using the distribution of the rvar mu and the the rvar sigma. We get 10,000 predicted values (our estimated distribution) for each of the 100 individuals:

pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)
str(pred)

##  rvar<2500,4>[100]  41.4 ± 2.2   3.6 ± 2.2  14.7 ± 2.2  61.6 ± 2.3 ...

Here, I randomly sample from the sample of 10,000 predicted values and plot this one instance of predicted values (in orange) along with the original data (in blue):

newdd <- data.table(x = dd$x, y = draws_of(pred)[sample(10000, 1),])
head(newdd)

##        x     y
## 1: 6.554 44.17
## 2: 0.281  2.57
## 3: 2.115 16.13
## 4: 9.889 60.41
## 5: 8.715 56.27
## 6: 3.448 24.04

ggplot(data = dd, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 1) +
  geom_point(color = "orange", size = 1, data = newdd) +
  theme(panel.grid = element_blank())

But we can actually visualize the distribution of predicted values for each individual and plot those distributions in relation to the actual data. If we want to look at the 80% interval for each individual (we could look at the 95% interval just as easily), we can estimate the interval bounds simply by applying the quantile function to the rvar pred:

interval80 <- t(quantile(pred, c(0.10, 0.90)))
head(interval80)

##       [,1]  [,2]
## [1,] 38.58 44.21
## [2,]  0.79  6.44
## [3,] 11.79 17.52
## [4,] 58.75 64.45
## [5,] 51.55 57.44
## [6,] 19.89 25.55

If the model is a good fit, we would expect the actual data to be scattered across those distributions without any obvious pattern, as is the case here. Not so surprising given the simulated data generation process:

df.80 <- data.table(x = dd$x, y=dd$y, interval80)
df.80[, extreme := !(y >= V1 & y <= V2)]

ggplot(data = df.80, aes(x = x, y = y)) +
  geom_segment(aes(y = V1, yend = V2, x = x, xend = x), color = "grey30", size = .1) +
  geom_point(aes(color = extreme), size = 1) +
  theme(panel.grid = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c("black", "red"))

Bayesian p-value

I find the visual presentation pretty compelling, but if we want to quantify the model fit, one option is to estimate a Bayesian p-value, described in this Gelman paper as

\[\text{p-value}(y) = P(T(y^{rep}) > T(y) \ |\ y)\] averaged over the parameters $\theta$ (and is a function of the observed data $y$). $y^{rep}$ is the replicated or predicted data from the model (what we have saved in the variable pred). $T()$ is any function of the data that is reasonable in this context. The idea is that the p-value will not be extremely high or low (eg., not less than 0.05 or not greater than 0.95) if the model is a good approximation of the actual data generating process. Since my main goal here is to illustrate the usefulness of the rvar datatype, and not necessarily to come up with the ideal test statistic $T$, I’ve created a pretty crude idea for $T$ in the the context of linear regression.

The first step is to split the data defined by different values of predictors $x$ into different bins (in this case I’ll use five) and calculate the proportion of observed $y_i$’s that fall below the predicted mean $\mu_i$:

\[p_{b} = \frac{1}{n_b}\sum_{i=1}^{n_b} I(y_i < \mu_i), \ b \in \{1,\dots,B\} \]

We also do the same to estimate $p_b^{rep}$ for each bin, using the replicated/predicted values of $y$. We expect the variability $p^{rep}$ (i.e. $p_1^{rep} \approx \dots \approx p_5^{rep}$): by definition, predictions are randomly scattered around the means in each bin, with half above and half below. If the model is a good fit of the observed data, we would expect the $p_b$’s of based on observed data to all also be close to 0,5. However, if the model is a poor fit, there will likely be variability in proportions based on observed $y$’s across bins, so that the $P(\text{var}(p^{rep}) > \text{var}(p))$ should be quite close to 0.

[As I write this, I’m noticing that this binned test statistic might bear some of the same motivations that underlie the Goldfeld-Quandt test for heteroscedasticity. OK, not quite, but perhaps it is very, very tangentially related? In any case, the more famous test was developed in part by my father; today would have been his 81st birthday, so I am very happy to make that (very subtle) connection.]

Estimating the p-value from the data

One cool feature of rvars is that they can be included in data.frames (though not in data.tables). This allows us to do some cool summarization without a lot of manipulation.

df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)

head(df)

##       x     y            mu         pred grp
## 1 6.554 43.13 41.47 ± 0.253 41.42 ± 2.21   4
## 2 0.281  3.32  3.61 ± 0.390  3.59 ± 2.22   1
## 3 2.115 16.72 14.67 ± 0.291 14.66 ± 2.23   2
## 4 9.889 59.35 61.59 ± 0.425 61.60 ± 2.25   5
## 5 8.715 53.21 54.51 ± 0.356 54.51 ± 2.27   5
## 6 3.448 24.41 22.72 ± 0.239 22.72 ± 2.24   2

In this case, I want to calculate the proportion of values where the observed $y$ is less than $\mu$ in each bin; I can use lapply on the data frame df to calculate each of those proportions. However, I am actually calculating the proportion 10,000 times within each bin, once for each sample, so I have a distribution of proportions within each bin.

bin_prop_y <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], I(y < mu))))
bin_prop_y

## [[1]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.57 ± 0.022 
## 
## [[2]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.48 ± 0.041 
## 
## [[3]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.53 ± 0.027 
## 
## [[4]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.4 ± 0.079 
## 
## [[5]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.63 ± 0.027

A brief word about the function rvar_mean that I’ve used here (there is a more detailed description on the posterior website). If we have samples of multiple variables, we can apply a function across the variables within a sample (as opposed to across samples within a single variable) by using rvar_func. Within each bin, there are roughly 20 variables (one for each individual), and by using the function rvar_mean, I am averaging across individuals within each sample to get a distribution of proportions within each bin.

In the next steps, I need to do a little bit of manipulation to make things work. I was hoping to avoid this, but I haven’t been able to figure out any other way to get the data in the right format to estimate the probability. I am basically taking the data underlying the random variable (the 10,000 values for each bin), creating a single array, and then creating a new rvar.

array_y <- abind(lapply(bin_prop_y, function(x) as_draws_array(draws_of(x))))
head(array_y)

## , , ...1
## 
##       1
## 1 0.577
## 2 0.538
## 3 0.577
## 4 0.538
## 5 0.577
## 6 0.577
## 
## , , ...1
## 
##       1
## 1 0.500
## 2 0.444
## 3 0.500
## 4 0.500
## 5 0.444
## 6 0.500
## 
## , , ...1
## 
##       1
## 1 0.529
## 2 0.529
## 3 0.529
## 4 0.529
## 5 0.529
## 6 0.529
## 
## , , ...1
## 
##       1
## 1 0.579
## 2 0.368
## 3 0.579
## 4 0.579
## 5 0.368
## 6 0.474
## 
## , , ...1
## 
##      1
## 1 0.65
## 2 0.60
## 3 0.65
## 4 0.65
## 5 0.60
## 6 0.65

(rv_y <- rvar(array_y))

## rvar<10000>[1,5] mean ± sd:
##   ...1          ...1          ...1          ...1          ...1         
## 1 0.57 ± 0.022  0.48 ± 0.041  0.53 ± 0.027  0.40 ± 0.079  0.63 ± 0.027

Here, I am repeating the steps on the predicted values ($y^{rep}$). Even with the inelegant coding, it is still only three lines:

bin_prop_pred <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (pred < mu))))
array_pred <- abind(lapply(bin_prop_pred, function(x) as_draws_array(draws_of(x))))
rv_pred <- rvar(array_pred)

Finally, we are ready to calculate the p-value using the distribution of test statistics $T$. Note that rvar_var is calculating the variance of the proportions across the bins within a single sample to give us a distribution of variances of the proportions based on the observed and predicted values. The overall p-value is the overage of the distribution.

(T_y <- rvar_var(rv_y))

## rvar<10000>[1] mean ± sd:
## [1] 0.0092 ± 0.0033

(T_pred <- rvar_var(rv_pred))

## rvar<10000>[1] mean ± sd:
## [1] 0.013 ± 0.0088

# p-value
mean(T_pred > T_y)

## [1] 0.585

As expected, since the data generation process and the model are roughly equivalent, the p-value is neither extremely large or small, indicating good fit.

Straying from the simple model assumptions

If we tweak the data generation process slightly by including a quadratic term, things change a bit:

\[y \sim N(\mu = 2 + 6*x - 0.3x^2, \ \sigma^2 = 4)\]

Below, I give you the code and output without any commentary, except to say that both the visual display and the p-value strongly suggest that the simple linear regression model are not a good fit for these data generated with an added quadratic term.

b_quad <- -0.3
dd <- genData(100, ddef)

fit <- mod$sample(
  data = list(N = nrow(dd), x = dd$x, y = dd$y),
  seed = 72651,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 500,
  iter_sampling = 2500
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.1 seconds.

post_rvars <- as_draws_rvars(fit$draws())

x_rvar <- as_rvar(dd$x)
mu <- post_rvars$alpha + post_rvars$beta * x_rvar
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)

df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90))))
df.80[, extreme := !(y >= V1 & y <= V2)]

df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)

bin_prop_y <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (y < mu))))
array_y <- abind(lapply(bin_prop_y, function(x) as_draws_array(draws_of(x))))
rv_y <- rvar(array_y)

bin_prop_pred <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (pred < mu))))
array_pred <- abind(lapply(bin_prop_pred, function(x) as_draws_array(draws_of(x))))
rv_pred <- rvar(array_pred)

(T_y <- rvar_var(rv_y))

## rvar<10000>[1] mean ± sd:
## [1] 0.11 ± 0.012

(T_pred <- rvar_var(rv_pred))

## rvar<10000>[1] mean ± sd:
## [1] 0.013 ± 0.0091

mean(T_pred > T_y)

## [1] 0

I followed up this post with a quick update here.

References:

Gelman, Andrew. “A Bayesian formulation of exploratory data analysis and goodness‐of‐fit testing.” International Statistical Review 71, no. 2 (2003): 369-382.

Goldfeld, Stephen M., and Richard E. Quandt. “Some tests for homoscedasticity.” Journal of the American statistical Association 60, no. 310 (1965): 539-547.

Kerman, Jouni, and Andrew Gelman. “Manipulating and summarizing posterior simulations using random variable objects.” Statistics and Computing 17, no. 3 (2007): 235-244.

Estimating a risk difference (and confidence intervals) using logistic regression

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 15 Jun 2021 00:00:00 +0000

The odds ratio (OR) – the effect size parameter estimated in logistic regression – is notoriously difficult to interpret. It is a ratio of two quantities (odds, under different conditions) that are themselves ratios of probabilities. I think it is pretty clear that a very large or small OR implies a strong treatment effect, but translating that effect into a clinical context can be challenging, particularly since ORs cannot be mapped to unique probabilities.

One alternative measure of effect is the risk difference, which is certainly much more intuitive. Although a difference is very easy to calculate when measured non-parametrically (you just calculate the proportion for each arm and take the difference), things get a little less obvious when there are covariates that need adjusting. (There is a method developed by Richardson, Robins, & Wang, that allow analysts to model the risk difference, but I won’t get into that here.)

Currently, I’m working on an NIA IMPACT Collaboratory study evaluating an intervention designed to increase COVID-19 vaccination rates for staff and long-term residents in nursing facilities. A collaborator suggested we report the difference in vaccination rates rather than the odds ratio, arguing in favor of the more intuitive measure. From my perspective, the only possible downside in using a risk difference instead of an OR is that risk difference estimates are marginal, whereas odds ratios are conditional. (I’ve written about this distinction before.) The marginal risk difference estimate is a function of the distribution of patient characteristics in the study that influence the outcome, so the reported estimate might not be generalizable to other populations. The odds ratio, on the other hand, is not dependent on the covariates. The ultimate consensus on our research team is that the benefits of improved communication outweigh the potential loss of generalizability.

My goal here is to demonstrate the relative simplicity of estimating the marginal risk difference described in these papers by Kleinman & Norton and Peter Austin. I won’t be using real data from the study that motivated this, but will generate simulated data so that I can illustrate the contrast between marginal and conditional estimates.

Quickly defining the parameters of interest

In the study that motivated this, we had two study arms - an intervention arm which involved extensive outreach and vaccination promotion and the other a control arm where nothing special was done. So, there are two probabilities that we are interested in: $p_1 \equiv P(\text{vaccinated} | \text{intervention})$ and $p_0 \equiv P(\text{vaccinated} | \text{control}).$

The risk difference comparing the two groups is simply

\[\text{RD} = p_1 - p_0,\] the odds $w_a$ for each treatment group is

\[w_a = \frac{p_a}{1-p_a}, \ \ a \in \{0,1\},\] and the odds ratio comparing the intervention arm to the control arm is

\[\text{OR} = \frac{w_1}{w_0}.\]

The logistic regression model models the log odds as a linear function of the intervention status and any other covariates that are being adjusted. In the examples below, there is one continuous covariate $x$ that ranges from -0.5 to 0.5:

\[\text{log}(w_A) = \alpha + \beta A + \gamma X.\] $\beta$ represents the log(OR) conditional on a particular value of $X$:

\[\text{log}(w_1) = \alpha + \beta + \gamma X \\ \text{log}(w_0) = \alpha + \gamma X, \] and

\[\text{log(OR)} = \text{log}\left(\frac{w_1}{w_0}\right) =\text{log}(w_1) - \text{log}(w_0) = \beta\]

More importantly, we can move between odds and probability relatively easily:

\[\begin{aligned} \frac{p_a}{1-p_a} &= w_a \\ p_a &= w_a(1- p_a) \\ p_a + w_ap_a &= w_a \\ p_a &= \frac{w_a}{1 + w_a} \\ p_a &= \frac{1}{1 + w_a^{-1}} \end{aligned}\]

Estimating the marginal probability using model estimates

After fitting the model, we have estimates $\hat{\alpha}$, $\hat{\beta}$, and $\hat{\gamma}$. We can generate a pair of odds for each individual $i$ ($w_{i1}$ and $w_{i0}$) using their observed $x_i$ and the estimated parameters. All we need to do is set $a=1$ and $a=0$ to generate a predicted $\hat{w}_{i1}$ and $\hat{w}_{i0}$, respectively, for each individual. Note we do not pay attention to the actual treatment arm that the individual was randomized to:

\[ \text{log}(\hat{w}_{i1}) = \hat{\alpha} + \hat{\beta} + \hat{\gamma}x_i, \]

\[ \hat{w}_{i1} = \text{exp}(\hat{\alpha} + \hat{\beta} + \hat{\gamma}x_i). \]

Likewise,

\[ \hat{w}_{i0} = \text{exp}(\hat{\alpha} + \hat{\gamma}x_i). \] We get $\hat{p}_{ia}$ for $a \in \{0,1\}$ as

\[ \hat{p}_{ia} = \frac{1}{1 + \hat{w}_{ia}^{-1}}\]

Finally, the marginal risk difference $\widehat{\text{RD}}$ can be estimated as

\[ \widehat{\text{RD}} = \frac{1}{n}\sum_{i=1}^n \hat{p}_{i1} - \frac{1}{n}\sum_{i=1}^n \hat{p}_{i0} \]

from all $n$ study participants regardless of actual treatment assignment.

Fortunately, in R we don’t need to do any of these calculations as predictions on the probability scale can be extracted from the model fit. Standard errors of this risk difference can be estimated using bootstrap methods.

Simulated data set

Before getting into the simulations, here are the packages needed to run the code shown here:

set.seed(287362)

library(simstudy)
library(data.table)
library(ggplot2)
library(ggthemes)
library(parallel)

I am generating a binary outcome $y$ that is a function of a continuous covariate $x$ that ranges from -0.5 to 0.5. I use the beta distribution to generate $x1$ which is transformed into $x$. The advantage of this distribution is the flexibility we have in defining the shape. The OR used to generate the outcome is 2.5:

def <- defDataAdd(varname = "x1", formula = "..mu_x", variance = 8, dist = "beta")
def <- defDataAdd(def, varname = "x", formula = "x1 - 0.5", dist = "nonrandom")
def <- defDataAdd(def, varname = "y", 
  formula = "-2 + log(2.5) * rx + 1.5 * x",
  dist = "binary", link="logit")

In the first scenario of 500 observations, the distribution of $x$ will be right-skewed. This is established by setting the mean of $x1$ close to 0:

mu_x = 0.2
  
dd_2 <- genData(500)
dd_2 <- trtAssign(dd_2, grpName = "rx")
dd_2 <- addColumns(def, dd_2)

ggplot(data = dd_2, aes(x = x)) +
  geom_histogram(fill="#9ec785", binwidth = 0.05, boundary = 0) +
  scale_x_continuous(limits = c(-.55, .55), breaks = seq(-.5, .5, by = .25)) +
  theme(panel.grid = element_blank())

The first step in estimating the risk difference is to fit a logistic regression model:

glmfit <- glm(y ~ rx + x, data = dd_2, family = "binomial")

Next, we need to predict the probability for each individual based on the model fit under each treatment condition. This will give us $\hat{p}_{i1}$ and $\hat{p}_{i0}$:

newdata <- dd_2[, .(rx=1, x)]
p1 <- mean(predict(glmfit, newdata, type = "response"))

newdata <- dd_2[, .(rx=0, x)]
p0 <- mean(predict(glmfit, newdata, type = "response"))

c(p1, p0)

## [1] 0.152 0.068

A simple calculation gives us the point estimate for the risk difference (and note that the estimated OR is close to 2.5, the value used to generate the data):

risk_diff <- p1 - p0
odds_ratio <- exp(coef(glmfit)["rx"])
  
c(rd = risk_diff, or = odds_ratio)

##    rd or.rx 
## 0.084 2.456

We can use a bootstrap method to estimate a 95% confidence interval for risk difference. This involves sampling ids from each treatment group with replacement, fitting a new logistic regression model, predicting probabilities, and calculating a the risk difference. This is repeated 999 times to get a distribution of risk differences, from which we extract an estimated confidence interval:

bootdif <- function(dd) {
  
  db <- dd[, .(id = sample(id, replace = TRUE)), keyby = rx]
  db <- merge(db[, id, rx], dd, by = c("id", "rx"))
  
  glmfit <- glm(y ~ rx + x, data = db, family = "binomial")

  newdata <- db[, .(rx=1, x)]
  p1 <- mean(predict(glmfit, newdata, type = "response"))

  newdata <- db[, .(rx=0, x)]
  p0 <- mean(predict(glmfit, newdata, type = "response"))

  return(p1 - p0)
}

bootest <- unlist(mclapply(1:999, function(x) bootdif(dd_2), mc.cores = 4))
quantile(bootest, c(0.025, 0.975))

##  2.5% 97.5% 
## 0.031 0.137

Change in distribution changes risk difference

To illustrate how a shift in the distribution of $x$ can influence the marginal risk difference without changing the odds ratio, I just need to specify the mean of $x1$ to be closer to 1. This creates a left-skewed distribution that will increase the risk difference:

mu_x = 0.8

The risk difference appears to increase, but the OR seems to be pretty close to the true value of 2.5:

##    rd or.rx 
##  0.18  2.59

And for completeness, here is the estimated confidence interval:

##  2.5% 97.5% 
##  0.10  0.25

A more robust comparison

It is hardly fair to evaluate this property using only two data sets. It is certainly possible that the estimated risk differences are inconsistent just by chance. I have written some functions (provided below in the addendum) that facilitate the replication of numerous data sets created under different distribution assumptions to a generate a distribution of estimated risk differences (as well as a distribution of estimated ORs). I have generated 5000 data sets of 500 observations each under four different assumptions of mu_x used in the data generation process defined above: {0.2, 0.4, 0.6, 0.8}.

It is pretty apparent the the risk difference increases as the distribution of $x$ shifts from right-skewed to left-skewed:

And it is equally apparent that shifting the distribution has no impact on the OR, which is consistent across different levels of $x$:

References:

Austin, Peter C. “Absolute risk reductions, relative risks, relative risk reductions, and numbers needed to treat can be obtained from a logistic regression model.” Journal of clinical epidemiology 63, no. 1 (2010): 2-6.

Kleinman, Lawrence C., and Edward C. Norton. “What’s the risk? A simple approach for estimating adjusted risk measures from nonlinear models including logistic regression.” Health services research 44, no. 1 (2009): 288-302.

Richardson, Thomas S., James M. Robins, and Linbo Wang. “On modeling and estimation for the relative risk and risk difference.” Journal of the American Statistical Association 112, no. 519 (2017): 1121-1130.

Support:

Addendum: replication code

s_define <- function() {
  
  def <- defDataAdd(varname = "x1", formula = "..mu_x", variance = 8, dist = "beta")
  def <- defDataAdd(def, varname = "x", formula = "x1 - 0.5", dist = "nonrandom")
  def <- defDataAdd(def, varname = "y", 
    formula = "-2 + 1 * rx + 1.5 * x",
    dist = "binary", link="logit")
  
  return(list(def = def)) # list_of_defs is a list of simstudy data definitions
}

s_generate <- function(list_of_defs, argsvec) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx")
  dx <- addColumns(def, dx)
  
  return(dx) #  generated data is a data.table
}

s_model <- function(dx) {
  
  glmfit <- glm(y ~ rx + x, data = dx, family = "binomial")
  
  newdata <- dx[, .(rx=1, x)]
  p1 <- mean(predict(glmfit, newdata, type = "response"))
  
  newdata <- dx[, .(rx=0, x)]
  p0 <- mean(predict(glmfit, newdata, type = "response"))
  
  risk_diff <- p1 - p0
  odds_ratio <- exp(coef(glmfit)["rx"])
  
  model_results <- data.table(risk_diff, odds_ratio)
  
  return(model_results) # model_results is a data.table
}

s_single_rep <- function(list_of_defs, argsvec) {
  
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data)
  
  return(model_results)
}


s_replicate <- function(argsvec, nsim) {
  
  list_of_defs <- s_define()
  
  model_results <- rbindlist(
    parallel::mclapply(
      X = 1 : nsim, 
      FUN = function(x) s_single_rep(list_of_defs, argsvec), 
      mc.cores = 4)
  )
  
  model_results <- cbind(t(argsvec), model_results)
  
  return(model_results) # summary_stats is a data.table
}

### Scenarios

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

#----

n <- 500
mu_x <- c(0.2, 0.4, 0.6, 0.8)

scenarios <- scenario_list(n = n, mu_x = mu_x)

summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 5000)))

ggplot(data = summary_stats, aes(x = risk_diff, group = mu_x)) +
  geom_density(aes(fill = factor(mu_x)), alpha = .7) +
  scale_fill_canva(palette = "Simple but bold", name = "mu_x") +
  theme(panel.grid = element_blank()) +
  xlab("estimated risk difference")

ggplot(data = summary_stats, aes(x = odds_ratio, group = mu_x)) +
  geom_density(aes(fill = factor(mu_x)), alpha = .7) +
  scale_fill_canva(palette = "Simple but bold", name = "mu_x") +
  theme(panel.grid = element_blank()) +
  xlab("estimated odds ratio")

Sample size determination in the context of Bayesian analysis

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 01 Jun 2021 00:00:00 +0000

Given my recent involvement with the design of a somewhat complex trial centered around a Bayesian data analysis, I am appreciating more and more that Bayesian approaches are a very real option for clinical trial design. A key element of any study design is sample size. While some would argue that sample size considerations are not critical to the Bayesian design (since Bayesian inference is agnostic to any pre-specified sample size and is not really affected by how frequently you look at the data along the way), it might be a bit of a challenge to submit a grant without telling the potential funders how many subjects you plan on recruiting (since that could have a rather big effect on the level of resources - financial and time - required.)

Earlier, I touched a bit on these issues while discussing the frequentist properties of Bayesian models, but I didn’t really get directly into sample size considerations. I’ve been doing some more exploring and simulating, so I am sharing some of that here.

Bayesian inference

In the Bayesian framework, all statistical inference is based on the estimated posterior probability distribution for the parameter(s) of interest (say $\theta$) once we have observed the data: $P(\theta | \text{data})$. In addition to extracting the mean or median of the distribution as a point estimate, we can get a measure of uncertainty by extracting quantiles from the distribution (a 95% interval comes to mind, though there is no reason to be limited by that convention).

Alternatively, we can make a probability statement about the parameter being above or below a threshold of effectiveness. For example if we are estimating a log-odds ratio for an intervention that prevents a bad outcome, we might be interested in $P(log(OR) < 0).$ We may even pre-specify that the trial will be considered a success if $P(log(OR) < 0) > 0.95.$

library(simstudy)
library(data.table)
library(ggplot2)
library(cmdstanr)
library(posterior)
library(bayesplot)

Data generation

To investigate, I will use a simple binary outcome $Y$ that is changed by exposure or intervention $A$. In this first case, I will randomly select a log-odds ratio from $N(\mu = -1, \sigma = 0.5).$

defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * A", 
  dist = "binary", link="logit")

set.seed(21)
lor <- rnorm(1, -1, 0.5)

dT <- genData(200)
dT <- trtAssign(dT, grpName = "A")
dT <- addColumns(defB, dT)

Model fitting

I am primarily interested in recovering the log-odds ratio use to generate the data using a simple Bayesian model, written here in Stan. The parameter of interest in the Stan model is $\beta$, log-odds ratio. The prior distribution is $t_{student}(df=3, \mu=0, \sigma=5).$

data {
  int<lower=0> N;
  int<lower=0,upper=1> y[N];
  vector[N] x;
  real mu;
  real s;
}

parameters {
  real alpha;
  real beta;
}

model {
  beta ~ student_t(3, mu, s);
  y ~ bernoulli_logit(alpha + beta * x);
}

To estimate the posterior distribution, I am using the R package cmdstanr:

mod <- cmdstan_model("code/bayes_logistic.stan")

fit <- mod$sample(
  data =  list(N=nrow(dT), y=dT$Y, x=dT$A, mu=0, s=5),
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 1000,
  iter_sampling = 4000,
  step_size = 0.1,
  show_messages = FALSE
)

## Running MCMC with 4 parallel chains...
## 
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.2 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.2 seconds.
## Total execution time: 0.4 seconds.

(If you’re impressed at how fast that model ran, it is because it is on my new MacBook Pro with the new Apple M1 chip - 4 or 5 times faster than my previous MacBook Pro with an Intel chip. It took me a while to get R, RStudio, and particularly, cmdstan up and running, but once I did, it has been totally worth it.)

First thing to check, of course, is whether the sampling from the posterior distribution was well-behaved. Here is a trace plot for the parameter $\beta$:

draws_array <- as_draws_array(fit$draws())
mcmc_trace(draws_array, pars = "beta")

Here are the summary statistics of the posterior distribution. Based on these data, the median log-odds ratio is $-0.61$ and $P(lor < 0) = 89\%$:

res <- data.table(fit$summary(variables = "beta"))[, 
          .(median, sd, q95, len = q95-q5)]

betas <- data.table(beta = as.matrix(draws_array[,,"beta"]))
res$p0 <- mean(betas$beta.V1 < 0)

res

##        median       sd       q95      len      p0
## 1: -0.6050845 0.511862 0.2103548 1.673138 0.88875

A plot of the posterior distribution is the best way to fully assess the state of knowledge about the parameter having observed this data set. The density plot includes a vertical dashed line at the median, and the dark shading indicates lowest $95\%$ of the density. The fact that the cutoff point $0$ lies within the bottom $95\%$ makes it clear that the threshold was not met.

d <- density(draws_array[,,"beta"], n = 1024)
plot_points <- as.data.table(d[c("x", "y")])
median_xy <- plot_points[findInterval(res$median, plot_points$x)]

ggplot(data = plot_points, aes(x = x, y = y)) +
  geom_area(aes(fill = (x < res$q95))) +
  geom_segment(x = median_xy$x, xend=median_xy$x, y=0, yend = median_xy$y,
               size = 0.2, color = "white", lty=3) +
  scale_fill_manual(values = c("#adc3f2", "#5886e5")) +
  theme(panel.grid = element_blank(),
        legend.position = "none")

Bayesian power

If we want to assess what kind of sample sizes we might want to target in study based on this relatively simple design (binary outcome, two-armed trial), we can conduct a Bayesian power analysis that has a somewhat different flavor from the more typical frequentist Bayesian that I typically do with simulation. There are a few resources I’ve found very useful here: this book by Spiegelhalter et al and these two papers, one by Wang & Gelfand and another by De Santis & Gubbiotti

When I conduct a power analysis within a frequentist framework, I usually assume set of fixed/known effect sizes, and the hypothesis tests are centered around the frequentist p-value at a specified level of $\alpha$. The Bayesian power analysis differs with respect to these two key elements: a distribution of effect sizes replaces the single fixed effect size to accommodate uncertainty, and the posterior distribution probability threshold (or another criteria such as the variance of the posterior distribution or the length of the 95% credible interval) replaces the frequentist hypothesis test.

We have a prior distribution of effect sizes. De Santis and Gubbiotti suggest it is not necessary (and perhaps less desirable) to use the same prior used in the model fitting. That means you could use a skeptical (conservative) prior centered around 0, in the analysis, but use a prior for data generation that is consistent with a clinically meaningful effect size. In the example above the analysis prior was

\[ \beta \sim t_{student}(df = 3, \mu = 0, \sigma = 5) \]

and the data generation prior was

\[ \beta \sim N(\mu = -1, \sigma = 0.5).\]

To conduct the Bayesian power analysis, I replicated the simulation and model fitting shown above 1000 times for each of seven different sample sizes ranging from 100 to 400. (Even though my laptop is quite speedy, I used the NYU Langone Health high performance cluster Big Purple to do this, because I wanted to save a few hours.) I’m not showing the parallelized code in this post, but take a look here for an example similar to this. (I’m happy to share with anyone if you’d like to have the code. Updated 7/1/2021: code has been added in the Addendum below.)

The plots below show a sample of 20 posterior distributions taken from the 1000 generated for each of three sample sizes. As in the frequentist context, an increase in sample size appears to reduce the variance of the posterior distribution estimated in a Bayesian model. We can see visually that as the sample size increases, the distribution collapses towards the mean or median, which has a direct impact on how confident we are in drawing conclusions from the data; in this case, it is apparent that as sample size increases, the proportion of posterior distributions meet the 95% threshold increases.

Here is a curve that summarizes the probability of a posterior distribution meeting the 95% threshold at each sample size level. At a size of 400, 80% of the posterior distributions (which are themselves based on data generated from varying effect sizes specified by the data generation prior and the analysis prior) would lead us to conclude that the trial is success.

References:

Wang, Fei, and Alan E. Gelfand. “A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models.” Statistical Science 17, no. 2 (2002): 193-208.

Spiegelhalter, David J., Keith R. Abrams, and Jonathan P. Myles. Bayesian approaches to clinical trials and health-care evaluation. Vol. 13. John Wiley & Sons, 2004.

De Santis, Fulvio, and Stefania Gubbiotti. “Sample Size Requirements for Calibrated Approximate Credible Intervals for Proportions in Clinical Trials.” International Journal of Environmental Research and Public Health 18, no. 2 (2021): 595.

Addendum

Here is the full R code for the Bayesian power analysis using simulation. I am including the slurmR code that I used to execute on the HPC:

library(simstudy)
library(data.table)
library(ggplot2)
library(bayesplot)
library(posterior)
library(cmdstanr)
library(slurmR)
library(collapse)

s_define <- function() {
    
  defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * rx", 
                     dist = "binary", link="logit")
  
  return(list(defB = defB)) # list_of_defs is a list of simstudy data definitions
}

s_generate <- function(list_of_defs, argsvec) {
    
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  #--- add data generation code ---#
  
  lor <- rnorm(1, mu.lor, sigma.lor)
  
  dT <- genData(nobs)
  dT <- trtAssign(dT, grpName = "rx")
  dT <- addColumns(defB, dT)
  
  return(dT[])
  
}

s_model <- function(generated_data, mod, argsvec) {
    
  list2env(as.list(argsvec), envir = environment())
    
  dt_to_list <- function(dx) {
    
    N <- nrow(dx)                  ## number of observations 
    y <- dx$Y                      ## individual outcome 
    x <- dx$rx                     ## treatment arm for individual 
    s <- t_sigma
    mu <- 0 # can be mu.lor
      
    list(N=N, y=y, x=x, s=s, mu = mu)
  }

  fit <- mod$sample(
    data = dt_to_list(generated_data),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 1000,
    iter_sampling = 4000,
    step_size = 0.1,
    show_messages = FALSE
  )
    
  res <- data.table(fit$summary(variables = "beta"))[, .(median, sd, q95, len = q95-q5)]
    
  draws_array <- as_draws_array(fit$draws())
  betas <- data.table(beta = as.matrix(draws_array[,,"beta"]))
  res$p0 <- mean(betas$beta.V1 < 0)

  return(res) # model_results is a data.table
  }

s_single_rep <- function(list_of_defs, argsvec, mod) {
    
  set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
  
  list_of_defs <- s_define()
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data, mod, argsvec)
  
  return(model_results)
}
  
s_replicate <- function(argsvec, nsim, mod) {
  
  list_of_defs <- s_define()
  
  model_results <- 
    lapply(
      X = 1 : nsim, 
      FUN = function(x) s_single_rep(list_of_defs, argsvec, mod)
    )
  
  #--- add summary statistics code ---#
  
  model_sums <- unlist2d(lapply(model_results, function(x) x), 
    idcols = "replicate", DT = TRUE)
  
  summary_stats <- model_sums[ , 
    .(p_95 = mean(p0 >= 0.95), 
      p_len = mean(len <= 2),
      p_sd = mean(sd <= 0.5))
  ]
  
  model_ests <- data.table(t(argsvec), summary_stats)

  return(model_ests)
  
}

###

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

mu.lor <- c(0, -0.5, -1.0, -1.5)
sigma.lor <- c(0.25)
nobs <- c(100, 150, 200, 250, 300, 350, 400)
t_sigma <- c(1, 5, 10)

scenarios <- scenario_list(mu.lor = mu.lor, sigma.lor = sigma.lor, 
                           nobs = nobs, t_sigma = t_sigma)

set_cmdstan_path(path = ".../cmdstan/2.25.0")
mod <- cmdstan_model("present.stan")

job <- Slurm_lapply(
  X = scenarios, 
  FUN = s_replicate, 
  mod = mod,
  nsim = 1200,
  njobs = min(length(scenarios), 90L), 
  mc.cores = 4L,
  job_name = "i_bp",
  tmp_path = "/gpfs/data/troxellab/ksg/scratch",
  plan = "wait",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("s_single_rep", "s_define", "s_generate", "s_model"),
  overwrite = TRUE
)

summary_stats <- Slurm_collect(job)
final_tab <- rbindlist(summary_stats)

save(final_tab, file = ".../bp.rda")

Generating random lists of names with errors to explore fuzzy word matching

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 13 Apr 2021 00:00:00 +0000

Health data systems are not always perfect, a point that was made quite obvious when a study I am involved with required a matched list of nursing home residents taken from one system with set results from PCR tests for COVID-19 drawn from another. Name spellings for the same person from the second list were not always consistent across different PCR tests, nor were they always consistent with the cohort we were interested in studying defined by the first list. My research associate, Yifan Xu, and I were asked to see what we could do to help out.

This is my first foray into fuzzy word-matching. We came up with simple solution to match the names on the two lists based on the R function adist that should allow the research team to finalize a matched list with minimal manual effort.

In order to test our proposed approach, we developed a way to generate random lists of names with errors. This post presents both the code for random list generation with errors as well as the simple matching algorithm.

Distance between strings

Fuzzy word matching can be approached using the concept of string distance. Quite simply, this can be measured by counting the number of transformations required to move from the original to the target word. A transformation is one of three moves: (1) a substitution, (2) an insertion, or (3) a deletion. The figure below illustrates the 5 “moves” that are required to get from CAT to KITTEN: two substitutions and three insertions.

The adist function can calculate this string distance, and if you set counts = TRUE, the function will provide the number of substitutions, insertions, and deletions. Here are the results for our example:

adist("CAT", "KITTEN", counts = TRUE)

##      [,1]
## [1,]    5
## attr(,"counts")
## , , ins
## 
##      [,1]
## [1,]    3
## 
## , , del
## 
##      [,1]
## [1,]    0
## 
## , , sub
## 
##      [,1]
## [1,]    2
## 
## attr(,"trafos")
##      [,1]    
## [1,] "SSIMII"

Assessing whether a distance is meaningful or no longer fuzzy certainly depends on the nature of the problem and the length of a strings. The distance from CAT to DOG is 3 (with 3 substitutions), and so is the distance from DERE, STEPHEN to DEERE, STEVEN (1 insertion, 1 deletion, and 1 substitution); we might be willing to declare the individual’s name a match while declining to pair the two different animals.

Simulating lists of names with errors

To test out our fuzzy matching process, we need to be able to create a master list of names from which we can create two sub-lists: (1) the cohort list of nursing home residents with correct name spellings, and (2) the list of PCR records that includes multiple records (test results) per individual, with possible inconsistent name spellings across the different tests for a specific person.

Generating names

The master list can easily be generated using the randomNames function in the R package randomNames. A call to this function provides samples of names from a large scale database. (It provides gender and ethnic variation if you need it.)

library(data.table)
library(randomNames)

set.seed(6251)
randomNames(4)

## [1] "Hale, James"      "el-Qazi, Najeema" "Sourn, Raj"       "Jensen, Tia"

Generating errors

To facilitate the generation of spelling errors, I’ve created a function that takes a string, a specified number of substitutions, a number of insertions (if negative then these are deletions), and an indicator that flips the order of the names (typically “Last Name, First Name”):

mis_string <- function(name, subs = 1, ins = 0, flip = FALSE) {
  
  names <- trimws(unlist(strsplit(name, split = ",")))
  
  if (subs) {
    for (i in 1 : subs) {
      change <- sample(1:2, 1)
      ii <- sample(nchar(names[change]), 1)
      l <- substr(names[change], ii, ii)
      s <- sample(letters[letters != l], 1)
      names[change] <- sub(l, s, names[change]) 
    }
  }
  
  if (ins > 0) {
    for (i in 1 : ins) {
      change <- sample(c(1, 2), 1)
      ii <- sample(nchar(names[change]), 1)
      stringi::stri_sub(names[change], ii, ii-1) <- sample(letters, 1)
    }
  }
  
  if (ins < 0) {
    for (i in 1 : -ins) {
      change <- sample(c(1, 2), 1)
      ii <- sample(nchar(names[change]), 1)
      stringi::stri_sub(names[change], ii, ii) <- ""
    }
  }
  
  paste(names[flip + 1], names[2 - flip], sep = ", ")
  
}

Here are two applications of mis_string on the name “Vazquez Topete, Deyanira”:

mis_string("Vazquez Topete, Deyanira", subs = 2, ins = 2)

## [1] "Vazhquiez Topete, Dmyanika"

mis_string("Vazquez Topete, Deyanira", subs = 1, ins = -2, flip = TRUE)

## [1] "Deynira, uazquez Topet"

Master list definitions

To generate the master list we define (using simstudy) a set of key parameters: an indicator pcr identifying whether the person has at least one test (70% will have a test), an indicator resident identifying whether the person is part of our resident cohort (20% of those on the master list will be residents), and a variable for the number of tests an individual has (conditional on having at least 1 test). There will be names on the master list that do not have any tests nor are they a resident; these patients are removed from the master list.

library(simstudy)

def_n <- defDataAdd(varname = "pcr", formula = 0.7, dist="binary")
def_n <- defDataAdd(def_n, varname = "resident", formula = 0.2, dist="binary")
def_n <- defDataAdd(def_n, varname = "n_tests", formula = 3, dist="noZeroPoisson")

PCR list error definitions

Each person with a PCR test will have one or more records in the PCR list. The following set of definitions indicates the number of substitutions and insertions (both specified as categorical integer variables) as well as whether the first and last names should be flipped.

def_p <- defDataAdd(varname = "subs", formula = ".8;.15;.04;.01", dist="categorical")
def_p <- defDataAdd(def_p, varname = "ins", 
  formula = "0.02;0.05;0.86;0.05;0.02", dist="categorical")
def_p <- defDataAdd(def_p, varname = "flip", formula = 0.10, dist="binary")

Generating the data

In this simulation I am generating 50 possible names:

set.seed(3695)

n <- 50

d_master <- data.table(id = 1:n, name = randomNames(n))
d_master <- addColumns(def_n, d_master)
d_master <- d_master[(pcr | resident)]

head(d_master)

##    id           name pcr resident n_tests
## 1:  1  Maas, Synneva   1        0       3
## 2:  2   Rock, Alyssa   1        0       3
## 3:  3    Lee, August   1        0       1
## 4:  4   Keefe, Dylan   1        0       1
## 5:  5       Yang, An   1        0       1
## 6:  6 Andrew, Crysta   1        0       3

In this case, there are be 7 individuals in the resident cohort and 40 individuals have at least one PCR test. 5 residents were tested:

d_master[, .(
    num_res = sum(resident), 
    num_pcr = sum(pcr), 
    num_both = sum( (resident & pcr) )
  )
]

##    num_res num_pcr num_both
## 1:       7      40        5

The PCR list will have 110 total tests for the 40 people with tests.

d_pcr <- genCluster(d_master[pcr == 1], "id", "n_tests", "pcr_id")
d_pcr <- addColumns(def_p, d_pcr)
d_pcr[, subs := subs - 1]
d_pcr[, ins := ins - 3]
d_pcr[, obs_name := mapply(mis_string, name, subs, ins, flip)]

d_pcr[, .(pcr_id, id, name, obs_name, subs, ins, flip)]

##      pcr_id id            name         obs_name subs ins flip
##   1:      1  1   Maas, Synneva    Sycnfvq, Maas    3   0    1
##   2:      2  1   Maas, Synneva    Synneva, Maas    0   0    1
##   3:      3  1   Maas, Synneva    Maas, Sznneva    1   0    0
##   4:      4  2    Rock, Alyssa     Rock, Alyssa    0   0    0
##   5:      5  2    Rock, Alyssa     Ropk, Alyssa    1   0    0
##  ---                                                         
## 106:    106 48 Wall, Sebastian Wall, Sebtastian    0   1    0
## 107:    107 48 Wall, Sebastian  Wall, kebastian    1   0    0
## 108:    108 48 Wall, Sebastian  wall, Sebastian    1   0    0
## 109:    109 49   Tafoya, April    Tafoya, April    0   0    0
## 110:    110 49   Tafoya, April    Tafoya, April    0   0    0

We end up with two lists - one with just residents only and one with a list of PCR tests. This is mimicking the actual data we might get from our flawed health data systems.

d_res <- d_master[resident == 1, .(id, name)]
d_pcr <- d_pcr[, .(pcr_id, id, name = obs_name, pcr)]

The truth

Before proceeding to the matching, here is a PCR test records for the residents. This is the correct match that we hope to recover.

d_pcr[ id %in% d_res$id]

##     pcr_id id               name pcr
##  1:     57 26       Diaper, nody   1
##  2:     58 26       Draper, Cody   1
##  3:     59 26      Cody, Drapeyr   1
##  4:     60 27  al-Naqvi, Qamraaa   1
##  5:     61 27  al-Naqvi, Qamraaa   1
##  6:     64 29 el-Hallal, Zahraaa   1
##  7:     65 29 Zahraaa, el-Hallal   1
##  8:     86 42       Allen, Jalyn   1
##  9:     87 42        llen, Jalyn   1
## 10:     88 42       Allen, Jalyn   1
## 11:    102 47 Sanandres, Bzandon   1
## 12:    103 47 Sananores, Brandon   1
## 13:    104 47 Sanandues, Brandon   1
## 14:    105 47 Sanandres, Brandon   1

Fuzzy matching of simulated data

The fuzzy matching is quite simple (and I’ve simplified even more by ignoring the possibility that the data have been flipped). The first step is to merge each PCR row with each resident name, which in this case will result in $7 \times 110 = 770$ rows. The idea is that we will be comparing each of the names from the PCR tests with each of the resident names. In the merged data table dd, x is the resident name, and name is the PCR test list name.

dd <- data.table(merge(d_res$name, d_pcr))
dd

##                            x pcr_id id          name pcr
##   1:           Korenek, Tara      1  1 Sycnfvq, Maas   1
##   2:            Draper, Cody      1  1 Sycnfvq, Maas   1
##   3:       al-Naqvi, Qamraaa      1  1 Sycnfvq, Maas   1
##   4:      el-Hallal, Zahraaa      1  1 Sycnfvq, Maas   1
##   5: Slee Ackerson, Jeremiah      1  1 Sycnfvq, Maas   1
##  ---                                                    
## 766:       al-Naqvi, Qamraaa    110 49 Tafoya, April   1
## 767:      el-Hallal, Zahraaa    110 49 Tafoya, April   1
## 768: Slee Ackerson, Jeremiah    110 49 Tafoya, April   1
## 769:            Allen, Jalyn    110 49 Tafoya, April   1
## 770:      Sanandres, Brandon    110 49 Tafoya, April   1

Next, we calculate the string distance for each pair of strings in dd:

dd[, pid := .I]
dd[, dist := adist(x, name), keyby = pid]
dd

##                            x pcr_id id          name pcr pid dist
##   1:           Korenek, Tara      1  1 Sycnfvq, Maas   1   1   10
##   2:            Draper, Cody      1  1 Sycnfvq, Maas   1   2   11
##   3:       al-Naqvi, Qamraaa      1  1 Sycnfvq, Maas   1   3   12
##   4:      el-Hallal, Zahraaa      1  1 Sycnfvq, Maas   1   4   14
##   5: Slee Ackerson, Jeremiah      1  1 Sycnfvq, Maas   1   5   18
##  ---                                                             
## 766:       al-Naqvi, Qamraaa    110 49 Tafoya, April   1 766   13
## 767:      el-Hallal, Zahraaa    110 49 Tafoya, April   1 767   14
## 768: Slee Ackerson, Jeremiah    110 49 Tafoya, April   1 768   19
## 769:            Allen, Jalyn    110 49 Tafoya, April   1 769   11
## 770:      Sanandres, Brandon    110 49 Tafoya, April   1 770   15

We can refine the matched list of $770$ pairs to include only those that differ by 5 or fewer transformations, and can even create a score based on the distances where a score of 100 represents a perfect match. This refined list can then be reviewed manually to make a final determination in case there are any false matches.

dd <- dd[dist <= 3,]
dd[, score := 100 - 5*dist]
  
dd[, .(staff_name = x, pcr_name = name, pcr_id, pcr, pid, score)]

##             staff_name           pcr_name pcr_id pcr pid score
##  1:       Draper, Cody       Diaper, nody     57   1 394    90
##  2:       Draper, Cody       Draper, Cody     58   1 401   100
##  3:  al-Naqvi, Qamraaa  al-Naqvi, Qamraaa     60   1 416   100
##  4:  al-Naqvi, Qamraaa  al-Naqvi, Qamraaa     61   1 423   100
##  5: el-Hallal, Zahraaa el-Hallal, Zahraaa     64   1 445   100
##  6:       Allen, Jalyn       Allen, Jalyn     86   1 601   100
##  7:       Allen, Jalyn        llen, Jalyn     87   1 608    95
##  8:       Allen, Jalyn       Allen, Jalyn     88   1 615   100
##  9: Sanandres, Brandon Sanandres, Bzandon    102   1 714    95
## 10: Sanandres, Brandon Sananores, Brandon    103   1 721    95
## 11: Sanandres, Brandon Sanandues, Brandon    104   1 728    95
## 12: Sanandres, Brandon Sanandres, Brandon    105   1 735   100

We did pretty well, identifying 12 of the 14 resident records in the PCR data. The two we missed were the result of flipped names.

In practice, a relatively close distance is not necessarily a good match. For example SMITH, MARY and SMITH, JANE are only separated by three letter substitutions, but they are most likely not the same person. We could minimize this problem if we have additional fields to match on, such as date of birth. This would even allow us to increase the string distance we are willing to accept for a possible match without increasing the amount of manual inspection required.

The case of three MAR mechanisms: when is multiple imputation mandatory?

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 30 Mar 2021 00:00:00 +0000

I thought I’d written about this before, but I searched through my posts and I couldn’t find what I was looking for. If I am repeating myself, my apologies. I explored missing data two years ago, using directed acyclic graphs (DAGs) to help understand the various missing data mechanisms (MAR, MCAR, and MNAR). The DAGs provide insight into when it is appropriate to use observed data to get unbiased estimates of population quantities even though some of the observations are missing information.

In that original post, I mentioned I might have more to say at some point in the future. Well, two years later I am again thinking about missing data, this time in the context of an ongoing randomized controlled trial. The research team has been discussing various ways to address potential biases that missing information might be introducing into the analysis. The group has decided that we need to use imputation to fill in the missing data, but I wanted to be clear why this added step is called for. After all, it is quite well known that imputation may not be necessary in light of missing data (see this post, for example.)

I’ve created three scenarios with data missing at random (MAR), where the probability of missingness is a function of observed data. In the first scenario, the treatment effect can surprisingly be estimated simply by comparing the means, no adjustment or imputation needed. In the second case, comparing the means directly is not appropriate, but adjustment for the predictor of missingness is sufficient; again, no imputation needed. And in the third case, neither a simple comparison nor a modeling adjustment do the trick; imputation is mandatory.

A little background for context

The actual RCT is considerably more complicated than I am describing here, but this is the general idea. Individuals are randomized to one of two study arms $A$, where $A_i=1$ if patient $i$ is in the treatment arm, and $A_i = 0$ if the the patient is in the control arm. We measure the outcome $Y$ at two time points, so we have $Y_1$ and $Y_2$; our primary interest, however, is $Y_2$. We measure a key covariate $X$ that influences both $Y_1$ and $Y_2$. This is the true underlying DAG:

The challenge is that, for some patients, the second measurement $Y_2$ is missing, and we believe that $Y_1$ is a good predictor of the missingness pattern. But before getting into this (which is Case #3), I’ll start with a simpler scenario.

Case #1

In the first scenario, there is only a single outcome measurement $Y$, and we have measured $X$. The simplified DAG looks like this:

Unfortunately, we’ve only been able to collect the outcome measurement $Y$ for a subset of the sample, so that the observed $Y^*$ includes missing values for some subjects. The missing data mechanism is MAR, because the level of the observed baseline covariate $X$ determines the probability of observing $Y$. The indicator $R_y = 1$ when we do observe $Y$ and $R_y = 0$ when we do not.

I’ll go ahead and simulate data based on this first DAG. In case you’d like to replicate, here are the libraries necessary for the simulations:

library(simstudy)
library(ggplot2)
library(broom)
library(data.table)
library(mice)

The data definitions establish the relationship between $A$, $X$ and $Y$ (the treatment effect of $A$ on $Y$ is 2.5) as well as create a missingness mechanism for $Y$ that is a function $X$; subjects with higher values of $X$ are more likely to have missing outcome measurements.

def1 <- defData(varname = "x", formula=0.5, dist = "binary")
def2 <- defDataAdd(varname = "y", formula = "5 + 5*x + 2.5*a", variance = 2)
defm <- defMiss(varname = "y", formula = "-3.5 + 2.3*x", logit.link = TRUE)

To generate the observed data with missing data, we first generate a complete data set (based on the data definitions), and then we generate a missing data matrix, which finally gives us the observed data set which includes $\text{NA}$’s for about 13% of the $Y$’s.

set.seed(17236)

dd <- genData(500, def1)
dd <- trtAssign(dd, grpName = "a")
dd <- addColumns(def2, dd)

ddmiss <- genMiss(dd, defm, id = "id")
ddobs <- genObs(dd, ddmiss, id = "id")

ddobs

##       id x a    y
##   1:   1 0 0  6.1
##   2:   2 1 0  9.2
##   3:   3 1 1 11.6
##   4:   4 0 0  4.5
##   5:   5 1 1   NA
##  ---             
## 496: 496 0 0  5.8
## 497: 497 0 1  7.3
## 498: 498 0 1  6.9
## 499: 499 1 1 11.1
## 500: 500 1 0 10.0

Using the full data set dd (without any missing data), we can get a point estimate of the treatment effect $\delta$ merely by calculating

\[\hat{\delta} = \bar{Y}_{a=1} - \bar{Y}_{a=0}\]

dd[, .(avg = mean(y)), keyby = a][ , avg - shift(avg)][2]

## [1] 2.5

There is no reason to believe that the observed data means are the same as the complete data set means. That is, it is not likely that $\bar{Y^*}_{a=1}$ = $\bar{Y}_{a=1}$ or $\bar{Y^*}_{a=0}$ = $\bar{Y}_{a=0}$. Observations with higher values of $X$ (and thus higher values of $Y$) are more likely to have missing $Y$’s, so the average observed values in both treatment groups should be lower. This seems to be the case here:

dd[, .(avg = mean(y)), keyby = a]

##    a  avg
## 1: 0  7.5
## 2: 1 10.0

ddobs[, (avg = mean(y, na.rm = TRUE)), keyby = a]

##    a  V1
## 1: 0 7.2
## 2: 1 9.7

In the real world, we can only estimate the treatment effect $\delta^*$ with the data that we have: \[\hat{\delta}^* = \bar{Y}_{a=1}^* - \bar{Y}_{a=0}^*\]

It looks like, in this case at least, the bias in estimates of the means are in the same direction, so that the estimate of the treatment effect based on the difference of means in the observed data is unbiased:

ddobs[!is.na(y), .(avg = mean(y)), keyby = a][ , avg - shift(avg)][2]

## [1] 2.5

If this is the case more generally for data sets generated using this mechanism, we may not need to worry at all about the missing data mechanism; even though we know it is MAR, we might be able to treat it as MCAR, and just use the observed measurements only, without any adjustment or imputation.

Simulating 2500 data sets using steps outlined above provides insight into the nature of the bias. (I’ve provided generic code for generating repeated data sets in the addendum.) The estimates based on the complete data set are shown on the $x$ axis, and the observed data estimates are on the $y$ axis. The dotted lines show the average of the estimates for the complete and observed data sets, respectively.

For both treatment arms, the average estimate from the complete data sets is centered around the true value (used in the data generation process). As expected (since higher values of Y are likely to be missing), the average estimate for each arm is biased downwards when we do not take into consideration the missingness.

However, the bias is removed when we consider the treatment effect, which is our primary interest. In this (perhaps overly) simplistic scenario, there is no price to pay when ignoring the missing data. Both estimates are centered around 2.5, the true value.

Case #2

The second example differs from the first only in one respect: the size of the intervention effect depends on the baseline covariate $X$ (the line drawn from $X$ to the arrow connecting $A$ and $Y$ represents this effect modification).

In the example, $\delta_0 = 1$ for the sub-population with $X = 0$, and $\delta_1 = 4$ for the sub-population with $X = 1$. If the population were evenly distributed between $X=0$ and $X=1$, then we would observe an overall effect $\delta = 2.5$.

d1 <- defData(varname = "x", formula=0.5, dist = "binary")
d2 <- defDataAdd(varname = "y", formula = "6 + 1*a + 2*x + 3*a*x", variance = 2)
dm <- defMiss(varname = "y", formula = "-3.5 + 2.3*x", logit.link = TRUE)

But this time around, if we go ahead and naïvely estimate $\delta^* = \bar{Y}_{a=1}^* - \bar{Y}_{a=0}^*$, the estimate will be biased.

The reason for this bias is that the mix of $X$ in the observed sample is different from the complete sample (and population); since $X$ influences the effect size this change impacts the overall unadjusted estimate. In the complete data set $P(X=1) = 0.50$, but in an observed data set with missing values $P^*(X=1) = 0.44$ (this can be confirmed using the assumptions from the data generation process, but I’ll let you do that as an exercise if you’d like.) The population average treatment effect is $P(X=0) \times 1 + P(X=1) \times 4 = 2.5$. And in the data set with missing data $P^*(X=0) \times 1 + P^*(X=1) \times 4 = 0.56 \times 1 + 0.44 \times 4 = 2.3$.

We can still estimate the treatment effect if we adjust for $X$ in a regression model, or just take the difference in means within each level of $X$. These estimates are unbiased:

If we want to recover the population average treatment effect, we can reweight the group-level treatment effects by the distribution of $X$ in complete sample (since $X$ is fully observed). No imputation is needed.

Case #3

Now we are back to the original motivating scenario. The missing data mechanism is depicted in the next DAG. Those with higher scores in the first period are more likely to have missing values in the second time period, perhaps because they have improved sufficiently and no longer feel like participating in the study.

The DAG is implemented with these definitions:

def1 <- defData(varname = "x", formula=0.5, dist = "binary")

def2 <- defDataAdd(varname = "y1", formula = "5 + a*2.5 + 5*x", variance = 2)
def2 <- defDataAdd(def2, "y2", formula = "1 + y1 + 5*x", variance = 2)

defm <- defMiss(varname = "y2", formula = "-4.5 + 0.3*y1", logit.link = TRUE)

In this case, simply comparing the means in the data sets with missing data provides a biased estimate - we can see this on the left; the argument is similar to the one I made in the previous scenario. If we opt to control for $Y_1$, we introduce all sorts of biases, as $Y_1$ is a mediator between $A$ and $Y_2$, as well as a collider. (I’ve written about the dangers of controlling for post-intervention variables here and the need to be careful with colliders here.)

Since neither a simple comparison of means nor an adjusted model will suffice here, our only option is to use multiple imputation, which in R can be can be implemented with the package mice. Below, I am showing code that generates 20 imputed data sets, fits models for each of them, and pools the results to provide a single estimate and measure of uncertainty.

library(mice)

imp <- mice(ddobs[,-"id"], m=20, maxit=5, print=FALSE)
fit <- with(imp, lm(y2 ~ a))
results <- summary(pool(fit))

Multiple imputation has been applied to the same 2500 data sets with missing data that are represented in the biased estimate plots. The plot below shows a pretty strong correlation with the estimates from the full data mode, and both are centered at the true population effect of 2.5.

The takeaway from all this is that while multiple imputation is not always necessary, if you think there are potentially unmeasured confounders or post-intervention measures that are conceivably in the mix, a multiple imputation approach might be wiser than trying to adjust your way out of the problem.

I plan on implementing a Bayesian model that treats the missing data as parameters. If I can get that working, I will share it here, of course.

Addendum

In case you’d like to play around with other scenarios, I’m including the code that will allow you to repeatedly sample data sets. Just provide you our data definitions.

s_generate <- function(n) {
  
  dd <- genData(n, d1)
  dd <- trtAssign(dd, grpName = "a")
  dd <- addColumns(d2, dd)
  
  dmiss <- genMiss(dd, dm, id = "id")
  dobs <- genObs(dd, dmiss, id = "id")
  
  return(list(dd, dobs))
  
}

s_replicate <- function(n) {
  
  dsets <- s_generate(n)
  
  diff.complete <- dsets[[1]][, .(avg = mean(y2)), keyby = a][ , avg - shift(avg)][2]    
  diff.obs<- dsets[[2]][!is.na(y2), .(avg = mean(y2)), keyby = a][ , avg - shift(avg)][2] 
  
  est.complete <- coef(lm(y2 ~ a, data = dsets[[1]]))["a"]
  est.obs <- coef(lm(y2 ~ y1 + a, data = dsets[[2]]))["a"]
  
  imp <- mice(dsets[[2]][,-"id"], m=20, maxit=5, print=FALSE)
  fit <- with(imp, lm(y2 ~ a))
  pooled.ests <- summary(pool(fit))
  est.impute <- pooled.ests$estimate[2]
  
  return(data.table(diff.complete, est.complete, diff.obs, est.obs, est.impute))
}

results <- rbindlist(mclapply(1:2500, function(x) s_replicate(300), mc.cores = 4))

Framework for power analysis using simulation

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 16 Mar 2021 00:00:00 +0000

The simstudy package started as a collection of functions I developed as I found myself repeating many of the same types of simulations for different projects. It was a way of organizing my work that I decided to share with others in case they wanted a routine way to generate data as well. simstudy has expanded a bit from that, but replicability is still a key motivation.

What I have here is another attempt to document and organize a process that I find myself doing quite often - repeated data generation and model fitting. Whether I am conducting a power analysis using simulation or exploring operating characteristics of different models, I take a pretty similar approach. I refer to this structure when I am starting a new project, so I thought it would be nice to have it easily accessible online - and that way others might be able to refer to it as well.

The framework

I will provide a simple application below, but first I’ll show the general structure. The basic idea is that we want to generate data under a variety of assumptions - for example, a power analysis will assume different sample sizes, effects, and/or levels of variation - and for each set of assumptions, we want to generate a large number of replications to mimic repeated sampling from a population. The key elements of the process include (1) defining the data, (2) generating a data set, (3) fitting a model to the data, and (4) providing summary statistics.

If you have familiarity with simstudy, I’d say the code is pretty self-explanatory. In the function s_generate, there is a call to base R function list2env, which makes all elements of a list available as variables in the function’s environment. The replication process is managed by the mclapply function from the parallel package. (Alternative approaches include using function lapply in base R or using a for loop.)

s_define <- function() {
  
  #--- add data definition code ---#
  
  return(list_of_defs) # list_of_defs is a list of simstudy data definitions
}

s_generate <- function(list_of_defs, argsvec) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  #--- add data generation code ---#
  
  return(generated_data) #  generated_data is a data.table
}

s_model <- function(generated_data) {
  
  #--- add model code ---#
  
  return(model_results) # model_results is a data.table
}

s_single_rep <- function(list_of_defs, argsvec) {
  
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data)
  
  return(model_results)
}

s_replicate <- function(argsvec, nsim) {
  
  list_of_defs <- s_define()

  model_results <- rbindlist(
    parallel::mclapply(
      X = 1 : nsim, 
      FUN = function(x) s_single_rep(list_of_defs, argsvec), 
      mc.cores = 4)
  )
  
  #--- add summary statistics code ---#
  
  return(summary_stats) # summary_stats is a data.table
}

Specifying scenarios

The possible values of each data generating parameter are specified as a vector. The function scenario_list creates all possible combinations of the values of the various parameters, so that there will be $n_1 \times n_2 \times n_3 \times ...$ scenarios, where $n_i$ is the number of possible values for parameter $i$. Examples of parameters might be sample size, effect size, variance, etc, really any value that can be used in the data generation process.

The process of data generation and model fitting is executed for each combination of $n_1 \times n_2 \times n_3 \times ...$ scenarios. This can be done locally using function lapply or using a high performance computing environment using something like Slurm_lapply in the slurmR package. (I won’t provide an example of that here - let me know if you’d like to see that.)

#---- specify varying power-related parameters ---#

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

param_1 <- c(...)
param_2 <- c(...)
param_3 <- c(...)
.
.
.

scenarios <- scenario_list(param1 = param_1, param_2 = param_2, param_3 = param_3, ...)

#--- run locally ---#

summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 1000)))

Example: power analysis of a CRT

To carry out a power analysis of a cluster randomized trial, I’ll fill in the skeletal framework. In this case I am interested in understanding how estimates of power vary based on changes in effect size, between cluster/site variation, and the number of patients per site. The data definitions use double dot notation to allow the definitions to change dynamically as we switch from one scenario to the next. We estimate a mixed effect model for each data set and keep track of the proportion of p-value estimates less than 0.05 for each scenario.

s_define <- function() {
  
  #--- data definition code ---#
  
  def1 <- defData(varname = "site_eff", 
    formula = 0, variance = "..svar", dist = "normal", id = "site")
  def1 <- defData(def1, "npat", formula = "..npat", dist = "poisson")
  
  def2 <- defDataAdd(varname = "Y", formula = "5 + site_eff + ..delta * rx", 
    variance = 3, dist = "normal")
  
  return(list(def1 = def1, def2 = def2)) 
}

s_generate <- function(list_of_defs, argsvec) {
  
  list2env(list_of_defs, envir = environment())
  list2env(as.list(argsvec), envir = environment())
  
  #--- data generation code ---#
  
  ds <- genData(40, def1)
  ds <- trtAssign(ds, grpName = "rx")
  dd <- genCluster(ds, "site", "npat", "id")
  dd <- addColumns(def2, dd)
  
  return(dd)
}

s_model <- function(generated_data) {
  
  #--- model code ---#
  
  require(lme4)
  require(lmerTest)
  
  lmefit <- lmer(Y ~ rx + (1|site), data = generated_data)
  est <- summary(lmefit)$coef[2, "Estimate"]
  pval <- summary(lmefit)$coef[2, "Pr(>|t|)"]
  
  return(data.table(est, pval)) # model_results is a data.table
}

s_single_rep <- function(list_of_defs, argsvec) {
  
  generated_data <- s_generate(list_of_defs, argsvec)
  model_results <- s_model(generated_data)
  
  return(model_results)
}

s_replicate <- function(argsvec, nsim) {
  
  list_of_defs <- s_define()

  model_results <- rbindlist(
    parallel::mclapply(
      X = 1 : nsim, 
      FUN = function(x) s_single_rep(list_of_defs, argsvec), 
      mc.cores = 4)
  )
  
  #--- summary statistics ---#
  
  power <- model_results[, mean(pval <= 0.05)]
  summary_stats <- data.table(t(argsvec), power)
  
  return(summary_stats) # summary_stats is a data.table
}

scenario_list <- function(...) {
  argmat <- expand.grid(...)
  return(asplit(argmat, MARGIN = 1))
}

delta <- c(0.50, 0.75, 1.00)
svar <- c(0.25, 0.50)
npat <- c(8, 16)

scenarios <- scenario_list(delta = delta, svar = svar, npat = npat)

#--- run locally ---#

summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 250)))

The overall results (in this case, the power estimate) can be reported for each scenario.

summary_stats

##     delta svar npat power
##  1:  0.50 0.25    8 0.480
##  2:  0.75 0.25    8 0.844
##  3:  1.00 0.25    8 0.960
##  4:  0.50 0.50    8 0.368
##  5:  0.75 0.50    8 0.684
##  6:  1.00 0.50    8 0.904
##  7:  0.50 0.25   16 0.660
##  8:  0.75 0.25   16 0.940
##  9:  1.00 0.25   16 1.000
## 10:  0.50 0.50   16 0.464
## 11:  0.75 0.50   16 0.792
## 12:  1.00 0.50   16 0.956

We can also plot the results easily to get a clearer picture. Higher between-site variation clearly reduces power, as do smaller effect sizes and smaller sizes. None of this is surprising, but is always nice to see things working out as expected:

Randomization tests make fewer assumptions and seem pretty intuitive

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 02 Mar 2021 00:00:00 +0000

I’m preparing a lecture on simulation for a statistical modeling class, and I plan on describing a couple of cases where simulation is intrinsic to the analytic method rather than as a tool for exploration and planning. MCMC methods used for Bayesian estimation, bootstrapping, and randomization tests all come to mind.

Randomization tests are particularly interesting as an approach to conducting hypothesis tests, because they allow us to avoid making unrealistic assumptions. I’ve written about this before under the rubric of a permutation test. The example I use here is a little a different; truth be told, the real reason I’m sharing is that I came up with a nice little animation to illustrate a simple randomization process. So, even if I decide not to include it in the lecture, at least you’ve seen it.

The goal of a randomization test

In the context of an RCT with treatment and control arms and we want to compare average responses (measured by some outcome $Y$), the standard hypothesis test is framed around a null hypothesis $H_0: \mu_\text{t} = \mu_\text{c}$. In this framework, $\mu_\text{t}$ and $\mu_\text{c}$ are the average responses in the population under treatment and control, respectively. For this to be valid, we need to assume that the study sample is representative of the population of interest, that has been randomly selected; for most RCTs, this is a fairly untenable. Participants of studies are typically not randomly drawn from the population, but are more likely to have shown up in a particular setting, been identified for recruitment, and ultimately decided for themselves about participating.

The randomization test makes no assumption about whether the sample is representative. Rather, it asks a question that is limited to the sample at hand, regardless of how it was collected or created. The null hypothesis in the randomization test is that the average response in the sample is unrelated to treatment assignment. If we reject the null hypothesis and conclude treatment assignment mattered in this sample, we can decide to extend this conclusion to the population based on our understanding of the intervention and how the sample relates to this broader population. In the standard hypothesis testing framework, the leap from the sample to the population is baked in; in the context of a randomization test, the generalization is not so obvious.

It is also important to underscore that while the standard hypothesis test makes assumptions about normality (or at least about the validity of the Central Limit Theorem), the randomization test makes no assumptions about the underlying distribution of the outcome data.

Simulated data

To make things a little more interesting, and to to test the importance of the normality assumption, I’m using a data generation process that mixes two populations - Group 1, where the outcome $Y_1 \sim N(\mu=0, \sigma^2=1)$ and Group 2, with larger mean and variance: $Y_2 \sim N(5,4)$. The treatment effect also differs across the groups. The population (or actually the sample) will be comprised of 70% from Group 1 and 30% from Group 2.

For the purposes of creating the animation, I am generating 1000 observations in total, randomizing 500 to each arm:

library(simstudy)

d1 <- defDataAdd(varname = "Y_1", formula = "0 + 2 * rx", 
                 variance = 1, dist = "normal")
d1 <- defDataAdd(d1, varname = "Y_2", formula = "5 + 1 * rx", 
                 variance = 4, dist = "normal")
d1 <- defDataAdd(d1, varname = "Y", 
                 formula = "Y_1 | 0.7 + Y_2 | 0.3", dist = "mixture")

set.seed(11778)
dd <- genData(1000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)

The histogram of these data makes it quite clear that the data are not normally distributed:

Randomization

In the simple case of two-arm trial, the randomization test is quite simple: we repeatedly assign randomly generated alternate treatment arm labels to each of the observations, and calculate the test statistic following each iteration. In this case the test statistic in the difference in means $\Delta_\text{obs} = \bar{Y}_t - \bar{Y}_c$.

In a more complex situation, where the data have an underlying structure, such as clustering or block randomization, we have to make sure that the re-randomization does not violate that structure. For example, in the case of a cluster randomized trial where all individuals within the cluster are in the same intervention arm, the null hypothesis is that cluster-level treatment has no impact, so we would re-randomize the cluster as a whole, not the individuals.

Here is the animation that depicts a single iteration of the re-randomization process, starting with original data, permuting the data, and calculating $\Delta_i^*$, the test statistic for iteration $i$ of the procedure. In the data just generated $\Delta_\text{obs} = 1.9$ and the re-randomized $\Delta^* = 0.1$. (The code for the animation is in the addendum.)

dd[, rx_s := sample(rx, replace = FALSE)]

Estimating a p-value

The animation is kind of a cool way to depict single iteration, but to estimate a distribution for $\Delta^*$ and ultimately the p-value, we need to do this repeatedly. Using 1000 observations, the p-value will be vanishingly small, so I’m creating a much smaller data set of 60 observations with an observed effect size of 1.8.

dd <- genData(60)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)

Delta_obs <- dd[rx == 1, mean(Y)] - dd[rx == 0, mean(Y)]
Delta_obs

## [1] 1.809

The iteration process consists of repeatedly calling a function that randomly assigns labels and returns the group differences based on these new labels. It is generally recommended to run between 500 and 1500 iterations (including the observed data set, which can be viewed as just another iteration under the null hypothesis); here I am using 1500.

randomize <- function(dx) {
  
  rx_s <- sample(dx$rx, replace = FALSE)
  dn <- data.table(Y = dx$Y, rx = rx_s)
  Delta_star <- dn[rx == 1, mean(Y)] - dn[rx == 0, mean(Y)]
  Delta_star

}

Delta_stars <- sapply(1:1499, function(x) randomize(dd))

The distribution of the $\Delta^*$’s appears to be normally distributed even though the underlying data are not; the red line indicates the observed value, $\Delta_\text{obs}$:

The p-value is estimated by comparing $\Delta_\text{obs}$ with a combined data set that includes the $\Delta^*$’s and $\Delta_\text{obs}$. Using an $\alpha = 0.05$, we would reject that null hypothesis and conclude that within this sample, treatment had an effect.

1 - mean(abs(Delta_obs) >= abs(c(Delta_obs, Delta_stars)))

## [1] 0.004667

Operating characteristics of the randomization test

I was very interested to see what the Type I error rate would be for this example, so I repeatedly generated data sets under the assumption of no treatment effect . For each data set, I estimated both a traditional as well as a randomization test p-value. I used a large number of iterations - 48,000 to be exact - to make sure my Type I error estimate converged as close to as possible to the truth.

This would have taken days, probably close to a week, on my laptop; to get around this I used a high performance computer which I have described in the context of Bayesian modeling. Rather than days, it took about 2 hours. If you’d like the code for this, I’m happy to share. Using datasets with 20 observations, the standard Type I error rate was 4.5% and the error rate using the randomization test was exactly 5.0%. With 40 observations, the error rates were 4.8% and 4.9%, respectively. So it appears that, in these scenarios at least, the randomization test does a slightly better job of realizing the targeted 5% Type I error rate.

Addendum

The animation is created using the gganimate package. This is completely new to me, so I am still exploring; if you want to learn more, I’d recommend checking out the website. The key element is to define a sequence of plots that represent states; gganimate magically creates the necessary transitions, and you can control observation times and smoothness of the transitions. The output is a gif file.

library(ggplot2)
library(gganimate)

dif_in_means_orig <- round(dd[rx == 1, mean(Y)] - dd[rx == 0, mean(Y)], 1)
dif_in_means_perm <- round(dd[rx_s == 1, mean(Y)] - dd[rx_s == 0, mean(Y)], 1)

dd1 <- dd[, .(iter = 1, id=id, rx = rx, rcolor = rx, Y=Y, perm = FALSE)]
dd2 <- dd[, .(iter = 2, id=id, rx = 0.5, rcolor = 3, Y=Y, perm = FALSE)]
dd3 <- dd[, .(iter = 3, id=id, rx = 0.5, rcolor = rx_s, Y=Y, perm = TRUE)]
dd4 <- dd[, .(iter = 4, id=id, rx = rx_s, rcolor = rx_s, Y=Y, perm = TRUE)]

ddx <- rbind(dd1, dd2, dd3, dd4)
ddx[, iter := factor(iter, 
  labels = c(
    paste0("Original data with effect size = ", dif_in_means_orig, " ..."), 
    "permutation ...", 
    "permutation ....", 
    paste0("... after permutation, the mean difference = ", dif_in_means_perm)))]

a <- ggplot(data = ddx, aes(x = rx, y = Y, group = id)) +
  geom_point(position = position_jitter(seed = 42), 
             aes(color = factor(rcolor), shape = perm)) +
  geom_vline(xintercept = 0.5, color = "white") +
  scale_color_manual(values = c("#bbb66c", "#6c71bb", "grey80")) +
  scale_shape_manual(values = c(19, 4)) +
  scale_x_continuous(limits = c(-.5, 1.5), breaks = c(0, 1), 
                     labels = c("control", "treatment")) +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        axis.title.x = element_blank(),
        axis.text = element_text(size = 12),
        axis.title.y = element_text(size = 14)) +
  transition_states(iter, state_length = 2, transition_length = 1) +
  labs(title = "{closest_state}", y="Outcome")

animate(a, duration = 15, fps = 10, height = 450, width = 350)

Visualizing the treatment effect with an ordinal outcome

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 16 Feb 2021 00:00:00 +0000

If it’s true that many readers of a journal article focus on the abstract, figures and tables while skimming the rest, it is particularly important tell your story with a well conceived graphic or two. Along with a group of collaborators, I am trying to figure out the best way to represent an ordered categorical outcome from an RCT. In this case, there are a lot of categories, so the images can get confusing. I’m sharing a few of the possibilities that I’ve tried so far, including the code.

The motivation for this work is a data set we don’t have yet. The ongoing CONTAIN trial is a large multi-site trial evaluating the effectiveness of using COVID-19 convalescent plasma (CP) to treat newly hospitalized patients with COVID-19. The primary outcome is a World Health Organization (WHO) score that ranges from 0 to 10, where 0 is healthy and virus-free and 10 is death:

The primary goal of the study is to show whether patients receiving CP are more likely to have better outcomes (i.e. score lower on the WHO 11-point scale) 14 days after randomization compared to those who receive a saline solution placebo.

The analysis will use a cumulative proportional odds model, which I have written about previously (see here, for example). We plan on doing a Bayesian version of the model, but I won’t get into that either (I wrote about the Bayes model in the context of a related project here). Rather, I will show four options for presenting the data in a way that highlights the treatment effect (or lack thereof). Often, when I’m blogging, I’ll skip over the ggplot code, but that is the point here so you’ll see everything.

The data

Since I don’t have access to the actual data, simulated data will have to suffice. The data generation process I’m using is quite simple:

generate 1000 individuals
randomize to treatment and control using a 1:1 ratio
specify baseline probabilities for the distribution of outcomes in the control arm
outcome probabilities in the treatment arm are based on an assumed log cumulative odds ratio of -0.6 (which translates to a cumulative odds ratio of 0.55).

The odds we are talking about are

\[ Odds = \frac{P(WHO \ge y)}{P(WHO < y)}, \ \ y \in \{2,\dots,10\}, \] so it should be clear that lower odds is preferred.

Here is the data generation code:

library(simstudy)
library(data.table)
library(ggplot2)

def <- defDataAdd(varname = "z", formula = "rx * -0.6", dist = "nonrandom")

set.seed(39821)

dd <- genData(1000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(def, dd)

dd <- genOrdCat(
  dd, 
  adjVar = "z", 
  baseprobs = c(0.06, 0.06, 0.10, 0.10, 0.10, 0.13, 0.13, 0.10, 0.10, 0.06, 0.06),
  catVar = "WHO"
)

Summarizing the data

In the plots that follow, I’ll be using summary data: proportions and cumulative proportions of patients that fall into each category:

dsum <- dd[, .(N = sum(.N)), keyby = .(rx, WHO)]
dsum[, rx := factor(rx, labels = c("control", "treatment"))]
dsum

##            rx WHO  N
##  1:   control   1 27
##  2:   control   2 28
##  3:   control   3 48
##  4:   control   4 54
##  5:   control   5 52
##  6:   control   6 62
##  7:   control   7 63
##  8:   control   8 63
##  9:   control   9 40
## 10:   control  10 42
## 11:   control  11 21
## 12: treatment   1 50
## 13: treatment   2 46
## 14: treatment   3 78
## 15: treatment   4 63
## 16: treatment   5 64
## 17: treatment   6 62
## 18: treatment   7 53
## 19: treatment   8 27
## 20: treatment   9 33
## 21: treatment  10 13
## 22: treatment  11 11
##            rx WHO  N

I’ll also need the total number of patients in each arm so that I can provide informative labels:

dnames <- dsum[, sum(N), keyby = rx]
dnames[, legend_label := paste0(rx, " (n = ", V1, ")")]
dnames[, axis_label := paste0(rx, "\n(n = ", V1, ")")]
dnames

##           rx  V1        legend_label           axis_label
## 1:   control 500   control (n = 500)   control\n(n = 500)
## 2: treatment 500 treatment (n = 500) treatment\n(n = 500)

Proportions by arm

The first plot is quite straightforward, showing the proportion of each arm that falls in each category. This plot allows us to see right away that the treatment arm has more patients in the lower categories. While not particularly elegant, the plot makes it quite easy to gauge the relative proportions:

dsum[, prop := N/sum(N), keyby = rx]
dsum[, legend_label := factor(rx, labels = dnames[, legend_label])]

ggplot(data = dsum, aes(x = WHO, y = prop)) +
  geom_line(aes(group = legend_label, color = legend_label), size = 1) +
  geom_point(aes(color = legend_label), size = 2) +
  ylim(0, 0.2) +
  theme(panel.grid = element_blank(),
        legend.title = element_blank(),
        legend.position = c(.8, .88),
        legend.background = element_rect(fill = "grey92"),
        legend.key = element_rect(fill = "grey92")) +
  scale_color_manual(values = c("#798E87", "#C27D38"),
                     guide = guide_legend(revers = TRUE)) +
  scale_x_discrete(labels = c("virus-free", 1:9, "died")) +
  ylab("proportion") +
  xlab("WHO score")

Cumulative proportion by arm

A slightly nicer version of the proportional line plot might be the same idea but with cumulative probabilities or proportions. We again can easily see that the treatment is having the desired effect, as the cumulative proportion is higher at the low end of the scale. I actually like this one, but it may be harder for folks to interpret, particularly if they haven’t worked with ordinal data extensively. Perhaps the first plot and this one in tandem would work well to give a complete picture.

dsum[, cumprop := cumsum(prop), keyby = rx]

ggplot(data = dsum, aes(x = WHO, y = cumprop)) +
  geom_line(aes(group = legend_label, color = legend_label), size = 1) +
  geom_point(aes(color = legend_label), size = 2) +
  scale_x_discrete(labels = c("virus-free", 1:9, "died")) +
  ylim(0, 1) +
  theme(panel.grid = element_blank(),
        legend.title = element_blank(),
        legend.position = c(.75, .2),
        legend.background = element_rect(fill = "grey92"),
        legend.key = element_rect(fill = "grey92")) +
  scale_color_manual(values = c("#798E87", "#C27D38"),
                     guide = guide_legend(revers = TRUE)) +
  ylab("cumulative proportion") +
  xlab("WHO score")

Distribution of outcome using stacked bars

The next one was inspired by a recent paper describing the results of an RCT assessing the effect of Hydroxychloroquine on COVID-19 patients. The plot is packed with information, but is still simple enough to understand. With a large number of categories, the stacked bars might not make it completely obvious that treatment appears effective. By using more colors, we might be able to overcome this. But I’ve reserved that for the final plot.

cc <- scales::seq_gradient_pal("#267efa", "white")(seq(0,1,length.out=11))
dsum[, linept := (rx == "treatment") * (2 - 0.3) + (rx == "control") * (1 + 0.3)]
dsum[, axis_label := factor(rx, labels = dnames[, axis_label])]

ggplot(data = dsum, aes( fill = WHO, y = N, x = axis_label)) +
  geom_bar(position=position_fill(reverse=TRUE), 
           stat = "identity", width = 0.6) +
  geom_line(aes(x = linept, y = N), 
            position = position_fill(reverse = TRUE),
            lty = 3, size = .2, color = "black") +
  geom_text(aes(label = scales::percent(prop, accuracy = 1L), y = prop),
            position = position_stack(vjust = 0.5, reverse= TRUE),
            size = 2.5, color = "grey20") +
  theme(axis.title.x = element_blank(),
        panel.grid = element_blank(),
        legend.title = element_text(size = 8, face = "bold")) +
  scale_fill_manual(
    values = cc, 
    name = "WHO score", 
    labels = c("virus-free", 1:9, "died"),
    guide = guide_legend(reverse = TRUE)) +
  ylab("proportion")

Distribution of outcome using divergent bars

In this last version, the stacked bars are rotated and shifted so that they diverge from the middle of the WHO scale. This emphasizes that the treatment arm does appear to have a higher proportion of patients who are doing relatively well. This divergent plot is a bit trickier to pull off manually using ggplot, and requires some manipulation of the factors to make it work out the way we want.

cc_low <- scales::seq_gradient_pal("#faa226", "white")(seq(0.2, 0.8, length.out=4))
cc_high <- scales::seq_gradient_pal("white", "#267efa")(seq(0.2, 0.8, length.out=7))
cc <- c(cc_low, cc_high)

dsum[, w_alt := factor(WHO, levels = 1:11, labels=letters[1:11])]
dsum_low <- dsum[as.numeric(w_alt) %in% c(1:4)]
dsum_high <- dsum[as.numeric(w_alt) %in% c(5:11)]
dsum_high[, w_alt := factor(w_alt, levels = letters[11:5])]

ggplot() +
  geom_bar(
    data = dsum_low,
    aes(x = axis_label, y = -prop, fill = w_alt),
    width = .6, stat="identity") +
  geom_bar(
    data = dsum_high,
    aes(x = axis_label, y = prop, fill = w_alt),
    width = .6, stat="identity") +
  scale_fill_manual(
    values = cc, 
    name = "WHO score", 
    labels = c("virus-free", 1:9, "died")) +
  ylab("proportion") +
  theme(panel.grid = element_blank(),
        axis.title.y = element_blank(),
        legend.title = element_text(size = 8, face = "bold")) + 
  scale_y_continuous(limits = c(-.50,.75), 
    breaks = c(-.50, -.25, 0, 0.25, .5, .75), 
    labels = c("50%", "25%","0%", "25%", "50%","75%")) +
  geom_hline(yintercept = 0, color = "grey96") +
  coord_flip()

In the end, it is difficult to say which approach will be preferable - I think it will depend on the actual data and the differences between the groups. Of course, not all journals permit color, so that would certainly influence the design. But this is a good start: it gives us plenty to play around with once the data do finally arrive.

How useful is it to show uncertainty in a plot comparing proportions?

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 02 Feb 2021 00:00:00 +0000

I recently created a simple plot for a paper describing a pilot study of an intervention targeting depression. This small study was largely conducted to assess the feasibility and acceptability of implementing an existing intervention in a new population. The primary outcome measure that was collected was the proportion of patients in each study arm who remained depressed following the intervention. The plot of the study results that we included in the paper looked something like this:

The motivation for showing the data in this form was simply to provide a general sense of the outcome patterns we observed, even though I would argue (and I have argued) that one shouldn’t try to use a small pilot to draw strong conclusions about a treatment effect, or maybe any conclusions at all. The data are simply too noisy. However, it does seem useful to show data that suggest an intervention might move things in the right direction (or least not in the wrong direction). I would have been fine showing this plot along with a description of the feasibility outcomes and plans for a future bigger trial that is designed to actually measure the treatment effect and allow us to draw stronger conclusions.

Of course, some journals have different priorities and might want to make stronger statements about the research they publish. Perhaps with this in mind, a reviewer suggested that we include 95% confidence intervals around the point estimates to give a more complete picture. In that case the figure would have looked something like this:

When I shared this plot with my collaborators, it generated a bit of confusion. They had done a test comparing two proportions at period 2 and found a “significant” difference between the two arms. The p-value was 0.04, and the 95% confidence interval for the difference in proportions was [0.03, 0.57], which excludes 0:

prop.test(x = c(21, 12), n=c(30, 30), correct = TRUE)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(21, 12) out of c(30, 30)
## X-squared = 4, df = 1, p-value = 0.04
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.027 0.573
## sample estimates:
## prop 1 prop 2 
##    0.7    0.4

Does it make sense that the 95% CIs for the individual proportions overlap while at the same time there does appear to be a real difference between the two groups (at least in this crude way, without making adjustments for multiple testing or considering the possibility that there might be differences in the two groups)? Well - there’s actually no real reason to think that this is a paradox. The two different types of confidence intervals are measuring very different quantities - one set is looking at individual proportions, and the other is looking at the difference in proportions.

I thought a simple way to show this non-paradox would be to generate all the possible confidence intervals and p-values for a case where we have 30 patients per arm, and create a plot to show how overlapping individual confidence intervals for the proportions relate to p-values based on a comparison of those proportions.

I’ve created a data set that is a grid of events, where I am only interested in cases where the number of “events” (e.g. individuals with depression) in the intervention arm is less than the number in the control arm.

N <- 30
e0 <- c(5:(N-5))
e1 <- c(5:(N-5))

dd <- data.table(expand.grid(e0 = e0, e1 = e1))
dd <- dd[e1 < e0]
dd[, id := 1:.N]

dd

##      e0 e1  id
##   1:  6  5   1
##   2:  7  5   2
##   3:  8  5   3
##   4:  9  5   4
##   5: 10  5   5
##  ---          
## 206: 24 22 206
## 207: 25 22 207
## 208: 24 23 208
## 209: 25 23 209
## 210: 25 24 210

For each pair of possible outcomes, I am estimating the confidence interval for each proportion. If the upper limit of intervention arm 95% CI is greater than the lower limit of the control arm 95% CI, the two arms overlap. (Look back at the confidence interval plot to make sure this makes sense.)

ci <- function(x, n) {
  data.table(t(prop.test(x = x, n = n, correct = T)$conf.int))
}  

de0 <- dd[, ci(e0, N), keyby = id]
de0 <- de0[, .(L_e0 = V1, U_e0 = V2)]

de1 <- dd[, ci(e1, N), keyby = id]
de1 <- de1[, .(L_e1 = V1, U_e1 = V2)]

dd <- cbind(dd, de0, de1)
dd[, overlap := factor(U_e1 >= L_e0, labels = c("no overlap", "overlap"))]

In the next and last step, I am getting the p-value for a comparison of the proportions in each pair. Any p-value less than the cutoff of 5% is considered significant.

cidif <- function(x, n) {
  prop.test(x = x, n = n, correct = T)$p.value
}  

dd[, pval := cidif(x = c(e1, e0), n = c(N, N)), keyb = id]
dd[, sig := factor(pval < 0.05, labels = c("not significant","significant"))]

The plot shows each pair color coded as to whether there is overlap and the difference is statistically significant.

library(paletteer)

ggplot(data=dd, aes(x = e0, y = e1)) +
  geom_point(aes(color = interaction(overlap, sig)), size = 3, shape =15) +
  theme(panel.grid = element_blank(),
        legend.title = element_blank()) +
  scale_color_paletteer_d(
    palette = "wesanderson::Moonrise2",
    breaks=c("overlap.not significant", "overlap.significant", "no overlap.significant")
  ) +
  scale_x_continuous(limits=c(5, 25), name = "number of events in control arm") +
  scale_y_continuous(limits=c(5, 25), name = "number of events in treatment arm")

The blue points in the center represent outcomes that are relatively close; there is overlap in the individual 95% CIs and the results are not significant. The rust points in the lower right-hand corner represent outcomes where differences are quite large; there is no overlap and the results are significant. (It will always be the case that if there is no overlap in the individual 95% CIs, the differences will be significant, at least before making adjustments for multiplicity, etc.) The region of gold points is where the ambiguity lies, outcomes where there is overlap between the individual 95% CIs but the differences are indeed statistically significant.

The following plot focuses on a single row from the grid plot above. Fixing the number of events in the treatment arm to 10, the transition from (a) substantial overlap and non-significance to (b) less overlap and significance to (c) complete separation and significance is made explicit.

d10 <- dd[e1==10]
d10 <- melt(
  data = d10,
  id.vars = c("e0","e1", "sig", "overlap"), 
  measure.vars = list(c("L_e0", "L_e1"), c("U_e0", "U_e1")), 
  value.name = c("L","U")
)

ggplot(data = d10, aes(x = factor(e0), ymin = L, ymax = U)) +
  geom_errorbar(aes(group = "variable", 
                    color=interaction(overlap, sig)
                ),
                width = .4,
                size = 1,
                position = position_dodge2()) +
  theme(panel.grid = element_blank(),
        legend.title = element_blank(),
        plot.title = element_text(size = 10, face = "bold")) +
  scale_color_paletteer_d(
    palette = "wesanderson::Moonrise2",
    breaks=c("overlap.not significant", "overlap.significant", "no overlap.significant")
  ) +
  scale_y_continuous(limits = c(0, 1), name = "proportion with events") +
  xlab("number of events in control arm") +
  ggtitle("Comparison of 95% CIs and significance - treatment arm fixed at 10 events")

Where does this leave us? I think including the 95% CIs for the individual proportions is not really all that helpful, because there is that area of ambiguity. (Not to mention the fact that I think we should be de-emphasizing the p-values while reporting the results of a pilot.)

In this case, I am fine with the original plot, but, it is possible to provide an alternative measure of uncertainty by including error bars defined by the sample standard deviation. While doing this is typically more interesting in the context of continuous outcomes, it does give a sense of the sampling variability, which in the case of proportions is largely driven by the sample size. If you do decide to go this route, make sure to label the plot clearly to indicate what the error bars represent (so readers don’t think they are something they are not, such as 95% CIs).

Finding answers faster for COVID-19: an application of Bayesian predictive probabilities

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 19 Jan 2021 00:00:00 +0000

As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of COMPILE, a prospective individual patient data meta-analysis, recognizes this. They are regularly monitoring the data to determine if there is a sufficiently strong signal to indicate effectiveness of convalescent plasma (CP) for hospitalized patients not on ventilation.

How much data is enough to draw a conclusion? We know that at some point in the next few months, many if not all of the studies included in the meta-analysis will reach their target enrollment, and will stop recruiting new patients; at that point, the meta-analysis data set will be complete. Before that end-point, an interim DSMB analysis might indicate there is a high probability that CP is effective although it does not meet the pre-established threshold of 95%. If we know the specific number of patients that will ultimately be included in the final data set, we can predict the probability that the findings will put us over that threshold, and possibly enable a recommendation. If this probability is not too low, the DSMB may decide it is worth waiting for the complete results before drawing any conclusions.

Predicting the probability of success (or futility) is done using the most recent information collected from the study, which includes observed data, the parameter estimates, and the uncertainty surrounding these estimates (which is reflected in the posterior probability distribution).

This post provides an example using a simulated data set to show how this prediction can be made.

Determining success

In this example, the outcome is the WHO 11-point ordinal scale for clinical status at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. As in COMPILE, I’ll use a Bayesian proportional odds model to assess the effectiveness of CP. The measure of effectiveness is an odds ratio (OR) that compares the cumulative odds of having a worse outcome for the treated group compared to the cumulative odds for the control group:

\[ \text{Cumulative odds for level } k \text{ in treatment arm } j =\frac{P(Y_{ij} \ge k)}{P(Y_{ij} \lt k)}, \ k \in \{1,\dots, 10\} \]

The goal is to reduce the odds of having a bad outcome, so a successful therapy is one where $OR \lt 1$. In a Bayesian context, we estimate the posterior probability distribution of the $OR$ (based on prior assumptions before we have collected any data). We will recommend the therapy in the case that most of the probability density lies to the left of 1; in particular we will claim success only when $P(OR \lt 1) > 0.95$. For example, in the figure the posterior distribution on top would lead us to consider the therapy successful since 95% of the density falls below 1, whereas the distribution on the bottom would not:

Data set

This data set here is considerably simpler than the COMPILE data that has motivated all of this. Rather than structuring this example as a multi-study data set, I am assuming a rather simple two-arm design without any sort of clustering. I am including two binary covariates related to sex and age. The treatment in this case reduces the odds of worse outcomes (or increases the odds of better outcomes). For more detailed discussion of generating ordinal outcomes, see this earlier post (but note that I have flipped direction of cumulative probability in the odds formula).

library(simstudy)
library(data.table)

def1 <- defDataAdd(varname="male", formula="0.7", dist = "binary")
def1 <- defDataAdd(def1, varname="over69", formula="0.6", dist = "binary")
def1 <- defDataAdd(def1, 
  varname="z", formula="0.2*male + 0.3*over69 - 0.3*rx", dist = "nonrandom")

baseprobs <-  c(0.10, 0.15, 0.08, 0.07, 0.08, 0.08, 0.11, 0.10, 0.09, 0.08, 0.06)

RNGkind("L'Ecuyer-CMRG")
set.seed(9121173)

dd <- genData(450)
dd <- trtAssign(dd, nTrt = 2, grpName = "rx")
dd <- addColumns(def1, dd)
dd <- genOrdCat(dd, adjVar = "z", baseprobs = baseprobs, catVar = "y")

Here is a plot of the cumulative proportions by treatment arm for the first 450 patients in the (simulated) trial. The treatment arm has more patients with lower WHO-11 scores, so for the most part lies above the control arm line. (This may be a little counter-intuitive, so it may be worthwhile to think about it for a moment.)

Estimate a Bayes ordinal cumulative model

With the data from 450 patients in hand, the first step is to estimate a Bayesian proportional odds model, which I am doing in Stan. I use the package cmdstanr to interface between R and Stan.

Here is the model:

\[ \text{logit}\left(P(y_{i}) \ge k \right) = \tau_k + \beta_1 I(\text{male}) + \beta_2 I(\text{over69}) + \delta T_i, \ \ \ k \in \{ 1,\dots,10 \} \]

where $T_i$ is the treatment indicator for patient $i$, and $T_i = 1$ when patient $i$ receives CP. $\delta$ represents the log odds ratio, so $OR = e^{\delta}$. I’ve included the Stan code for the model in the the first addendum.

library(cmdstanr)

dt_to_list <- function(dx) {
  
  N <- nrow(dx)                               ## number of observations 
  L <- dx[, length(unique(y))]                ## number of levels of outcome 
  y <- as.numeric(dx$y)                       ## individual outcome 
  rx <- dx$rx                                 ## treatment arm for individual 
  x <- model.matrix(y ~ factor(male) + factor(over69), data = dx)[, -1]
  D <- ncol(x)
  
  list(N=N, L=L, y=y, rx=rx, x=x, D=D)
}

mod <- cmdstan_model("pprob.stan")

fit <- mod$sample(
  data = dt_to_list(dd),
  seed = 271263,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 2000,
  iter_sampling = 2500,
  step_size = 0.1
)

## Running MCMC with 4 parallel chains...
## 
## Chain 4 finished in 56.9 seconds.
## Chain 1 finished in 57.8 seconds.
## Chain 3 finished in 58.0 seconds.
## Chain 2 finished in 60.7 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 58.3 seconds.
## Total execution time: 61.0 seconds.

Extract posterior distribution

Once the model is fit, our primary interest is whether we can make a recommendation about the therapy. A quick check to verify if $P(OR < 1) > 0.95$ confirms that we are not there yet.

library(posterior)

draws_df <- as_draws_df(fit$draws())
draws_dt <- data.table(draws_df[-grep("^yhat", colnames(draws_df))])

mean(draws_dt[, OR < 1])

## [1] 0.89

A plot that shows the bottom 95% portion of the density in blue makes it clear that the threshold has not been met:

The elements of the predictive probability

We have collected complete data from 450 patients out of an expect 500, though we are not yet ready declare success. An interesting question to ask at this point is “given what we have observed up until this point, what is the probability that we will declare success after 50 additional patients are included in the analysis?” If the probability is sufficiently high, we may decide to delay releasing the inconclusive results pending the updated data set. (On the other hand, if the probability is quite low, there may be no point in delaying.)

The prediction incorporates three potential sources of uncertainty. First, there is the uncertainty regarding the parameters, which is described by the posterior distribution. Second, even if we knew the parameters with certainty, the outcome remains stochastic (i.e. not pre-determined conditional on the parameters). Finally, we don’t necessarily know the characteristics of the remaining patients (though we may have some or all of that information if recruitment has been finished but complete follow-up has not).

In the algorithm that follows - the steps follow from these three elements of uncertainty:

Generate 50 new patients by bootstrap sampling (with replacement) from the observed patients. The distribution of covariates of the new 50 patients will be based on the original 450 patients.
Make a single draw from the posterior distribution of our estimates to generate a set of parameters.
Using the combination of new patients and parameters, generate ordinal outcomes for each of the 50 patients.
Combine this new data set with the 450 existing patients to create a single analytic file.
Fit a new Bayesian model with the 500-patient data set.
Record $P(OR < 1)$ based on this new posterior distribution. If $P(OR < 1)$ is $\gt 95\%$, we consider the result to be a “success”, otherwise it is “not a success”.

We repeat this cycle, say 1000 times. The proportion of cycles that are counted as a success represents the predictive probability of success.

Step 1: new patients

library(glue)

dd_new <- dd[, .(id = sample(id, 25, replace = TRUE)), keyby = rx]
dd_new <- merge(dd[, .(id, male, over69)], dd_new, by = "id")
dd_new[, id:= (nrow(dd) + 1):(nrow(dd) +.N)]

Step 2: draw set of parameters

draw <- as.data.frame(draws_dt[sample(.N, 1)])

The coefficients $\hat{\beta}_1$ (male), $\hat{\beta}_2$ (over69), and $\hat{\delta}$ (treatment effect) are extracted from the draw from the posterior:

D <- dt_to_list(dd)$D
beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
delta <- draw$delta
coefs <- as.matrix(c(beta, delta))

coefs

##       [,1]
## [1,]  0.22
## [2,]  0.60
## [3,] -0.19

Using the draws of the $\tau_k$’s, I’ve calculated the corresponding probabilities that can be used to generate the ordinal outcome for the new observations:

tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
tau <- c(tau, Inf)
cprop <- plogis(tau)
xprop <- diff(cprop)
baseline <- c(cprop[1], xprop) 

baseline

##  [1] 0.117 0.136 0.123 0.076 0.089 0.101 0.114 0.102 0.054 0.040 0.048

Step 3: generate outcome using coefficients and baseline probabilities

zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
dd_new$z <- zmat %*% coefs
setkey(dd_new, id)

dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")

Step 4: combine new with existing

dx <- rbind(dd, dd_new)

Step 5: fit model

fit_pp <- mod$sample(
  data = dt_to_list(dx),
  seed = 737163,
  refresh = 0,
  chains = 4L,
  parallel_chains = 4L,
  iter_warmup = 2000,
  iter_sampling = 2500,
  step_size = 0.1
)

## Running MCMC with 4 parallel chains...
## 
## Chain 2 finished in 79.4 seconds.
## Chain 4 finished in 79.7 seconds.
## Chain 3 finished in 80.1 seconds.
## Chain 1 finished in 80.5 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 79.9 seconds.
## Total execution time: 80.6 seconds.

Step 6: assess success for single iteration

draws_pp <- data.table(as_draws_df(fit_pp$draws()))
draws_pp[, mean(OR < 1)]

## [1] 0.79

Estimating the predictive probability

The next step is to pull all these elements together in a single function that we can call repeatedly to estimate the predictive probability of success. This probability is estimated by calculating the proportion of iterations that result in a success.

Computing resources required for this estimation might be quite substantial. If we iterate 1000 times, we need to fit that many Bayesian models. And 1000 Bayesian model estimates could be prohibitive - a high performance computing cluster (HPC) may be necessary. (I touched on this earlier when I describe exploring the characteristic properties of Bayesian models.) I have provided the code below in the second addendum in case any readers are interested in trying to implement on an HPC.

I’ll conclude with a figure that shows how predictive probabilities can vary depending on the observed sample size and $P(OR < 1)$ for the interim data set. Based on the data generating process I’ve used here, if we observe a $P(OR < 1) = 90\%$ at an interim look after 250 patients, it is considerably more probable that we will end up over 95% than if we observe that same probability at an interim look after 450 patients. This makes sense, of course, since the estimate at 450 patients will have less uncertainty, and adding 50 patients will not likely change to results dramatically. The converse is true after 250 patients.

Ultimately, the interpretation of the predictive probability will depend on the urgency of making a recommendation, the costs of waiting, the costs of deciding to soon, and other factors specific to the trial and those making the decisions.

Addendum A: stan code

data {
  int<lower=0> N;                // number of observations
  int<lower=2> L;                // number of WHO categories
  int<lower=1,upper=L> y[N];     // vector of categorical outcomes
  int<lower=0,upper=1> rx[N];    // treatment or control
  int<lower=1> D;                // number of covariates
  row_vector[D] x[N];            // matrix of covariates  N x D matrix
}

parameters {
  
  vector[D] beta;           // covariate estimates 
  real delta;               // overall control effect
  ordered[L-1] tau;         // cut-points for cumulative odds model ([L-1] vector)
  
}

transformed parameters{ 
  
  vector[N] yhat;

  for (i in 1:N){
    yhat[i] = x[i] * beta + rx[i] * delta;
  }
}

model {
  
  // priors
  
  beta ~ student_t(3, 0, 10);
  delta ~ student_t(3, 0, 2);
  tau ~ student_t(3, 0, 8);
      
  // outcome model
  
  for (i in 1:N)
    y[i] ~ ordered_logistic(yhat[i], tau);
}

generated quantities {
  real OR = exp(delta);
}

Addendum B: HPC code

library(slurmR)

est_from_draw <- function(n_draw, Draws, dd_obs, D, s_model) {
  
  set_cmdstan_path(path = "/.../cmdstan/2.25.0")
  
  dd_new <- dd_obs[, .(id = sample(id, 125, replace = TRUE)), keyby = rx]
  dd_new <- merge(dd_obs[, .(id, male, over69)], dd_new, by = "id")
  dd_new[, id:= (nrow(dd_obs) + 1):(nrow(dd_obs) +.N)]
  
  draw <- as.data.frame(Draws[sample(.N, 1)])
  
  beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
  delta <- draw$delta
  coefs <- as.matrix(c(beta, delta))
  
  tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
  tau <- c(tau, Inf)
  cprop <- plogis(tau)
  xprop <- diff(cprop)
  baseline <- c(cprop[1], xprop) 
  
  zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
  dd_new$z <- zmat %*% coefs
  setkey(dd_new, id)
  
  dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")
  
  dx <- rbind(dd_obs, dd_new)
  
  fit_pp <- s_model$sample(
    data = dt_to_list(dx),
    refresh = 0,
    chains = 4L,
    parallel_chains = 4L,
    iter_warmup = 2000,
    iter_sampling = 2500,
    step_size = 0.1
  )
  
  draws_pp <- data.table(as_draws_df(fit_pp$draws()))
  return(data.table(n_draw, prop_success = draws_pp[, mean(OR < 1)]))
}

job <- Slurm_lapply(
  X = 1L:1080L, 
  FUN = est_from_draw, 
  Draws = draws_dt,
  dd_obs = dd,
  D = D,
  s_model = mod,
  njobs = 90L, 
  mc.cores = 4L,
  job_name = "i_pp",
  tmp_path = "/.../scratch",
  plan = "wait",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("dt_to_list"),
  overwrite = TRUE
)

job
res <- Slurm_collect(job)
rbindlist(res)[, mean(prop_success >= 0.95)]

Coming soon: effortlessly generate ordinal data without assuming proportional odds

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 05 Jan 2021 00:00:00 +0000

I’m starting off 2021 with my 99th post ever to introduce a new feature that will be incorporated into simstudy soon to make it a bit easier to generate ordinal data without requiring an assumption of proportional odds. I should wait until this feature has been incorporated into the development version, but I want to put it out there in case any one has any further suggestions. In any case, having this out in plain view will motivate me to get back to work on the package.

In the past, I have generated ordinal without the non-proportional odds assumption, but it was a bit cumbersome. I’ve wanted to simplify the approach by incorporating the functionality directly into genOrdCat. While this last step is not complete, it is very close. For now, there is a temporary function genOrdCatNP.

Starting with the proportional odds assumption

First, I am generating a data set using genOrdCat and an assumption of proportionality. This will be a large data set to eliminate any sampling issues when summarizing the odds ratios. (Please take a look here for a little more detail on how how the odds and cumulative odds are defined.) The data will include a treatment or exposure indicator and an ordinal categorical outcome with four possible responses. In this case, the log odds-ratio comparing the odds for the treatment group with the control group will be -1.3, which translates to an odds ratio (OR) of 0.27.

d1 <- defDataAdd(varname = "z", formula = "rx * 1.3", dist = "nonrandom")

set.seed(382763)
dd <- genData(100000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)

baseprobs <- c(.4, .3, .2, .1)
dd_p <- genOrdCat(dtName = dd, adjVar = "z", baseprobs = baseprobs)

Here is a calculation of the cumulative proportions and odds, as well as the odds ratios, which are all close to -1.3:

(prop <- prop.table(dd_p[, table(rx, cat)], 1))

##    cat
## rx     1    2    3    4
##   0 0.40 0.30 0.20 0.10
##   1 0.15 0.24 0.32 0.29

(cumprop <- data.table(apply(prop, 1, cumsum)))

##      0    1
## 1: 0.4 0.15
## 2: 0.7 0.39
## 3: 0.9 0.71
## 4: 1.0 1.00

(cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))])

##    odds0 odds1
## 1:  0.67  0.18
## 2:  2.32  0.63
## 3:  8.84  2.45
## 4:   Inf   Inf

cumodds[1:3, odds1/odds0]

## [1] 0.27 0.27 0.28

To visualize cumulative proportional odds, here’s a plot using the really nice likert package:

library(likert)

item <- data.frame(dd_p[, "cat"])
names(item) <- "response"
bin.grp <- factor(dd_p[, rx])
likert.bin <- likert(item, grouping = bin.grp)

plot(likert.bin) + ggtitle("Proportional odds")

It’s now simple to relax the proportionality assumption

Two arguments are needed to relax the assumption: npVar and npAdj. The user specifies the first argument to indicate what variable is driving the lack of proportionality. In this case, it will be the treatment itself, since that is the only covariate in the data set. The second argument is a vector that represents the violation of the assumption at each level of the ordinal measure. The length of this vector is currently the same length as the number of ordinal categories, though the last value will always be 0.

In this example, the cumulative log-odds ratio for the first category will be 0.3 points more than the proportional assumption of -1.3, so should be quite close to -1.0 (or 0.37 on the OR scale). The cumulative log-odds ratio (and OR) for the second category will be the same as in the case of proportionality. And, the cumulative log-odds ratio for the third category will be 1.0 less than -1.3, or around -2.3 (which translates to 0.10 on the OR scale)

npAdj <- c(-0.3, 0.0, 1.0, 0)
dd_np <- genOrdCatNP(dtName = dd, adjVar = "z", baseprobs = baseprobs,
                  npVar = "rx", npAdj = npAdj)

cumprop <- data.table(apply(prop.table(dd_np[, table(rx, cat)], 1), 1, cumsum))
cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))]
cumodds[1:3, odds1/odds0]

## [1] 0.37 0.27 0.10

And here is what non-proportional cumulative odds looks like:

Constrained randomization to evaulate the vaccine rollout in nursing homes

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 22 Dec 2020 00:00:00 +0000

On an incredibly heartening note, two COVID-19 vaccines have been approved for use in the US and other countries around the world. More are possibly on the way. The big challenge, at least here in the United States, is to convince people that these vaccines are safe and effective; we need people to get vaccinated as soon as they are able to slow the spread of this disease. I for one will not hesitate for a moment to get a shot when I have the opportunity, though I don’t think biostatisticians are too high on the priority list.

Those who are at the top of the priority list are staff and residents of nursing homes. Unfortunately, within these communities, there are pockets of ambivalence or skepticism about vaccines in general, and the COVID vaccine in particular. In the past, influenza vaccine rates for residents at some facilities could be as low as 50%. I am part of an effort organized by researchers affiliated with the IMPACT Collaboratory to figure out a way to increase these numbers.

This effort involves a cluster randomized trial (CRT) to evaluate the effectiveness of a wide-ranging outreach program designed to encourage nursing home residents to get their shots. The focus is on facilities that have relatively high proportions of African-American and Latinx residents, because these facilities have been among the least successful in the past in convincing residents to get vaccinated. The outcome measure of the trial, which will be measured at the individual level, will be the probability of vaccination within 5 weeks of being available at the nursing home.

The nursing homes in the study are members of one of four national nursing home networks or corporations. In this CRT, the randomization will be stratified by these networks and by the proportion of African-American and Latinx residents. We are defining the race/ethnicity strata using cutoffs of proportions: <25%, 25% to 40%, and >40%. We want our randomization to provide balance with respect to racial/ethnic composition in the intervention and control arms within each individual stratum. However, we are concerned that the strata with fewer nursing homes have a high risk of imbalance just by chance. Constrained randomization is one possible way to mitigate this risk, which is the focus here in this post.

Constrained randomization

The basic idea is pretty simple. We generate a large number of possible randomization lists based on the requirements of the study design. For each randomization, we evaluate whether the balancing criteria have been met; by selecting only the subset of randomizations that pass this test, we create a sample of eligible randomizations. With this list of possible randomizations is in hand, we randomly select one, which becomes the actual randomization. Because we have limited the final selection only to possible randomizations that have been vetted for balance (or whatever criteria we require), we are guaranteed to satisfy the pre-specified criteria.

Simulated data

I am using a single simulated data set to illustrate the constrained randomization process. Using the simstudy package, creating this data set is a two-step process of defining the data and then generating the data.

Defining the data

There will be a total of 200 nursing homes in 3 (rather than 4) networks. Just as in the real data, racial/ethnic composition will differ by network (because they are based in different parts of the country). And the networks are different sizes. The proportions of African-American/Latinx residents are generated using the beta distribution, which ranges from 0 to 1. In simstudy, the beta distribution is parameterized with a mean (specified in the formula argument) and dispersion (specified in the variance argument. See this for more details on the beta distribution.

library(simstudy)
library(data.table)

def <- defData(varname = "network", formula = "0.3;0.5;0.2", 
  dist = "categorical", id = "site")

defC <- defCondition(condition = "network == 1", 
  formula = "0.25", variance = 10, dist = "beta")
defC <- defCondition(defC, condition = "network == 2", 
  formula = "0.3", variance = 7.5, dist = "beta")
defC <- defCondition(defC, condition = "network == 3", 
  formula = "0.35", variance = 5, dist = "beta")

Generating the data

set.seed(2323761)

dd <- genData(200, def, id = "site")
dd <- addCondition(defC, dd, newvar = "prop")

dd[, stratum := cut(prop, breaks = c(0, .25, .4, 1), 
  include.lowest = TRUE, labels = c(1, 2, 3))]

dd

##      site  prop network stratum
##   1:    1 0.340       2       2
##   2:    2 0.181       2       1
##   3:    3 0.163       2       1
##   4:    4 0.099       3       1
##   5:    5 0.178       2       1
##  ---                           
## 196:  196 0.500       2       3
## 197:  197 0.080       3       1
## 198:  198 0.479       3       3
## 199:  199 0.071       2       1
## 200:  200 0.428       2       3

Randomization

Now that we have a data set in hand, we can go ahead an randomize. I am using the simstudy function trtAssign, which allows us to specify the strata as well as the the ratio of controls to intervention facilities. In this case, we have a limit in the number of sites at which we can implement the intervention. In this simulation, I assume that we’ll randomize 150 sites to control, and 50 to the intervention, a 3:1 ratio.

dr <- trtAssign(dd, nTrt = 2, balanced = TRUE, strata = c("network", "stratum"), 
  ratio = c(3, 1), grpName = "rx")

We want to inspect an average proportion of African-American/Latinx residents within each strata (without adjusting for nursing home size, which is ignored here). First, we create a data table that includes the difference in average proportions between the facilities randomized to the intervention and those randomized to control:

dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
dc <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
dc[, dif := abs(`1` - `0`)]

Looking at both the table and the figure, Stratum 3 (>40%) in Network 1 jumps out as having the largest discrepancy, about 15 percentage points:

##    network stratum    0     1     dif
## 1:       1       1 0.16 0.192 0.03062
## 2:       1       2 0.31 0.305 0.00641
## 3:       1       3 0.45 0.600 0.14568
## 4:       2       1 0.15 0.143 0.00394
## 5:       2       2 0.32 0.321 0.00310
## 6:       2       3 0.49 0.547 0.05738
## 7:       3       1 0.13 0.085 0.04948
## 8:       3       2 0.33 0.331 0.00029
## 9:       3       3 0.56 0.607 0.04284

Constraining the randomization

We want to do better and ensure that the maximum difference within a stratum falls below some specified threshold, say a 3 percentage point difference. All we need to do is repeatedly randomize and then check balance. I’ve written a function randomize that will be called repeatedly. Here I generate 1000 randomization lists, but in some cases I might need to generate many more, particularly if it is difficult to achieve targeted balance in any particular randomization.

randomize <- function(dd) {
  
  dr <- trtAssign(dd, nTrt = 2, strata = c("network", "stratum"), balanced = TRUE,
    ratio = c(3, 1), grpName = "rx")
  
  dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
  dx <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
  dx[, dif := abs(`1` - `0`)]
  
  list(is_candidate = all(dx$dif < 0.03), randomization = dr[,.(site, rx)], 
    balance = dx)
  
}

rand_list <- lapply(1:1000, function(x) randomize(dd))

Here is one randomization that fails to meet the criteria as 5 of the 9 strata exceed the 3 percentage point threshold:

## [[1]]
## [[1]]$is_candidate
## [1] FALSE
## 
## [[1]]$randomization
##      site rx
##   1:    1  0
##   2:    2  1
##   3:    3  0
##   4:    4  1
##   5:    5  0
##  ---        
## 196:  196  0
## 197:  197  1
## 198:  198  0
## 199:  199  0
## 200:  200  0
## 
## [[1]]$balance
##    network stratum    0     1    dif
## 1:       1       1 0.16 0.207 0.0503
## 2:       1       2 0.30 0.334 0.0330
## 3:       1       3 0.45 0.600 0.1457
## 4:       2       1 0.15 0.138 0.0107
## 5:       2       2 0.32 0.330 0.0078
## 6:       2       3 0.50 0.514 0.0142
## 7:       3       1 0.13 0.085 0.0493
## 8:       3       2 0.34 0.311 0.0239
## 9:       3       3 0.55 0.647 0.0950

Here is another that passes, as all differences are below the 3 percentage point threshold:

## [[1]]
## [[1]]$is_candidate
## [1] TRUE
## 
## [[1]]$randomization
##      site rx
##   1:    1  1
##   2:    2  0
##   3:    3  1
##   4:    4  0
##   5:    5  1
##  ---        
## 196:  196  1
## 197:  197  0
## 198:  198  1
## 199:  199  1
## 200:  200  0
## 
## [[1]]$balance
##    network stratum    0    1    dif
## 1:       1       1 0.16 0.18 0.0168
## 2:       1       2 0.31 0.31 0.0039
## 3:       1       3 0.49 0.49 0.0041
## 4:       2       1 0.15 0.14 0.0144
## 5:       2       2 0.32 0.33 0.0064
## 6:       2       3 0.50 0.52 0.0196
## 7:       3       1 0.12 0.12 0.0095
## 8:       3       2 0.34 0.31 0.0239
## 9:       3       3 0.57 0.58 0.0134

All that remains is to identify all the randomization sets that met the criteria (in this case there are only 6, suggesting we should probably generate at least 100,000 randomizations to ensure we have enough to pick from).

candidate_indices <- sapply(rand_list, function(x) x[["is_candidate"]])
candidates <- rand_list[candidate_indices]
(n_candidates <- length(candidates))

## [1] 6

selected <- sample(x = n_candidates, size = 1)
ds <- candidates[[selected]][["randomization"]]

ds <- merge(dd, ds, by = "site")
dx <- ds[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]

And looking at the plot confirms that we have a randomization scheme that is balanced based on our target:

Of course, the selection criteria could be based on any combination of factors. We may have multiple means that we want to balance, or we might want the two arms to be similar with respect to the standard deviation of a measure. These additional criteria may require more randomization schemes to be generated just because balance is that much more difficult to achieve, but all that really costs is computing time, not programming effort.

Support:

A Bayesian implementation of a latent threshold model

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 08 Dec 2020 00:00:00 +0000

In the previous post, I described a latent threshold model that might be helpful if we want to dichotomize a continuous predictor but we don’t know the appropriate cut-off point. This was motivated by a need to identify a threshold of antibody levels present in convalescent plasma that is currently being tested as a therapy for hospitalized patients with COVID in a number of RCTs, including those that are particpating in the ongoing COMPILE meta-analysis.

Barring any specific scientific rationale, we could pick an arbitrary threshold and continue with our analysis. Unfortunately, our estimates would not reflect the uncertainty around the selection of that threshold point; an approach that incorporates this uncertainty would be more appropriate. Last time, I described a relatively simple scenario with a single continuous predictor, a latent threshold, and a continuous outcome; the estimates were generated using the R package chngpt. Because I want to be able to build more flexible models in the future that could accommodate multiple continuous predictors (and latent thresholds), I decided to implement a Bayesian version of the model.

The model

Before laying out the model (described in much more detail in the Stan User’s Guide), I should highlight two key features. First, we assume that the distribution of the outcome differs on either side of the threshold. In this example, we expect that the outcome data for antibody levels below the threshold are distributed as $N(\alpha, \sigma)$, and that data above the threshold are $N(\beta, \sigma)$. Second, since we do not know the threshold value, the likelihood is specified as a mixture across the range of all possible thresholds; the posterior distribution of the parameters $\alpha$ and $\beta$ reflect the uncertainty where the threshold lies.

The observed data include the continuous outcome $\textbf{y}$ and a continuous antibody measure $\textbf{x}$. There are $M$ possible pre-specified thresholds that are reflected in the vector $\mathbf{c}$. Each candidate threshold is treated as a discrete quantity and a probability $\lambda_m$ is attached to each. Here is the model for the outcome conditional on the distribution parameters as well as the probability of the thresholds:

\[p(\textbf{y}|\alpha, \beta, \sigma, \mathbf{\lambda}) = \sum_{m=1}^M \lambda_m \left(\prod_{i: \; x_i < c[m]} \text{normal}(y_i | \alpha, \sigma) \prod_{i: \; x_i \ge c[m]} \text{normal}(y_i | \beta, \sigma)\right)\]

Implmentation in Stan

Here is a translation of the model into Stan. The data for the model include the antibody level $x$, the outcome $y$, and the candidate thresholds included in the vector $\mathbf{c}$ which has length $M$. In this example, the candidate vector is based on the range of observed antibody levels.

data {
  int<lower=1> N;                // number of observations
  real x[N];                     // antibody measures
  real y[N];                     // outcomes
    
  int<lower=1> M;                // number of candidate thresholds
  real c[M];                     // candidate thresholds
}

At the outset, equal probability will be assigned to each of the $M$ candidate thresholds, which is $1/M$. Since Stan operates in log-probabilities, this is translated to $\text{log}(1/M) = \text{-log}(M)$:

transformed data {
  real lambda;
  lambda = -log(M);
}

The three parameters that define the two distributions (above and below the threshold) are $\alpha$, $\beta$, and $\sigma$:

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

This next block is really the implementation of the threshold model. $\mathbf{lp}$ is a vector of log probabilities, where each element represents the log of each summand in the model specified above.

transformed parameters {
  vector[M] lp;
  lp = rep_vector(lambda, M);
  
  for (m in 1:M)
    for (n in 1:N)
      lp[m] = lp[m] + normal_lpdf(y[n] | x[n] < c[m] ? alpha : beta, sigma);
}

The notation y[n] | x[n] < c[m] ? alpha : beta, sigma is Stan’s shorthand for an if-then-else statement (this is note Stan code!):

if x[n] < c[m] then 
  y ~ N(alpha, sigma)
else if x[n] >= c[m] then
  y ~ N(beta, sigma)

And finally, here is the specification of the priors and the full likelihood, which is the sum of the log-likelihoods across the candidate thresholds. The function log_sum_exp executes the summation across the $M$ candidate thresholds specified in the model above.

model {
  alpha ~ student_t(3, 0, 2.5);
  beta ~ student_t(3, 0, 2.5);
  sigma ~ exponential(1);
  
  target += log_sum_exp(lp);
}

Data generation

The data generated to explore this model is based on the same data definitions I used in the last post to explore the MLE model.

library(simstudy)
set.seed(87654)

d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
              dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status", 
              variance = 1, dist = "normal")

dd <- genData(500, d1)

The threshold is quite apparent here. In the right hand plot, the latent classes are revealed.

Model fitting

We use the rstan package to access Stan, passing along the observed antibody data, outcome data, as well as the candidate thresholds:

library(rstan)

rt <- stanc("/.../threshold.stan");
sm <- stan_model(stanc_ret = rt, verbose=FALSE)

N <- nrow(dd3)
y <- dd3[, y]
x <- dd3[, antibody] 
c <- seq(round(min(x), 1), round(max(x), 1), by = .1)
M <- length(c)

studydata3 <- list(N=N, x=x, y=y, M=M, c=c)
fit3 <-  sampling(sm, data = studydata3, iter = 3000, warmup = 500, 
                  cores = 4L, chains = 4, control = list(adapt_delta = 0.8))

The first order of business is to make sure that the MCMC algorithm sampled the parameter space in a well-behave manner. Everything looks good here:

library(bayesplot)

posterior <- as.array(fit3) 
lp <- log_posterior(fit3)
np <- nuts_params(fit3)

color_scheme_set("mix-brightblue-gray")

mcmc_trace(posterior, pars = c("alpha","beta", "sigma"), 
                facet_args = list(nrow = 3), np = np) + 
  xlab("Post-warmup iteration")

The posterior distributions of the three parameters of interest ($\alpha$, $\beta$, and $\sigma$) are quite close to the values used in the data generation process:

mcmc_intervals(posterior, pars = c("alpha","beta", "sigma"))

The posterior probability of the threshold

Even though the distributions of $\alpha$, $\beta$, and $\sigma$ are marginal with respect to the candidate thresholds, we may still be interested in the posterior distribution of the thresholds. An approach to estimating this is described in the User’s Guide. I provide a little more detail and code for generating the plot in the addendum.

The plot shows the log-probability for each of the candidate thresholds considered, with a red dashed line drawn at $-0.7$, the true threshold used in the data generation process. In this case, the probability (and log-probability) peaks at this point. In fact, there is a pretty steep drop-off on either side, indicating that we can have a lot of confidence that the threshold is indeed $-0.7$.

When there is a single distribution

If we update the data definitions to generate a single distribution (i.e. the outcome is independent of the antibody measure), the threshold model with a struggles to identify a threshold, and the parameter estimates have more uncertainty.

d1 <- updateDef(d1, changevar = "y", newformula = "0")
dd <- genData(500, d1)

Here is a plot of the data based on the updated assumption:

And here are the posterior probabilities for the parameters - now with much wider credible intervals:

Here is the posterior distribution of thresholds, intentionally plotted to highlight the lack of distinction across the candidate thresholds:

Addendum - posterior probabilties of the threshold

Here’s a little more background on how the posterior probabilities for the threshold were calculated. As a reminder, $\textbf{c}$ is a vector of candidate thresholds of length $M$. We define a quantity $q(c_m | data)$ as

\[ q(c_m | data) = \frac{1}{R}\sum_{r=1}^R \text{exp}(lp_{rc_m}) \] where $lp_{cr_m}$ is the value of $lp$ from the r’th draw for threshold candidate $c_m$. We are actually interested in $p(c_m|data$), which is related to $q$:

\[ p(c_m | data) = \frac{q(c_m | data)}{\sum_{m'=1}^M q(c_{m'}|data)} \]

The R code is a little bit involved, because the log-probabilities are so small that exponentiating them to recover the probabilities runs into floating point limitations. In the examples I have been using here, the log probabilities ranged from $-4400$ to $-700$. On my device the smallest value I can meaningfully exponentiate is $-745$; anything smaller results in a value of 0, rendering it impossible to estimate $q$.

To get around this problem, I used the mpfr function in the Rmfpr package. Here is a simple example to show how exponentiate a hihgly negative variable $b$. A helper variable $a$ is specified to set the precision, which can then be used to derive the desired result, which is $\text{exp}(b)$.

Everything is fine if $b \ge -745$:

library(Rmpfr)

b <- -745
exp(b)

## [1] 4.94e-324

For $b<-745$, we have floating point issues:

b <- -746
exp(b)

## [1] 0

So, we turn to mpfr to get the desired result. First, specify $a$ with the proper precision:

(a <- mpfr(exp(-100), precBits=64))

## 1 'mpfr' number of precision  64   bits 
## [1] 3.72007597602083612001e-44

And now we can calculate $\text{exp}(b)$:

a^(-b/100)

## 1 'mpfr' number of precision  64   bits 
## [1] 1.03828480951583225515e-324

The code to calculate $\text{log}(p_{c_m})$ extracts the draws of $lp$ from the sample, exponentiates, and sums to get the desired result.

library(glue)

a <- mpfr(exp(-100), precBits=64)

qc <- NULL
for(m in 1:M) {
  lp.i <- glue("lp[{m}]")
  le <- rstan::extract(fit3, pars = lp.i)[[1]]
  q <- a^(-le/100)
  qc[m] <- sum(q)
}

qcs <- mpfr2array(qc, dim = M)
lps <- log(qcs/sum(qcs))
dps <- data.table(c, y=as.numeric(lps))

ggplot(data = dps, aes(x = c, y = y)) +
  geom_vline(xintercept = -0.7, color = "red", lty = 3) +
  geom_line(color = "grey60") +
  geom_point(size = 1) +
  theme(panel.grid = element_blank()) +
  ylab("log(probability)") +
  xlab("threshold from low to not low") +
  scale_y_continuous(limits = c(-800, 0))

A latent threshold model to dichotomize a continuous predictor

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 24 Nov 2020 00:00:00 +0000

This is the context. In the convalescent plasma pooled individual patient level meta-analysis we are conducting as part of the COMPILE study, there is great interest in understanding the impact of antibody levels on outcomes. (I’ve described various aspects of the analysis in previous posts, most recently here). In other words, not all convalescent plasma is equal.

If we had a clear measure of antibodies, we could model the relationship of these levels with the outcome of interest, such as health status as captured by the WHO 11-point scale or mortality, and call it a day. Unfortunately, at the moment, there is no single measure across the RCTs included in the meta-analysis (though that may change). Until now, the RCTs have used a range of measurement “platforms” (or technologies), which may measure different components of the convalescent plasma using different scales. Given these inconsistencies, it is challenging to build a straightforward model that simply estimates the relationship between antibody levels and clinical outcomes.

The study team is coalescing around the idea of comparing the outcomes of patients who received low levels of antibodies with patients who received not low levels (as well as with patients who received no antibodies). One thought (well, really my thought) is to use a model that can jointly estimate the latent threshold and, given that threshold, estimate a treatment effect. Importantly, this model would need to accommodate multiple antibody measures and their respective thresholds.

To tackle this problem, I have turned to a class of models called change point or threshold models. My ultimate goal is to fit a Bayesian model that can estimate threshold and effect-size parameters for any number of RCTs using any number of antibody measures. At this point we are a few steps removed from that, so in this post I’ll start with a simple case of a single RCT and a single antibody measure, and use a maximum likelihood estimation method implemented in the R package chngpt to estimate parameters from a simulated data set. In a subsequent post, I’ll implement a Bayesian version of this simple model, and perhaps in a third post, I’ll get to the larger model that incorporates more complexity.

Visualizing simple scenarios

Change point models appear to be most commonly used in the context of time series data where the focus is on understanding if a trend or average has shifted at a certain point in a sequence of measurements over time. In the case of COMPILE, the target would be a threshold for a continuous antibody measure across multiple patients; we are interested in measuring the average outcome for patients on either side of the threshold.

The following plots show three scenarios. On the left, there is no threshold; the distribution of continuous outcomes is the same across all values of the the antibody measure. In the middle, there is a threshold at $-0.7$; patients with antibody levels below $-0.7$ have a lower average outcome than patients with antibodies above $-0.7$. On the right, the threshold is shifted to $0.5$.

The key here is that the outcome is solely a function of the latent categorical status - not the actual value of the antibody level. This may be a little simplistic, because we might expect the antibody level itself to be related to the outcome based on some sort of linear or non-linear relationship rather than the dichotomous relationship we are positing here. However, if we set our sights on detecting a difference in average clinical outcomes for patients categorized as having been exposed to low and not low antibody levels rather than on understanding the full nature of their relationship, this simplification may be reasonable.

Data generation

I think if you see the data generation process, the model and assumptions might make more sense. We start with an antibody level that, for simplicity’s sake, has a standard normal distribution. In this simulation, the latent group status (i.e. low vs. not low) is not determined completely by the threshold (though it certainly could); here, the probability that latent status is not low is about $5\%$ for patients with antibody levels that fall below $-0.7$, but is $95\%$ for patients that exceed threshold.

library(simstudy)
set.seed(87654)

d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
              dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status", 
              variance = 1, dist = "normal")

dd <- genData(500, d1)
dd

##       id antibody latent_status       y
##   1:   1  -1.7790             0  0.5184
##   2:   2   0.2423             1  3.2174
##   3:   3  -0.4412             1  1.8948
##   4:   4  -1.2505             0  0.9816
##   5:   5  -0.0552             1  2.9251
##  ---                                   
## 496: 496  -0.4634             1  2.7298
## 497: 497   0.6862             0 -0.0507
## 498: 498  -1.0899             0  0.9680
## 499: 499   2.3395             1  1.9540
## 500: 500  -0.4874             1  3.5238

Simple model estimation

The chngptm function in the chngpt package provides an estimate of the threshold as well as the treatment effect of antibody lying above this latent threshold. The parameters in this simple case are recovered quite well. The fairly narrow $95\%$ confidence interval (2.2, 2.8) just misses the true value. The very narrow $95\%$ CI for the threshold is (-0.73, -0.69) just does include the true value.

library(chngpt)

fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody, 
  data = dd, type="step", family="gaussian")

summary(fit)

## Change point model threshold.type:  step 
## 
## Coefficients:
##                   est Std. Error* (lower upper) p.value*
## (Intercept)     0.296       0.130 0.0547  0.563 2.26e-02
## antibody>chngpt 2.520       0.139 2.2416  2.787 1.99e-73
## 
## Threshold:
##        est Std. Error     (lower     upper) 
##   -0.70261    0.00924   -0.72712   -0.69092

Alternative scenarios

When there is more ambiguity in the relationship between the antibody threshold and the classification into the two latent classes of low and not low, there is more uncertainty in both the effect and threshold estimates. Furthermore, the effect size estimate is attenuated, since the prediction of the latent class is less successful.

In the next simulation, the true threshold remains at $-0.7$, but the probability that a patient below the threshold actually does not have low levels of antibodies increases to about $21\%$, while the probability of a patient above the threshold does not have low levels of antibodies decreases to $79\%$. There is more uncertainty regarding the the threshold, as the $95\%$ CI is (-1.09, -0.62). And the estimated effect is $1.5 \; (1.3, 2.0)$ is attenuated with more uncertainty. Given the added uncertainty in the data generation process, these estimates are what we would expect.

d1 <- updateDef(d1, changevar = "latent_status", 
  newformula = "-1.3 + 2.6 * (antibody > -0.7)")

dd <- genData(500, d1)
fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody, 
  data = dd, type="step", family="gaussian")

summary(fit)

## Change point model threshold.type:  step 
## 
## Coefficients:
##                   est Std. Error* (lower upper) p.value*
## (Intercept)     0.881       0.159   0.50   1.12 3.05e-08
## antibody>chngpt 1.439       0.173   1.17   1.85 1.09e-16
## 
## Threshold:
##        est Std. Error     (lower     upper) 
##    -0.6298     0.0579    -0.8083    -0.5814

The effect size has an impact on the estimation of a threshold. At the extreme case where there is no effect, the concept of a threshold is not meaningful; we would expect there to be great uncertainty with the estimate for the threshold. As the true effect size grows, we would expect the precision of the threshold estimate to increase as well (subject to the latent class membership probabilities just described). The subsequent plot shows the point estimates and $95\%$ CIs for thresholds at different effect sizes. The true threshold is $0.5$ and effect sizes range from 0 to 2:

This last figure shows that the uncertainty around the effect size estimate is higher at lower levels of true effectiveness. This higher level of uncertainty in the estimated effect is driven by the higher level of uncertainty in the estimate of the threshold at lower effect sizes (as just pointed out above).

With a fundamentally different data generating process

What happens when the underlying data process is quite different from the one we have been imagining? Is the threshold model useful? I would say “maybe not” in the case of a single antibody measurement. I alluded to this a bit earlier in the post, justifying the idea by arguing it might make more sense with multiple types of antibody measurements. We will hopefully find that out if I get to that point. Here, I briefly investigate the estimates we get from a threshold model when the outcome is linearly related to the antibody measurement, and there is in fact no threshold, as in this data set:

d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")  
d1 <- defData(d1, varname = "y", formula = "antibody", variance = 1, dist = "normal")

dd <- genData(500, d1)

The estimated threshold is near the center of the antibody data (which in this case is close to $0$), with a fairly narrow $95\%$ confidence interval. The effect size is essentially a comparison of the means for patients with measurements below $0$ compared to patients above $0$. If this were the actual data generation process, it might be preferable to model the relationship directly using simple linear regression without estimating a threshold.

fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody, 
               data = dd, type="step", family="gaussian")
summary(fit)

## Change point model threshold.type:  step 
## 
## Coefficients:
##                    est Std. Error* (lower upper) p.value*
## (Intercept)     -0.972       0.162  -1.24 -0.607 2.19e-09
## antibody>chngpt  1.739       0.109   1.58  2.006 1.15e-57
## 
## Threshold:
##        est Std. Error     (lower     upper) 
##    -0.0713     0.2296    -0.3832     0.5170

Exploring the properties of a Bayesian model using high performance computing

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 10 Nov 2020 00:00:00 +0000

An obvious downside to estimating Bayesian models is that it can take a considerable amount of time merely to fit a model. And if you need to estimate the same model repeatedly, that considerable amount becomes a prohibitive amount. In this post, which is part of a series (last one here) where I’ve been describing various aspects of the Bayesian analyses we plan to conduct for the COMPILE meta-analysis of convalescent plasma RCTs, I’ll present a somewhat elaborate model to illustrate how we have addressed these computing challenges to explore the properties of these models.

While concept of statistical power may not be part of the Bayesian analytic framework, there are many statisticians who would like to assess this property regardless of the modeling approach. These assessments require us to generate multiple data sets and estimate a model for each. In this case, we’ve found that each run through the MCMC-algorithm required to sample from the posterior probability of the Bayesian model can take anywhere from 7 to 15 minutes on our laptops or desktops. If we want to analyze 1,000 data sets using these methods, my laptop would need to run continuously for at least a week. And if we want to explore models under different assumptions about the data generating process or prior distributions - well, that would be impossible.

Fortunately, we have access to the High Performance Computing Core at NYU Langone Health (HPC), which enables us to analyze 1,000 data sets in about 90 minutes. Still pretty intensive, but clearly a huge improvement. This post describes how we adapted our simulation and modeling process to take advantage of the power and speed of the the HPC.

COMPILE Study design

There are numerous randomized control trials being conducted around the world to evaluate the efficacy of antibodies in convalescent blood plasma in improving outcomes for patients who have been hospitalized with COVID. Because each trial lacks adequate sample size to allow us to draw any definitive conclusions, we have undertaken a project to pool individual level data from these various studies into a single analysis. (I described the general approach and some conceptual issues here and here). The outcome is an 11-point categorical score developed by the WHO, with 0 indicating no COVID infection and 10 indicating death. As the intensity of support increases, the scores increase. For the primary outcome, the score will be based on patient status 14 days after randomization.

This study is complicated by the fact that each RCT is using one of three different control conditions: (1) usual care (unblinded), (2) non-convalescent plasma, and (3) saline solution. The model conceptualizes the three control conditions as three treatments to be compared against the reference condition of convalescent plasma. The overall treatment effect will represent a quantity centered around the three control effects.

Because we have access to individual-level data, we will be able to adjust for important baseline characteristics that might be associated with the prognosis at day 14; these are baseline WHO-score, age, sex, and symptom duration prior to randomization. The primary analysis will adjust for these factors, and secondary analyses will go further to investigate if any of these factors modify the treatment effect. In the example in this post, I will present the model based on a secondary analysis that considers only a single baseline factor symptom duration; we are allowing for the possibility that the treatment may be more or less effective depending on the symptom duration. (For example, patients who have been sicker longer may not respond to the treatment, while those who are treated earlier may.)

Model

Here is the model that I am using (as I mentioned, the planned COMPILE analysis will be adjusting for additional baseline characteristics):

\[ \text{logit} \left(P\left(Y_{kis} \ge y\right)\right) = \tau_{yk} + \beta_s + I_{ki} \left( \gamma_{kcs} + \delta_{kc} \right), \; \; y \in \{1, \dots L-1\} \text{ with } L \text{ response levels} \] And here are the assumptions for the prior distributions:

\[ \begin{aligned} \tau_{yk} &\sim t_{\text{student}} \left( \text{df=} 3, \mu=0, \sigma = 5 \right), \; \; \text{monotone within } \text{site } k \\ \beta_s &\sim \text{Normal} \left( \mu=0, \sigma = 5 \right), \qquad \; \;s \in \{1,\dots, S \} \text{ for symptom duration strata}, \beta_1 = 0 \\ \gamma_{kcs} &\sim \text{Normal}\left( \gamma_{cs}, 1 \right), \qquad \qquad \;\;\;\;\;\;c \in \{0, 1, 2\} \text{ for control conditions }, \gamma_{kc1} = 0 \text{ for all } k \\ \gamma_{cs} &\sim \text{Normal}\left( \Gamma_s, 0.25 \right), \qquad \qquad \; \; \gamma_{c1} = 0 \text{ for all } c \\ \Gamma_s &\sim t_{\text{student}} \left( 3, 0, 1 \right), \qquad \qquad \qquad \Gamma_{1} = 0 \\ \delta_{kc} &\sim \text{Normal}\left( \delta_c, \eta \right)\\ \delta_c &\sim \text{Normal}\left( -\Delta, 0.5 \right) \\ \eta &\sim t_{\text{student}}\left(3, 0, 0.25 \right) \\ -\Delta &\sim t_{\text{student}} \left( 3, 0, 2.5 \right) \end{aligned} \] There are $K$ RCTs. The outcome for the $i$th patient from the $k$th trial on the $L$-point scale at day 14 is $Y_{ki}=y$, $y=0,\dots,L-1$ (although the COMPILE study will have $L = 11$ levels, I will be using $L=5$ to speed up estimation times a bit). $I_{ki}$ indicates the treatment assignment for subject $i$ in the $k$th RCT, $I_{ki} = 0$ if patient $i$ received CP and $I_{ki} = 1$ if the patient was in any control arm. There are three control conditions $C$: standard of care, $C=0$; non-convalescent plasma, $C=1$; saline/LR with coloring, $C=2$; each RCT $k$ is attached to a specific control condition. There are also $S=3$ symptom duration strata: short duration, $s=0$; moderate duration, $s=1$; and long duration, $s=2$. (COMPILE will use five symptom duration strata - again I am simplifying.)

$\tau_{yk}$ corresponds to the $k$th RCT’s intercept associated with level $y$; the $\tau$’s represent the cumulative log odds for patients with in symptom duration group $s=0$ and receiving CP treatment. Within a particular RCT, all $\tau_{yk}$’s, satisfy the monotonicity requirements for the intercepts of the proportional odds model. $\beta_s$, $s \in {2, 3}$, is the main effect of symptom duration ($\beta_1 = 0$, where $s=1$ is the reference category). $\gamma_{kcs}$ is the moderating effect of strata $s$ in RCT $k$; $\gamma_{kc1} = 0$, since $s=1$ is the reference category. $\delta_{kc}$ is the RCT-specific control effect, where RCT $k$ is using control condition $c$.

Each $\gamma_{kcs}$ is normally distributed around a control type/symptom duration mean $\gamma_{cs}$. And each $\gamma_{cs}$ is centered around a pooled mean $\Gamma_s$. The $\delta_{kc}$’s are assumed to be normally distributed around a control-type specific effect $\delta_c$, with variance $\eta$ that will be estimated; the $\delta_c$’s are normally distributed around $-\Delta$ (we take $-\Delta$ as the mean of the distribution to which $\delta_c$ belongs so that $\exp(\Delta)$ will correspond to the cumulative log-odds ratio for CP relative to control, rather than for control relative to CP.). (For an earlier take on these types of models, see here.)

Go or No-go

The focus of a Bayesian analysis is the estimated posterior probability distribution of a parameter or parameters of interest, for example the log-odds ratio from a cumulative proportional odds model. At the end of an analysis, we have credibility intervals, means, medians, quantiles - all concepts associated a probability distribution.

A “Go/No-go” decision process like a hypothesis test is not necessarily baked into the Bayesian method. At some point, however, even if we are using a Bayesian model to inform our thinking, we might want to or have to make a decision. In this case, we might want recommend (or not) the use of CP for patients hospitalized with COVID-19. Rather than use a hypothesis test to reject or fail to reject a null hypothesis of no effect, we can use the posterior probability to create a decision rule. In fact, this is what we have done.

In the proposed design of COMPILE, the CP therapy will be deemed a success if both of these criteria are met:

\[ P \left( \exp\left(-\Delta\right) < 1 \right) = P \left( OR < 1 \right) > 95\%\]

\[P \left( OR < 0.80 \right) > 50\%\] The first statement ensures that the posterior probability of a good outcome is very high. If we want to be conservative, we can obviously increase the percentage threshold above $95\%$. The second statement says that the there is decent probability that the treatment effect is clinically meaningful. Again, we can modify the target OR and/or the percentage threshold based on our desired outcome.

Goals of the simulation

Since there are no Type I or Type II errors in the Bayesian framework, the concept of power (which is the probability of rejecting the null hypothesis when it is indeed not true) does not logically flow from a Bayesian analysis. However, if we substitute our decision rules for a hypothesis test, we can estimate the probability (call it Bayesian power, though I imagine some Bayesians would object) that we will make a “Go” decision given a specified treatment effect. (To be truly Bayesian, we should impose some uncertainty on what that specific treatment effect is, and calculate a probability distribution of Bayesian power. But I am keeping things simpler here.)

Hopefully, I have provided sufficient motivation for the need to simulate data and fit multiple Bayesian models. So, let’s do that now.

The simulation

I am creating four functions that will form the backbone of this simulation process: s_define, s_generate, s_estimate, and s_extract. Repeated calls to each of these functions will provide us with the data that we need to get an estimate of Bayesian power under our (static) data generating assumptions.

Data definitions

The first definition table, defC1, sets up the RCTs. Each RCT has specific symptom duration interaction effect $a$ and control treatment effect $b$. To introduce a little variability in sample size, 1/3 of the studies will be larger (150 patients), and 2/3 will be smaller (75 patients).

The remaining tables, defC2, defS, and defC3, define patient-level data. defC2 adds the control group indicator (0 = CP, 1 = standard care, 2 = non-convalescent plasma, 3 = saline) and the symptom duration stratum. defS defines the interaction effect conditional on the stratum. defC3 defines the ordinal categorical outcome.

s_define <- function() {
  
  defC1 <- defDataAdd(varname = "a",formula = 0, variance = .005, dist = "normal")    
  defC1 <- defDataAdd(defC1,varname = "b",formula = 0, variance= .01, dist = "normal")
  defC1 <- defDataAdd(defC1,varname = "size",formula = "75+75*large", dist = "nonrandom") 
  
  defC2 <- defDataAdd(varname="C_rv", formula="C * control", dist = "nonrandom") 
  defC2 <- defDataAdd(defC2, varname = "ss", formula = "1/3;1/3;1/3", 
                      dist = "categorical")
  
  defS <- defCondition(
    condition = "ss==1",  
    formula = 0, 
    dist = "nonrandom")
  defS <- defCondition(defS,
    condition = "ss==2",  
    formula = "(0.09 + a) * (C_rv==1) + (0.10 + a) * (C_rv==2) + (0.11 + a) * (C_rv==3)", 
    dist = "nonrandom")
  defS <- defCondition(defS,
    condition = "ss==3",  
    formula = "(0.19 + a) * (C_rv==1) + (0.20 + a) * (C_rv==2) + (0.21 + a) * (C_rv==3)", 
    dist = "nonrandom")
  
  defC3 <- defDataAdd(
    varname = "z", 
    formula = "0.1*(ss-1)+z_ss+(0.6+b)*(C_rv==1)+(0.7+b)*(C_rv==2)+(0.8+b)*(C_rv==3)", 
    dist = "nonrandom")
  
  list(defC1 = defC1, defC2 = defC2, defS = defS, defC3 = defC3)
  
}

Data generation

The data generation process draws on the definition tables to create an instance of an RCT data base. This process includes a function genBaseProbs that I described previously.

s_generate <- function(deflist, nsites) {
  
  genBaseProbs <- function(n, base, similarity, digits = 2) {
    
    n_levels <- length(base)
    
    x <- gtools::rdirichlet(n, similarity * base) 
    
    x <- round(floor(x*1e8)/1e8, digits)
    xpart <- x[, 1:(n_levels-1)]    
    partsum <- apply(xpart, 1, sum)
    x[, n_levels] <- 1 - partsum
    
    return(x)
  }
  
  basestudy <- genBaseProbs(n = nsites,
                            base =  c(.10, .35, .25, .20, .10),
                            similarity = 100)
  
  dstudy <- genData(nsites, id = "study")   
  dstudy <- trtAssign(dstudy, nTrt = 3, grpName = "C")
  dstudy <- trtAssign(dstudy, nTrt = 2, strata = "C", grpName = "large", ratio = c(2,1))
  dstudy <- addColumns(deflist[['defC1']], dstudy)
  
  dind <- genCluster(dstudy, "study", numIndsVar = "size", "id")
  dind <- trtAssign(dind, strata="study", grpName = "control") 
  dind <- addColumns(deflist[['defC2']], dind)
  dind <- addCondition(deflist[["defS"]], dind, newvar = "z_ss")
  dind <- addColumns(deflist[['defC3']], dind)
  
  dl <- lapply(1:nsites, function(i) {
    b <- basestudy[i,]
    dx <- dind[study == i]
    genOrdCat(dx, adjVar = "z", b, catVar = "ordY")
  })
  
  rbindlist(dl)[]
}

Model estimation

The estimation involves creating a data set for Stan and sampling from the Bayesian model. The Stan model is included in the addendum.

s_estimate <- function(dd, s_model) {
  
  N <- nrow(dd)                               ## number of observations 
  L <- dd[, length(unique(ordY))]             ## number of levels of outcome 
  K <- dd[, length(unique(study))]            ## number of studies 
  y <- as.numeric(dd$ordY)                    ## individual outcome 
  kk <- dd$study                              ## study for individual 
  ctrl <- dd$control                          ## treatment arm for individual 
  cc <- dd[, .N, keyby = .(study, C)]$C       ## specific control arm for study 
  ss <- dd$ss
  x <- model.matrix(ordY ~ factor(ss), data = dd)[, -1] 
  
  studydata <- list(N=N, L= L, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc, ss=ss, x=x)
  
  fit <-  sampling(s_model, data=studydata, iter = 4000, warmup = 500,
                   cores = 4L, chains = 4, control = list(adapt_delta = 0.8))
  fit
}

Estimate extraction

The last step is the extraction of summary data from the posterior probability distributions. I am collecting quantiles of the key parameters, including $\Delta$ and $OR = \exp(-\Delta)$. For the Bayesian power analysis, I am estimating the probability of falling below the two thresholds for each data set. And finally, I want to get a sense of the quality of each estimation process by recovering the number of divergent chains that resulted from the MCMC algorithm (more on that here).

s_extract <- function(iternum, mcmc_res) {
  
  posterior <- as.array(mcmc_res)
  
  x <- summary(
    mcmc_res, 
    pars = c("Delta", "delta", "Gamma", "beta", "alpha", "OR"),
    probs = c(0.025, 0.5, 0.975)
  )
  
  dpars <- data.table(iternum = iternum, par = rownames(x$summary), x$summary)
  
  p.eff <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 1)
  p.clinic <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 0.8)
  dp <- data.table(iternum = iternum, p.eff = p.eff, p.clinic = p.clinic)
  
  sparams <- get_sampler_params(mcmc_res, inc_warmup=FALSE)
  n_divergent <- sum(sapply(sparams, function(x) sum(x[, 'divergent__'])))
  ddiv <- data.table(iternum, n_divergent)
  
  list(ddiv = ddiv, dpars = dpars, dp = dp)
}

Replication

Now we want to put all these pieces together and repeatedly execute those four functions and save the results from each. I’ve described using lapply to calculate power in a much more traditional setting. We’re going to take the same approach here, except on steroids, replacing lapply not with mclapply, the parallel version, but with Slurm_lapply, which is a function in the slurmR package.

Slurm (Simple Linux Utility for Resource Management) is a HPC cluster job scheduler. slurmR is a wrapper that mimics many of the R parallel package functions, but in a Slurm environment. The strategy here is to define a meta-function (iteration) that itself calls the four functions already described, and then call that function repeatedly. Slurm_lapply does that, and rather than allocating the iterations to different cores on a computer like mclapply does, it allocates the iterations to different nodes on the HPC, using what is technically called a job array. Each node is essentially its own computer. In addition to that, each node has multiple cores, so we can run the different MCMC chains in parallel within a node; we have parallel processes within a parallel process. I have access to 100 nodes at any one time, though I find I don’t get much performance improvement if I go over 90, so that is what I do here. Within each node, I am using 4 cores. I am running 1,980 iterations, so that is 22 iterations per node. As I mentioned earlier, all of this runs in about an hour and a half.

The following code includes the “meta-function” iteration, the compilation of the Stan model (which only needs to be done once, thankfully), the Slurm_lapply call, and the Slurm batch code that I need to execute to get the whole process started on the HPC, which is called Big Purple here at NYU. (All of the R code goes into a single .R file, the batch code is in a .slurm file, and the Stan code is in its own .stan file.)

iteration <- function(iternum, s_model, nsites) {

    s_defs <- s_define()
    s_dd <- s_generate(s_defs, nsites = nsites)
    s_est <- s_estimate(s_dd, s_model)
    s_res <- s_extract(iternum, s_est)

    return(s_res)
    
}

library(simstudy)
library(rstan)
library(data.table)
library(slurmR)

rt <- stanc("/.../r/freq_bayes.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)

job <- Slurm_lapply(
  X = 1:1980, 
  iteration, 
  s_model = sm,
  nsites = 9,
  njobs = 90, 
  mc.cores = 4,
  tmp_path = "/.../scratch",
  overwrite = TRUE,
  job_name = "i_fb",
  sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
  export = c("s_define", "s_generate", "s_estimate", "s_extract"),
  plan = "wait")

job
res <- Slurm_collect(job)

diverge <- rbindlist(lapply(res, function(l) l[["ddiv"]]))
ests <- rbindlist(lapply(res, function(l) l[["dpars"]]))
probs <- rbindlist(lapply(res, function(l) l[["dp"]]))

save(diverge, ests, probs, file = "/.../data/freq_bayes.rda")

#!/bin/bash
#SBATCH --job-name=fb_parent
#SBATCH --mail-type=END,FAIL                      # send email if the job end or fail
#SBATCH --mail-user=keith.goldfeld@nyulangone.org
#SBATCH --partition=cpu_short
#SBATCH --time=3:00:00                            # Time limit hrs:min:sec
#SBATCH --output=fb.out                           # Standard output and error log

module load r/3.6.3
cd /.../r

Rscript --vanilla fb.R

Results

Each of the three extracted data tables are combined across simulations and the results are saved to an .rda file, which can be loaded locally in R and summarized. In this case, we are particularly interested in the Bayesian power estimate, which is the proportion of data sets that would results in a “go” decision (a recommendation to strongly consider using the intervention).

However, before we consider that, we should first get a rough idea about how many replications had divergence issues, which we extracted into the diverge data table. For each replication, we used four chains of length 3,500 each (after the 500 warm-up samples), accounting for a total of 14,000 chains. Here are the proportion of replications with at least one divergent chain:

load("DataBayesCOMPILE/freq_bayes.rda")
diverge[, mean(n_divergent > 0)]

## [1] 0.102

While 10% of replications with at least 1 divergent chain might seem a little high, we can get a little more comfort from the fact that it appears that almost all replications had fewer than 35 (0.25%) divergent chains:

diverge[, mean(n_divergent < 35)]

## [1] 0.985

To get a general sense of how well our model is working, we can plot the distribution of posterior medians. In particular, this will allow us to assess how well the model is recovering the values used in the data generating process. In this case, I am excluding the 29 replications with 35 or more divergent chains:

Finally, we are ready to report the estimated Bayesian power (again, using the replications with limited number of divergent chains) and show the distribution of probabilities.

probs_d <- merge(probs, diverge, by = "iternum")[n_divergent < 35]
probs_d[, mean(p.eff > 0.95 & p.clinic > 0.50)]

## [1] 0.726

So, given an actual effect $OR=\exp(-0.70) = 0.50$, we would conclude with a decision to go ahead with the therapy with 73% probability. However, a single estimate of power based on one effect size is a bit incomplete; it would be preferable to assess power under numerous scenarios of effect sizes and perhaps prior distribution assumptions to get a more complete picture. And if you have access to a HPC, this may actually be something you can do in a realistic period of time.

Addendum

The stan model that implements the model described at the outset actually looks a little different than that model in two key ways. First, there is a parameter $\alpha$ that appears in the outcome model, which represents an overall intercept across all studies. Ideally, we wouldn’t need to include this parameter since we want to fix it at zero, but the model behaves very poorly without it. We do include it, but with a highly restrictive prior that will constrain it to be very close to zero. The second difference is that standard normal priors appear in the model - this is to alleviate issues related to divergent chains, which I described in a previous post.

data {
    int<lower=0> N;                // number of observations
    int<lower=2> L;                // number of WHO categories
    int<lower=1> K;                // number of studies
    int<lower=1,upper=L> y[N];     // vector of categorical outcomes
    int<lower=1,upper=K> kk[N];    // site for individual
    int<lower=0,upper=1> ctrl[N];  // treatment or control
    int<lower=1,upper=3> cc[K];    // specific control for site
    int<lower=1,upper=3> ss[N];    // strata
    row_vector[2] x[N];            // strata indicators  N x 2 matrix
  }

parameters {
  
  real Delta;               // overall control effect
  vector[2] Gamma;          // overall strata effect
  real alpha;               // overall intercept for treatment
  ordered[L-1] tau[K];      // cut-points for cumulative odds model (K X [L-1] matrix)
  
  real<lower=0>  eta_0;     // sd of delta_k (around delta)

  // non-central parameterization
  
  vector[K] z_ran_rx;      // site-specific effect 
  vector[2] z_phi[K];      // K X 2 matrix 
  vector[3] z_delta;
  vector[2] z_beta;
  vector[2] z_gamma[3];    // 3 X 2 matrix
}

transformed parameters{ 
  
  vector[3] delta;          // control-specific effect
  vector[K] delta_k;        // site specific treatment effect
  vector[2] gamma[3];       // control-specific duration strata effect (3 X 2 matrix)
  vector[2] beta;           // covariate estimates of ss
  vector[2] gamma_k[K];     // site-specific duration strata effect (K X 2 matrix)
  vector[N] yhat;

  
  delta = 0.5 * z_delta + Delta; // was 0.1
  beta = 5 * z_beta;
  
  for (c in 1:3) 
    gamma[c] = 0.25 * z_gamma[c] + Gamma;
  
  for (k in 1:K){
    delta_k[k] = eta_0 * z_ran_rx[k] + delta[cc[k]]; 
  }
  
  for (k in 1:K)
    gamma_k[k] = 1 * z_phi[k] + gamma[cc[k]];

  for (i in 1:N)  
    yhat[i] = alpha + x[i] * beta + ctrl[i] * (delta_k[kk[i]] + x[i]*gamma_k[kk[i]]);
}

model {
  
  // priors
  
  z_ran_rx ~ std_normal(); 
  z_delta ~ std_normal();
  z_beta ~ std_normal();

  alpha ~ normal(0, 0.25);
  eta_0 ~ student_t(3, 0, 0.25);
  
  Delta ~ student_t(3, 0, 2.5);
  Gamma ~ student_t(3, 0, 1);
  
  for (c in 1:3)
      z_gamma[c] ~ std_normal();
  
  for (k in 1:K)
      z_phi[k] ~ std_normal();
      
  for (k in 1:K)
      tau[k] ~ student_t(3, 0, 5);
  
  // outcome model
  
  for (i in 1:N)
    y[i] ~ ordered_logistic(yhat[i], tau[kk[i]]);
}

generated quantities {
  
  real OR;
  OR = exp(-Delta); 
  
}

A refined brute force method to inform simulation of ordinal response data

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 27 Oct 2020 00:00:00 +0000

Francisco, a researcher from Spain, reached out to me with a challenge. He is interested in exploring various models that estimate correlation across multiple responses to survey questions. This is the context:

He doesn’t have access to actual data, so to explore analytic methods he needs to simulate responses.
It would be ideal if the simulated data reflect the properties of real-world responses, some of which can be gleaned from the literature.
The studies he’s found report only means and standard deviations of the ordinal data, along with the correlation matrices, but not probability distributions of the responses.
He’s considering simstudy for his simulations, but the function genOrdCat requires a set of probabilities for each response measure; it doesn’t seem like simstudy will be helpful here.

Ultimately, we needed to figure out if we can we use the empirical means and standard deviations to derive probabilities that will yield those same means and standard deviations when the data are simulated. I thought about this for a bit, and came up with a bit of a work-around; the approach seems to work decently and doesn’t require any outrageous assumptions.

I might have kept this between the two of us, but in the process of looking more closely at my solution, I generated a plot that was so beautiful and interesting that I needed to post it. And since I am posting the image, I thought I might as well go ahead and describe the solution in case any one else might find it useful. But first, the plot:

A little more detail

In the simplest scenario, we want to simulate responses from a single survey question with responses ranging from 1 to 7, where 1 might signify totally disagree and 7 would mean totally agree, with gradations in between. Responses collected from a population will be distributed across the seven categories, and the proportion of responses that fall within each category represents the probability of a response.

To inform the simulation, we have a journal article that reports only a mean and standard deviation from responses to that same question collected in an earlier study. The idea is to find the probabilities for the possible responses that correspond to those observed means and standard deviations. That is, how do we go from the mean and standard deviation to a set of probabilities?

The reverse - going from known probabilities to a mean response and standard deviation - is much easier: we just calculate the weighted mean and weighted standard deviations, where the weights are the probabilities.

For example, say the probability distribution of the seven categorical responses is 21%, 20%, 18%, 15%, 13%, 9%, and 4% responding 1, 2, … , and 7, respectively, and represented by this histogram:

Under this distribution the weighted mean and standard deviation are 3.2 and 1.8, respectively:

weighted.mean(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))

## [1] 3.2

weighted.sd(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))

## [1] 1.8

The brute force approach

My first thought about how use $\mu$ and $\sigma$ was simple, if a bit crude. Generate a slew of probabilities (like a million or so) and calculate the weighted mean and standard deviation for each distribution. I would look for the probabilities that yielded values that were close to my target (i.e. those that had been reported in the literature).

There are a couple of drawbacks to this approach. First, it is not particularly systematic, since we generating the probabilities randomly, and even though we have large numbers, we are not guaranteed to generate combinations that reflect our targets. Second, there is no reason to think that the generated randomly generated distributions will look like the true distribution. And third, there is no reason to think that, even if we do find a match, the distribution is unique.

I actually went ahead and implemented this approach and found two distributions that also yield $\mu$ = 3.2 and and $\sigma$ = 1.8 (truth be told, I did this part first and then found the distribution above using the method I will describe in a second):

Here are the target $\mu$’s and $\sigma$’s for the distributions on the right and left:

p_left <- c(0.218, 0.174, 0.170, 0.206, 0.134, 0.022, 0.077)
c(weighted.mean(1:7, p_left), weighted.sd(1:7, p_left))

## [1] 3.2 1.8

p_right <- c(0.185, 0.185, 0.247, 0.217, 0.011, 0.062, 0.092)
c(weighted.mean(1:7, p_right), weighted.sd(1:7, p_right))

## [1] 3.2 1.8

Drawing on the beta distribution

Thinking about probabilities always draws me to the beta family distribution, a continuous distribution from 0 to 1. Theses distributions are parameterized with two shape values, often referred to as $a$ and $b$. Here are a few probability density functions (pdf’s) for $(a,b)$ pairs of (1, 1.6) in yellow, (2, 4) in red, and (2, 2) in blue:

I had an idea that generating different pdf’s based on different values of $a$ and $b$ might provide a more systematic way of generating probabilities. If we carve the pdf into $K$ sections (where $K$ is the number of responses, in our case 7), then the area under the pdf in the $k$th slice could provide the probability for the $k$th response. Since each pdf is unique (determined by specific values of $a$ and $b$), this would ensure different (i.e. unique) sets of probabilities to search through.

Using the example from above where $a$ = 1 and $b$ = 1.6, here is how the slices look based on the seven categories:

The cumulative probability at each slice $x \in \{1, ..., 7\}$ is ($P(X < x/7)$), and can be calculated in R with the function pbeta:

z <- pbeta((1:7)/7, 2, 4)
z

## [1] 0.15 0.44 0.71 0.89 0.97 1.00 1.00

The probability of for each category is $P(X = x) = P(X < x) - P(X < (x-1))$, and is calculated easily:

p <- z - c(0, z[-7])
p

## [1] 0.1518 0.2904 0.2684 0.1786 0.0851 0.0239 0.0018

This is the transformed probability distribution from the continuous beta scale to the discrete categorical scale:

And finally here are $\mu$ and $\sigma$ associated with these values of $a$ and $b$:

c(weighted.mean(1:7, p), weighted.sd(1:7, p))

## [1] 2.8 1.3

Brute force, refined

If we create a grid of $(a, b)$ values, there will be an associated, and unique, set of probabilities for each pair derived from slicing the pdf into $K$ sections And for each of these sets of probabilities, we can calculate the means and standard deviations. We then find the $(\mu, \sigma)$ pair that is closest to our target. While this idea is not that much better than the brute force approach suggested above, at least it is now systematic. If we do it in two steps, first by searching for the general region and then zooming in to find a specific set of probabilities, we can really speed things up and use less memory.

Is limiting the search to beta-based distributions justifiable? It might depend on the nature of responses in a particular case, but it does seem reasonable; most importantly, it assures fairly well-behaved distributions that could plausibly reflect a wide range of response patterns. Barring any additional information about the distributions, then, I would have no qualms using this approach. (If it turns out that this is a common enough problem, I would even consider implementing the algorithm as a simstudy function.)

Now, it is time to reveal the secret of the plot (if you haven’t figured it out already). Each point is just the $(\mu, \sigma)$ pair generated by a specific $(a, b)$ pair. Here is the code, implementing the described algorithm:

get_abp <- function(a, b, size) {
  
  x <- 1:size
  
  z <- pbeta((1:size)/size, a, b)
  p <- z - c(0, z[-size])
  
  sigma <- weighted.sd(x, p)
  mu <- weighted.mean(x, p)
  
  data.table(a, b, mu, sigma, t(p))
}

get_p <- function(a, b, n) {
  ab <- asplit(expand.grid(a = a, b = b), 1)
  rbindlist(lapply(ab, function(x) get_abp(x[1], x[2],  n)))
}

a <- seq(.1, 25, .1)
b <- seq(.1, 25, .1)

ab_res <- get_p(a, b, 7)

We can fill in the plot with more points by increasing the range of $a$ and $b$ that we search, but creating such a huge table of look-up values is time consuming and starts to eat up memory. In any case, there is no need, because we will refine the search by zooming in on the area closest to our target.

Here is the plot again, based on $a$’s and $b$’s ranging from 0.1 to 25, with the superimposed target pair $(\mu =3.2, \sigma = 1.8)$

To zoom in, we first find the point in the grid that is closest to our target (based on Euclidean distance). We then define a finer grid around this point in the grid, and re-search for the closest point. We do have to be careful that we do not search for invalid values of $a$ and $b$ (i.e. $a \le 0$ and $b \le 0$). Once we find our point, we have the associated probabilities.:

t_mu = 3.2
t_s = 1.8

ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
close_point <- ab_res[distance == min(distance), .(a, b, distance)]
  
a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
  
a_zoom <- a_zoom[a_zoom > 0]
b_zoom <- b_zoom[b_zoom > 0]
  
res_zoom <- get_p(a_zoom, b_zoom, 7)

Here is the new search region:

And the selection of the point:

res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
res_zoom[distance == min(distance)]

##       a   b  mu sigma   V1  V2   V3   V4   V5   V6    V7 distance
## 1: 0.97 1.6 3.2   1.8 0.22 0.2 0.17 0.15 0.12 0.09 0.046   0.0021

Applying the beta-search to a bigger problem

To conclude, I’ll finish with Francisco’s more ambitious goal of simulating correlated responses to multiple questions. In this case, we will assume four questions, all with responses ranging from 1 to 7. The target ($\mu, \sigma$) pairs taken from the (hypothetical) journal article are:

targets <- list(c(2.4, 0.8), c(4.1, 1.2), c(3.4, 1.5), c(5.8, 0.8))

The correlation matrix taken from this same article is:

corMat <- matrix(c(
  1.00, 0.09, 0.11, 0.05,
  0.09, 1.00, 0.35, 0.16,
  0.11, 0.35, 1.00, 0.13,
  0.05, 0.16, 0.13, 1.00), nrow=4,ncol=4)

The get_target_prob function implements the search algorithm described above:

get_target_prob <- function(t_mu, t_s, ab_res) {
  
  ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
  close_point <- ab_res[distance == min(distance), .(a, b, distance)]
  
  a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
  b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
  
  a_zoom <- a_zoom[a_zoom > 0]
  b_zoom <- b_zoom[b_zoom > 0]
  
  res_zoom <- get_p(a_zoom, b_zoom, 7)
  
  res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
  baseprobs <- as.vector(res_zoom[distance == min(distance), paste0("V", 1:7)], "double")
  
  baseprobs
}

Calling the function conducts the search and provides probabilities for each question:

probs <- lapply(targets, function(x) get_target_prob(x[1], x[2], ab_res))
(probs <- do.call(rbind, probs))

##         [,1]    [,2]   [,3]  [,4]   [,5]    [,6]    [,7]
## [1,] 1.1e-01 4.8e-01 0.3334 0.077 0.0059 8.8e-05 3.7e-08
## [2,] 7.9e-03 8.5e-02 0.2232 0.306 0.2537 1.1e-01 1.3e-02
## [3,] 1.0e-01 2.1e-01 0.2345 0.209 0.1502 7.9e-02 1.7e-02
## [4,] 2.5e-08 5.4e-05 0.0036 0.051 0.2638 5.0e-01 1.8e-01

At least in theory, Francisco can now conduct his simulation study. In this case, I am generating a huge sample size to minimize sampling variation with the hope that we can recover the means, standard deviations and correlations, which, of course, we do:

d_ind <- genData(100000) 
dx <- genOrdCat(d_ind, adjVar = NULL, baseprobs = probs, catVar = "y", 
                corMatrix = corMat, asFactor = FALSE)

apply(as.matrix(dx[, -1]), 2, mean)

## grp1 grp2 grp3 grp4 
##  2.4  4.1  3.4  5.8

apply(as.matrix(dx[, -1]), 2, sd)

## grp1 grp2 grp3 grp4 
##  0.8  1.2  1.5  0.8

cor(as.matrix(dx[, -1]))

##       grp1 grp2  grp3  grp4
## grp1 1.000 0.08 0.099 0.043
## grp2 0.080 1.00 0.331 0.142
## grp3 0.099 0.33 1.000 0.110
## grp4 0.043 0.14 0.110 1.000

In the end, Francisco seemed to be satisfied with the solution - at least satisfied enough to go to the trouble to have a bottle of wine sent to me in New York City, which was definitely above and beyond. While my wife and I will certainly enjoy the wine - and look forward to being able to travel again so maybe we can enjoy a glass in person - seeing that image emerge from a beta-distribution was really all I needed. Salud.

simstudy just got a little more dynamic: version 0.2.1

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 13 Oct 2020 00:00:00 +0000

simstudy version 0.2.1 has just been submitted to CRAN. Along with this release, the big news is that I’ve been joined by Jacob Wujciak-Jens as a co-author of the package. He initially reached out to me from Germany with some suggestions for improvements, we had a little back and forth, and now here we are. He has substantially reworked the underbelly of simstudy, making the package much easier to maintain, and positioning it for much easier extension. And he implemented an entire system of formalized tests using testthat and hedgehog; that was always my intention, but I never had the wherewithal to pull it off, and Jacob has done that. But, most importantly, it is much more fun to collaborate on this project than to toil away on my own.

You readers, though, are probably more interested in the changes that, as a user, you will notice. There are a number of bug fixes (hopefully you never encountered those, but I know some of you have, because you have pointed them out to me) and improved documentation, including some new vignettes. There is even a nice new website that is created with the help of pkgdown.

The most exciting extension of this new version is the ability to modify data definitions on the fly using externally defined variables. Often, we’d like to explore data generation and modeling under different scenarios. For example, we might want to understand the operating characteristics of a model given different variance or other parametric assumptions. There was already some functionality built into simstudy to facilitate this type of dynamic exploration, with updateDef and updateDefAdd, that allows users to edit lines of existing data definition tables. Now, there is an additional and, I think, more powerful mechanism - called double-dot reference - to access variables that do not already exist in a defined data set or data definition.

Double-dot external variable reference

It may be useful to think of an external reference variable as a type of hyperparameter of the data generation process. The reference is made directly in the formula itself, using a double-dot (“..”) notation before the variable name.

Here is a simple example:

library(simstudy)

def <- defData(varname = "x", formula = 0, 
  variance = 5, dist = "normal")
def <- defData(def, varname = "y", formula = "..B0 + ..B1 * x", 
  variance = "..sigma2", dist = "normal")

def

##    varname         formula variance   dist     link
## 1:       x               0        5 normal identity
## 2:       y ..B0 + ..B1 * x ..sigma2 normal identity

B0, B1, and sigma2 are not part of the simstudy data definition, but will be set external to that process, either in the global environment or within the context of a function.

B0 <- 4;
B1 <- 2;
sigma2 <- 9

set.seed(716251)
dd <- genData(100, def)

fit <- summary(lm(y ~ x, data = dd))

coef(fit)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)        4       0.28      14  2.6e-25
## x                  2       0.13      15  5.9e-28

fit$sigma

## [1] 2.8

It is easy to create a new data set on the fly with different slope and variance assumptions without having to go to the trouble of updating the data definitions.

B1 <- 3
sigma2 <- 16

dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))

coef(fit)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)      4.4       0.43      10  4.6e-17
## x                3.1       0.22      14  8.6e-26

fit$sigma

## [1] 4.2

Using with apply functions

Double-dot references can be flexibly applied using lapply (or the parallel version mclapply) to create a range of data sets under different assumptions:

gen_data <- function(sigma2, d) {
  dd <- genData(400, d)
  dd[, sigma2 := sigma2]
  dd
}

sigma2s <- c(1:9)^2

dd_m <- lapply(sigma2s, function(s) gen_data(s, def))
dd_m <- rbindlist(dd_m)

ggplot(data = dd_m, aes(x = x, y = y)) +
  geom_point(size = .5, color = "#DDAA33") +
  facet_wrap(sigma2 ~ .) +
  theme(panel.grid = element_blank())

Using with vectors

Double-dot referencing is also vector friendly. For example, if we want to create a mixture distribution from a vector of values (which we can also do using a categorical distribution), we can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4:

defblk <- defData(varname = "blksize", 
   formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")

defblk

##    varname                           formula variance    dist     link
## 1: blksize ..sizes[1] | .5 + ..sizes[2] | .5        0 mixture identity

sizes <- c(2, 4)
genData(1000, defblk)

##         id blksize
##    1:    1       2
##    2:    2       4
##    3:    3       2
##    4:    4       4
##    5:    5       4
##   ---             
##  996:  996       4
##  997:  997       2
##  998:  998       4
##  999:  999       4
## 1000: 1000       4

There are a few other changes to the package that are described here (but look for version 0.2.0 - we found a pretty major bug right away and fixed it, hence 0.2.1). Moving forward, we have some more things in the works, of course. And if you have suggestions of your own, you know where to find us.

Addendum

Here’s a more detailed example to show how double-dot references simplify things considerably in a case where I originally used the updateDef function. In a post where I described regression to the mean, there is an addendum that I adapt here using double-dot references. I’m not going into the motivation for the code here - check out the post if you’d like to see more.)

Here’s the data original code (both examples require the parallel package):

d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "x2", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ", 
             dist = "nonrandom")

rtomean <- function(n, d) {
  dd <- genData(n, d)
  data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}

repl <- function(xvar, nrep, ucoef, d) {
  
  d <- updateDef(d, "x1", newvariance = xvar)
  d <- updateDef(d, "x2", newvariance = xvar)
  
  dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d)))
  mudif <- unlist(lapply(dif, mean))
  data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
  
}

dres <- list()
i <- 0

for (ucoef in c(0, 1, 2, 3)) {
  
  i <- i + 1
  
  uform <- genFormula( c(0, ucoef), "U")
  
  d <- updateDef(d, "x1", newformula = uform)
  d <- updateDef(d, "x2", newformula = uform)
  
  dr <- mclapply(seq(1, 4, by = 1), function(x) repl(x, 1000, ucoef, d))
  dres[[i]] <- rbindlist(dr)
}

dres <- rbindlist(dres)

And here is the updated code:

d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "x2", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ", 
             dist = "nonrandom")

rtomean <- function(n, d, ucoef, xvar) {
  
  dd <- genData(n, d)
  data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
  
}

repl <- function(nrep, d, ucoef, xvar) {
  
  dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d, ucoef, xvar)))
  mudif <- unlist(lapply(dif, mean))
  data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
  
}

ucoef <- c(0:3)
xvar <- c(1:4)
params <- asplit(expand.grid(ucoef = ucoef, xvar = xvar), 1)

dres <- rbindlist(mclapply(params, function(x) repl(1000, d, x["ucoef"], x["xvar"])))

The code is much cleaner and the data generating process doesn’t really lose any clarity. Importantly, this change has allowed me to take advantage of the apply approach (rather than using a loop). I’d conclude that double-dot references have the potential to simplify things quite a bit.

Permuted block randomization using simstudy

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 29 Sep 2020 00:00:00 +0000

Along with preparing power analyses and statistical analysis plans (SAPs), generating study randomization lists is something a practicing biostatistician is occasionally asked to do. While not a particularly interesting activity, it offers the opportunity to tackle a small programming challenge. The title is a little misleading because you should probably skip all this and just use the blockrand package if you want to generate randomization schemes; don’t try to reinvent the wheel. But, I can’t resist. Since I was recently asked to generate such a list, I’ve been wondering how hard it would be to accomplish this using simstudy. There are already built-in functions for simulating stratified randomization schemes, so maybe it could be a good solution. The key element that is missing from simstudy, of course, is the permuted block setup.

Why permuted block randomization?

What is permuted block randomization and why even bother? The block part indicates that patients will be randomized in sub-groups. If we have blocks of size two, patients will be paired sequentially, with each patient randomized to different arms. This helps with balance over time and when recruitment is less than anticipated. If we were to randomize 100 patients to treatment and control (50 in each arm), it is possible that treatment assignments will cluster close to the start of the trial, just by chance. If the time period is related to the outcome in some way, this would be undesirable. Furthermore, if trial recruitment lagged and had to stop early, there would be an actual lack of balance across the arms.

The argument for block randomization seems strong enough. But if coordinators in the field know that we are using this approach, there is a risk of influencing patient recruitment. If it is known that patients are randomized in blocks of size four, and the first two patients are randomized to drug therapy, the coordinator will know that the next two patients will be randomized to control. This could influence who the coordinator recruits into the study, particularly if they believe drug therapy is superior. They may actively or unconsciously recruit healthier patients when it is known that they are going to get the control. (This, of course, is much less of an issue when recruiters/coordinators are completely blinded to group assignment.) By changing the block sizes in an unpredictable manner, by permuting the sizes, this problem is solved. Hence, permuted block randomization.

simstudy code

I want to walk through the code that will generate permuted block randomization. In this scenario we are conducting a trial to compare a drug therapy with placebo in at least 120 patients. We would like to randomize within blocks of size two or four, and the order of the blocks will themselves be randomized. We assume that each block size will have equal probability of being selected, though balance across different block sizes is not guaranteed. The preliminary code shown here implements these specifications:

library(simstudy)
library(data.table)

set.seed(1234)

n <- 120
levels <- c("D", "P")    # Drug/Placebo
blk_sizes <- c(2, 4)

n_arms <- length(levels)
p_blk <- 1/length(blk_sizes)

The first step is to generate a sequence of blocks with varying block sizes. We take advantage of the mixture distribution option in simstudy to generate blocks. This distribution is specified using a string with the format “$x_1|p_1 + \dots + x_k|p_k$”. In this case, there are $k=2$ block sizes; $x_1 = 2$, $p_1 = 0.5$, $x_2 = 4$, and $p_2 = 0.5$. We construct the mixture formula using the predefined block sizes, and use that formula to define the data that we need to generate:

v <- sapply(blk_sizes, function(x) paste(x, p_blk, sep = "|"))
mixformula <- paste(v, collapse = "+")

def <- defData(varname = "blk_size", formula = mixformula, 
  dist = "mixture", id = "blk_id")

def

##     varname     formula variance    dist     link
## 1: blk_size 2|0.5+4|0.5        0 mixture identity

Now, we need generate enough blocks to support the target number of patients to be randomized; that is, the sum of the block sizes should at least as large as the target. If all block sizes were the minimum block size (in this case $2$), then we would need at least $n/2$ blocks. Clearly, we will need fewer, but we will start with $n/2$ here and remove the excess:

maxblocks <- ceiling(n / min(blk_sizes))
dd <- genData(maxblocks, def)

#--- removing the excess

nblocks <- dd[, threshold := (cumsum(blk_size) >= n) * .I][threshold > 0]
dd <- dd[1:nblocks[1, threshold]]

tail(dd)

##    blk_id blk_size threshold
## 1:     36        4         0
## 2:     37        2         0
## 3:     38        2         0
## 4:     39        4         0
## 5:     40        4         0
## 6:     41        4        41

In the final step, we use cluster data generation to create the individual patients, defining each block as a cluster. Treatment assignment is stratified by each block:

di <- genCluster(dd, cLevelVar = "blk_id", numIndsVar = "blk_size", level1ID = "id")
dtrt <- trtAssign(di, nTrt = n_arms, strata = "blk_id", grpName = "arm")
  
dtrt <- dtrt[, .(id, blk_id, blk_size, arm = factor(arm, labels = levels))]

Here are examples of the block randomization results for four blocks:

dtrt[blk_id == 5]

##    id blk_id blk_size arm
## 1: 15      5        4   P
## 2: 16      5        4   D
## 3: 17      5        4   P
## 4: 18      5        4   D

dtrt[blk_id == 8]

##    id blk_id blk_size arm
## 1: 25      8        2   D
## 2: 26      8        2   P

dtrt[blk_id == 19]

##    id blk_id blk_size arm
## 1: 59     19        2   P
## 2: 60     19        2   D

dtrt[blk_id == 26]

##    id blk_id blk_size arm
## 1: 73     26        4   D
## 2: 74     26        4   P
## 3: 75     26        4   P
## 4: 76     26        4   D

A real-world application

I’ve created a function blkRandom based on this code so that I can illustrate this functionality in a more realistic setting. In a current multi-site study that I’m working on (already did the power analysis, finalizing the SAP), we need to provide a randomization list so that subject recruitment can begin. Randomization will be stratified by each of the sites (1 through 7), by sex (M and F), and by location of recruitment (A or B); in total, there will be $7\times 2 \times 2 = 28$ strata. For each of the 28 strata we want to randomize 50 potential subjects using permuted block randomization; for particular strata, this is certainly too large a number, but it doesn’t hurt to overestimate as long as the clinical trial software system can handle it.

Here is how the function would work for a single strata (just showing the first and last blocks):

blkRandom(n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))[c(1:4, 47:50)]

##    blk_id blk_size threshold id arm
## 1:      1        4         0  1   A
## 2:      1        4         0  2   B
## 3:      1        4         0  3   A
## 4:      1        4         0  4   B
## 5:     13        4        13 47   B
## 6:     13        4        13 48   A
## 7:     13        4        13 49   A
## 8:     13        4        13 50   B

Here is a wrapper function for blkRandom that incorporates a specific strata $s$. This will enable us to do permuted block randomization within different subgroups of the population, such as males and females, or sites:

sBlkRandom <- function(s, n, levels, blk_sizes) {
  
  dB <- blkRandom(n, levels, blk_sizes)
  dB[, .(id = paste0(id, s), stratum = s, arm)]
  
}

sBlkRandom(s = "M1A", n = 30, levels = c("A", "B"), blk_sizes = c(2, 4))[1:5]

##      id stratum arm
## 1: 1M1A     M1A   B
## 2: 2M1A     M1A   A
## 3: 3M1A     M1A   A
## 4: 4M1A     M1A   B
## 5: 5M1A     M1A   B

All the pieces are now in place.

We need to create a list of strata, each of which requires its own permuted block randomization list:

library(tidyr)

#--- specify all strata variables

site <- c(1 : 7)
sex <- c("M", "F")
location <- c("A", "B")

#--- create strata

strata <- expand.grid(sex = sex,site = site, location = location)
strata <- unite(strata, "stratum", sep = "")$stratum 

head(strata)

## [1] "M1A" "F1A" "M2A" "F2A" "M3A" "F3A"

With the list of strata in hand - we create the randomization lists using lapply to repeatedly call sBlkRandom:

rbindlist(lapply(
    strata, 
    function(s) sBlkRandom(s, n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))
  )
)

##          id stratum arm
##    1:  1M1A     M1A   B
##    2:  2M1A     M1A   A
##    3:  3M1A     M1A   A
##    4:  4M1A     M1A   B
##    5:  5M1A     M1A   B
##   ---                  
## 1420: 46F7B     F7B   B
## 1421: 47F7B     F7B   B
## 1422: 48F7B     F7B   A
## 1423: 49F7B     F7B   B
## 1424: 50F7B     F7B   A

Going with the tried and true

This has been fun and I think successful, but as I mentioned, you might want to stick with the established blockrand package that is designed around this very specific goal. I have written a simple wrapper function that rectifies one slightly minor shortcoming (block sizes are specified as blk_size$/ 2$) and allows us to use lapply to make repeated calls across the strata:

library(blockrand)

rand_stratum <- function(stratum, n, levels, blk_sizes) {
  
  blk_sizes <- blk_sizes / 2
  
  dB <- data.table(blockrand(
    n = n, 
    num.levels = length(levels), 
    levels = levels,
    id.prefix = stratum, 
    block.prefix  = stratum,
    stratum = stratum,
    block.sizes = blk_sizes)
  )
  
  dB[, .(id, stratum, treatment)]
  
}

rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))

##          id stratum treatment
##    1: M1A01     M1A         B
##    2: M1A02     M1A         A
##    3: M1A03     M1A         B
##    4: M1A04     M1A         A
##    5: M1A05     M1A         B
##   ---                        
## 1420: F7B46     F7B         A
## 1421: F7B47     F7B         A
## 1422: F7B48     F7B         B
## 1423: F7B49     F7B         A
## 1424: F7B50     F7B         B

In case the elegance and simplicity (not to mention all the other features that it provides, but I didn’t show you) are not compelling enough, the speed comparison isn’t even close: blockrand is 7 times faster than my simstudy solution (albeit on the millisecond scale - so you might not actually notice it).

library(microbenchmark)

microbenchmark(
 rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A", "B"), c(2, 4)))),
 rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))
)

## Warning in microbenchmark(rbindlist(lapply(strata, function(s) sBlkRandom(s, :
## less accurate nanosecond times to avoid potential integer overflows

## Unit: milliseconds
##                                                                                   expr
##    rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A",      "B"), c(2, 4))))
##  rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A",      "B"), c(2, 4))))
##  min  lq mean median  uq max neval cld
##  150 156  166    163 172 219   100   b
##   17  18   20     19  21  34   100  a

Addendum

An earlier version of the post did not include the blkRandom function. Here it is:

blkRandom <- function(n, levels, blk_sizes) {
  
  n_arms <- length(levels)
  p_blk <- 1/length(blk_sizes)
  
  v <- sapply(blk_sizes, function(x) paste(x, p_blk, sep = "|"))
  mixformula <- paste(v, collapse = "+")
  
  def <- defData(varname = "blk_size", formula = mixformula, 
                 dist = "mixture", id = "blk_id")
  
  maxblocks <- ceiling(n / min(blk_sizes))
  
  dd <- genData(maxblocks, def)
  nblocks <- dd[, threshold := (cumsum(blk_size) >= n) * .I][threshold > 0]
  dd <- dd[1:nblocks[1, threshold]]
  
  di <- genCluster(dd, cLevelVar = "blk_id", numIndsVar = "blk_size", level1ID = "id")
  dtrt <- trtAssign(di, nTrt = n_arms, strata = "blk_id", grpName = "arm")
  
  dtrt[, arm := factor(arm, labels = levels)]
  
  dtrt[]
  
}

Generating probabilities for ordinal categorical data

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 15 Sep 2020 00:00:00 +0000

Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently here. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.

Motivation

The outcome for the analysis that we will be conducting is the WHO 11-point ordinal scale for clinical improvement at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. We plan to use a Bayesian proportional odds model to assess the effectiveness of the therapy. Since this is a meta-analysis, we will be including these data from a collection of studies being conducted around the world.

Typically, in a proportional odds model one has to make an assumption about proportionality. In this case, while we are willing to make that assumption within specific studies, we are not willing to make that assumption across the various studies. This means we need to generate a separate set of intercepts for each study that we simulate.

In the proportional odds model, we are modeling the log-cumulative odds at a particular level. The simplest model with a single exposure/treatment covariate for a specific study or cluster $k$ is

\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{xk} + \beta A,\] where $x$ ranges from 1 to 10, all the levels of the WHO score excluding the lowest level $x=0$. $A$ is the treatment indicator, and is $A=1$ for patients who receive the treatment. $\alpha_{xk}$ is the intercept for each study/cluster $k$. $\beta$ is interpreted as the log-odds ratio comparing the odds of the treated with the non-treated within each study. The proportionality assumption kicks in here when we note that $\beta$ is constant for all levels of $x$. In addition, in this particular model, we are assuming that the log-odds ratio is constant across studies (not something we will assume in a more complete model). We make no assumptions about how the study intercepts relate to each other.

To make clear what it would mean to make a stronger assumption about the odds across studies consider this model:

\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{x} + b_k + \beta A,\]

where the intercepts for each study are related, since they are defined as $\alpha_{x} + b_k$, and share $\alpha_x$ in common. If we compare the log-odds of the treated in one study $k$ with the log-odds of treated in another study $j$ (so $A=1$ in both cases), the log-odds ratio is $b_j - b_k$. The ratio is independent of $x$, which implies a strong proportional odds assumption across studies. In contrast, the same comparison across studies based on the first model is $\alpha_{xj} - \alpha_{xk}$, which is not necessarily constant across different levels of $x$.

This is a long way of explaining why we need to generate different sets of intercepts for each study. In short, we would like to make the more relaxed assumption that odds are not proportional across studies or clusters.

The Dirichlet distribution

In order to generate ordinal categorical data I use the genOrdCat function in the simstudy package. This function requires a set of baseline probabilities that sum to one; these probabilities map onto level-specific intercepts. There will be a distinct set of baseline probabilities for each study and I will create a data set for each study. The challenge is to be able to generate unique baseline probabilities as if I were sampling from a population of studies.

If I want to generate a single probability (i.e. a number between $0$ and $1$), a good solution is to draw a value from a beta distribution, which has two shape parameters $\alpha$ and $\beta$.

Here is a single draw from $beta(3, 3)$:

set.seed(872837)

rbeta(1, shape1 = 3, shape2 = 3)

## [1] 0.568

The mean of the beta distribution is $\alpha/(\alpha + \beta)$ and the variance is $\alpha\beta/(\alpha+\beta)^2(\alpha + \beta + 1)$. We can reduce the variance and maintain the same mean by increasing $\alpha$ and $\beta$ by a constant factor (see addendum for a pretty picture):

library(data.table)

d1 <- data.table(s = 1, value = rbeta(1000, shape1 = 1, shape2 = 2))
d2 <- data.table(s = 2, value = rbeta(1000, shape1 = 5, shape2 = 10))
d3 <- data.table(s = 3, value = rbeta(1000, shape1 = 100, shape2 = 200))

dd <- rbind(d1, d2, d3)
dd[, .(mean(value), sd(value)), keyby = s]

##    s    V1     V2
## 1: 1 0.338 0.2307
## 2: 2 0.336 0.1195
## 3: 3 0.333 0.0283

The Dirichlet distribution is a multivariate version of the beta distribution where $K$ values between $0$ and $1$ are generated, with the caveat that they sum to $1$. Instead of $\alpha$ and $\beta$, the Dirichlet is parameterized by a vector of length $K$

\[\boldsymbol{\alpha} = \left(\alpha_1,\dots, \alpha_K\right)^T,\]

where there are $K$ levels of the ordinal outcome. A draw from this distribution returns a vector $\boldsymbol{p} = ( p_1, \dots, p_K)^T$ where $\sum_{i=1}^K p_i = 1$ and

\[E(p_k)=\frac{\alpha_k}{\sum_{i=1}^K \alpha_i}.\] A draw from a Dirichlet distribution with $K=2$ is actually equivalent to a draw from a beta distribution where $\boldsymbol{\alpha} = (\alpha, \beta)^T$. Before, I generated data from a $beta(1, 2)$, and now here is a draw from $Dirichlet\left(\boldsymbol\alpha = (1,2)\right)$ using rdirichlet from the gtools package:

library(gtools)

dir <- rdirichlet(1000, alpha = c(1,2))
head(dir)

##        [,1]  [,2]
## [1,] 0.3606 0.639
## [2,] 0.4675 0.533
## [3,] 0.2640 0.736
## [4,] 0.0711 0.929
## [5,] 0.5643 0.436
## [6,] 0.0188 0.981

The first column has the same distribution as the $beta$ distribution from before; the mean and standard deviation are close to the values estimated above:

c(mean(dir[,1]), sd(dir[,1]))

## [1] 0.332 0.236

To ramp things up a bit, say we have $K = 5$, and the target mean values for each level are $\boldsymbol{p} = \left(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9} \right)$, one way to specify this is:

dir_1 <- rdirichlet(1000, alpha = c(1, 2, 3, 2, 1))
head(dir_1)

##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,] 0.1710 0.6637 0.0676 0.0633 0.0343
## [2,] 0.1130 0.1150 0.2803 0.4229 0.0689
## [3,] 0.1434 0.0678 0.3316 0.1721 0.2851
## [4,] 0.0250 0.1707 0.3841 0.2490 0.1712
## [5,] 0.0633 0.3465 0.4056 0.0853 0.0993
## [6,] 0.1291 0.1510 0.3993 0.2612 0.0593

Here are the observed means for each $p_k$, pretty close to the target:

apply(dir_1, 2, mean)

## [1] 0.111 0.221 0.328 0.229 0.112

Of course, we could generate data with a similar target $\boldsymbol{p}$ by multiplying $\boldsymbol\alpha$ by a constant $c$. In this case, we use $c=10$ and see that the average values for each $p_k$ are also close to the target:

dir_2 <- rdirichlet(1000, alpha = c(10, 20, 30, 20, 10))
apply(dir_2, 2, mean)

## [1] 0.113 0.222 0.334 0.220 0.111

There is a key difference between specifying $\boldsymbol{\alpha}$ and $c\boldsymbol{\alpha}$. Just as in the beta distribution, as $c$ grows larger, the variation within each $p_k$ decreases. This will be useful when generating the study specific probabilities if we want explore different levels of variation.

Here’s the standard deviations from the two data sets just generated:

apply(dir_1, 2, sd)

## [1] 0.102 0.131 0.144 0.134 0.098

apply(dir_2, 2, sd)

## [1] 0.0333 0.0425 0.0508 0.0421 0.0333

Generating the baseline probabilities

A simple function that includes two key arguments - the base probabilities (which are really $\boldsymbol{\alpha}$) and a similarity index (which is really just the constant $c$) - implements these ideas to generate study-specific probabilities for each outcome level. As the similarity index increases, the variation across studies or sites decreases. The function includes an additional adjustment to ensure that the row totals sum exactly to $1$ and not to some value infinitesimally greater than $1$ as a result of rounding. Such a rounding error could cause problems for the function genOrdCat.

genBaseProbs <- function(n, base, similarity, digits = 8) {
  
  n_levels <- length(base)
  x <- rdirichlet(n, similarity * base) 
  
  #--- ensure that each vector of probabilities sums exactly to 1
  
  x <- round(floor(x*1e8)/1e8, digits)   # round the generated probabilities
  xpart <- x[, 1:(n_levels-1)]           # delete the base prob of the final level
  partsum <- apply(xpart, 1, sum)        # add the values of levels 1 to K-1
  x[, n_levels] <- 1 - partsum           # the base prob of the level K = 1 - sum(1:[K-1])
  
  return(x)
}

In this first example, I am generating 11 values (representing base probabilities) for each of 9 studies using a relatively low similarity index, showing you the first six studies:

basestudy <- genBaseProbs(
  n = 9,
  base =  c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
  similarity = 15,
)

round(head(basestudy), 3)

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
## [1,] 0.094 0.022 0.121 0.100 0.061 0.102 0.053 0.309 0.059 0.078 0.000
## [2,] 0.025 0.079 0.043 0.197 0.083 0.044 0.099 0.148 0.025 0.150 0.107
## [3,] 0.007 0.042 0.084 0.066 0.049 0.145 0.191 0.323 0.078 0.012 0.003
## [4,] 0.061 0.021 0.063 0.104 0.092 0.292 0.112 0.110 0.113 0.026 0.008
## [5,] 0.067 0.023 0.021 0.042 0.063 0.473 0.108 0.127 0.016 0.013 0.046
## [6,] 0.001 0.018 0.054 0.225 0.150 0.301 0.043 0.081 0.100 0.008 0.020

A great way to see the variability is a cumulative probability plot for each individual study. With a relatively low similarity index, you can generate quite a bit of variability across the studies. In order to create the plot, I need to first calculate the cumulative probabilities:

library(ggplot2)
library(viridis)

cumprobs <- data.table(t(apply(basestudy, 1, cumsum)))
n_levels <- ncol(cumprobs)
cumprobs[, id := .I]

dm <- melt(cumprobs, id.vars = "id", variable.factor = TRUE)
dm[, level := factor(variable, labels = c(0:10))]

ggplot(data  = dm, aes(x=level, y = value)) +
  geom_line(aes(group = id, color = id)) +
  scale_color_viridis( option = "D") +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.position = "none")

Here is a plot of data generated using a similarity index of 150. Variation is reduced pretty dramatically:

Using base probabilities to generate ordinal data

Now that we have these base probabilities, the last step is to use them to generate ordinal outcomes. I am generating the simplest of data sets: 9 “studies” each with 500 subjects, without any covariates or even treatment assignment. Since the genOrdCat requires an adjustment variable, I am adjusting everyone by 0. (This is something I need to fix - there should be no such requirement.)

library(simstudy)

d_study <- genData(9, id = "study")   
d_ind <- genCluster(d_study, "study", numIndsVar = 500, "id")
d_ind[, z := 0]

d_ind

##       study   id z
##    1:     1    1 0
##    2:     1    2 0
##    3:     1    3 0
##    4:     1    4 0
##    5:     1    5 0
##   ---             
## 4496:     9 4496 0
## 4497:     9 4497 0
## 4498:     9 4498 0
## 4499:     9 4499 0
## 4500:     9 4500 0

To generate the ordinal categorical outcome, we have to treat each study separately since they have unique baseline probabilities. This can be accomplished using lapply in the following way:

basestudy <- genBaseProbs(
  n = 9,
  base =  c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
  similarity = 50
)

list_ind  <- lapply(
  X = 1:9, 
  function(i) {
    b <- basestudy[i,]
    d_x <- d_ind[study == i]
    genOrdCat(d_x, adjVar = "z", b, catVar = "ordY")
  }
)

The output list_ind is a list of data tables, one for each study. For example, here is the 5th data table in the list:

list_ind[[5]]

##      study   id z ordY
##   1:     5 2001 0    7
##   2:     5 2002 0    9
##   3:     5 2003 0    5
##   4:     5 2004 0    9
##   5:     5 2005 0    9
##  ---                  
## 496:     5 2496 0    9
## 497:     5 2497 0    4
## 498:     5 2498 0    7
## 499:     5 2499 0    5
## 500:     5 2500 0   11

And here is a table of proportions for each study that we can compare with the base probabilities:

t(sapply(list_ind, function(x) x[, prop.table(table(ordY))]))

##           1     2     3     4     5     6     7     8     9    10    11
##  [1,] 0.106 0.048 0.086 0.158 0.058 0.162 0.092 0.156 0.084 0.028 0.022
##  [2,] 0.080 0.024 0.092 0.134 0.040 0.314 0.058 0.110 0.028 0.110 0.010
##  [3,] 0.078 0.050 0.028 0.054 0.148 0.172 0.162 0.134 0.058 0.082 0.034
##  [4,] 0.010 0.056 0.116 0.160 0.054 0.184 0.102 0.084 0.156 0.056 0.022
##  [5,] 0.010 0.026 0.036 0.152 0.150 0.234 0.136 0.084 0.120 0.026 0.026
##  [6,] 0.040 0.078 0.100 0.092 0.170 0.168 0.196 0.050 0.038 0.034 0.034
##  [7,] 0.006 0.064 0.058 0.064 0.120 0.318 0.114 0.068 0.082 0.046 0.060
##  [8,] 0.022 0.070 0.038 0.160 0.182 0.190 0.074 0.068 0.070 0.036 0.090
##  [9,] 0.054 0.046 0.052 0.128 0.100 0.290 0.102 0.092 0.080 0.030 0.026

Of course, the best way to compare is to plot the data for each study. Here is another cumulative probability plot, this time including the observed (generated) probabilities in black over the baseline probabilities used in the data generation in red:

Sometime soon, I plan to incorporate something like the function genBaseProbs into simstudy to make it easier to incorporate non-proportionality assumptions into simulation studies that use ordinal categorical outcomes.

Addendum

The variance of the beta distribution (and similarly the Dirichlet distribution) decreases as $\alpha$ and $\beta$ both increase proportionally (keeping the mean constant). I’ve plotted the variance of the beta distribution for $\alpha = 1$ and different levels of $\beta$ and $C$. It is clear that at any level of $\beta$ (I’ve drawn a line at $\beta = 1$), the variance decreases as $C$ increases. It is also clear that, holding $\alpha$ constant, the relationship of $\beta$ to variance is not strictly monotonic:

var_beta <- function(params) {
  a <- params[1]
  b <- params[2]
  (a * b) / ( (a + b)^2 * (a + b + 1))
}

loop_b <- function(C, b) {
  V <- sapply(C, function(x) var_beta(x*c(1, b)))
  data.table(b, V, C)
}

b <- seq(.1, 25, .1)
C <- c(0.01, 0.1, 0.25, 0.5, 1, 2, 4, 10, 100)

d_var <- rbindlist(lapply(b, function(x) loop_b(C, x)))

ggplot(data = d_var, aes(x = b, y = V, group = C)) +
  geom_vline(xintercept = 1, size = .5, color = "grey80") +
  geom_line(aes(color = factor(C))) +
  scale_y_continuous(name = expression("Var beta"~(alpha==1~","~beta))) +
  scale_x_continuous(name = expression(beta)) +
  scale_color_viridis(discrete = TRUE, option = "B", name = "C") +
  theme(panel.grid = element_blank(),
        legend.title.align=0.15)

Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 01 Sep 2020 00:00:00 +0000

The federal government recently granted emergency approval for the use of antibody rich blood plasma when treating hospitalized COVID-19 patients. This announcement is unfortunate, because we really don’t know if this promising treatment works. The best way to determine this, of course, is to conduct an experiment, though this approval makes this more challenging to do; with the general availability of convalescent plasma (CP), there may be resistance from patients and providers against participating in a randomized trial. The emergency approval sends the incorrect message that the treatment is definitively effective. Why would a patient take the risk of receiving a placebo when they have almost guaranteed access to the therapy?

This doesn’t obviate the fact that we still need to figure out if CP is effective. Last month, I described an approach to pooling data across multiple different, but similar, studies as way to leverage information that is being generated around the country and world from ongoing trials. I was particularly cryptic about the nature of the research, because the paper describing the details of the proposed research had not yet been published. Now, the project has a name (COMPILE), a website, and most importantly, participating studies committed to sharing data.

In preparation for the analyses, we have been developing a statistical plan, which is based on a pooled Bayesian model similar to what I described) earlier. The Bayesian approach offers much needed flexibility in this context when we must make a principled decision as quickly as possible. Indeed, now that the emergency approval has been granted, there is even more urgency. The study’s Data Safety and Monitoring Board will be evaluating the data frequently, which a Bayesian approach accommodates quite well. (I imagine I will have much to write about over the next few months as we try to better understand the implications and challenges of taking this path.)

In this post, I am describing a nitty-gritty issue related to Markov chain Monte Carlo (MCMC) estimation: stability. It may not sound super-exciting, but stable estimation is key to drawing correct inferences from the estimated posterior distribution. As a model becomes more complex, the MCMC estimation in stan can be plagued by degenerate sampling caused by divergent transitions. MCMC works well when the algorithm takes the sampler across the full posterior distribution without getting stuck, but all bets are off when the process breaks down.

Using a slightly simplified version of the data and model we are proposing for COMPILE, I want to show how to see if things have gotten stuck at some point, and then present a possible solution to getting things unstuck. (I highly encourage you to look here and here for many more details if this is indeed exciting to you.)

The model

The studies included in this meta-analysis will be similar in that they are all providing 1- or 2-units of plasma to the patients randomized to therapy. However, there will be differences with respect to the control arm: studies will use a saline solution, non-convalescent plasma, or usual care (in the last case, the study is not blinded). We need to account for the possibility that the treatment effect will vary slightly depending on the type of control. In this case, I am assuming a binary outcome (though in the actual study we are using an ordinal outcome and a proportional odds model). Here is the logistic model: \[\text{logodds}(P(y_{ik}=1)) = \alpha_0 + b_k + \delta_k x_{ik}\] where the observed data are $y_{ik}$, the indicator for the outcome (say death) of patient $i$ in study $k$, and $x_{ik}$, an indicator set to $1$ if the patient is in the control arm, 0 otherwise. The parameters are $\alpha_0$, the global logodds of death (note here the intercept represents the treatment condition, not the control condition as is usually the case); $b_k$, the study-specific logodds of death given treatment; and $\delta_k$, the “control” effect (a logodds ratio) specific to study $k$.

The really interesting aspect of this model is $\delta_k$. This effect is a function of three components - the study site, the control-type, and the the overall/global treatment effect. We are assuming that there is a tendency for studies to vary around a control-type effect average. So, with three controls $c\in {1,2,3}$: \[\delta_k \sim N(\delta_c, \eta_0),\]

determined by the control-type of study $k$. Furthermore, we assume that the control-type effects, the $\delta_c$’s, vary around a global effect $\Delta$: \[\delta_c \sim N(\Delta, \eta).\] We assume that $\eta$ is quite small; that is, the control effects will likely be very similar. We are not actually interested in $\eta$ so we do not attempt to estimate it. However $\eta_0$, the variability across studies, is important, so we will estimate that.

Generating data

To start, here are the R packages I am using to generate the data.

library(simstudy)
library(data.table)

Next, I define the study level parameters: the study-specific intercept $b_k$ and the study-specific “control” effect $\delta_k$, which is a function of the control-type. Note I do not specify a study-level “control” effect, there will just be natural variation across studies. The individual-level outcome $y_{ik}$ is defined as a function of study parameters. The overall treatment effect is $\Delta = 0.5$ on the logodds ratio scale.

def_s <- defDataAdd(varname = "b_k", formula = 0, variance = 0.025)
def_s <- defDataAdd(
    def_s, varname = "delta_k", 
    formula = "(c_type==1) * 0.4 + (c_type==2) * 0.5 + (c_type==3) * 0.6", 
    dist = "nonrandom"
  )
  
def_i <- defDataAdd(
    varname = "y", formula = "-1 + b_k + rx * delta_k", 
    dist = "binary", link = "logit")

I am generating 7 studies with under each control type, for a total of 21 studies. Each study has 50 patients, 25 in each arm, for a total of 1050 patients.

dc <- genData(3, id = "c_type")
    
ds <- genCluster(dc, "c_type", numIndsVar = 7, level1ID = "site")
ds <- addColumns(def_s, ds)

di <- genCluster(ds, "site", 50, "id")
di <- trtAssign(di, 2, strata = "site", grp = "rx")
di <- addColumns(def_i, di)

Estimation using stan

I am using rstan directly to sample from the posterior distribution, as opposed to using a more user-friendly package like brms or rstanarm. I’ve actually been warned against taking this approach by folks at stan, because it can be more time consuming and could lead to problems of the sort that I am showing you how to fix. However, I find building the model using stan code very satisfying and illuminating; this process has really given me a much better appreciation of the Bayesian modeling. And besides, this model is odd enough that trying to shoehorn it into a standard brms model might be more trouble than it is worth.

library(rstan)
library(ggplot)
library(bayesplot)

The stan code, which can reside in its own file, contains a number of blocks that define the model. The data block specifies the data will be provided to the model; this can include summary data as well as raw data. The parameters block is where you specify the parameters of the model that need to be estimated. The transformed parameters block includes another set of parameters that will be estimated, but are a function of parameters defined in the previous block. And in this case, the last block is the model where prior distributions are specified as well as the likelihood (outcome) model. Rather than walk you through the details here, I will let you study a bit and see how this relates to the model I specified above.

data {
  
  int<lower=0> N;                // number of observations
  int<lower=0> C;                // number of control types
  int<lower=1> K;                // number of studies
  int y[N];                      // vector of categorical outcomes
  int<lower=1,upper=K> kk[N];    // site for individual
  int<lower=0,upper=1> ctrl[N];  // treatment or control
  int<lower=1,upper=C> cc[K];    // specific control for site
  
}

parameters {

  real alpha;               // overall intercept for treatment
  vector[K] beta_k;         // site specific intercept
  real<lower=0> sigma_b;    // sd of site intercepts

  vector[K] delta_k;        // site specific treatment effect
  real<lower=0>  eta_0;     // sd of delta_k (around delta_c)

  vector[C] delta_c;        // control-specific effect
  real Delta;               // overall control effect
  
}

transformed parameters{ 
  
  vector[N] yhat;

  for (i in 1:N)  
      yhat[i] = alpha +  beta_k[kk[i]] + (ctrl[i] * (delta_k[kk[i]]));

}

model {
  
  // priors
  
  alpha ~ student_t(3, 0, 2.5);
  beta_k ~ normal(0, sigma_b);
  sigma_b ~ cauchy(0, 1);
  eta_0 ~ cauchy(0, 1);

  for (k in 1:K)
      delta_k[k] ~ normal(delta_c[cc[k]], eta_0);

  delta_c ~ normal(Delta, 0.5);
  Delta ~ normal(0, 10);
  
  // likelihood/outcome model
  
  y ~ bernoulli_logit(yhat);
}

We need to compile the stan code so that it can be called from the R script:

rt_c <- stanc("binary_outcome.stan")
sm_c <- stan_model(stanc_ret = rt_c, verbose=FALSE)

And here is the R code that prepares the data for the stan program and then samples from the posterior distribution. In this case, I will be using 4 different Monte Carlo chains of 2500 draws each (after allowing for 500 warm-up draws), so we will have a actual sample size of 10,000.

N <- nrow(di) ;                       
C <- di[, length(unique(c_type))]     
K <- di[, length(unique(site))]       
y <- as.numeric(di$y)                 
kk <- di$site                         
ctrl <- di$rx                         
cc <- di[, .N, keyby = .(site, c_type)]$c_type  
    
sampdat <- list(N=N, C=C, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc)
    
fit_c <-  sampling(
  sm_c, data = sampdat, iter = 3000, warmup = 500, 
  show_messages = FALSE, cores = 4, refresh = 0,
  control = list(adapt_delta = 0.8)
)

Inspecting the posterior distribution

Assuming that everything has run smoothly, the first thing to do is to make sure that the MCMC algorithm adequately explored the full posterior distribution in the sampling process. We are typically interested in understanding the properties of the distribution, like the mean or median, or 95% credible intervals. To have confidence that these properties reflect the true posterior probabilities, we need to be sure that the sample they are drawn from is a truly representative one.

A quick and effective way to assess the “representativeness” of the sample is to take a look at the trace plot for a particular parameter, which literally tracks the path of the MCMC algorithm through the posterior distribution. Below, I’ve included two trace plots, one for $\Delta$, the overall effect, and the other for $\eta_0$, the variability of a study’s control effect $\delta_k$ around $\delta_c$. On the left, the plots appear as they should, with lines jumping up and down. However, these particular plots include red indicators where the algorithm got stuck, where there were divergent transitions. We really don’t want to see any of this indicators, because that is a sign that our sample is not representative of the posterior distribution.

posterior_c <- as.array(fit_c) 
lp_c <- log_posterior(fit_c)
np_c <- nuts_params(fit_c)

color_scheme_set("mix-brightblue-gray")

mcmc_trace(posterior_c, pars = "Delta", np = np_c) + 
  xlab("Post-warmup iteration")

mcmc_trace(
  posterior_c, pars = "Delta", np = np_c, window = c(1500, 1700)
) + 
  xlab("Post-warmup iteration")

On the right in the figure above, I’ve zoomed in on steps 700 to 900 to see if we can see any patterns. And sure enough, we can. In the $\Delta$ plot, straight lines appear in the middle, evidence that the sampling for some of the chains did indeed get stuck. Likewise, the plot for $\eta_0$ shows flat lines near $0$.

There’s an additional plot that shows the same thing but in a slightly more dramatic and comprehensive way. This plot (shown below) has a line for each step connecting the parameter estimates of that step. The red lines represent divergent transitions. The important thing to note here is that in all cases with divergent transitions, $\eta_0$ found itself close to $0$. In other words, the sampling was getting stuck at this point, and this is the likely culprit for the sampling issues.

color_scheme_set("darkgray")

parcoord_c <-mcmc_parcoord(
  posterior_c, np = np_c, 
  pars = c("eta_0", "sigma_b", "alpha",
           "delta_c[1]", "delta_c[2]","delta_c[3]", "Delta")
  )

parcoord_c +
  scale_x_discrete(expand = c(0.01, 0.01)) +
  theme(panel.background = element_rect(fill = "grey90")) +
  ylim(-3, 3) +
  ggtitle("Original model specification")

A remedy for divergent transitions

In a moment, I will provide a brief illustration of the perils of divergent transitions, but before that I want to describe “non-centered parameterization,” an approach that can be used to mitigate divergence. The idea is that since sampling from a standard Gaussian or normal distribution is less likely to lead to problematic transitions, we should try to do this as much as possible. “In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation.” (See here for the source of this quote and much more.)

For example, in the original model specification, we parameterized $\delta_k$ and $\delta_c$ as

\[ \delta_k \sim N(\delta_c, \eta_0) \\ \delta_c \sim N(\Delta, 0.5) \]

\[ \eta_0 \sim Cauchy(0, 1) \\ \Delta \sim N(0, 10), \]

whereas using “non-centered” parameterization, we would incorporate two latent standard normal variables $\theta_{rx}$ and $\theta_c$ into the model. $\delta_k$ and $\delta_c$ have the same prior distribution as the original model, but we are now sampling from standard normal prior distribution: \[ \delta_k = \delta_c + \eta_0 \theta_{rx} \\ \theta_{rx} \sim N(0, 1) \]

\[ \delta_c = \Delta + 0.5\theta_c\\ \theta_c \sim N(0, 1) \]

\[ \eta_0 \sim Cauchy(0, 1) \\ \Delta \sim N(0, 10). \]

This transformation makes the path through the posterior distribution much smoother and, at least in this case, eliminates the divergent transitions.

Here is the stan code using non-centered parameterization (again, feel free to linger and study):

data {
  
  int<lower=0> N;                 // number of observations
  int<lower=0> C;                 // number of control types
  int<lower=1> K;                 // number of studies
  int y[N];                       // vector of categorical outcomes
  int<lower=1,upper=K> kk[N];     // site for individual
  int<lower=0,upper=1> ctrl[N];   // treatment or control
  int<lower=1,upper=C> cc[K];     // specific control for site

}

parameters {

  real alpha;               // overall intercept for treatment
   
  real<lower=0> sigma_b;
  real<lower=0>  eta_0;     // sd of delta_k (around delta)

  real Delta;               // overall control effect
  
   // non-centered parameterization
  
  vector[K] z_ran_rx;   // site level random effects (by period)
  vector[K] z_ran_int;  // individual level random effects 
  vector[C] z_ran_c;  // individual level random effects 

}

transformed parameters{ 
  
  vector[N] yhat;
  vector[K] beta_k;
  vector[K] delta_k;        // site specific treatment effect
  vector[C] delta_c;

  beta_k = sigma_b * z_ran_int + alpha;
  
  for (i in 1:C)
    delta_c[i] = 0.5 * z_ran_c[i] + Delta; 
  
  for (i in 1:K)
    delta_k[i] = eta_0 * z_ran_rx[i] + delta_c[cc[i]]; 
  
  for (i in 1:N)  
      yhat[i] = beta_k[kk[i]] + ctrl[i] * delta_k[kk[i]];

}

model {
  
  // priors
  
  alpha ~ student_t(3, 0, 2.5);

  z_ran_c ~ std_normal();
  z_ran_int ~ std_normal();  
  z_ran_rx ~ std_normal();  

  sigma_b ~ cauchy(0, 1);
  eta_0 ~ cauchy(0, 1);
  
  Delta ~ normal(0, 10);
  
  // outcome model
  
  y ~ bernoulli_logit(yhat);
}

Looking at the trace plots from the non-centered model makes it clear that divergent transitions are no longer a problem. There are no red indicators, and the patterns of straight lines have been eliminated:

Proceed with caution if you ignore the divergence warnings

If we don’t heed the warnings, how bad can things be? Well, it will probably depend on the situation, but after exploring with multiple data sets, I have convinced myself that it is probably a good idea to reduce the number of divergent transitions as close to 0 as possible.

I conducted an experiment by generating 100 data sets and fitting a model using both the original and non-centered parameterizations. I collected the posterior distribution for each data set and estimation method, and plotted the density curves. (In case you are interested, I used a parallel process running on a high-performance computing core to do this; running on my laptop, this would have taken about 5 hours, but on the HPC it ran in under 15 minutes.) The purpose of this was to explore the shapes of the densities across the different data sets. I know it is a little odd to use this frequentist notion of repeatedly sampling datasets to evaluate the performance of these two approaches, but I find it to be illuminating. (If you’re interested in the code for that, let me know.)

Below on the right, the plot of the posteriors from the non-centered parameterization shows variability in location, but is otherwise remarkably consistent in shape and scale. On the left, posterior densities show much more variation; some are quite peaked and others are even bi-modal. (While I am not showing this here, the densities from samples with more divergent transitions tend to diverge the most from the well-behaved densities on the right.)

Although the mean or median estimates from a divergent sample may not be too far off from its non-divergent counterpart, the more general description of the distribution may be quite far off the mark, making it likely that inferences too will be off the mark.

Generating data from a truncated distribution

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 18 Aug 2020 00:00:00 +0000

A researcher reached out to me the other day to see if the simstudy package provides a quick and easy way to generate data from a truncated distribution. Other than the noZeroPoisson distribution option (which is a very specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy. I’ve actually had it in the back of my mind to add something like this to simstudy, but have hesitated because it might mean changing (or at least adding to) the defData table structure.

However, it may be time to go for it. The process and coding are actually relatively straightforward, so there is no real reason not to. I was developing a simple prototype for several probability distributions (though the concept can easily be applied to any distribution where the cumulative distribution function, or CDF, is readily accessible), and am sharing here in case you need to do this before it is available in the package, or if you just want to implement yourself.

What is a truncated distribution?

A truncated probability distribution is one derived from limiting the domain of an existing distribution. A picture is worth a thousand words. On the left, we have a histogram for 10,000 observations drawn from a full (non-truncated) Gaussian or normal distribution with mean 0 and standard deviation 3. In the middle, the histogram represents data drawn from the positive portion of the same distribution (i.e. is truncated at the left by 0). And on the far right, the truncation is defined by the boundaries $(-3, 3.5)$:

Leveraging the uniform distribution and a CDF

A while back, I described a copula approach to generating correlated data from different distributions (ultimately implemented in functions genCorGen and addCorGen). I wrote about combining a draw from a uniform distribution with the CDF of any target distribution to facilitate random number generation from the target generation. This is an approach that works well for truncated distributions also, where the truncated distribution is the target.

Again - visuals help to explain how this works. To start, here are several CDFs of normal distributions with different means and variances:

The CDF of a distribution (usually written as $F(x)$) effectively defines that distribution: $F(x) = P(X \le x)$. Since probabilities by definition range from $0$ to $1$, we know that $F(x)$ also ranges from $0$ to $1$. It is also the case, that $F(x)$ is monotonically increasing (or at least non-decreasing) from $0$ to $1$.

Let’s say we want to generate a draw from $N(\mu =0, \sigma = 3)$ using the the CDF. We can first generate a draw from $u = Uniform(0,1)$. We then treat $u$ as a value of the CDF, and map it back $x$ to get our draw from the target distribution. So, $x = F^{-1}(u)$. In R, the CDF for the normal distribution can be determined using the qnorm function, where the first argument is a probability value between $0$ and $1$. This would be the R code to generate a single draw from $N(0, 3)$ using a random draw from $Uniform(0, 1)$:

(u <- runif(1))

## [1] 0.9

qnorm(u, mean = 0, sd = 3)

## [1] 3.9

This is how $u = 0.9$ relates to the draw of $x=3.9$:

To generate a random sample of 10,000 draws from $N(0, 3)$, this process is replicated 10,000 times:

library(ggplot2)

u <- runif(10000)
x <- qnorm(u, mean = 0, sd = 3)

ggplot(data = data.frame(x), aes(x = x)) +
  geom_histogram(fill = "#CCC591", alpha = 1, binwidth = .2, boundary = 0) +
  theme(panel.grid = element_blank(),
        axis.title = element_blank())

Extending the inverse process to generate truncation

Let’s say we are only interested in generating data from the middle portion of the $N(0,3)$ distribution, between $a$ and $b$. The trick is to use the corresponding CDF values, $F(a)$ and $F(b)$ as the basis of the randomization.

To generate data within the constraints $a$ and $b$, all we would need to do is generate a value from the uniform distribution with minimum equal to $F(a)$ and maximum $F(b)$. We then conduct the mapping as we did before when drawing from the full distribution. By constraining $u$ to be between $F(a)$ and $F(b)$, we force the values of the target distribution to lie between $a$ and $b$.

Now, we are ready to create a simple function rnormt that implements this: The pnorm function provides the CDF at a particular value:

rnormt <- function(n, range, mu, s = 1) {
  
  # range is a vector of two values
  
  F.a <- pnorm(min(range), mean = mu, sd = s)
  F.b <- pnorm(max(range), mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  
  qnorm(u, mean = mu, sd = s)
  
}

Here, I am generating the data plotted above, showing the code this time around.

library(data.table)
library(simstudy)
library(paletteer)

defC <- defCondition(condition= "tt == 1", 
                     formula = "rnormt(10000, c(-Inf, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 2", 
                     formula = "rnormt(10000, c(0, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 3", 
                     formula = "rnormt(10000, c(-3, 3.5), mu = 0, s = 3)")

dd <- genData(30000)
dd <- trtAssign(dd, nTrt = 3, grpName = "tt")
dd <- addCondition(defC, dd, "x")

dd[, tt := factor(tt, 
     labels = c("No truncation", "Left truncation at 0", "Left and right truncation"))]

ggplot(data = dd, aes(x = x, group = tt)) +
  geom_histogram(aes(fill = tt), alpha = 1, binwidth = .2, boundary = 0) +
  facet_grid(~tt) +
  theme(panel.grid = element_blank(),
        axis.title = element_blank(),
        legend.position = "none") +
  scale_fill_paletteer_d("wesanderson::Moonrise2")

Going beyond the normal distribution

With this simple approach, it is possible to generate a truncated distribution using any distribution available in R. Here is another example that allows us to generate truncated data from a gamma distribution:

rgammat <- function(n, range, shape, scale = 1) {
  
  F.a <- pgamma(min(range), shape = shape, scale = scale)
  F.b <- pgamma(max(range), shape = shape, scale = scale)
  
  u <- runif(n, min = F.a, max = F.b)

  qgamma(u, shape = shape, scale = scale)

}

To conclude, here is a plot of gamma-based distributions using rgammat. And I’ve added similar plots for beta and Poisson distributions - I’ll leave it to you to write the functions. But, if you don’t want to do that, simstudy will be updated at some point soon to help you out.

A hurdle model for COVID-19 infections in nursing homes

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 04 Aug 2020 00:00:00 +0000

Late last year, I added a mixture distribution to the simstudy package, largely motivated to accommodate zero-inflated Poisson or negative binomial distributions. (I really thought I had added this two years ago - but time is moving so slowly these days.) These distributions are useful when modeling count data, but we anticipate observing more than the expected frequency of zeros that would arise from a non-inflated (i.e. “regular”) Poisson or negative binomial distribution.

There is a related, though subtly different model - the hurdle model - that I want to talk about here, because it has come up in the context of planning a new study to address COVID-19 infection rates in nursing homes using much more aggressive testing strategies.

A hurdle model is a modified count model that also assumes a relatively high frequency of zeros, but is set up as a two-stage data generation process rather than as a mixture distribution. In the first stage, the development of any (i.e. at least 1) new infections during the observation period is described by binomial probability distribution. If there is at least one infection, the process has transitioned over the “hurdle” into the second stage; here, the number of new infections is described by a non-zero count distribution, such as a truncated Poisson or truncated negative binomial distribution.

Model specification

If $N_i$ is the number of new infections in nursing home $i$ over the observation period and we have two intervention arms $T_i \in \{0,1\}$, the intervention effect at each stage of the process can be modeled simply as:

\[ \text{logodds}\left[ P(N_i > 0) \right] = \beta_0 + \beta_1 T_i + \boldsymbol{X_i \beta_2}\]

\[\text{log}(N_i | N_i \ge 1) = \alpha_0 + \alpha_1 T_i + \boldsymbol{X_i \alpha_2} + \text{log}(D_i)\]

The intervention effect for the binomial stage is $\beta_1$ (on the logodds scale) and the intervention effect for the hurdle (count) stage is $\alpha_1$ (on the log scale). $\boldsymbol{X_i}$ are any covariates that are used for stratified randomization.

$D_i$ is the number of resident-days observed during the follow-up period, and $\text{log}(D_i)$ is the “offset”; we are effectively modeling a rate of infections $\text{log}(N_i/D_i)$. This will take into account the fact that residents will be observed for different lengths of time - some moving into the nursing home after the study has started, and others leaving or dying before the study is complete.

Simulating a hurdle model

Simulating data from this model is relatively straightforward, complicated only by the need to generate varying observation periods. Essentially, we must generate two outcomes - a binary outcome and a non-zero count outcome (in this case it will be from a non-zero Poisson distribution), and the observed outcome is 0 if the binary outcome is actually 0, and the value of the count outcome if the binary outcome is 1.

To get things going, here are the packages I will use. The pscl package provides a function hurdle to estimate the model parameters from our simulated data, and stargazer package outputs the model in a nice, readable format.

library(simstudy)
library(data.table)
library(ggplot2)
library(pscl)
library(stargazer)

Data generation

In this simulation the average observation time is 80 days (out of 90 maximum), and on average, each nursing home will have 100 residents. In the control arm, 95% of the nursing homes will have at least one infection, and 80% of the intervention arm will have at least one. The corresponding odds ratio is $(0.80/0.20)/(.95/0.05) = 0.21$.

The infection rate per 1000 resident-days for the control arm will be $\sim (20/8000)*1000 = 2.5$; for the intervention arm, the rate will be $\sim (20/8000) *0.8 * 1000 = 2.0$.

Here is the data definition table defHurdle created by the function defDataAdd that encodes these assumptions:

##    varname                             formula variance          dist     link
## 1:    nRes                                 100        0       poisson identity
## 2:   aDays                                  80        0       poisson identity
## 3:   nDays                     pmin(90, aDays)        0     nonrandom identity
## 4:   pDays                        nRes * nDays        0     nonrandom identity
## 5:    xBin                    0.95 - 0.15 * rx        0        binary identity
## 6:    xCnt log(20/8000)+log(0.8)*rx+log(pDays)        0 noZeroPoisson      log
## 7:       y                         xBin * xCnt        0     nonrandom identity

The data generation is only at the nursing home level. In this example, we are assuming 500 nursing homes:

set.seed(29211)
dx <- genData(500)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(defHurdle, dx)
  
dx

##       id rx nRes aDays nDays pDays xBin xCnt  y
##   1:   1  1  113    86    86  9718    1   16 16
##   2:   2  0   89    66    66  5874    1   16 16
##   3:   3  1   83    82    82  6806    1   13 13
##   4:   4  0   91    95    90  8190    1   27 27
##   5:   5  1   97    70    70  6790    0   17  0
##  ---                                           
## 496: 496  1  116    85    85  9860    0   17  0
## 497: 497  1   89    94    90  8010    1   14 14
## 498: 498  0  112    92    90 10080    1   20 20
## 499: 499  1   97    71    71  6887    1   21 21
## 500: 500  0   92    68    68  6256    1   13 13

Data visualization

A plot of the data shows the effect at each stage of the hurdle process:

dx[, rate1000 := (y/pDays)*1000]
dx[ , rx := factor(rx, labels = c("No intervention", "Intervention"))]
dm <- dx[rate1000 != 0, .(mu = mean(rate1000)), keyby = rx]

ggplot(data = dx, aes(x = rate1000)) +
  geom_vline(aes(xintercept = mu), data = dm, lty = 3, color = "grey50") +
  geom_histogram(binwidth = .1,
                 aes(y = (..count..)/sum(..count..), fill = rx)) +
  facet_grid(rx ~ .) +
  theme(panel.grid = element_blank(),
        legend.position = "none") +
  scale_y_continuous(labels = scales::percent, 
                     name = "proportion",
                     expand = c(0, 0),
                     breaks = c(c(.05, .10)),
                     limits = c(0, .11)) +
  scale_x_continuous(name = "# infections per 1000 resident-days") +
  scale_fill_manual(values = c("#EDCB64", "#B62A3D"))

Parameter estimation

I fit two models here. The first includes a possible intervention effect, and the second assumes no intervention effect. The purpose in fitting the second model is to provide a basis of comparison.

hfit1 <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx)
hfit1.0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)

The hurdle model returns two sets of estimates. The first component of the model shown here is binomial model. The estimated intervention effect (odds ratio) is $exp(-1.570) = 0.21$, as expected. Note that the log-likelihood reported here is for the composite hurdle model (both stages).

stargazer(hfit1, hfit1.0, type = "text", zero.component = TRUE, 
          notes =" ", notes.append = FALSE, notes.label="", 
          dep.var.labels.include = FALSE, dep.var.caption = "",
          omit.stat = "n", object.names = TRUE, model.numbers = FALSE)

## 
## ====================================
##                  hfit1     hfit1.0  
## ------------------------------------
## rxIntervention -1.570***            
##                 (0.325)             
##                                     
## Constant        2.900***   1.900*** 
##                 (0.285)    (0.133)  
##                                     
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
##

The second component is the count model. The estimated intervention effect is $exp(-0.279) = 0.76$, which is close to the true value of $0.80$. (The reported log-likelihoods are the same as in the binomial model.)

stargazer(hfit1, hfit1.0, type = "text", zero.component = FALSE, 
          notes =" ", notes.append = FALSE, notes.label="", 
          dep.var.labels.include = FALSE, dep.var.caption = "",
          omit.stat = "n", object.names = TRUE, model.numbers = FALSE)

## 
## ====================================
##                  hfit1     hfit1.0  
## ------------------------------------
## rxIntervention -0.279***            
##                 (0.023)             
##                                     
## Constant       -5.980***  -6.090*** 
##                 (0.014)    (0.011)  
##                                     
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
##

In this particular case, the intervention alters both the binomial probability and the county distribution, but that will not necessarily always be the case. A log-likelihood ratio test (LRT) is a global test that compares the model that explicitly excludes an intervention effect (hfit1.0) with the model that includes an intervention effect. If the likelihoods under each are close enough, then the model that excludes the intervention effect is considered sufficient, and there is no reason to conclude that the intervention is effective. We can use the p-value based on the LRT as a measure of whether or not the intervention is generally effective, either because it changes the binomial probability, the count distribution, or both.

In this case, the p-value is quite low:

lrt1 <- -2*(logLik(hfit1.0) - logLik(hfit1))
1 - pchisq(lrt1, 2)

## 'log Lik.' 0 (df=2)

Alternative scenarios

Here are three additional scenarios that provide examples of ways the intervention can affect the outcome. In Scenario 2, the intervention no longer has an effect on the probability of having at least one infection, but still has an effect on the count. In Scenario 3, the intervention only effects the probability of having at least one infection, and not the count distribution. And in Scenario 4, the intervention has no effect at all at either stage.

defHurdle.V2 <- updateDef(defHurdle, "xBin", "0.95")
defHurdle.V3 <- updateDef(defHurdle, "xCnt", "log(20/8000) + log(pDays)")
defHurdle.V4 <- updateDef(defHurdle.V3, "xBin", "0.95")

The plots bear out the underlying parameters. We can see the probability of a zero is the same across treatment arms in Scenario 2, just as the distributions of the count variable in Scenario 3 appear equivalent. In Scenario 4, it is hard to distinguish between the two distributions across interventions.

Here are the model fits - the results are consistent with the plots:

## 
## ================================================================================
##                  hfit2     hfit2.0     hfit3     hfit3.0     hfit4     hfit4.0  
## --------------------------------------------------------------------------------
## rxIntervention   -0.099              -0.871***               0.058              
##                 (0.446)               (0.287)               (0.342)             
##                                                                                 
## Constant        3.180***   3.130***   2.440***   1.940***   2.500***   2.530*** 
##                 (0.323)    (0.223)    (0.233)    (0.135)    (0.239)    (0.171)  
##                                                                                 
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
##

## 
## ================================================================================
##                  hfit2     hfit2.0     hfit3     hfit3.0     hfit4     hfit4.0  
## --------------------------------------------------------------------------------
## rxIntervention -0.210***               0.010                 -0.029             
##                 (0.022)               (0.022)               (0.021)             
##                                                                                 
## Constant       -6.010***  -6.110***  -6.000***  -6.000***  -5.980***  -6.000*** 
##                 (0.015)    (0.011)    (0.015)    (0.011)    (0.015)    (0.010)  
##                                                                                 
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
##

And finally, the p-values from the LRTs of the models under each of the three scenarios are consistent with the underlying data generating processes. It is only in the last scenario where there is no reason to believe that the intervention has some sort of effect.

round(c(lrt2 = 1 - pchisq(lrt2, 2), 
        lrt3=1 - pchisq(lrt3, 2), 
        lrt4=1 - pchisq(lrt4, 2)), 4)

##   lrt2   lrt3   lrt4 
## 0.0000 0.0067 0.3839

Addendum - estimating power

If you’ve visited my blog before, you might have picked up on the fact that I like to use simulation to estimate sample size or power when planning a randomized trial. This allows me to be sure everyone understands the assumptions.

To estimate power, I generate multiple data sets under a specific set of assumptions and estimate intervention effects for each data set. The power of the study under this set of assumptions is the proportion of times we would conclude that the intervention is effective. In the context of a hurdle model, I use the p-value from the LRT as the arbiter of effectiveness; the proportion of p-values less than 0.05 is the power.

gData <- function(n, def) {
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx")
  dx <- addColumns(defHurdle, dx)
  
  dx[]
  
}

estModel <- function(dx) {
  
  hfit <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx, )
  hfit0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)
  lrt <- -2*(logLik(hfit0) - logLik(hfit))

  data.table(p.zero = coef(summary(hfit))$zero["rx", "Pr(>|z|)"],
             p.count = coef(summary(hfit))$count["rx", "Pr(>|z|)"],
             X2 = 1 - pchisq(lrt, 2))
  
}

iter <- function(n, defHurdle, i) {
  
  dx <- gData(n, def)
  hfit <- estModel(dx)
  return(data.table(i = i, hfit))
  
}

diter <- rbindlist(lapply(1:1000, function(i) iter(50, defHurdle, i)))

Here are the results from the individual replications Scenario 1 effect assumptions and 50 nursing homes:

diter

##          i p.zero  p.count       X2
##    1:    1 0.9975 4.06e-04 0.000437
##    2:    2 0.0449 1.05e-03 0.000216
##    3:    3 0.0713 5.92e-03 0.002246
##    4:    4 0.0449 5.85e-04 0.000128
##    5:    5 0.1891 3.20e-02 0.034025
##   ---                              
##  996:  996 0.3198 7.04e-03 0.014600
##  997:  997 0.1891 1.13e-02 0.013579
##  998:  998 0.3198 8.16e-04 0.001973
##  999:  999 1.0000 4.45e-06 0.000023
## 1000: 1000 0.5590 2.34e-03 0.007866

And here is the estimate of power - in this case there is about 90% power that we will conclude that there is an effect of some type given the assumptions under Scenario 1:

diter[, mean(X2 <= 0.05)]

## [1] 0.898

In conclusion, here is a power plot for a range of effect size assumptions, sample size assumptions, and control arm assumptions. In all of these cases, I assumed that the binomial probability under the control condition would be 70%, (If anyone wants to see the code for generating all of this data and the plot, I can post on github. However, it is really just an extension of what is shown here.)

A Bayesian model for a simulated meta-analysis

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 Jul 2020 00:00:00 +0000

This is essentially an addendum to the previous post where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the nlme package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package rstan.

Create the data set

We’ll use the exact same data generating process as described in some detail in the previous post.

library(simstudy)
library(rstan)
library(data.table)

defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study", 
    formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
    dist = "poisson")

defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")

RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)

ds <- genData(12, defS)

dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)

d.obs <- dc[, .(study, id, x, y)]

Build the Stan model

There are multiple ways to estimate a Stan model in R, but I choose to build the Stan code directly rather than using the brms or rstanarm packages. In the Stan code, we need to define the data structure, specify the parameters, specify any transformed parameters (which are just a function of the parameters), and then build the model - which includes laying out the prior distributions as well as the likelihood.

In this case, the model is slightly different from what was presented in the context of a mixed effects model. This is the mixed effects model:

\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\ \\ \delta_k = \delta_0 + v_k \\ e_{ik} \sim N(0, \sigma_k^2), v_k \sim N(0,\tau^2) \] In this Bayesian model, things are pretty much the same: \[ y_{ik} \sim N(\alpha_k + \delta_k x_{ik}, \sigma_k^2) \\ \\ \delta_k \sim N(\Delta, \tau^2) \]

The key difference is that there are prior distributions on $\Delta$ and $\tau$, introducing an additional level of uncertainty into the estimate. I would expect that the estimate of the overall treatment effect $\Delta$ will have a wider 95% CI (credible interval in this context) than the 95% CI (confidence interval) for $\delta_0$ in the mixed effects model. This added measure of uncertainty is a strength of the Bayesian approach.

data {
  int<lower=0> N;               // number of observations
  int<lower=1> K;               // number of studies
  real y[N];                    // vector of continuous outcomes
  int<lower=1,upper=K> kk[N];   // study for individual
  int<lower=0,upper=1> x[N];    // treatment arm for individual
}

parameters {
  vector[K] beta;               // study-specific intercept
  vector[K] delta;              // study effects
  real<lower=0> sigma[K];       // sd of outcome dist - study specific
  real Delta;                   // average treatment effect
  real <lower=0> tau;           // variation of treatment effects
}

transformed parameters{ 
  
  vector[N] yhat;
  
  for (i in 1:N)  
      yhat[i] = beta[kk[i]] + x[i] * delta[kk[i]];
}

model {
  
  // priors
  
  sigma ~ normal(0, 2.5);
  beta ~ normal(0, 10);
  
  tau ~ normal(0, 2.5);
  Delta ~ normal(0, 10);
  delta ~ normal(Delta, tau);


  // outcome model
  
  for (i in 1:N)
    y[i] ~ normal(yhat[i], sigma[kk[i]]);
}

Generate the posterior distributions

With the model in place, we transform the data into a list so that Stan can make sense of it:

N <- nrow(d.obs)                               ## number of observations
K <- dc[, length(unique(study))]               ## number of studies
y <- d.obs$y                                   ## vector of continuous outcomes
kk <- d.obs$study                              ## study for individual
x <- d.obs$x                                   ## treatment arm for individual

ddata <- list(N = N, K = K, y = y, kk = kk, x = x)

And then we compile the Stan code:

rt <- stanc("model.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)

Finally, we can sample data from the posterior distribution:

fit <-  sampling(sm, data=ddata, seed = 3327, iter = 10000, warmup = 2500,
                 control=list(adapt_delta=0.9))

Check the diagonstic plots

Before looking at any of the output, it is imperative to convince ourselves that the MCMC process was a stable one. The trace plot is the most basic way to assess this. Here, I am only showing these plots for $\Delta$ and $\tau$, but the plots for the other parameters looked similar, which is to say everything looks good:

pname <- c("Delta", "tau")
stan_trace(object = fit, pars = pname)

Look at the results

It is possible to look inspect the distribution of any or all parameters. In this case, I am particularly interested in the treatment effects at the study level, and overall. That is, the focus here is on $\Delta$, $\delta_k$, and $\tau$.

pname <- c("delta", "Delta","tau")
print(fit, pars=pname, probs = c(0.05, 0.5, 0.95))

## Inference for Stan model: model.
## 4 chains, each with iter=10000; warmup=2500; thin=1; 
## post-warmup draws per chain=7500, total post-warmup draws=30000.
## 
##            mean se_mean   sd    5%   50%  95% n_eff Rhat
## delta[1]   6.39    0.01 1.13  4.51  6.41 8.22 29562    1
## delta[2]  -0.78    0.01 1.62 -3.45 -0.78 1.85 28188    1
## delta[3]  -0.14    0.01 1.39 -2.37 -0.16 2.18 28909    1
## delta[4]   3.08    0.00 0.59  2.09  3.08 4.05 34277    1
## delta[5]  -0.16    0.01 1.01 -1.77 -0.18 1.52 27491    1
## delta[6]   3.87    0.00 0.86  2.47  3.87 5.27 35079    1
## delta[7]   4.04    0.01 1.11  2.21  4.03 5.87 32913    1
## delta[8]   5.23    0.01 1.29  3.12  5.23 7.36 33503    1
## delta[9]   1.79    0.01 1.25 -0.27  1.78 3.82 30709    1
## delta[10]  1.38    0.01 1.12 -0.46  1.38 3.21 30522    1
## delta[11]  4.47    0.01 1.25  2.43  4.47 6.54 34573    1
## delta[12]  0.79    0.01 1.45 -1.60  0.80 3.16 33422    1
## Delta      2.48    0.00 0.89  1.01  2.50 3.89 31970    1
## tau        2.72    0.00 0.71  1.72  2.64 4.01 24118    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 15:47:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The forest plot is quite similar to the one based on the mixed effects model, though as predicted, the 95% CI is considerably wider:

As a comparison, here is the plot from the mixed effects model estimated using the nlme package in the previous post. The bootstrapped estimates of uncertainty at the study level are quite close to the Bayesian measure of uncertainty; the difference really lies in the uncertainty around the global estimate.

Simulating multiple RCTs to simulate a meta-analysis

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 07 Jul 2020 00:00:00 +0000

I am currently involved with an RCT that is struggling to recruit eligible patients (by no means an unusual problem), increasing the risk that findings might be inconclusive. A possible solution to this conundrum is to find similar, ongoing trials with the aim of pooling data in a single analysis, to conduct a meta-analysis of sorts.

In an ideal world, this theoretical collection of sites would have joined forces to develop a single study protocol, but often there is no structure or funding mechanism to make that happen. However, this group of studies may be similar enough - based on the target patient population, study inclusion and exclusion criteria, therapy protocols, comparison or control condition, randomization scheme, and outcome measurement - that it might be reasonable to estimate a single treatment effect and some measure of uncertainty.

This pooling approach would effectively be a prospective meta-analysis using individual participant data. The goal is to estimate a single treatment effect for this intervention or therapy that has been evaluated by different groups under varying research conditions, with possibly different treatment effects in each study.

To explore how all of this works, I generated some data and fit some models. As usual I thought the code would be more useful sitting on this blog rather than hidden away on some secure server.

The model

In this simulation, I am using a generic continuous outcome $y_{ik}$, for individual $i$ who is participating in study $k$. The individual outcome is a function of the study itself and whether that individual received the experimental therapy ($x_{ik} = 1$ for patients in the experimental arm):

\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\ \\ \delta_k = \delta_0 + v_k \] $\alpha_k$ is the intercept for study $k$, or the average outcome for patients in study $k$ in the control arm. $\delta_k$ is the treatment effect in study $k$ and can be decomposed into a common treatment effect across all studies $\delta_0$ and a study-specific effect $v_k$. $v_k$ is often assumed to be normally distributed, $v_k \sim N(0, \tau^2)$. An individual effect, $e_{ik}$ is also assumed to be normally distributed, $e_{ik} \sim N(0, \sigma_k^2)$. Note that the variance $\sigma_k^2$ of individual effects might differ across studies; that is, in some studies patients may be more similar to each other than in other studies.

The simulation assumptions

Before starting - here are the necessary libraries in case you want to follow along:

library(simstudy)
library(parallel)
library(nlme)
library(data.table)

In these simulations, there are 12 studies, each enrolling a different number of patients. There are a set of smaller studies, moderately sized studies, and larger studies. We are not really interested in the variability of the intercepts ($\alpha_k$’s), but we generate based on a normal distribution $N(3, 2)$. The overall treatment effect is set at $3$, and the study-specific effects are distributed as $N(0, 6)$. We use a gamma distribution to create the study-specific within study variation $\sigma^2_k$: the average within-study variance is $16$, and will range between $1$ and $64$ (the variance of the variances is $mean^2 \times dispersion = 16^2 \times 0.2 = 51.2$). The study-specific data are generated using these assumptions:

defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study", 
    formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
    dist = "poisson")

The individual outcomes are generated based on the model specified above:

defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")

Data generation

First, we generate the study level data:

RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)

ds <- genData(12, defS)
ds

##     study  a.k d.0     v.k  s2.k size.study n.study
##  1:     1 2.51   3  2.7437  5.25          2      30
##  2:     2 1.51   3 -4.8894 30.48          2      37
##  3:     3 1.62   3 -4.1762 15.06          1      22
##  4:     4 3.34   3  0.2494  3.26          2      44
##  5:     5 2.34   3 -2.9078  5.59          1      15
##  6:     6 1.70   3  1.3498  7.42          2      44
##  7:     7 4.17   3 -0.4135 14.58          2      45
##  8:     8 2.14   3  0.7826 25.78          2      44
##  9:     9 2.54   3 -1.1197 15.72          1      28
## 10:    10 3.10   3 -2.1275 10.00          1      24
## 11:    11 2.62   3 -0.0812 32.76          2      40
## 12:    12 1.17   3 -0.5745 30.94          2      49

And then we generate individuals within each study, assign treatment, and add the outcome:

dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)

The observed data set obviously does not include any underlying study data parameters. The figure based on this data set shows the individual-level outcomes by treatment arm for each of the 12 studies. The study-specific treatment effects and differences in within-study variation are readily apparent.

d.obs <- dc[, .(study, id, x, y)]

Initial estimates

If each study went ahead and analyzed its own data set separately, the emerging picture would be a bit confusing. We would have 12 different estimates, some concluding that the treatment is effective, and others not able to draw that conclusion. A plot of the 12 model estimates along with the 95% confidence intervals highlights the muddled picture. For additional reference, I’ve added points that represent the true (and unknown) study effects in blue, including a blue line at the value of the overall treatment effect.

lm.ind <- function(z, dx) {
  fit <- lm(y~x, data = dx)
  data.table(z, coef(fit)["x"], confint(fit, "x"))
}

res <- lapply(1:d.obs[, length(unique(study))], function(z) lm.ind(z, d.obs[study == z]))

The meta-analysis

The meta-analysis is an attempt to pool the findings from all the studies to try to draw an overall conclusion. Traditionally, meta-analysis has been conducted using only the summary information from each study - effect size estimates, standard errors, and sample sizes. More recently, researchers have started to use individual-level data to estimate an overall effect. There are advantages to this added level of detail, particularly in enhancing the ability to model patient-level and study-level characteristics that might influence the effect size; these adjustments could help reduce the variance of the effect size estimates.

There are packages in R specifically designed to conduct meta-analysis, but I am doing it “manually” through the use of the nlme package, which estimates mixed-effects model that mimics the underlying data process. (In a subsequent post, I will do the same thing using a Bayesian model implement using rstan.) I opted for nlme over the lme4 package, because the former can accommodate the possibility of different within-study variation.

The model fit here includes a study specific (fixed) intercept, an overall treatment effect, and a study-specific treatment effect. And, as I just mentioned, the within-study variation is accommodated:

lmefit <- lme(y ~  factor(study) + x - 1,
               random  = ~ x - 1 | study, 
               weights = varIdent(form = ~ 1 | study),
               data = d.obs, method = 'REML'
              )

The model estimate for the overall treatment effect is 2.5, just under but close to the true value of 3.0:

round(coef(summary(lmefit))["x",], 3)

## Warning in pt(-abs(tVal), fDF): NaNs produced

##     Value Std.Error        DF   t-value   p-value 
##     2.481     0.851   410.000     2.915     0.004

Bootstrapping uncertainty

Every meta-analysis I’ve seen includes a forest plot that shows the individual study estimates along with the global estimate of primary interest. In my version of this plot, I wanted to show the estimated study-level effects from the model ($\delta_0 + v_k$) along with 95% confidence intervals. The model fit does not provide a variance estimate for each study-level treatment effect, so I have estimated the standard error using bootstrap methods. I repeatedly sample from the observed data (sampling stratified by study and treatment arm) and estimate the same fixed effects model. For each iteration, I keep the estimated study-specific treatment effect as well as the estimated pooled effect:

bootest <- function() {
  
  bootid <- d.obs[, .(id = sample(id, .N, replace = TRUE)), keyby = .(study, x)][, .(id)]
  dboot <- merge(bootid, d.obs, by = "id")
  
  bootfit <- tryCatch(
              { lme(y ~  factor(study) + x - 1,
                 random  = ~ x - 1 | study, 
                 weights = varIdent(form = ~ 1 | study),
                 data = dboot, method = 'REML')
              }, 
              error = function(e) {
                   return("error")
              }, 
              warn = function(w) {
                   return("warning")
              }
  )
  
  if (class(bootfit) == "lme") {
    return(data.table(t(random.effects(bootfit) + fixed.effects(bootfit)["x"]),
                      pooled = fixed.effects(bootfit)["x"]))
  }
  
}

res <- mclapply(1:3000, function(x) bootest(), mc.cores = 4)
res <- rbindlist(res)

The next plot shows the individual study estimates based on the pooled analysis along with the overall estimate in red, allowing us to bring a little clarity to what was an admittedly confusing picture. We might conclude from these findings that the intervention appears to be effective.

As an aside, it is interesting to compare the two forest plot figures in the post, because it is apparent that the point estimates for the individual studies in the second plot are “pulled” closer to the overall average. This is the direct result of the mixed effects model that imposes a structure in the variation of effect sizes across the 12 studies. In contrast, the initial plot shows individual effect sizes that were independently estimated without any such constraint or structure. Pooling across groups or clusters generally has an attenuating effect on estimates.

Consider a permutation test for a small pilot study

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 23 Jun 2020 00:00:00 +0000

Recently I wrote about the challenges of trying to learn too much from a small pilot study, even if it is a randomized controlled trial. There are limitations on how much you can learn about a treatment effect given the small sample size and relatively high variability of the estimate. However, the temptation for researchers is usually just too great; it is only natural to want to see if there is any kind of signal of an intervention effect, even though the pilot study is focused on questions of feasibility and acceptability.

Through my work with the NIA IMPACT Collaboratory, I have been involved with planning a research initiative to test the feasibility of studying a highly innovative strategy in nursing homes to reduce the risk of Covid-19 infections among both residents and staffs. Given that the strategy is so novel, there are big questions about whether it can even be implemented, and how the outcome measures can be collected. So, it may be premature to figure out if the approach will reduce infection. But still, it is hard not to try to gain a little insight into the potential effect of the intervention.

One of the lead investigators suggested a permutation test, because we know the sample is going to be small and we might not want to be forced to make parametric assumptions about the outcome. In the context of a pilot study, the permutation test might give a crude indication about the potential impact of an intervention. Would the full-blown follow-up study be conducted if there is no observed effect in the pilot? That is a bigger question. But, the suggestion of some sort of signal might provide additional motivation if feasibility was no longer a question; we would still need to be careful about how we incorporate these findings into planning for the bigger study.

Permutation test explained, briefly

Typically, if we are comparing outcomes for two treatment arms, we calculate a statistic that quantifies the comparison. For example, this could be a difference in group means, a risk ratio, or a log-odds ratio. For whatever statistic we use, there would be an underlying sampling distribution of that statistic under the assumption that there is no difference between the two groups. Typically, the sampling distribution would be estimated analytically using additional assumptions about the underlying distributions of the observed data, such as normal or Poisson. We then use the sampling distribution to calculate a p-value for the observed value of the statistic.

The permutation approach is an alternative way to generate the sampling distribution of the statistic under an assumption of no group differences without making any assumptions about the distributions of the data. If group membership does not influence the outcome, it wouldn’t matter if we re-arranged all the treatment assignments in the data. We could do that and estimate the statistic. In fact, we could do that for all the possible arrangements, and that would give us a distribution of the statistic for that sample under the assumption of no group effect. (If the number of possible arrangements is excessive, we could just take a large sample of those possible arrangements, which is what I do below.) To get a p-value, we compare the observed statistic to this manufactured distribution.

Now to the simulations.

The data generation process

In this proposed study, we are interested in measuring the rate of Covid-19 infection in 8 nursing homes. Given the nature of the spread of disease and the inter-relationship of the infections between residents, the nursing home is the logical unit of analysis. So, we will only have 8 observations - hardly anything to hang your hat on. But, can the permutation provide any useful information?

The data generation process starts with generating a pool of residents at each site, about 15 per home. The study will run for followed for 4 months (or 120 days), and residents will come and go during that period. We are going to assume that the average residence time is 50 days at each home, but there will be some variability. Based on the number of patients and average length of stay, we can calculate the number of patient-days per site. The number of infected patients $y$ at a site is a function of the intervention and the time of exposure (patient-days). We will be comparing the average rates ($y/patDays$) for the two groups.

In the first simulation, I am assuming no treatment effect, because I want to assess the Type 1 error (the probability of concluding there is an effect given we know there is no effect).

Here is a function to generate the data definitions and a second function to go through the simple data generation process:

library(simstudy)
library(parallel)

defs <- function() {
  
  def <- defDataAdd(varname = "nRes", formula = 15, dist = "poisson")
  def <- defDataAdd(def, varname = "nDays", formula = 50, dist = "poisson")
  def <- defDataAdd(def, varname = "patDays", 
                    formula = "nRes * pmin(120, nDays)",
                    dist = "nonrandom")
  def <- defDataAdd(def, varname = "y", 
                    formula = "-4 - 0.0 * rx + log(patDays)", 
                    variance = 1,
                    dist = "negBinomial", link = "log")
  def <- defDataAdd(def, varname = "rate", 
                    formula = "y/patDays", 
                    dist = "nonrandom")
  
  return(def[])

}

gData <- function(n, def) {
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx")
  dx <- addColumns(def, dx)
  
  dx[]
  
}

And here we actually generate a single data set:

RNGkind(kind = "L'Ecuyer-CMRG")
#set.seed(72456)
set.seed(82456)

def <- defs()
dx <- gData(8, def)
dx

##    id rx nRes nDays patDays  y        rate
## 1:  1  1   16    40     640  0 0.000000000
## 2:  2  0    9    59     531  8 0.015065913
## 3:  3  1   20    57    1140 58 0.050877193
## 4:  4  1   14    51     714 14 0.019607843
## 5:  5  1    7    59     413  4 0.009685230
## 6:  6  0   14    38     532  4 0.007518797
## 7:  7  0   10    40     400 19 0.047500000
## 8:  8  0   11    56     616 22 0.035714286

The observed difference in rates is quite close to 0:

dx[, mean(rate), keyby = rx]

##    rx         V1
## 1:  0 0.02644975
## 2:  1 0.02004257

obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]
obs.diff

## [1] -0.006407182

The permutation test

With 8 sites, there are $8!$ possible permutations, or a lot ways to scramble the treatment assignments.

factorial(8)

## [1] 40320

I decided to implement this myself in a pretty rudimentary way, though there are R packages out there that can certainly do this better. Since I am comparing averages, I am creating a vector that represents the contrast.

rx <- dx$rx/sum(dx$rx)
rx[rx==0] <- -1/(length(dx$rx) - sum(dx$rx))

rx

## [1]  0.25 -0.25  0.25  0.25  0.25 -0.25 -0.25 -0.25

I’m taking a random sample of 5000 permutations of the contrast vector, storing the results in a matrix:

perm <- t(sapply(1:5000, function(x) sample(rx, 8, replace = FALSE)))
head(perm)

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] -0.25 -0.25  0.25 -0.25  0.25  0.25 -0.25  0.25
## [2,] -0.25 -0.25  0.25 -0.25  0.25 -0.25  0.25  0.25
## [3,]  0.25 -0.25 -0.25  0.25 -0.25 -0.25  0.25  0.25
## [4,]  0.25 -0.25  0.25  0.25 -0.25  0.25 -0.25 -0.25
## [5,] -0.25 -0.25 -0.25  0.25  0.25  0.25  0.25 -0.25
## [6,]  0.25 -0.25  0.25 -0.25  0.25 -0.25  0.25 -0.25

Using a simple operation of matrix multiplication, I’m calculating a rate difference for each of the sampled permutations:

perm.diffs <- perm %*% dx$rate
head(perm.diffs)

##              [,1]
## [1,]  0.005405437
## [2,]  0.025396039
## [3,]  0.004918749
## [4,] -0.007490399
## [5,] -0.004336380
## [6,]  0.007538896

Here is an estimate of the 2-sided p-value:

mean(abs(perm.diffs) > abs(obs.diff))

## [1] 0.7166

And finally, here is a histogram of the permuted rate differences, with the observed rate different overlaid as a red line. The observed value lies pretty much right in the middle of the distribution, which is what the p-value suggests:

ggplot(data = data.frame(perm.diffs), aes(x = perm.diffs)) +
  geom_histogram(binwidth = .005, color = "white") +
  geom_vline(xintercept = obs.diff, color = "red") +
  scale_y_continuous(expand = c(0,0), breaks = seq(2500, 10000, 2500)) +
  theme(panel.grid = element_blank())

Operating characteristics

As in a power analysis by simulation, we can estimate the Type 1 error rate by generating many data sets, and for each one calculate a p-value using the permutation test. The proportion of p-values less than 0.05 would represent the Type 1 error rate, which should be close to 0.05.

iter <- function(n) {
  
  dx<- gData(n, defs())
  obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]

  rx <- dx$rx/sum(dx$rx)
  rx[rx==0] <- -1/(n - sum(dx$rx))
  
  perm <- t(sapply(1:20000, function(x) sample(rx, n, replace = FALSE)))
  
  perm.diffs <- perm %*% dx$rate
  mean(abs(perm.diffs) > abs(obs.diff))
}

Here we use 5000 data sets to estimate the Type 1 error rate under the data generating process we’ve been using all along, and for each of those data sets we use 5000 permutations to estimate the p-value.

res <- unlist(mclapply(1:5000, function(x) iter(8), mc.cores = 4))
mean(res < .05)

## [1] 0.0542

The risks of using a model (with assumptions)

If we go ahead and try to find a signal using a parametric model, there’s a chance we’ll be led astray. These data are count data, so it would not be strange to consider Poisson regression model to estimate the treatment effect (in this case, the effect would be a rate ratio rather than a rate difference). Given that the data are quite limited, we may not really be in a position to verify whether the Poisson distribution is appropriate; as a result, it might be hard to actually select the right model. (In reality, I know that this model will lead us astray, because we used a negative binomial distribution, a distribution with more variance than the Poisson, to generate the count data.)

Just as before, we generate 5000 data sets. For each one we fit a generalized linear model with a Poisson distribution and a log link, and store the effect estimate along with the p-value.

chkglm <- function(n) {
  
  dx <- gData(n, defs())
  glmfit <- glm( y ~ rx + offset(log(patDays)), family = poisson, data = dx)
  
  data.table(t(coef(summary(glmfit))["rx",]))
  
}

glm.res <- rbindlist(mclapply(1:5000, function(x) chkglm(8)))

The estimated Type 1 error is far greater than 0.05; there would be a pretty good chance that we will be over-enthusiastic about the potential success of our new nursing home strategy if it was not actually effective.

glm.res[, .(mean(`Pr(>|z|)` < 0.05))]

##        V1
## 1: 0.6152

When there is a treatment effect

In the case where there is actually a treatment effect, the observed effect size is more likely to fall closer to one of the distribution’s tails, depending on the direction of the effect. If the treatment reduces the number infections, we would expect the rate difference to be $< 0$, as it is in this particular case:

def <- updateDefAdd(def, changevar = "y", 
          newformula = "-4 - 1.2 * rx + log(patDays)" )

At the end of the day, if you feel like you must estimate the treatment effect in a pilot study before moving on to the larger trial, one option is to use a non-parametric approach like a permutation test that requires fewer assumptions to lead you astray.

In the end, though, we opted for a different model. If we do get the go ahead to conduct this study, we will fit a Bayesian model instead. We hope this will be flexible enough to accommodate a range of assumptions and give us a potentially more informative posterior probability of a treatment effect. If we actually get the opportunity to do this, I’ll consider describing that model here.

Support:

When proportional odds is a poor assumption, collapsing categories is probably not going to save you

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 09 Jun 2020 00:00:00 +0000

Continuing the discussion on cumulative odds models I started last time, I want to investigate a solution I always assumed would help mitigate a failure to meet the proportional odds assumption. I’ve believed if there is a large number of categories and the relative cumulative odds between two groups don’t appear proportional across all categorical levels, then a reasonable approach is to reduce the number of categories. In other words, fewer categories translates to proportional odds. I’m not sure what led me to this conclusion, but in this post I’ve created some simulations that seem to throw cold water on that idea.

When the odds are proportional

I think it is illustrative to go through a base case where the odds are actually proportional. This will allow me to introduce the data generation and visualization that I’m using to explore this issue. I am showing a lot of code here, because I think it is useful to see how it is possible to visualize cumulative odds data and the model estimates.

The first function genDT generates a data set with two treatment arms and an ordinal outcome. genOrdCat uses a base set of probabilities for the control arm, and the experimental arm probabilities are generated under an assumption of proportional cumulative odds (see the previous post for more details on what cumulative odds are and what the model is).

library(simstudy)
library(data.table)

genDT <- function(nobs, baseprobs, defA) {
  
  dT <- genData(nobs)
  dT <- trtAssign(dT, grpName = "rx")
  dT <- addColumns(defA, dT)
  
  dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
  dT[]
}

In this case, I’ve set the base probabilities for an ordinal outcome of 8 categories. The log of the cumulative odds ratio comparing experimental arm to control is 1.0 (and is parameterized as -1.0). In this case, the proportional odds ratio should be about 2.7.

baseprobs <- c(0.05, 0.10, 0.15, 0.25, .20, 0.15, .05, .05)
defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")

set.seed(29672) # 19779
dT.prop <- genDT(200, baseprobs, defA)

Calculation of the observed cumulative odds ratio at each response level doesn’t provide an entirely clear picture about proportionality, but the sample size is relatively small given the number of categories.

codds <- function(cat, dx) {
  dcodds <- dx[, .(codds = mean(as.numeric(r) <= cat)/mean(as.numeric(r) > cat)), 
               keyby = rx]
  round(dcodds[rx == 1, codds] / dcodds[rx==0, codds], 2)
}

sapply(1:7, function(x) codds(x, dT.prop))

## [1] 1.48 3.81 3.12 1.83 2.05 3.59 2.02

A visual assessment

An excellent way to assess proportionality is to do a visual comparison of the observed cumulative probabilities with the estimated cumulative probabilities from the cumulative odds model that makes the assumption of proportional odds.

I’ve written three functions that help facilitate this comparison. getCumProbs converts the parameter estimates of cumulative odds from the model to estimates of cumulative probabilities.

getCumProbs <- function(coefs) {
  
  cumprob0 <- data.table(
    cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")])), 1),
    r = factor(1 : nrow(coefs)),
    rx = 0
  )
  
  cumprob1 <- data.table(
    cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")] + 
                             coefs["rx", 1])), 1),
    r = factor(1 : nrow(coefs)),
    rx = 1
  )
  
  rbind(cumprob0, cumprob1)[]
 
}

The function bootCumProb provides a single bootstrap from the data so that we can visualize the uncertainty of the estimated cumulative probabilities. In this procedure, a random sample is drawn (with replacement) from the data set, a clm model is fit, and the cumulative odds are converted to cumulative probabilities.

library(ordinal)

bootCumProb <- function(bootid, dx) {
  
  sampid <- dx[, .(srow = sample(.I, replace = TRUE)), keyby = rx][, srow]
  dtBoot <- dx[sampid,]
  
  bootFit <- clm(r ~ rx, data = dtBoot)
  bcoefs <- coef(summary(bootFit))
  
  bcumProbs <- getCumProbs(bcoefs)
  bcumProbs[, bootid := bootid]
  
  bcumProbs[]
}

The third function fitPlot fits a clm model to the original data set, collects the bootstrapped estimates, calculates the observed cumulative probabilities, converts the estimated odds to estimated probabilities, and generates a plot of the observed data, the model fit, and the bootstrap estimates.

library(ggplot2)
library(paletteer)

fitPlot <- function(dx) {
  
  clmFit <- clm(r ~ rx, data = dx)
  coefs <- coef(summary(clmFit))
  
  bootProbs <- rbindlist(lapply(1:500, function(x) bootCumProb(x, dx)))
  
  cumObsProbs <- dx[, .N, keyby = .(rx, r)]
  cumObsProbs[, cumprob := cumsum(N)/sum(N) , keyby = rx]
  
  cumModProbs <- getCumProbs(coefs)
  
  ggplot(data = cumObsProbs, aes(x = r, y = cumprob , color = factor(rx))) +
    geom_line(data = cumModProbs, alpha = 1, aes(group=rx)) +
    geom_line(data = bootProbs, alpha = .01, 
              aes(group = interaction(rx, bootid))) + 
    geom_point() +
    ylab("cumulative probability") +
    xlab("ordinal category") +
    theme(panel.grid = element_blank(),
          legend.position = "none") +
    scale_color_paletteer_d("jcolors::pal6")
}

Here is a plot based on the original data set of 200 observations. The observed values are quite close to the modeled estimates, and well within the range of the bootstrap estimates.

fitPlot(dT.prop)

Collapsing the categories

Continuing with the same data set, let’s see what happens when we collapse categories together. I’ve written a function collapseCat that takes a list of vectors of categories that are to be combined and returns a new, modified data set.

collapseCat <- function(dold, collapse) {
  
  dx <- copy(dold)
  for (i in 1:length(collapse)) {
    dx[r %in% collapse[[i]], r:= min(collapse[[i]])]
  }
  
  dx[, r := factor(r)]
  dx[, r := factor(r, labels = c(1:length(levels(r))))]
  
  dx[]
}

Here is the distribution of the original data set:

dT.prop[, table(rx, r)]

##    r
## rx   1  2  3  4  5  6  7  8
##   0  7  4 14 31 19 15  8  2
##   1 10 22 19 19 16 11  2  1

And if we combine categories 1, 2, and 3 together, as well as 7 and 8, here is the resulting distribution based on the remaining five categories. Here’s a quick check to see that the categories were properly combined:

collapse <- list(c(1,2,3), c(7,8))
collapseCat(dT.prop, collapse)[, table(rx, r)]

##    r
## rx   1  2  3  4  5
##   0 25 31 19 15 10
##   1 51 19 16 11  3

If we create four modified data sets based on different combinations of groups, we can fit models and plot the cumulative probabilities for all for of them. In all cases the proportional odds assumption still seems pretty reasonable.

collapse <- list(list(c(3, 4), c(6, 7)),
                 list(c(1,2,3), c(7,8)),
                 list(c(1,2,3), c(4, 5), c(7,8)),
                 list(c(1,2), c(3, 4, 5),  c(6, 7, 8))
)

dC.prop <- lapply(collapse, function(x) collapseCat(dT.prop, x))

cplots <- lapply(dC.prop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)

Non-proportional cumulative odds

That was all just a set-up to explore what happens in the case of non-proportional odds. To do that, there’s just one more function to add - we need to generate data that does not assume proportional cumulative odds. I use the rdirichlet in the gtools package to generate values between 0 and 1, which sum to 1. The key here is that there is no pattern in the data - so that the ratio of the cumulative odds will not be constant.

genDTnon <- function(nobs, ncat) {
  
  ps <- gtools::rdirichlet(2, rep(2, ncat))
  
  p0 <- paste(ps[1, -ncat], collapse = ";")
  p1 <- paste(ps[2, -ncat], collapse = ";")
  
  defc <- defCondition(condition = "rx == 0", formula = p0, 
    dist = "categorical")
  defc <- defCondition(defc, condition = "rx == 1", formula = p1, 
    dist = "categorical")
  
  dx <- genData(nobs)
  dx <- trtAssign(dx, grpName = "rx")
  dx <- addCondition(defc, dx, "r")
  
  dx[, r := factor(r)]

  dx[]
}

Again, we generate a data set with 200 observations and an ordinal categorical outcome with 8 levels. The plot of the observed and estimated cumulative probabilities suggests that the proportional odds assumption is not a good one here. Some of the observed probabilities are quite far from the fitted lines, particularly at the low end of the ordinal scale. It may not be a disaster to to use a clm model here, but it is probably not a great idea.

dT.nonprop <- genDTnon(200, 8)
fitPlot(dT.nonprop)

The question remains - if we reduce the number of categories does the assumption of proportional odds come into focus? The four scenarios shown here do not suggest much improvement. The observed data still fall outside or at the edge of the bootstrap bands for some levels in each case.

dC.nonprop <- lapply(collapse, function(x) collapseCat(dT.nonprop, x))

cplots <- lapply(dC.nonprop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)

What should we do in this case? That is a tough question. The proportional odds model for the original data set with eight categories is probably just as reasonable as estimating a model using any of the combined data sets; there is no reason to think that any one of the alternatives with fewer categories will be an improvement. And, as we learned last time, we may actually lose power by collapsing some of the categories. So, it is probably best to analyze the data set using its original structure, and find the best model for that data set. Ultimately, that best model may need to relax the proportionality assumption; a post on that will need to be written another time.

Considering the number of categories in an ordinal outcome

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 26 May 2020 00:00:00 +0000

In two Covid-19-related trials I’m involved with, the primary or key secondary outcome is the status of a patient at 14 days based on a World Health Organization ordered rating scale. In this particular ordinal scale, there are 11 categories ranging from 0 (uninfected) to 10 (death). In between, a patient can be infected but well enough to remain at home, hospitalized with milder symptoms, or hospitalized with severe disease. If the patient is hospitalized with severe disease, there are different stages of oxygen support the patient can be receiving, such as high flow oxygen or mechanical ventilation.

It is common to analyze ordinal categorical outcomes like the WHO status measure using a cumulative proportional odds model. (I’ve described these models in a number of posts, starting here and here.) We’ve be been wrestling with the question of whether to use the full 11-point scale or to collapse categories to create a simpler outcome of four or five groups. One issue that comes up is whether this reduction would increase or decrease our ability to detect a treatment effect, assuming of course that there is a treatment effect. To explore the issue, I turned to simulation.

A very quick recap of the model

In the cumulative proportional odds model, we are comparing a series of cumulative odds across two groups, and we make an assumption that the ratio of these cumulative odds for the two groups is consistent throughout, the proportional odds assumption.

The cumulative odds for the control group that the status is $x$ is

\[ \text{cOdds}_{c}(x) = \frac{P(Status \le x | rx = Control)}{P(Status \gt x |rx = Control)} \]

And the cumulative odds ratio comparing Control to Treated is

\[ \text{COR}_{ct}(x) = \frac{\text{cOdds}_c(x)}{\text{cOdds}_t(x)} \]

In the proportional odds model, with a measure that has $K$ levels we make the assumption that

\[ \text{COR}_{ct}(1) = \text{COR}_{ct}(2) = \dots =\text{COR}_{ct}(K) \]

The model that we estimate is

\[ \text{logit}(P(Status \le x)) = \alpha_x - \beta * I(rx = Treat) \] where $\alpha_x$ is the log cumulative odds for a particular levels $x$, and $-\beta = \text{COR}_{ct}(k)$, the (proportional) log odds ratio across all $k$ status levels.

Conceputalizing the categories

I am comparing estimates of models for outcome scales that use a range of categories, from 2 to 16. (I expanded beyond 11 to get a better sense of the results when the gradations become quite fine.) The figure shows the 16-group structure collapsing into 2 groups. The first row depicts the distribution of the control group across 16 categories. The second row combines the 2 rightmost purple categories of the first row into a single category, resulting in 15 total categories. Moving downwards, a pair of adjacent categories are combined at each step, until only 2 categories remain at the bottom.

And here are the actual probabilities for the bottom seven rows, from 8 categories down to 2:

baseprobs[7:1]

## [[1]]
## [1] 0.075 0.075 0.075 0.075 0.175 0.175 0.175 0.175
## 
## [[2]]
## [1] 0.075 0.075 0.150 0.175 0.175 0.175 0.175
## 
## [[3]]
## [1] 0.075 0.075 0.150 0.350 0.175 0.175
## 
## [[4]]
## [1] 0.150 0.150 0.350 0.175 0.175
## 
## [[5]]
## [1] 0.15 0.15 0.35 0.35
## 
## [[6]]
## [1] 0.30 0.35 0.35
## 
## [[7]]
## [1] 0.3 0.7

Generating the data

To simulate the data, I use the function genOrdCat in simstudy that uses the base probabilities and the log-odds ratio transforming variable, which in this case is $z$. (I introduced this function a while back.) In this case the log odds ratio $(-\beta)$ is 1, which translates to a cumulative odds ratio of $exp(1) = 2.72$.

library(simstudy)

defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")

genDT <- function(nobs, baseprobs, defA) {
  
  dT <- genData(nobs)
  dT <- trtAssign(dT, grpName = "rx")
  dT <- addColumns(defA, dT)
  
  dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
  dT[]
}

A single data set of 5000 observations with 6 categories looks like this:

set.seed(7891237)
(dx <- genDT(5000, baseprobs[[5]], defA ))

##         id rx  z r
##    1:    1  0  0 1
##    2:    2  1 -1 3
##    3:    3  0  0 5
##    4:    4  0  0 4
##    5:    5  1 -1 1
##   ---             
## 4996: 4996  0  0 3
## 4997: 4997  1 -1 4
## 4998: 4998  0  0 3
## 4999: 4999  1 -1 3
## 5000: 5000  1 -1 4

Here are the distributions by treatment arm:

prop.table(dx[, table(rx, r)], margin = 1)

##    r
## rx       1      2      3      4      5      6
##   0 0.0644 0.0772 0.1524 0.3544 0.1772 0.1744
##   1 0.1792 0.1408 0.2204 0.2880 0.1012 0.0704

Here are the cumulative odds and the odds ratio for a response being 2 or less:

(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 2)/mean(as.numeric(r) > 2)), keyby = rx])

##    rx codds
## 1:  0 0.165
## 2:  1 0.471

dcodds[rx == 1, codds] / dcodds[rx==0, codds]

## [1] 2.85

And here are the cumulative odds and COR for a response being 4 or less.

(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 4)/mean(as.numeric(r) > 4)), keyby = rx])

##    rx codds
## 1:  0  1.84
## 2:  1  4.83

dcodds[rx == 1, codds] / dcodds[rx==0, codds]

## [1] 2.62

The CORs are both close to the true COR of 2.72.

Running the experiment

I was particularly interested in understanding the impact of increasing the number of categories $K$ on the probability of observing a treatment effect (i.e. the power). This required generating many (in this case 10,000) data sets under each scenario defined by the number of categories ranging from 2 to 16, and then estimating a cumulative odds model for each data set. I used the clm function in the ordinal package.

Two functions implement this iteration. analyzeDT generates a data set and returns a model fit. iterate repeatedly calls analyzeDT and estimates power for a particular scenario by calculating the proportion of p-values that are less than 0.05:

library(ordinal)

analyzeDT <- function(nobs, baseprobs, defA) {

  dT <- genDT(nobs, baseprobs, defA)
  clmFit <- clm(r ~ rx, data = dT)
  coef(summary(clmFit))
  
}

iterate <- function(niter, nobs, baseprobs, defA) {
  res <- lapply(1:niter, function(x) analyzeDT(nobs, baseprobs, defA))
  mean( sapply(res, function(x) x["rx",  "Pr(>|z|)"]) < 0.05) 
}

lapply is used here to cycle through each scenario (for enhanced speed mclapply in the parallel package could be used):

set.seed(1295)
power <- lapply(baseprobs, function(x) iterate(niter = 10000, 
            nobs = 100, x, defA))

Effect of K on power

A plot of the estimates suggests a strong relationship between the number of categories and power:

In this particular case, it seems apparent there are benefits to increasing from 2 to 6 categories. However, there are slight gains to be had by increasing the number of categories much beyond this; indeed, extending to the full 16 categories may not be worth the trouble, as the gains in power are minimal.

These minimal gains need to be weighed against the potential difficulty in acquiring the finely grained categorical outcomes. In cases where the defined categories are completely objective and are naturally collected as part of an operating environment - as in the WHO scale that might be gathered from an electronic health record - there is no real added burden to maximizing the number of categories. However, if the outcome scores are based on patient responses to a survey, the quality of data collection may suffer. Adding additional categories may confuse the patient and make the data collection process more burdensome, resulting in unreliable responses or even worse, missing data. In this case, the potential gains in power may be offset by poor data quality.

To stratify or not? It might not actually matter...

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 12 May 2020 00:00:00 +0000

Continuing with the theme of exploring small issues that come up in trial design, I recently used simulation to assess the impact of stratifying (or not) in the context of a multi-site Covid-19 trial with a binary outcome. The investigators are concerned that baseline health status will affect the probability of an outcome event, and are interested in randomizing by health status. The goal is to ensure balance across the two treatment arms with respect to this important variable. This randomization would be paired with an estimation model that adjusts for health status.

An alternative strategy is to ignore health status in the randomization, but to pre-specify an outcome model that explicitly adjusts for health status, just as in the stratification scenario. The question is, how do the operating characteristics (e.g. power, variance, and bias) of each approach compare. Are the (albeit minimal) logistics necessary for stratification worth the effort?

Simulation

Simulations under a variety of scenarios suggest that stratification might not be necessary. (See this paper for a much deeper, richer discussion of these issues.)

Define the data

In these simulations, I assume that there are a small number of clusters. The proportion of high risk cases in each cluster varies (specified by p), as do the event rates (specified by a). The simulations vary the log odds of an outcome (baseLO), effect sizes/log-odds ratio (effLOR), and the effect of poor health status xLOR):

library(simstudy)
library(parallel)

setDefs <- function(pX, precX, varRE, baseLO, effLOR, xLOR) {
  
  defc <- defData(varname = "p", formula = pX, variance = precX, 
                  dist = "beta", id = "site")
  defc <- defData(defc, varname = "a", formula = 0, variance = varRE)
  
  form <- genFormula(c(baseLO, effLOR, xLOR, 1), vars = c("rx", "x", "a"))
  
  defi1 <- defDataAdd(varname = "x", formula = "p", dist = "binary")
  defi2 <- defDataAdd(varname = "y", formula = form, dist = "binary", link = "logit")
  
  return(list(defc = defc, defi1 = defi1, defi2 = defi2))
  
}

Generate the data and estimates

Under each scenario, the data definitions are established by a call to setDefs and treatment is randomized, stratified by site only, or by site and health status x. (There is a slight bug in the trtAssign function that will generate an error if there is only a single observation in a site and particular strata - which explains my use of the try function to prevent the simulations from grinding to a halt. This should be fixed soon.)

For each generated data set under each scenario, we estimate a generalized linear model:

\[ logit(y_{ij}) = \beta_0 + \gamma_j + \beta_1r_i + \beta_2x_i \ , \] where $y_{ij}$ is the outcome for patient $i$ at site $j$, $r_i$ is the treatment indicator, and $x_i$ is the health status indicator. $\gamma_j$ is a fixed site-level effect. The function returns parameter estimate for the log-odds ratio (the treatment effect $\beta_1$), as well as its standard error estimate and p-value.

genEsts <- function(strata, nclust, clustsize, pX, precX, 
                    varRE, baseLO, effLOR, xLOR) {
  
  defs <- setDefs(pX, precX, varRE, baseLO, effLOR, xLOR)
  
  dc <- genData(nclust, defs$defc)
  
  dx <- genCluster(dc, "site", clustsize , "id")
  dx <- addColumns(defs$defi1, dx)
  
  dx <- try(trtAssign(dx, strata = strata, grpName = "rx"), silent = TRUE)
  
  if ( (class(dx)[1]) == "try-error") {
    return(NULL)
  }
  
  dx <- addColumns(defs$defi2, dx)
  
  glmfit <- glm(y~factor(site) + rx + x, data = dx, family = "binomial")
  
  estrx <- t(coef(summary(glmfit))["rx", ])
  
  return(data.table(estrx))
}

Iterate through multiple scenarios

We will “iterate” through different scenarios using the mclapply function the parallel package. For each scenario, we generate 2500 data sets and parameter estimates. For each of these scenarios, we calculate the

forFunction <- function(strata, nclust, clustsize, pX, precX, 
                         varRE, baseLO, effLOR, xLOR) {
  
  res <- rbindlist(mclapply(1:2500, function(x) 
    genEsts(strata, nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))
  
  data.table(strata = length(strata), nclust, clustsize, pX, precX, 
             varRE, baseLO, effLOR, xLOR, 
             est = res[, mean(Estimate)],
             se.obs = res[, sd(Estimate)],
             se.est = res[, mean(`Std. Error`)],
             pval = res[, mean(`Pr(>|z|)` < 0.05)]
             )
}

Specify the scenarios

We specify all the scenarios by creating a data table of parameters. Each row of this table represents a specific scenario, for which 2500 data sets will be generated and parameters estimated. For these simulations that I am reporting here, I varied the strata for randomization, the cluster size, the baseline event rate, and the effect size, for a total of 336 scenarios ($2 \times 6 \times 4 \times 7$).

strata <- list("site", c("site", "x"))
nclust <- 8
clustsize <- c(30, 40, 50, 60, 70, 80)
pX <- 0.35
precX <- 30
varRE <- .5
baseLO <- c(-1.5, -1.25, -1.0, -0.5)
effLOR <- seq(0.5, 0.8, by = .05)
xLOR <- c(.75)

dparam <- data.table(expand.grid(strata, nclust, clustsize, pX, precX, 
              varRE, baseLO, effLOR, xLOR))

setnames(dparam, c("strata","nclust", "clustsize", "pX", "precX", 
                   "varRE", "baseLO", "effLOR", "xLOR"))

dparam

##      strata nclust clustsize   pX precX varRE baseLO effLOR xLOR
##   1:   site      8        30 0.35    30   0.5   -1.5    0.5 0.75
##   2: site,x      8        30 0.35    30   0.5   -1.5    0.5 0.75
##   3:   site      8        40 0.35    30   0.5   -1.5    0.5 0.75
##   4: site,x      8        40 0.35    30   0.5   -1.5    0.5 0.75
##   5:   site      8        50 0.35    30   0.5   -1.5    0.5 0.75
##  ---                                                            
## 332: site,x      8        60 0.35    30   0.5   -0.5    0.8 0.75
## 333:   site      8        70 0.35    30   0.5   -0.5    0.8 0.75
## 334: site,x      8        70 0.35    30   0.5   -0.5    0.8 0.75
## 335:   site      8        80 0.35    30   0.5   -0.5    0.8 0.75
## 336: site,x      8        80 0.35    30   0.5   -0.5    0.8 0.75

Run the simulation

Everything is now set up. We go through each row of the scenario table dparam to generate the summaries for each scenario by repeated calls to forFunction, again using mclapply.

resStrata <- mclapply(1:nrow(dparam), function(x) with(dparam[x,],  
    forFunction(strata[[1]], nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))

resStrata <- rbindlist(resStrata)
resStrata[, .(strata,  baseLO, effLOR, xLOR,  est, se.obs, se.est, pval)]

##      strata baseLO effLOR  est se.obs se.est pval
##   1:      1   -1.5    0.5 0.53   0.33   0.31 0.38
##   2:      2   -1.5    0.5 0.52   0.32   0.31 0.39
##   3:      1   -1.5    0.5 0.52   0.28   0.27 0.49
##   4:      2   -1.5    0.5 0.51   0.27   0.27 0.48
##   5:      1   -1.5    0.5 0.51   0.24   0.24 0.58
##  ---                                             
## 332:      2   -0.5    0.8 0.82   0.21   0.20 0.98
## 333:      1   -0.5    0.8 0.82   0.19   0.19 0.99
## 334:      2   -0.5    0.8 0.82   0.19   0.19 1.00
## 335:      1   -0.5    0.8 0.82   0.18   0.17 1.00
## 336:      2   -0.5    0.8 0.81   0.18   0.17 1.00

Plotting the results

The plots below compare the estimates of the two different stratification strategies. Each point represents a specific scenario under stratification by site alone and stratification by site along and health status. If there are differences in the two strategies, we would expect to see the points diverge from the horizontal line. For all four plots, there appears to be little if any divergence, suggesting that, for these scenarios at least, little difference between stratification scenarios.

Power

In this first scatter plot, the estimated power under each stratification strategy is plotted. Power is estimated by the proportion of p-values in the 2500 iterations that were less than 0.05. Regardless of whether observed power for a particular scenario is high or low, we generally observe the same power under both strategies. The points do not diverge far from the red line, which represents perfect equality.

Standard errors

There are two ways to look at the variability of the two strategies. First, we can look at the observed variability of the effect estimates across the 2500 iterations. And second, we can look at the average of the standard error estimates across the iterations. In general, the two randomization schemes appear quite similar with respect to both observed and estimated variation.

Treatment effects

In this last plot, the average estimated treatment effect is shown for each scenario. The two stratification strategies both appear to provide the same unbiased estimates of the treatment effect.

References:

Kernan, Walter N., Catherine M. Viscoli, Robert W. Makuch, Lawrence M. Brass, and Ralph I. Horwitz. “Stratified randomization for clinical trials.” Journal of clinical epidemiology 52, no. 1 (1999): 19-26.

Simulation for power in designing cluster randomized trials

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 28 Apr 2020 00:00:00 +0000

As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis. (I guess this is the appropriate time to pull out the quote made by the famous statistician Ronald Fisher: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”)

In these times, when researchers are scrambling to improve care for patients the Covid-19, there isn’t often time for those early conversations, or they are happening with many different people. Recently, I’ve been asked to help figure out what the sample size requirements are for four or five studies involving promising therapies for Covid-19 patients at various stages of the disease. In most cases, randomization is at the patient, so power/sample size calculations are much simpler. In other situations, cluster randomization at the unit or hospital floor level is being considered, so the sample size estimates are a little more involved.

There are analytic/formula solutions for sample size estimates in non-clustered randomization. And if the outcome is continuous, adjustments can be made using an estimate of the design effect, which I wrote about recently. When the outcome is binary, or the number of clusters is small, or the cluster sizes themselves are small, I feel more comfortable using simulation methods. Indeed, the simstudy package grew out of my work to facilitate data generation for this very reason.

My intention here is to provide code to help others in case they want to conduct these relatively simple power analyses. One of the proposed studies expected to have a large number of small-sized clusters, so that is what I’ll simulate here.

The data generation process

To estimate power under a range of scenarios, I’ve written two functions to define the data generation process, one to generate the data, and a final one to generate a single data set and estimate the parameters of a mixed effects model.

data definitions

The variance of the cluster-level random effect is based on a conversion of the intra-cluster correlation (ICC) to the logistic scale, which is done through a call to the function iccRE. The definition of the outcome is based on this random effect plus a log odds-ratio that is derived from the control proportion and the assumed percent change:

library(simstudy)
library(lme4)

defRE <- function(icc, dist = "binary", varW = NULL) {
  
  setVar <- iccRE(ICC = icc, dist = dist, varWithin = varW)
  def <- defData(varname = "a", formula = 0, variance = setVar, id = "cluster")
  
  return(def)
}

defBinOut <- function(p1, pctdelta) {
    
  p2 <- (1 - pctdelta) * p1
    
  int <- round(log( p1/(1-p1) ), 4)
  effect <- round(log( (p2/(1-p2)) / (p1/(1-p1) )), 4)
  formula <- genFormula( c(int, effect, 1), c("rx","a") )
    
  def <- defDataAdd(varname = "y", formula = formula, dist = "binary", 
                  link = "logit")
  return(def)
}

data generation

The data generation follows from the data definitions. First, cluster-level data are generated (along with treatment assignment), and then the individual patient level data.

genDataSet <- function(nclust, clustsize, re.def, out.def) {
  
  dClust <- genData(nclust, re.def)
  dClust <- trtAssign(dClust, grpName = "rx")
  
  dPat <- genCluster(dtClust = dClust, cLevelVar = "cluster", 
                     numIndsVar = clustsize, level1ID = "id")
  dPat <- addColumns(out.def, dPat)
  
  return(dPat)
}

model estimation

The p-values used for the power calculation are estimated using glmer of the lme4 package, a generalized mixed effects model. (If the outcome were continuous, we would use lmer instead.) Unfortunately, this can be relatively resource-intensive, so the repeated estimations over a wide range of scenarios can be rather time consuming.

One way to speed things up is eliminate a step in the glmer algorithm to that takes considerable time, but has the side effect of excluding information about whether or not the model estimation has converged. Convergence can be a particular problem if variation across clusters is low, as when the ICC is low. The function below keeps track of whether an iteration has converged (but only if fast is set to FALSE). One might want to explore how frequently there is a failure to converge before turning on the fast flag.

This function returns the convergence status, the estimate of the random effects variance, and the effect parameter estimate, standard error, and p-value.

genBinEsts <- function(nclust, clustsize, re.def, out.def, fast = FALSE) {
  
  dP <- genDataSet(nclust, clustsize, re.def, out.def)

  mod.re <- glmer(y ~ rx + (1|cluster), data = dP, family = binomial,
      control = glmerControl( optimizer = "bobyqa", calc.derivs = !(fast) ))

  convStatus <- as.numeric(length(summary(mod.re)$optinfo$conv$lme4))
  
  res <- data.table(convStatus, re = VarCorr(mod.re)$cluster,
              t(coef(summary(mod.re))["rx",]))
  
  return(res)
}

Single data set

Here is an example setting the ICC at 0.025, the control proportion at 40%, and an effect size that translates to a 30% reduction (so that the treatment proportion will be 28%).

(defa <- defRE(icc = 0.025))

##    varname formula variance   dist     link
## 1:       a       0   0.0844 normal identity

(defy <- defBinOut(0.40, 0.30))

##    varname                       formula variance   dist  link
## 1:       y -0.4055 + -0.539 * rx + 1 * a        0 binary logit

And when we generate a single data set and estimate the parameters by calling the last function:

RNGkind("L'Ecuyer-CMRG")
set.seed(2711)

genBinEsts(40, 10, defa, defy)

##    convStatus re.(Intercept) Estimate Std. Error z value Pr(>|z|)
## 1:          0          0.284   -0.625      0.279   -2.24   0.0249

Estimating power

Everything is set up now to estimate power with repeated calls to this group of functions. This process can be done using the mclapply function in the parallel package, as I illustrated in earlier post. Here, I am showing a for loop implementation.

The variables ICC, SS, nClust, ctlPROB, and pctDELTA are vectors containing all the possible scenarios for which power will be estimated. In this case, power will be based on 1000 iterations under set of assumptions.

library(parallel)

nIters <- 1000
results <- NULL

for (icc in ICC) {
  for (ss in SS) {
    for (nclust in nCLUST) {
      for (p1 in ctlPROB) {
        for (pdelta in pctDELTA) {
          
          clustsize <- ss %/% nclust
          p2 <- p1 * (1 - pdelta)
          
          defa <- defRE(icc)
          defy <- defBinOut(p1, pdelta)
          
          res <- rbindlist(mclapply(1:nIters, 
                        function(x) genBinEsts(nclust, clustsize, defa, defy)))
          
          dres <- data.table(icc, ss, nclust, clustsize, p1, p2, pdelta,
                             converged = res[, mean(convStatus == 0)],
                             p.conv = res[convStatus == 0, mean(`Pr(>|z|)` < 0.05)],
                             p.all = res[convStatus != 2, mean(`Pr(>|z|)` < 0.05)])
          
          print(dres)
          results <- rbind(results, dres)
          
        }
      }
    }
  }
}

The first set of simulations evaluated power at a range of ICC’s, sample sizes, and effect sizes (in terms of percentage reduction). The number of clusters was fixed at 40, so the cluster size increased along with sample size. The probability of an event for a patient in the control group was also fixed at 10%.

ICC <- seq(0.025, .10, 0.025)
SS <- seq(200, 1600, 200)
pctDELTA <- c(.2, .3, .4, .5, .6)
nCLUST <- 40
ctlPROB <- 0.10

The plot shows that at total sample sizes less 800, we would only be able detect effect sizes of 60% when the control proportion is 10%.

When the control proportion is at 40%, there is clearly a much higher probability that the study will detect an effect even at the smaller sample sizes. Under these scenarios, the ICC has a much greater impact on power than when the control proportion is much lower.

Other properties of the design

Of course, power is only one concern of many. For example, we may need to understand how a study’s design relates to bias and variance. In this case, I wondered how well the standard error estimates would compare to observed standard errors, particularly when the cluster sizes were on the lower end. Here are two plots comparing the two.

The lines represent the observed standard errors (the standard deviation of the measure of effect, or the parameter representing the log odds-ratio) at each sample size (assuming 40 clusters and an effect size of 30% reduction.) The points are the average estimate of the standard error with error bars that reflect $\pm$ 1 standard deviation.

In the first set scenarios, the control probability is set at 10%. For all ICCs except perhaps 0.10, it appears that the standard error estimates are, on average, too large, though there is quite a bit of variability. The over-estimation declines as between cluster variance increases.

In the second set of scenarios, where the control probability is at 40%, there is less variation in the standard error estimates (as reflected in the shorter length error bars), and there appears to be a slight underestimate of variation, particularly with the larger ICCs.

I have only scratched the surface here in terms of the scenarios that can be investigated. In addition there are other measurements to consider. Clearly, it would be useful to know if these observed biases in the standard error estimates disappear with larger cluster sizes, or how the number of clusters relates to this bias. And, I didn’t even look at the whether effect size estimates are biased in different scenarios. The point is, while power is important, we must also understand the quality of our estimates.

Yes, unbalanced randomization can improve power, in some situations

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 14 Apr 2020 00:00:00 +0000

Last time I provided some simulations that suggested that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test.

Generating two groups with different means and variance

Using simstudy (the latest version 1.16 is now available on CRAN), it is possible to generate different levels of variance by specifying a formula in the data definition. In this example, the treatment group variance is five times the control group variance:

library(simstudy)
library(data.table)

def <- defDataAdd(varname = "y", formula = "1.1*rx", 
    variance = "1*(rx==0) + 5*(rx==1)", dist = "normal")

I have written a simple function to generate the data that can be used later in the power experiments:

genDataSet <- function(n, ratio, def) {
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
  dx <- addColumns(def, dx)
  return(dx[])
  
}

And now we can generate and look at some data.

RNGkind("L'Ecuyer-CMRG")
set.seed(383)

dx1 <- genDataSet(72, c(1, 2), def)
davg <- dx1[, .(avg = mean(y)), keyby = rx]

library(paletteer)
library(ggplot2)

ggplot(data = dx1, aes(x = factor(rx), y = y) ) +
  geom_jitter(width = .15, height = 0, aes(color = factor(rx))) +
  geom_hline(data= davg, lty = 3, size = .75,
    aes(yintercept = avg, color = factor(rx))) +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_line(color = "grey98"),
        legend.position = "none",
        axis.title.x = element_blank()) +
  scale_x_discrete(labels = c("control", "treatment")) +
  scale_color_paletteer_d("jcolors::pal5")

Power analyses

The following function generates a data set, records the difference in means for the two groups, and estimates the p-value of a t-test that assumes different variances for the two groups.

genPvalue <- function(n, ratio, def) {
  
  dx <- genDataSet(n, ratio, def)
  
  mean.dif <- dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
  p.value <- t.test(y~rx, data = dx)$p.value
  
  data.table(r = paste0(ratio[1], ":", ratio[2]), mean.dif, p.value)
}

In this comparison, we are considering three different designs or randomization schemes. In the first, randomization will be 1 to 1, so that half of the sample of 72 (n = 36) is assigned to the treatment arm. In the second, randomization will be 1 to 2, so that 2/3 of the sample (n=48) is assigned to treatment. And in the last, randomization will be 1 to 3, where 3/4 of the patients (n = 54) will be randomized to treatment. For each scenario, we will estimate the mean difference between the groups, the standard deviation of differences, and the power. All of these estimates will be based on 5000 data sets each, and we are still assuming treatment variance is five times the control variance.

library(parallel)

ratios <- list(c(1, 1), c(1, 2), c(1, 3))

results <- mclapply(ratios, function(r) 
  rbindlist(mclapply(1:5000, function(x) genPvalue(72, r, def ))) 
)

results <- rbindlist(results)

All three schemes provide an unbiased estimate of the effect size, though the unbalanced designs have slightly less variability:

results[, .(avg.difference = mean(mean.dif), 
            sd.difference = sd(mean.dif)), keyby = r]

##      r avg.difference sd.difference
## 1: 1:1            1.1          0.41
## 2: 1:2            1.1          0.38
## 3: 1:3            1.1          0.39

The reduced variability translates into improved power for the unbalanced designs:

results[, .(power = mean(p.value < 0.05)), keyby = r]

##      r power
## 1: 1:1  0.75
## 2: 1:2  0.81
## 3: 1:3  0.81

Benefits of imbalance under different variance assumptions

It seems reasonable to guess that if the discrepancy in variance between the two groups is reduced, there will be less advantage to over-allocating patients in the treatment arm. In fact, it may even be a disadvantage, as in the case of a binary outcome. Likewise, as the discrepancy increases, increased enthusiasm for unbalanced designs may be warranted.

Here is a plot (code available upon request) showing how the variation in the mean differences (shown as a standard deviation) relate to the design scheme and the underlying difference in the variance of the control and treatment groups. In all cases, the assumed variance for the control group was 1. The variance for the treatment group ranged from 1 to 9 in different sets of simulations. At each level of variance, four randomization schemes were evaluated: 1 to 1, 1 to 2, 1 to 3, and 1 to 4.

When variances are equal, there is no apparent benefit to using any other than a 1:1 randomization scheme. Even when the variance of the treatment group increases to 3, there is little benefit to a 1:2 arrangement. At higher levels of variance - in this case 5 - there appears to be more of a benefit to randomizing more people to treatment. However, at all the levels shown here, it does not look like anything above 1:2 is warranted.

So, before heading down the path of unbalanced randomization, make sure to take a look at your variance assumptions.

Can unbalanced randomization improve power?

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 31 Mar 2020 00:00:00 +0000

Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.

I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes.

Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only lose efficiency if we use anything other than a 1:1 randomization.

Brief public service announcement: simstudy update

When it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the simstudy package, which supports much of the simulations on this blog, could not readily handle anything other than 1:1 randomization. I had to quickly rectify that shortcoming. There is a new argument ratio in the trtAssign function where you can now specify any scheme for any number of treatment arms. This is available in version 1.16, which for the moment can be found only on github (kgoldfeld/simstudy).

Here is an example based on a 1:2:3 allocation. I’m not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the balance argument is set to TRUE, indicating that the allocation to the groups will be perfect, or as close as possible to the specified ratios. If balance is FALSE, the ratios are used as relative probabilities instead.

library(simstudy)
library(parallel)

RNGkind("L'Ecuyer-CMRG")
set.seed(16)

dx <- genData(600)
dx <- trtAssign(dx, nTrt = 3, balanced = TRUE, 
                ratio = c(1,2,3), grpName = "rx")

dx[, table(rx)]

## rx
##   1   2   3 
## 100 200 300

Unbalanced designs with a binary outcome

The outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome).

If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters.

estCoef <- function(n, formula, ratio) {
  
  def <- defDataAdd(varname = "y", formula = formula, dist = "binary")
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
  dx <- addColumns(def, dx)
  
  coef(summary(glm(y~rx, family = binomial, data = dx)))
}

estCoef(n = 244*2, formula = "0.4 - 0.3 * 0.4 * rx", ratio = c(1, 1))

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -0.4328641  0.1310474 -3.303111 0.0009561867
## rx          -0.4577476  0.1924533 -2.378487 0.0173838304

The probabilities of a bad outcome for the usual care group and drug treatment group are

c(usual = 1/(1 + exp(0.433)), drug = 1/(1+exp(0.433 + 0.458)))

##     usual      drug 
## 0.3934102 0.2909035

Assessing power

In order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05.

Here is the analytic solution for a 1:1 ratio.

power.prop.test(p1 = .4, p2 = .7*.4, power = .80)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 243.4411
##              p1 = 0.4
##              p2 = 0.28
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated $n$ in a simulation for power (using 1000 datasets), we should be close to 80%:

est.power <- function(n, ratio, p1, reduction) {
  
  formula = paste(p1, "* (1 -", reduction, "* rx)")
  p.val <- estCoef(n, formula, ratio)["rx", 4]
  return(p.val)
  
}

pvals <- unlist(mclapply(1:1000, 
    function(x) est.power(488, c(1, 1), 0.4, 0.3)))

mean(pvals < 0.05)

## [1] 0.814

The power experiment

Now we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions:

res <- list()

for (p1 in c(.2, .3, .4, .5)) {
  for (r in c(.2, .25, .3)) {
    
    p2 <- (1- r) * p1
    n <- ceiling(power.prop.test(p1 = p1, p2 = p2, power = .80)$n)*2
    
    for (i in c(1:5)) {
      
      pvals <- mclapply(1:1000, function(x) est.power(n, c(1, i), p1, r))
      pvals <- unlist(pvals)
      
      dres <- data.table(n, control_p = p1, pct_reduction = r, 
                         control = 1, rx = i, power = mean( pvals < .05))
      
      res <- append(res, list(dres))
      
    }
  }
}

res <- rbindlist(res)

Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment:

ggplot(data = res, aes(x = rx, y = power)) +
  geom_line(color = "blue") +
  facet_grid(control_p ~ pct_reduction, labeller = label_both) +
  theme(panel.grid = element_blank()) +
  scale_x_continuous(name = "ratio of treatment to control", 
                     breaks = c(1:5), labels = paste0(c(1:5),":1")) +
  scale_y_continuous(limits = c(.5,.9), breaks = c(.6, .7, .8))

Continuous outcomes

In the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I’ll show some simulations that indicate this might be the case.

Stay well.

When you want more than a chi-squared test, consider a measure of association

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 17 Mar 2020 00:00:00 +0000

In my last post, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the $\chi^2$ statistic.

I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal. This can arise, for example, when we want to compare Likert-type responses across multiple groups. The measure of effect I focused on - the cumulative proportional odds - is quite useful, but is potentially limited for two reasons. First, the proportional odds assumption may not be reasonable, potentially leading to biased estimates. Second, both factors may be nominal (i.e. not ordinal), it which case cumulative odds model is inappropriate.

An alternative, non-parametric measure of association that can be broadly applied to any contingency table is Cramér’s V, which is calculated as

\[ V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}} \] where $\chi^2$ is from the Pearson’s chi-squared test, $N$ is the total number of responses across all groups, $r$ is the number of rows in the contingency table, and $c$ is the number of columns. $V$ ranges from $0$ to $1$, with $0$ indicating no association, and $1$ indicating the strongest possible association. (In the addendum, I provide a little detail as to why $V$ cannot exceed $1$.)

Simulating independence

In this first example, the distribution of ratings is independent of the group membership. In the data generating process, the probability distribution for rating has no reference to grp, so we would expect similar distributions of the response across the groups:

library(simstudy)

def <- defData(varname = "grp", 
         formula = "0.3; 0.5; 0.2", dist = "categorical")
def <- defData(def, varname = "rating", 
         formula = "0.2;0.3;0.4;0.1", dist = "categorical")

set.seed(99)
dind <- genData(500, def)

And in fact, the distributions across the 4 rating options do appear pretty similar for each of the 3 groups:

In order to estimate $V$ from this sample, we use the $\chi^2$ formula (I explored the chi-squared test with simulations in a two-part post here and here):

\[ \sum_{i,j} {\frac{(O_{ij} - E_{ij})^2}{E_{ij}}} \]

observed <- dind[, table(grp, rating)]
obs.dim <- dim(observed)

getmargins <- addmargins(observed, margin = seq_along(obs.dim), 
                         FUN = sum, quiet = TRUE)

rowsums <- getmargins[1:obs.dim[1], "sum"]
colsums <- getmargins["sum", 1:obs.dim[2]]

expected <- rowsums %*% t(colsums) / sum(observed)
X2 <- sum( ( (observed - expected)^2) / expected)

X2

## [1] 3.45

And to check our calculation, here’s a comparison with the estimate from the chisq.test function:

chisq.test(observed)

## 
##  Pearson's Chi-squared test
## 
## data:  observed
## X-squared = 3.5, df = 6, p-value = 0.8

With $\chi^2$ in hand, we can estimate $V$, which we expect to be quite low:

sqrt( (X2/sum(observed)) / (min(obs.dim) - 1) )

## [1] 0.05874

Again, to verify the calculation, here is an alternative estimate using the DescTools package, with a 95% confidence interval:

library(DescTools)

CramerV(observed, conf.level = 0.95)

## Cramer V   lwr.ci   upr.ci 
##  0.05874  0.00000  0.08426

Group membership matters

In this second scenario, the distribution of rating is specified directly as a function of group membership. This is an extreme example, designed to elicit a very high value of $V$:

def <- defData(varname = "grp", 
            formula = "0.3; 0.5; 0.2", dist = "categorical")

defc <- defCondition(condition = "grp == 1", 
            formula = "0.75; 0.15; 0.05; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 2", 
            formula = "0.05; 0.75; 0.15; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 3", 
            formula = "0.05; 0.05; 0.15; 0.75", dist = "categorical")

# generate the data

dgrp <- genData(500, def)
dgrp <- addCondition(defc, dgrp, "rating")

It is readily apparent that the structure of the data is highly dependent on the group:

And, as expected, the estimated $V$ is quite high:

observed <- dgrp[, table(grp, rating)]

CramerV(observed, conf.level = 0.95)

## Cramer V   lwr.ci   upr.ci 
##   0.7400   0.6744   0.7987

Interpretation of Cramér’s V using proportional odds

A key question is how we should interpret V? Some folks suggest that $V \le 0.10$ is very weak and anything over $0.25$ could be considered quite strong. I decided to explore this a bit by seeing how various cumulative odds ratios relate to estimated values of $V$.

To give a sense of what some log odds ratios (LORs) look like, I have plotted distributions generated from cumulative proportional odds models, using LORs ranging from 0 to 2. At 0.5, there is slight separation between the groups, and by the time we reach 1.0, the differences are considerably more apparent:

My goal was to see how estimated values of $V$ change with the underlying LORs. I generated 100 data sets for each LOR ranging from 0 to 3 (increasing by increments of 0.05) and estimated $V$ for each data set (of which there were 6100). The plot below shows the mean $V$ estimate (in yellow) at each LOR, with the individual estimates represented by the grey points. I’ll let you draw you own conclusions, but (in this scenario at least), it does appear that 0.25 (the dotted horizontal line) signifies a pretty strong relationship, as LORs larger than 1.0 generally have estimates of $V$ that exceed this threshold.

p-values and Cramér’s V

To end, I am just going to circle back to where I started at the beginning of the previous post, thinking about p-values and effect sizes. Here, I’ve generated data sets with a relatively small between-group difference, using a modest LOR of 0.40 that translates to a measure of association $V$ just over 0.10. I varied the sample size from 200 to 1000. For each data set, I estimated $V$ and recorded whether or not the p-value from a chi-square test would have been deemed “significant” (i.e. p-value $< 0.05$) or not. The key point here is that as the sample size increases and we rely solely on the chi-squared test, we are increasingly likely to attach importance to the findings even though the measure of association is quite small. However, if we actually consider a measure of association like Cramér’s $V$ (or some other measure that you might prefer) in drawing our conclusions, we are less likely to get over-excited about a result when perhaps we shouldn’t.

I should also comment that at smaller sample sizes, we will probably over-estimate the measure of association. Here, it would be important to consider some measure of uncertainty, like a 95% confidence interval, to accompany the point estimate. Otherwise, as in the case of larger sample sizes, we would run the risk of declaring success or finding a difference when it may not be warranted.

Addendum: Why is Cramér’s V $\le$ 1?

Cramér’s $V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}$, which cannot be lower than 0. $V=0$ when $\chi^2 = 0$, which will only happen when the observed cell counts for all cells equal the expected cell counts for all cells. In other words, $V=0$ only when there is complete independence.

It is also the case that $V$ cannot exceed $1$. I will provide some intuition for this using a relatively simple example and some algebra. Consider the following contingency table which represents complete separation of the three groups:

I would argue that this initial $3 \times 4$ table is equivalent to the following $3 \times 3$ table that collapses responses $1$ and $2$ - no information about the dependence has been lost or distorted. In this case $n_A = n_{A1} + n_{A2}$.

In order to calculate $\chi^2$, we need to derive the expected values based on this collapsed contingency table. If $p_{ij}$ is the probability for cell row $i$ and column $j$, and $p_i.$ and $p._j$ are the row $i$ and column $j$ totals, respectively then independence implies that $p_{ij} = p_i.p._j$. In this example, under independence, the expected cell count for cell $i,j$ is $\frac{n_i}{N} \frac{n_j}{N} N = \frac{n_in_j}{N}$:

If we consider the contribution of group $A$ to $\chi^2$, we start with the $\sum_{group \ A} (O_j - E_j)^2/E_j$ and end up with $N - n_A$:

\[ \begin{aligned} \chi^2_{\text{rowA}} &= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{\left ( \frac{n_An_B}{N} \right )^2}{\frac{n_An_B}{N}} + \frac{\left ( \frac{n_An_C}{N} \right )^2}{\frac{n_An_C}{N}} \\ \\ &= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &=N \left ( \frac{n_A^2 - \frac{2n_A^3}{N} +\frac{n_A^4}{N^2}} {n_A^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &=N \left ( 1 - \frac{2n_A}{N} +\frac{n_A^2}{N^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &= N - 2n_A +\frac{n_A^2}{N} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\ &= N - 2n_A + \frac{n_A}{N} \left ( {n_A} + n_B + n_C \right ) \\ \\ &= N - 2n_A + \frac{n_A}{N} N \\ \\ &= N - n_A \end{aligned} \]

If we repeat this on rows 2 and 3 of the table, we will find that $\chi^2_{\text{rowB}} = N - n_B$, and $\chi^2_{\text{rowC}} = N - n_C$, so

\[ \begin{aligned} \chi^2 &= \chi^2_\text{rowA} +\chi^2_\text{rowB}+\chi^2_\text{rowC} \\ \\ &=(N - n_A) + (N - n_B) + (N - n_C) \\ \\ &= 3N - (n_A + n_B + n_C) \\ \\ &= 3N - N \\ \\ \chi^2 &= 2N \end{aligned} \]

And

\[ \frac{\chi^2}{2 N} = 1 \]

So, under this scenario of extreme separation between groups,

\[ V = \sqrt{\frac{\chi^2}{\text{min}(r-1, c-1) \times N}} = 1 \]

where $\text{min}(r - 1, c - 1) = \text{min}(2, 3) = 2$.

Alternatives to reporting a p-value: the case of a contingency table

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 03 Mar 2020 00:00:00 +0000

I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the American Statistical Association has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In 2016, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.” These principles are worth noting here in case you don’t make it over to the original paper:

p-values can indicate how incompatible the data are with a specified statistical model.
p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
proper inference requires full reporting and transparency
a p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

More recently, the ASA elaborated on this, responding to those who thought the initial paper was too negative, a list of many things not to do. In this new paper, the ASA argues that “knowing what not to do with p-values is indeed necessary, but it does not suffice.” We also need to know what we should do. One of those things should be focusing on effect sizes (and some measure of uncertainty, such as a confidence or credible interval) in order to evaluate an intervention or exposure.

Applying principled thinking to a small problem

Recently, I was discussing the presentation of results for a pilot study. I was arguing that we should convey the findings in a way that highlighted the general trends without leading readers to make overly strong conclusions, which p-values might do. So, I was arguing that, rather than presenting p-values, we should display effect sizes and confidence intervals, and avoid drawing on the concept of “statistical significance.”

Generally, this is not a problem; we can estimate an effect size like a difference in means, a difference in proportions, a ratio of proportions, a ratio of odds, or even the log of a ratio of odds. In this case, the outcome was a Likert-type survey where the response was “none”, “a little”, and “a lot”, and there were three comparison groups, so we had a $3\times3$ contingency table with one ordinal (i.e. ordered) factor. In this case, it is not so clear what the effect size measurement should be.

One option is to calculate a $\chi^2$ statistic, report the associated p-value, and call it a day. However, since the $\chi^2$ is not a measure of effect and the p-value is not necessarily a good measure of evidence, I considered estimating a cumulative odds model that would provide a measure of the association between group and response. However, I was a little concerned, because the typical version of this model makes an assumption of proportional odds, which I wasn’t sure would be appropriate here. (I’ve written about these models before, here and here, if you want to take a look.) It is possible to fit a cumulative odds model without the proportionality assumption, but then the estimates are harder to interpret since the effect size varies by group and response.

Fortunately, there is a more general measure of association for contingency tables with at least one, but possibly two, nominal factors: Cramer’s V. This measure which makes no assumptions about proportionality.

My plan is to simulate contingency table data, and in this post, I will explore the cumulative odds models. Next time, I’ll describe the Cramer’s V measure of association.

Non-proportional cumulative odds

In the cumulative odds model (again, take a look here for a little more description of these models), we assume that all the log-odds ratios are proportional. This may actually not be an unreasonable assumption, but I wanted to start with a data set that is generated without explicitly assuming proportionality. In the following data definition, the distribution of survey responses (none, a little, and a lot) across the three groups (1, 2, and 3) are specified uniquely for each group:

library(simstudy)

# define the data

def <- defData(varname = "grp", 
            formula = "0.3; 0.5; 0.2", dist = "categorical")

defc <- defCondition(condition = "fgrp == 1", 
            formula = "0.70; 0.20; 0.10", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2", 
            formula = "0.10; 0.60; 0.30", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3", 
            formula = "0.05; 0.25; 0.70", dist = "categorical")

# generate the data

set.seed(99)

dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- addCondition(defc, dx, "rating")
dx <- genFactor(dx, "rating", replace = TRUE, 
         labels = c("none", "a little", "a lot"))

dx[]

##       id fgrp  frating
##   1:   1    2 a little
##   2:   2    3 a little
##   3:   3    3    a lot
##   4:   4    2 a little
##   5:   5    2 a little
##  ---                  
## 176: 176    2    a lot
## 177: 177    1     none
## 178: 178    3    a lot
## 179: 179    2 a little
## 180: 180    2 a little

A distribution plot based on these 180 observations indicates that the odds are not likely proportional; the “tell” is the large bulge for those in group 2 who respond a little.

library(likert)

items <- dx[, .(frating)]
names(items) <- c(frating = "rating")

likert.data <- likert(items = items, grouping = dx$fgrp)
plot(likert.data, wrap = 100, low.color = "#DAECED", 
  high.color = "#CECD7B")

The $\chi^2$ test, not so surprisingly, indicates that it would be reasonable to conclude there are differences in responses across the three groups:

chisq.test(table(dx[, .(fgrp, frating)]))

## 
##  Pearson's Chi-squared test
## 
## data:  table(dx[, .(fgrp, frating)])
## X-squared = 84, df = 4, p-value <2e-16

But, since we are trying to provide a richer picture of the association that will be less susceptible to small sample sizes, here is the cumulative (proportional) odds model fit using the clm function in the ordinal package.

library(ordinal)

clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)

## formula: frating ~ fgrp
## data:    dx
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  180  -162.95 333.91 5(0)  4.61e-08 2.8e+01
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## fgrp2    2.456      0.410    5.98  2.2e-09 ***
## fgrp3    3.024      0.483    6.26  3.9e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                Estimate Std. Error z value
## none|a little     0.335      0.305    1.10
## a little|a lot    2.945      0.395    7.45

A plot of the observed proportions (show by the line) with the modeled proportions (shown as points) indicates that the model that makes the proportional assumption might not be doing a great job:

If we fit a model that does not make the proportionality assumption and compare using either AIC statistic (lower is better) or a likelihood ratio test (small p-value indicates that the saturated/non-proportional model is better), it is clear that the non-proportional odds model for this dataset is a better fit.

clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
summary(clmFit.sat)

## formula: frating ~ 1
## nominal: ~fgrp
## data:    dx
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  180  -149.54 311.08 7(0)  8.84e-11 4.7e+01
## 
## Threshold coefficients:
##                            Estimate Std. Error z value
## none|a little.(Intercept)     0.544      0.296    1.83
## a little|a lot.(Intercept)    1.634      0.387    4.23
## none|a little.fgrp2          -4.293      0.774   -5.54
## a little|a lot.fgrp2         -0.889      0.450   -1.98
## none|a little.fgrp3          -2.598      0.560   -4.64
## a little|a lot.fgrp3         -1.816      0.491   -3.70

anova(clmFit.prop, clmFit.sat)

## Likelihood ratio tests of cumulative link models:
##  
##             formula:       nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1       logit flexible  
## clmFit.sat  frating ~ 1    ~fgrp    logit flexible  
## 
##             no.par AIC logLik LR.stat df Pr(>Chisq)    
## clmFit.prop      4 334   -163                          
## clmFit.sat       6 311   -150    26.8  2    1.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is possible that the poor fit is just a rare occurrence. Below is a plot that shows the average result ($\pm 1 \ sd$) for 1000 model fits for 1000 data sets using the same data generation process. It appears those initial results were not an aberration - the proportional odds model fits a biased estimate, particularly for groups 1 and 2. (The code to do this simulation is shown in the addendum.)

Proportional assumption fulfilled

Here the data generation process is modified so that the proportionality assumption is incorporated.

def <- defData(varname = "grp", formula = ".3;.5;.2", 
               dist = "categorical")
def <- defData(def, varname = "z", formula = "1*I(grp==2) + 2*I(grp==3)", 
               dist = "nonrandom")

baseprobs <- c(0.7, 0.2, 0.1)

dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- genOrdCat(dx, adjVar = "z", baseprobs, catVar = "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
          labels = c("none", "a little", "a lot")
)

This is what proportional odds looks like - there are no obvious bulges, just a general shift rightward as we move from group 1 to 3:

When we fit the proportional model and compare it to the saturated model, we see no reason to reject the assumption of proportionality (based on either the AIC or LR statistics).

clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)

## formula: frating ~ fgrp
## data:    dx
## 
##  link  threshold nobs logLik  AIC    niter max.grad cond.H 
##  logit flexible  180  -176.89 361.77 4(0)  3.04e-09 2.7e+01
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## fgrp2    1.329      0.359    3.70  0.00022 ***
## fgrp3    2.619      0.457    5.73    1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                Estimate Std. Error z value
## none|a little     0.766      0.299    2.56
## a little|a lot    2.346      0.342    6.86

clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
anova(clmFit.prop, clmFit.sat)

## Likelihood ratio tests of cumulative link models:
##  
##             formula:       nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1       logit flexible  
## clmFit.sat  frating ~ 1    ~fgrp    logit flexible  
## 
##             no.par AIC logLik LR.stat df Pr(>Chisq)
## clmFit.prop      4 362   -177                      
## clmFit.sat       6 365   -177    0.56  2       0.75

And here is a plot summarizing a second set of 1000 iterations, this one using the proportional odds assumption. The estimates appear to be unbiased:

I suspect that in many instances, Likert-type responses will look more like the second case than the first case, so that the cumulative proportional odds model could very well be useful in characterizing the association between group and response. Even if the assumption is not reasonable, the bias might not be terrible, and the estimate might still be useful as a measure of association. However, we might prefer a measure that is free of any assumptions, such as Cramer’s V. I’ll talk about that next time.

References:

Ronald L. Wasserstein & Nicole A. Lazar (2016) The ASA Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133.

Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019) Moving to a World Beyond “p < 0.05”, The American Statistician, 73:sup1, 1-19.

Addendum: code for replicated analysis

library(parallel)
RNGkind("L'Ecuyer-CMRG")  # to set seed for parallel process

dat.nonprop <- function(iter, n) {
  
  dx <- genData(n, def)
  dx <- genFactor(dx, "grp", replace = TRUE)
  dx <- addCondition(defc, dx, "rating")
  dx <- genFactor(dx, "rating", replace = TRUE,
            labels = c("none", "a little", "a lot")
  )

  clmFit <- clm(frating ~ fgrp, data = dx)
  
  dprob.obs <- data.table(iter, 
      prop.table(dx[, table(fgrp, frating)], margin = 1))
  
  setkey(dprob.obs, fgrp, frating)
  setnames(dprob.obs, "N", "p.obs")
  
  dprob.mod <- data.table(iter, fgrp = levels(dx$fgrp),
      predict(clmFit, newdata = data.frame(fgrp = levels(dx$fgrp)))$fit)
  
  dprob.mod <- melt(dprob.mod, id.vars = c("iter", "fgrp"), 
                    variable.name = "frating", value.name = "N")
  
  setkey(dprob.mod, fgrp, frating)
  setnames(dprob.mod, "N", "p.fit")
  
  dprob <- dprob.mod[dprob.obs]
  dprob[, frating := factor(frating, 
                        levels=c("none", "a little", "a lot"))]
  
  dprob[]
  
}

def <- defData(varname = "grp", formula = ".3;.5;.2", 
            dist = "categorical")

defc <- defCondition(condition = "fgrp == 1", 
            formula = "0.7;0.2;0.1", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2", 
            formula = "0.1;0.6;0.3", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3", 
            formula = "0.05;0.25;0.70", dist = "categorical")

res.nonp <- rbindlist(mclapply(1:1000, 
                        function(iter) dat.nonprop(iter,180)))

sum.nonp <- res.nonp[, .(mfit = mean(p.fit), sfit = sd(p.fit), 
              mobs = mean(p.obs), sobs = sd(p.obs)), 
              keyby = .(fgrp, frating)]

sum.nonp[, `:=`(lsd = mfit - sfit, usd = mfit + sfit)]

ggplot(data = sum.nonp, aes(x = frating, y = mobs)) +
  geom_line(aes(group = fgrp), color = "grey60") +
  geom_errorbar(aes(ymin = lsd, ymax = usd,  color = fgrp), 
                width = 0) +
  geom_point(aes(y = mfit, color = fgrp)) +
  theme(panel.grid = element_blank(),
        legend.position = "none",
        axis.title.x = element_blank()) +
  facet_grid(fgrp ~ .) +
  scale_y_continuous(limits = c(0, 0.85), name = "probability") +
  scale_color_manual(values = c("#B62A3D", "#EDCB64", "#B5966D"))

Clustered randomized trials and the design effect

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 18 Feb 2020 00:00:00 +0000

I am always saying that simulation can help illuminate interesting statistical concepts or ideas. Such an exploration might provide some insight into the concept of the design effect, which underlies clustered randomized trial designs. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.

When individual outcomes in a group are correlated, we learn less about the group from adding a new individual than we might think. Take an extreme example where every individual in a group is perfectly correlated with all the others: we will learn nothing new about the group by adding someone new. In fact, we might as well just look at a single member, since she is identical to all the others. The design effect is a value that in a sense quantifies how much information we lose (or, surprisingly, possibly gain) by this interdependence.

Let’s just jump right into it.

The context

Imagine a scenario where an underlying population of interest is structurally defined by a group of clusters. The classic case is students in schools or classrooms. I don’t really do any school-based education (I learned from debating my teacher-wife that is a dangerous area to tread), but this example seems so clear. (The ideas in this post were, in part, motivated by my involvement with the NIA IMPACT Collaboratory, which focuses at the opposite end of life, seeking to improve care and quality of life for people living with advanced dementia and their caregivers through research and pragmatic clinical trials.) We might be interested in measuring the effect of some intervention (it may or may not take place in school) on an educational attainment outcome of high school-aged kids in a city (I am assuming a continuous outcome here just because it is so much easier to visualize). It does not seem crazy to think that the outcomes of kids from the same school might be correlated, either because the school itself does such a good (or poor) job of teaching or similar types of kids tend to go to the same school.

The unit of randomization

We have at least three ways to design our study. We could just recruit kids out and about in city and randomize them each individually to intervention or control. In the second approach, we decide that it is easier to randomize the schools to intervention or control - and recruit kids from each of the schools. This means that all kids from one school will be in the same intervention arm. And for the third option, we can go half way: we go to each school and recruit kids, randomizing half of the kids in each school to control, and the other half to the intervention. This last option assumes that we could ensure that the kids in the school exposed to the intervention would not influence their unexposed friends.

In all three cases the underlying assumptions are the same - there is a school effect on the outcome, an individual effect, and an intervention effect. But it turns out that the variability of the intervention effect depends entirely on how we randomize. And since variability of the outcome affects sample size, each approach has implications for sample size. (I’ll point you to a book by Donner & Klar, which gives a comprehensive and comprehensible overview of cluster randomized trials.)

Simulation of each design

Just to be clear about these different randomization designs, I’ll simulate 1500 students using each. I’ve set a seed in case you’d like to recreate the results shown here (and indicate the libraries I am using).

library(simstudy)
library(data.table)
library(ggplot2)
library(clusterPower)
library(parallel)
library(lmerTest)

RNGkind("L'Ecuyer-CMRG") # enables seed for parallel process
set.seed(987)

Randomization by student

I’ve written a function for each of the three designs to generate the data, because later I am going to need to generate multiple iterations of each design. In the first case, randomization is applied to the full group of students:

independentData <- function(N, d1) {
  
  di <- genData(N)
  di <- trtAssign(di, grpName = "rx")
  di <- addColumns(d1, di)
  di[]
  
}

The outcome is a function of intervention status and a combined effect of the student’s school and the student herself. We cannot disentangle the variance components, because we do not know the identity of the school:

defI1 <- defDataAdd(varname = "y", formula = "0.8 * rx", 
                variance = 10, dist = "normal")

dx <- independentData(N = 30 * 50, defI1)

The observed effect size and variance should be close to the specified parameters of 0.8 and 10, respectively:

dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]

## [1] 0.597

dx[, var(y)]

## [1] 10.2

Here is a plot of the individual observations that highlights the group differences and individual variation:

Randomization by site

Next, the intervention status is assigned to each of the $k$ schools/clusters before generating $m$ students per cluster. In this case, the outcome (defined by defI2) is a function of the cluster effect, individual effect, and the intervention status. Note here, the variance components are disentangled, but together they sum to 10, suggesting that total variance should be the same as the first scenario:

clusteredData <- function(k, m, d1, d2) {
  
  dc <- genData(k, d1)
  dc <- trtAssign(dc, grpName = "rx")
  
  di <- genCluster(dc, "site", m, level1ID = "id")
  di <- addColumns(d2, di)
  di[]
  
}

defC <- defData(varname = "ceff", formula = 0, 
                variance = 0.5, id = "site", dist = "normal")

defI2 <- defDataAdd(varname = "y", formula = "ceff + 0.8 * rx", 
                   variance = 9.5, dist = "normal")

dx <- clusteredData(k = 30, m = 50, defC, defI2)

The effect size and variation across all observations should be be quite similar to the previous design, though now the data has a structure that is determined by the clusters:

dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]

## [1] 0.203

dx[, var(y)]

## [1] 10.5

Randomization within site

In the last design, the treatment assignment is made after both the clusters and individuals have been generated. Cluster randomization within site is specified using the strata argument:

withinData <- function(k, m, d1, d2) {
  
  dc <- genData(k, d1)
  
  di <- genCluster(dc, "site", m, "id")
  di <- trtAssign(di, strata="site", grpName = "rx")
  di <- addColumns(d2, di)
  
  di[]

}

dx <- withinData(30, 50, defC, defI2)

dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]

## [1] 0.813

dx[, var(y)]

## [1] 10.1

The design effect

There’s a really nice paper by Vierron & Giraudeau that describes many of the issues I am only touching on here. In particular, they define the design effect and then relate this definition to formulas that are frequently used simplify the estimation of the design effect.

Consider the statistics $\sigma^2_{\Delta_{bc}}$ and $\sigma^2_{\Delta_{i}}$, which are the variance of the effect sizes under the cluster randomization and the individual randomization designs, respectively:

\[\sigma^2_{\Delta_{bc}} = Var(\bar{Y}_1^{bc} - \bar{Y}_0^{bc})\]

and

\[\sigma^2_{\Delta_{i}} =Var(\bar{Y}_1^{i} - \bar{Y}_0^{i})\]

These variances are never observed, since they are based on a very large (really, an infinite) number of repeated experiments. However, the theoretical variances can be derived (as they are in the paper), and can be simulated (as they will be here). The design effect $\delta_{bc}$ is defined as

\[\delta_{bc} = \frac{\sigma^2_{\Delta_{bc}}}{\sigma^2_{\Delta_{i}}}\]

This ratio represents the required adjustment in sample size required to make the two designs equivalent in the sense that they provide the same amount of information. This will hopefully become clear with the simulations below.

I have decided to use $k = 50$ clusters to ensure a large enough sample size to estimate the proper variance. I need to know how many individuals per cluster are required for 80% power in the cluster randomized design, given the effect size and variance assumptions I’ve been using here. I’ll use the clusterPower package (which unfortunately defines the number of clusters in each as $m$, so don’t let that confuse you). Based on this, we should have 18 students per school, for a total sample of 900 students:

crtpwr.2mean(m = 50/2, d = 0.8, icc = 0.05, varw = 9.5)

##    n 
## 17.9

Now, I am ready to generate effect sizes for each of 2000 iterations of the experiment assuming randomization by cluster. With this collection of effect sizes in hand, I will be able to estimate their variance:

genDifFromClust <- function(k, m, d1, d2) {
  dx <- clusteredData(k, m, d1, d2)
  dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] 
}

resC <- unlist(mclapply(1:niters, 
   function(x) genDifFromClust(k= 50, m=18, defC, defI2)))

Here is an estimate of $\sigma^2_{\Delta_{bc}}$ based on the repeated experiments:

(s2.D_bc <- var(resC))

## [1] 0.0818

And here is the estimate of $\sigma^2_{\Delta_{i}}$ (the variance of the effect sizes based on individual-level randomization experiments):

genDifFromInd <- function(N, d1) {
  dx <- independentData(N, d1)
  dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] 
}

resI <- unlist(mclapply(1:niters, 
          function(x) genDifFromInd(N = 50*18, defI1)))

(s2.D_i <- var(resI))

## [1] 0.0432

So, now we can use these variance estimates to derive the estimate of the design effect $\delta_{bc}$, which, based on the earlier definition, is:

(d_bc <- s2.D_bc / s2.D_i)

## [1] 1.89

The Vierron & Giraudeau paper derives a simple formula for the design effect assuming equal cluster sizes and an ICC $\rho$. This (or some close variation, when cluster sizes are not equal) is quite commonly used:

\[\delta_{bc} = 1 + (m-1)*\rho\]

As the ICC increases, the design effect increases. Based on the parameters for $m$ and $\rho$ we have been using in these simulations (note that $\rho = 0.5/(0.5+9.5) = 0.05$), the standard formula gives us this estimate of $\delta_{bc.formula}$ that is quite close to our experimental value:

( d_bc_form <- 1 + (18-1) * (0.05) )

## [1] 1.85

But what is the design effect?

OK, finally, we can now see what the design effect actually represents. As before, we will generate repeated data sets; this time, we will estimate the treatment effect using an appropriate model. (In the case of the cluster randomization, this is a linear mixed effects model, and in the case of individual randomization, this is linear regression model.) For each iteration, I am saving the p-value for the treatment effect parameter in the model. We expect close to 80% of the p-values to be lower than 0.05 (this is 80% power given a true treatment effect of 0.8).

First, here is the cluster randomized experiment and the estimate of power:

genEstFromClust <- function(k, m, d1, d2) {
  dx <- clusteredData(k, m, d1, d2)
  summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}

resCest <- unlist(mclapply(1:niters, 
          function(x) genEstFromClust(k=50, m = 18, defC, defI2)))

mean(resCest < 0.05) # power

## [1] 0.778

In just over 80% of the cases, we would have rejected the null.

And here is the estimated power under the individual randomization experiment, but with a twist. Since the design effect is 1.85, the cluster randomized experiment needs a relative sample size 1.85 times higher than an equivalent (individual-level) RCT to provide the same information, or to have equivalent power. So, in our simulations, we will use a reduced sample size for the individual RCT. Since we used 900 individuals in the CRT, we need only $900/1.85 = 487$ individuals in the RCT:

( N.adj <- ceiling( 50 * 18 / d_bc_form ) )

## [1] 487

genEstFromInd <- function(N, d1) {
  dx <- independentData(N, d1)
  summary(lm(y ~ rx, data = dx))$coef["rx", 4]
}

resIest <- unlist(mclapply(1:niters, 
          function(x) genEstFromInd(N = N.adj, defI1)))

The power for this second experiment is also quite close to 80%:

mean(resIest < 0.05) # power

## [1] 0.794

Within cluster randomization

It is interesting to see what happens when we randomize within the cluster. I think there may be some confusion here, because I have seen folks incorrectly apply the standard formula for $\delta_{bc}$, rather than this formula for $\delta_{wc}$ that is derived (again, under the assumption of equal cluster sizes) in the Vierron & Giraudeau paper as

\[ \delta_{wc} = 1- \rho\]

This implies that the sample size requirement actually declines as intra-cluster correlation increases! In this case, since $\rho = 0.05$, the total sample size for the within-cluster randomization needs to be only 95% of the sample size for the individual RCT.

As before, let’s see if the simulated data confirms this design effect based on the definition

\[ \delta_{wc} = \frac{\sigma^2_{\Delta_{wc}}}{\sigma^2_{\Delta_{i}}}\]

genDifFromWithin <- function(k, m, d1, d2) {
  dx <- withinData(k, m, d1, d2)
  dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)] 
}

resW <- unlist(mclapply(1:niters, 
   function(x) genDifFromWithin(k = 50, m = 18, defC, defI2)))

(s2.D_wc <- var(resW))

## [1] 0.0409

The estimated design effect is quite close to the expected design effect of 0.95:

(d_wc <- s2.D_wc / s2.D_i)

## [1] 0.947

And to finish things off, if we estimate an adjusted cluster size based on the design effects (first reducing the cluster size $m=18$ for the cluster randomized trial by $\delta_{bc.formula}$ to derive the appropriate sample size for the RCT, and then adjusting by $\delta_{wc} = 0.95$) to get the appropriate cluster size for the within cluster randomization, which is about 9 students. This study will only have 450 students, fewer than the RCT:

(m.adj <- round( (18 / d_bc_form) * 0.95, 0))

## [1] 9

genEstFromWithin <- function(k, m, d1, d2) {
  dx <- withinData(k, m, d1, d2)
  summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}

resWest <- unlist(mclapply(1:niters, 
  function(x) genEstFromWithin(k = 50, m = ceiling(m.adj), defC, defI2)))

mean(resWest < 0.05)

## [1] 0.779

References:

Donner, Allan, and Neil Klar. “Design and analysis of cluster randomization trials in health research.” New York (2010).

Vierron, Emilie, and Bruno Giraudeau. “Design effect in multicenter studies: gain or loss of power?.” BMC medical research methodology 9, no. 1 (2009): 39.

Support:

Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 04 Feb 2020 00:00:00 +0000

I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model. (If I have missed something obvious with respect to modeling options, please let me know.)

Two initiatives I am involved with, the HAS-QOL study and the IMPACT Collaboratory, are focused on improving quality of care for people living with Alzheimer’s disease and other dementias. Both are examples where the stepped-wedge study design can be an important tool to evaluate interventions in a real-world context. In an earlier post, I introduced a particular variation of the stepped-wedge design which includes an open cohort. I provided simulations of the data generating process we are assuming for the analysis presented here. Elsewhere (here and here), I described Bayesian models that can be used to analyze data with more complicated variance patterns; all of those examples were based on continuous outcomes.

Here, I am extending and combining these ideas. This post walks through the data generation process and describes a Bayesian model that addresses the challenges posed by the open cohort stepped-wedge study design.

The model

The process I use to simulate the data and then estimate to effects is based on a relatively straightforward logistic regression model with two random effects. To simplify things a bit, I intentionally make the assumption that there are no general time trends that affect that outcomes (though it would not be difficult to add in). In the logistic model, the log-odds (or logit) of a binary outcome is a linear function of predictors and random effects:

\[ \text{logit}(P(y_{ict}=1) = \beta_0 + \beta_1 I_{ct} + b_{ct} + b_i,\] where $\text{logit}(P(y_{ict}=1))$ is the log-odds for individual $i$ in cluster (or site) $c$ during time period $t$, and $I_{ct}$ is a treatment indicator for cluster $c$ during period $t$.

There are two random effects in this model. The first is a cluster-specific period random effect, $b_{ct}$ . For each cluster, there will actually be a vector of cluster effects $\mathbf{b_c} = (b_{c0}, b_{c1},...,b_{c,T-1})$, where $\mathbf{b_c}\sim MVN(\mathbf{0}, \sigma_{b_c}^2\mathbf{R})$, and $\mathbf{R}$ is

\[ \mathbf{R} = \left( \begin{matrix} 1 & \rho & \rho^2 & \cdots & \rho^{T-1} \\ \rho & 1 & \rho & \cdots & \rho^{T-2} \\ \rho^2 & \rho & 1 & \cdots & \rho^{T-3} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \rho^{T-1} & \rho^{T-2} & \rho^{T-3} & \cdots & 1 \end{matrix} \right ) \]

The second random effect is the individual or patient-level random intercept $b_i$, where $b_i \sim N(0,\sigma_{b_i}^2)$. We could assume a more structured relationship for individual patients over time (such as a decaying correlation), but in this application, patients will not have sufficient measurements to properly estimate this.

In the model $\beta_0$ has the interpretation of the log-odds for the outcome when the the cluster is still in the control state and the cluster-period and individual effects are both 0. $\beta_1$ is the average treatment effect conditional on the random effects, and is reported as a log odds ratio.

Simulating the study data

I am going to generate a single data set based on this model. If you want more explanation of the code, this earlier post provides the details. The only real difference here is that I am generating an outcome that is a function of cluster-period effects, individual effects, and treatment status.

Site level data

There will be 24 sites followed for 12 periods ($t=0$ through $t=11$), and the stepped-wedge design includes 6 waves of 4 sites in each wave. The first wave will start at $t=4$, and a new wave will be added each period, so that the last wave starts at $t=9$.

library(simstudy)

dsite <- genData(24, id = "site")

dper <- addPeriods(dsite, nPeriods = 12, idvars = "site", 
                   perName = "period")
  
dsw <- trtStepWedge(dper, "site", nWaves = 6, lenWaves = 1, 
                    startPer = 4, perName = "period",
                    grpName = "Ict")

Correlated site-level effects

The average site-level effect is 0, the standard deviation of site averages is $\sigma_{ct} = 0.3$, and the correlation coefficient that will determine between-period within site correlation is $\rho = 0.5$. The correlation structure is “AR-1”, which means the between-period correlation decays over time (see definition of $\mathbf{R}$ above.)

siteDef <- defData(varname = "eff.mu", formula = 0, 
                   dist = "nonrandom", id = "site")
siteDef <- defData(siteDef, varname = "eff.s2", formula = 0.3^2, 
                   dist = "nonrandom")

dsw <- addColumns(siteDef, dsw)

dsw <- addCorGen(dsw, nvars = 12, idvar = "site", rho = 0.5, 
                 corstr = "ar1", dist = "normal", 
                 param1 = "eff.mu", param2 = "eff.s2", 
                 cnames = "eff.st")

dsw <- dsw[, .(site, period, startTrt, Ict, eff.st)]

Patient level data

We are generating 20 patients per period for each site, so there will be a total of 5760 individuals ($20\times24\times12$). The individual level effect standard deviation $\sigma_{b_i} = 0.3$. Each of the patients will be followed until they die, which is a function of their health status over time, defined by the Markov process and its transition matrix defined below. (This was described in more detail in an earlier post.

dpat <- genCluster(dper, cLevelVar = "timeID", 
                   numIndsVar = 20, level1ID = "id")

patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
                     dist = "categorical")
patDef <- defDataAdd(patDef, varname = "eff.p", 
                     formula = 0, variance = 0.3^2)

dpat <- addColumns(patDef, dpat)

P <-t(matrix(c( 0.7, 0.2, 0.1, 0.0,
                0.1, 0.3, 0.5, 0.1,
                0.0, 0.1, 0.6, 0.3,
                0.0, 0.0, 0.0, 1.0),
             nrow = 4))

dpat <- addMarkov(dpat, transMat = P, 
                  chainLen = 12, id = "id", 
                  pername = "seq", start0lab = "S0",
                  trimvalue = 4)

dpat[, period := period + seq - 1]
dpat <- dpat[period < 12]

Individual outcomes

In this last step, the binary outcome $y_{ict}$ is generated based on treatment status and random effects. In this case, the treatment lowers the probability of $Y=1$.

dx <- merge(dpat, dsw, by = c("site","period"))
setkey(dx, id, period)

outDef <- defDataAdd(varname = "y", 
                     formula = "-0.5 - 0.8*Ict + eff.st + eff.p",
                     dist = "binary", link = "logit")

dx <- addColumns(outDef, dx)
dx <- dx[, .(site, period, id, Ict, y)]

Here are the site-level averages over time. The light blue indicates periods in which a site is still in the control condition, and the dark blue shows the transition to the intervention condition. The lines, which are grouped by wave starting period, show the proportion of $Y=1$ for each period. You should be able to see the slight drop following entry into treatment.

Estimating the treatment effect and variance components

Because none of the maximum likelihood methods implemented in R or SAS could estimate this specific variance structure using a mixed effects logistic regression model, I am fitting a Bayesian model using RStan and Stan, which requires a set of model definitions.

This model specification is actually quite similar to the model I estimated earlier, except of course the outcome distribution is logistic rather than continuous. Another major change is the use of a “non-centered” parameterization, which actually reduced estimation times from hours to minutes (more precisely, about 12 hours to about 30 minutes). This reparameterization requires a Cholesky decomposition of the variance-covariance matrix $\Sigma$. One additional limitation is that proper convergence of the MCMC chains seems to require a limited prior on $\rho$, so that $\rho \sim U(0,1)$ rather than $\rho \sim U(-1,1)$.

This particular code needs to be saved externally, and I have created a file named binary sw - ar ind effect - non-central.stan. This file is subsequently referenced in the call to RStan.

data {
  int<lower=1> I;              // number of unique individuals
  int<lower=1> N;              // number of records
  int<lower=1> K;              // number of predictors
  int<lower=1> J;              // number of sites
  int<lower=0> T;              // number of periods
  int<lower=1,upper=I> ii[N];  // id for individual
  int<lower=1,upper=J> jj[N];  // group for individual
  int<lower=1,upper=T> tt[N];  // period of indidvidual
  matrix[N, K] x;              // matrix of predictors
  int<lower=0,upper=1> y[N];   // vector of binary outcomes
}

parameters {
  vector[K] beta;              // model fixed effects
  real<lower=0> sigma_S;       // site variance (sd)
  real<lower=0,upper=1> rho;   // correlation
  real<lower=0> sigma_I;       // individual level varianc (sd)
  
  // non-centered paramerization
  
  vector[T] z_ran_S[J];   // site level random effects (by period)
  vector[I] z_ran_I;      // individual level random effects        
}

transformed parameters {

  cov_matrix[T] Sigma;
  matrix[T, T] L;         // for non-central parameterization
  vector[I] ran_I;        // individual level random effects
  vector[T] ran_S[J];     // site level random effects (by period)
  vector[N] yloghat;

  // Random effects with exchangeable correlation

  real sigma_S2 = sigma_S^2;

  for (j in 1:T)
    for (k in 1:T)
      Sigma[j,k] = sigma_S2 * pow(rho,abs(j-k));
  
  // for non-centered parameterization
  
  L = cholesky_decompose(Sigma);

  for(j in 1:J)
    ran_S[j] = L * z_ran_S[j];
    
  ran_I = sigma_I * z_ran_I;
  
  // defining mean on log-odds scale

  for (i in 1:N)
      yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
      
}

model {
  
  sigma_I ~ exponential(0.25);
  sigma_S ~ exponential(0.25);
  
  rho ~ uniform(0, 1);
  
  for(j in 1:J) {
    z_ran_S[j] ~ std_normal();
  }

  z_ran_I ~ std_normal();
  
  y ~ bernoulli_logit(yloghat);

}

Set up the data and call stan from R

Just for completeness, I am providing the code that shows the interface between R and Stan using RStan. The data needs to be sent to Stan as a list of data elements, which here is called testdat. For the estimation of the posterior probabilities, I am specifying 4 chains of 4000 iterations each, which includes 1000 warm-up iterations. I specified “adapt_delta = 0.90” to reduce the step-size a bit (default is 0.80); this slows things down a bit, but improves stability.

As I mentioned earlier, with this data set (and rather large number of effects to estimate), the running time is between 30 and 45 minutes. One of the downsides of this particular Bayesian approach is that it wouldn’t really be practical to do any kind of sample size estimate.

x <- as.matrix(dx[ ,.(1, Ict)])
I <- dx[, length(unique(id))]
N <- nrow(x)
K <- ncol(x)
J <- dx[, length(unique(site))]
T <- dx[, length(unique(period))]
ii <- dx[, id]
jj <- dx[, site]
tt <- dx[, period] + 1
y <- dx[, y]

testdat <- list(I=I, N=N, K=K, J=J, T=T, ii=ii, jj=jj, tt=tt, x=x, y=y)

library(rstan)
options(mc.cores = parallel::detectCores())

rt <- stanc("binary sw - ar ind effect - non-central.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)

fit.ar1 <- sampling(sm, data=testdat,
                    iter = 4000, warmup = 1000,
                    control=list(adapt_delta=0.90,
                        max_treedepth = 15),
                    chains = 4)

Diagnostics

After running the MCMC process to generate the probability distributions, the trace plots show that the mixing is quite adequate for the chains.

plot(fit.ar1, plotfun = "trace", pars = pars, 
  inc_warmup = FALSE, ncol = 1)

Extracting results

If we take a look at the posterior probability distributions, we can see that they contain the original values used to generate the data - so at least in this case, the model seems to model the original data generation process quite well.

pars <- c("beta", "sigma_S","sigma_I","rho")
summary(fit.ar1, pars = pars, probs = c(0.025, 0.975))$summary

##           mean  se_mean     sd   2.5%  97.5% n_eff Rhat
## beta[1] -0.519 0.000687 0.0459 -0.609 -0.428  4470    1
## beta[2] -0.751 0.000844 0.0573 -0.864 -0.638  4618    1
## sigma_S  0.307 0.000394 0.0256  0.260  0.362  4223    1
## sigma_I  0.254 0.001548 0.0476  0.148  0.337   945    1
## rho      0.544 0.001594 0.0812  0.376  0.698  2599    1

One thing that is not working so well is my attempt to compare different models. For example, I might want to fit another model that does not assume between-period correlations decay and compare it to the current model. Previously, I used the bridgesampling package for the comparisons, but it does not seem to be able to accommodate these models. I will continue to explore the options more model comparison and will report back if I find something promising.

This study is supported by the National Institutes of Health National Institute on Aging under award numbers R61AG061904 and U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

A brief account (via simulation) of the ROC (and its AUC)

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 21 Jan 2020 00:00:00 +0000

The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve.

A prediction problem

The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease. The population is comprised of two groups of individuals: those with the condition and those without. What we want is some sort of diagnostic tool (such as a blood test or diagnostic scan) that will identify which group a particular patient belongs to. The question is how well does that tool or measure help us distinguish between the two groups? The ROC (and AUC) is designed to help answer that question.

True and false predictions

While we might not know group membership for an individual, we assume that they do indeed belong to one of the two groups. When we base a prediction of group membership based on a test, we may or may not be right. There are four scenarios. It is possible that our prediction is (1) a true positive (the patient has the condition and that is what we predict), (2) a false positive (the patient does not have the condition, but we predict they do have it), (3) a false negative (the patient has the condition but we believe they are healthy), or (4) a true negative (the patient is healthy and that is our prediction.) A “good” test is one that maximizes true positive predictions while minimizing false positive predictions.

We can actually only assess the quality of the test if we know the true group membership of the individuals. So, our plan is to take measurements on this known sample, make predictions based on the test, and see how our predictions match up to reality. The ROC is one way to characterize how well our test matches up to reality.

Binary decision informed by a continuous measure

While we make a binary decision about group membership - either we think they have the condition or they do not - the underlying measure that is used to make that determination may be continuous, like a score. For example, a hypothetical test might return a score between -10 and 10. We can pick a threshold anywhere along the continuum that will form the basis of our prediction. For example, we might say that any score > 0 indicates the condition is present, otherwise it is not. This simple test will be useful as a tool to discriminate between the disease and non-disease groups if that threshold indeed distinguishes the groups.

This is probably best demonstrated with a simple simulation. The sample we will generate has 100 individuals, around 40% who have the condition in question. The average score for the non-disease group is set at -5, and the average score for the disease group is 5. Both have variance 3.5:

library(simstudy)

# define data

defx <- defData(varname = "condition", formula = .4, dist = "binary")
defx <- defData(defx, "x", formula = "-5 + 10*condition", 
                variance = 3.5, dist = "normal")

# generate data

set.seed(1873)
dx <- genData(100, defx)
head(dx)

##    id condition     x
## 1:  1         0 -5.83
## 2:  2         1  4.66
## 3:  3         1  4.23
## 4:  4         0 -3.87
## 5:  5         1  1.78
## 6:  6         0 -4.87

Looking at the plot below, a threshold of zero appears to do an excellent job of distinguishing the groups. All of those with the condition (depicted in red) are above the threshold, whereas all of those without the condition (depicted in green) fall below the threshold:

The world is not always so neat and clean

Of course, we don’t usually have a measure or test that separates the groups so cleanly. Let’s say the average of the disease group is 2.5 and the non-disease group is -3. The threshold of zero still works pretty well, but it is not perfect. Some with the disease fall below the threshold (false negatives), and some without the disease lie above the threshold (false positives). In fact, only 87% of those with the disease are correctly identified (true positives), while 13% of those without the condition are incorrectly identified has having the disease (false positives).

defx <- updateDef(defx, changevar = "x", newformula="-3 + 5.5*condition",
                  newvariance = 6)
dx <- genData(100, defx)

Generating the ROC

Zero isn’t the only possible threshold we could use for the diagnosis test. We can lower the threshold to below zero to ensure that we have 100% true positives, but we will have to sacrifice by increasing the proportion of false positives. Likewise, we could reduce the proportion of false positives by increasing the threshold above zero, but would reduce the proportion of true positives in the process.

There are, in fact, an infinite number of possible thresholds. Here is a sequence of plots of the same data with a number of different thresholds ranging from 8 to -8. The percent of true positives is shown on the top and the percent of false positives is shown on the bottom:

The ROC is really just a summarized version of this sequence of plots. The X-axis is the proportion of false positives at a particular threshold, and the Y-axis is the proportion of true positives. As we lower the threshold, we move from left to right. So, in the plot below, each point represents one of the sections above:

The ROC above is built from only 9 thresholds. If we consider all possible thresholds (continuous between -10 and 10), this is the the more complete curve:

Area under the ROC

The AUC is, well, the area under the ROC. The maximum AUC will be 1 when there is complete separation (there is an example of this below), and the minimum is 0.5 (depicted by the diagonal line) when there is no separation by the test measure (again, an example will follow). We can estimate this area by integrating an approximate function defined by the data between 0 and 1.

f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)

## 0.957 with absolute error < 0.00011

There is actually a meaningful interpretation of the AUC, that is described in a classic 1982 paper by Hanley & McNeil (if you want a deeper understanding of the issues, this paper is not a bad place to start - there is, of course, a huge literature on the topic of ROCs). The AUC is actually equivalent to the probability that the test measure of a random draw from the diseased group will be greater than the test measure of a random draw from the healthy group. So, an AUC = 0.90 indicates that 90% of the time we draw a test measure from the disease group and non-disease group, the measure from the disease group will be greater.

Here is a simple function that returns a value of TRUE if the random draw from the disease group is greater:

randcomp <- function(ds) {
  ds[condition == 1, sample(x, 1)] > ds[condition == 0, sample(x, 1)]
}

And here is the proportion of 1000 draws where the measure from the disease group draws is greater (this is expected to be close to the AUC, which was estimated above to be 0.957):

mean(sapply(1:1000, function(x) randcomp(dx)))

## [1] 0.958

Of course, R has several packages that provide ROCs and calculate AUCs. I’m using package pROC here just to show you that my AUC estimate is not totally crazy:

library(pROC)
roc_obj <- roc(response = dx$condition, predictor = dx$x)
auc(roc_obj)

## Area under the curve: 0.958

Alternative scenarios

As I indicated above, the AUC can generally range from 0.5 to 1.0. There is no hard and fast rule about what is a “good” AUC - it will depend on the application. Certainly, anything below 0.7 or maybe even 0.8 is pretty weak. I am going to conclude by generating data at the two extremes.

Minimal separation

When the test measure for each group is equally distributed, there is unlikely to be any threshold for which the proportion of true positives exceeds the proportion of false positives. If this is the case, we should probably look for another test measure - or be prepared to make a lot of mistakes in the non-disease group.

defx <- updateDef(defx, changevar = "x", newformula="0+0*condition",
                  newvariance = 8)
dx <- genData(100, defx)

As we move the threshold lower, both the proportion of true positives and false positives steadily increase:

As a result, the ROC hangs fairly close to the diagonal lower bound.

We would expect the AUC to be fairly close to 0.5, which it is:

f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)

## 0.623 with absolute error < 4.5e-05

mean(sapply(1:1000, function(x) randcomp(dx)))

## [1] 0.613

Complete separation

At the other extreme, the mean of the disease group is high enough so that there is no overlap between the two groups. In this case, the curve follows along Y-axis before going across the X-axis. We can achieve 100% true positives and no false positives if threshold is set at some point that is below the minimum of the disease group, and above the maximum of the non-disease group. Zero will be the ideal cut-off point for this example.

defx <- updateDef(defx, changevar = "x", newformula="-4+8*condition",
                  newvariance = 3.5)
dx <- genData(100, defx)

As expected the AUC is equal to 1:

f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)

## 0.996 with absolute error < 9.2e-05

mean(sapply(1:1000, function(x) randcomp(dx)))

## [1] 1

Logistic regression and the ROC

Just a quick note to conclude. The ROC is often used in conjunction with classification problems based on logistic regression modeling. In this case, we may not have a single underlying test measure, but rather we may have multiple predictors or measures. In this case, group assignment decision needs to be based on a summary of these multiple measures; one logical candidate is the individual’s predicted probability estimated by model.

If the specified logistic regression model provides good separation between the two groups, the predicted probabilities will be quite different for each group (higher AUC). However, if the model is not a strong classifier, the predicted probabilities for the two groups will be much closer together (lower AUC).

References:

Hanley, J.A. and McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), pp.29-36.

Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77.

Repeated measures can improve estimation when we only care about a single endpoint

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 10 Dec 2019 00:00:00 +0000

I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.

There is one issue that should be fairly generalizable to other studies. It is likely that individual measures will be collected repeatedly over time but the primary outcome of interest will be the measure collected during the last follow-up period. I wanted to explore what, if anything, can be gained by analyzing all of the available data rather than focusing only the final end point.

Data generation

In this simulation scenario, there will be 200 subjects randomized at the individual level to one of two treatment arms, intervention ($rx = 1$) and control ($rx = 0$). Each person will be followed for 5 months, with a binary outcome measure collected at the end of each month. In the data, period 0 is the first month, and period 4 is the final month.

library(simstudy)

set.seed(281726)

dx <- genData(200)
dx <- trtAssign(dx, grpName = "rx")
dx <- addPeriods(dx, nPeriods = 5)

Here are the data for a single individual:

dx[id == 142]

##     id period rx timeID
## 1: 142      0  1    706
## 2: 142      1  1    707
## 3: 142      2  1    708
## 4: 142      3  1    709
## 5: 142      4  1    710

The probabilities of the five binary outcomes for each individual are a function of time and intervention status.

defP <- defDataAdd(varname = "p", 
                   formula = "-2 + 0.2*period + 0.5*rx", 
                   dist = "nonrandom", link = "logit")
  
dx <- addColumns(defP, dx)

The outcomes for a particular individual are correlated, with outcomes in two adjacent periods are more highly correlated than outcomes collected further apart. (I use an auto-regressive correlation structure to generate these data.)

dx <- addCorGen(dtOld = dx, idvar = "id", nvars = 5, rho = 0.6, 
                corstr = "ar1", dist = "binary", param1 = "p", 
                method = "ep", formSpec = "-2 + 0.2*period + 0.5*rx",
                cnames = "y")

dx[id == 142]

##     id period rx timeID    p y
## 1: 142      0  1    706 0.18 0
## 2: 142      1  1    707 0.21 0
## 3: 142      2  1    708 0.25 1
## 4: 142      3  1    709 0.29 0
## 5: 142      4  1    710 0.33 0

In the real world, there will be loss to follow up - not everyone will be observed until the end. In the first case, I will be assuming the data are missing completely at random (MCAR), where missingness is independent of all observed and unobserved variables. (I have mused on missingess before.)

MCAR <- defMiss(varname = "y", formula = "-2.6",
                logit.link = TRUE, monotonic = TRUE
)

dm <- genMiss(dx, MCAR, "id", repeated = TRUE, periodvar = "period")
dObs <- genObs(dx, dm, idvars = "id")

dObs[id == 142]

##     id period rx timeID    p  y
## 1: 142      0  1    706 0.18  0
## 2: 142      1  1    707 0.21  0
## 3: 142      2  1    708 0.25  1
## 4: 142      3  1    709 0.29 NA
## 5: 142      4  1    710 0.33 NA

In this data set only about 70% of the total sample is observed - though by chance there is different dropout for each of the treatment arms:

dObs[period == 4, .(prop.missing = mean(is.na(y))), keyby = rx]

##    rx prop.missing
## 1:  0         0.28
## 2:  1         0.38

Estimating the intervention effect

If we are really only interested in the probability of a successful outcome in the final period, we could go ahead and estimate the treatment effect using a simple logistic regression using individuals who were available at the end of the study. The true value is 0.5 (on the logistic scale), and the estimate here is close to 1.0 with a standard error just under 0.4:

fit.l <- glm(y ~ rx, data = dObs[period == 4], family = binomial)
coef(summary(fit.l))

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -1.25       0.28    -4.4  9.9e-06
## rx              0.99       0.38     2.6  9.3e-03

But, can we do better? Fitting a longitudinal model might provide a more stable and possibly less biased estimate, particularly if the specified model is the correct one. In this case, I suspect it will be an improvement, since the data was generated using a process that is amenable to a GEE (generalized estimating equation) model.

library(geepack)
  
fit.m <- geeglm(y ~ period + rx, id = id, family = binomial, 
         data = dObs, corstr = "ar1")

coef(summary(fit.m))

##             Estimate Std.err Wald Pr(>|W|)
## (Intercept)    -2.33   0.259   81  0.00000
## period          0.30   0.072   17  0.00003
## rx              0.83   0.263   10  0.00152

And finally, it is reasonable to expect that a model that is based on a data set without any missing values would provide the most efficient estimate. And that does seem to be case if we look at the standard error of the effect estimate.

fit.f <- geeglm(y ~ period + rx, id = id, family = binomial, 
         data = dx, corstr = "ar1")

coef(summary(fit.f))

##             Estimate Std.err Wald Pr(>|W|)
## (Intercept)    -2.15   0.227 89.2  0.0e+00
## period          0.30   0.062 23.1  1.5e-06
## rx              0.54   0.233  5.4  2.1e-02

Of course, we can’t really learn much of anything from a single simulated data set. Below is a plot of the mean estimate under each modeling scenario (along with the blue line that represents $\pm 2$ sd) based on 2500 simulated data sets with missingness completely at random. (The code for these replications is included in the addendum.)

It is readily apparent that under an assumption of MCAR, all estimation models yield unbiased estimates (the true effect size is 0.5), though using the last period only is inherently more variable (given that there are fewer observations to work with).

Missing at random

When the data are MAR (missing at random), using the last period only no longer provides an unbiased estimate of the effect size. In this case, the probability of missingness is a function of time, intervention status, and the outcome from the prior period, all of which are observed. This is how I’ve defined the MAR process:

MAR <- defMiss(varname = "y", 
               formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
               logit.link = TRUE, monotonic = TRUE
)

The mean plots based on 2500 iterations reveal the bias of the last period only. It is interesting to see that the GEE model is not biased, because we have captured all of the relevant covariates in the model. (It is well known that a likelihood method can yield unbiased estimates in the case of MAR, and while GEE is not technically a likelihood, it is a quasi-likelihood method.)

Missing not at random

When missingness depends on unobserved data, such as the outcome itself, then GEE estimates are also biased. For the last set of simulations, I defined missingness of $y$ in any particular time period to be a function of itself. Specifically, if the outcome was successful and the subject was in the intervention, the subject would be more likely to be observed:

NMAR <- defMiss(varname = "y", 
                formula = "-2.9 + 0.2*period - 2*rx*y",
                logit.link = TRUE, monotonic = TRUE
)

Under the assumption of missingness not at random (NMAR), both estimation approaches based on the observed data set with missing values yields an biased estimate, though using all of the data appears to reduce the bias somewhat:

Addendum: generating replications

iter <- function(n, np, defM) {
  
  dx <- genData(n)
  dx <- trtAssign(dx, grpName = "rx")
  dx <- addPeriods(dx, nPeriods = np)
  
  defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + .5*rx", 
                    dist = "nonrandom", link = "logit")
  
  dx <- addColumns(defP, dx)
  dx <- addCorGen(dtOld = dx, idvar = "id", nvars = np, rho = .6, 
                  corstr = "ar1", dist = "binary", param1 = "p", 
                  method = "ep", formSpec = "-2 + 0.2*period + .5*rx",
                  cnames = "y")
  
  dm <- genMiss(dx, defM, "id", repeated = TRUE, periodvar = "period")
  dObs <- genObs(dx, dm, idvars = "id")
  
  fit.f <- geeglm(y ~ period + rx, id = id, family = binomial, 
         data = dx, corstr = "ar1")
  
  fit.m <- geeglm(y ~ period + rx, id = id, family = binomial, 
         data = dObs, corstr = "ar1")
  
  fit.l <- glm(y ~ rx, data = dObs[period == (np - 1)], family = binomial)
  
  return(data.table(full = coef(fit.f)["rx"], 
                    miss = coef(fit.m)["rx"],
                    last = coef(fit.l)["rx"])
         )
}

## defM

MCAR <- defMiss(varname = "y", formula = "-2.6",
                logit.link = TRUE, monotonic = TRUE
)

MAR <- defMiss(varname = "y", 
               formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
               logit.link = TRUE, monotonic = TRUE
)

NMAR <- defMiss(varname = "y", 
                formula = "-2.9 + 0.2*period - 2*rx*y",
                logit.link = TRUE, monotonic = TRUE
)

##

library(parallel)

niter <- 2500

resMCAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MCAR)))
resMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MAR)))
resNMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, NMAR)))

Adding a "mixture" distribution to the simstudy package

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 26 Nov 2019 00:00:00 +0000

I am contemplating adding a new distribution option to the package simstudy that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new mixture option is to step through a few examples and see it in action.

Specifying the “mixture” distribution

As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities. For example, if we have two variables, $x_1$ and $x_2$, we have a mixture if, for any particular observation, we take $x_1$ with probability $p_1$ and $x_2$ with probability $p_2$, where $\sum_i{p_i} = 1$, $i \in (1, 2)$. So, if we have already defined $x_1$ and $x_2$ using the defData function, we can create a third variable $x_{mix}$ with this definition:

def <- defData(def, varname = "xMix", 
               formula = "x1 | 0.4 + x2 | 0.6", 
               dist = "mixture")

In this example, we will draw $x_1$ with probability 0.4 and $x_2$ with probability 0.6. We are, however, not limited to mixing only two variables; to make that clear, I’ll start off with an example that shows a mixture of three normally distributed variables.

Mixture of 3 normal distributions

In this case, we have $x_1 \sim N(1,1)$, $x_2 \sim N(5,4)$, and $x_3 \sim N(9,1)$. The mixture will draw from $x_1$ 30% of the time, from $x_2$ 40%, and from $x_3$ 30%:

def <- defData(varname = "x1", formula = 1, variance = 1)
def <- defData(def, varname = "x2", formula = 5, variance = 4)
def <- defData(def, varname = "x3", formula = 9, variance = 1)
def <- defData(def, varname = "xMix", 
               formula = "x1 | .3 + x2 | .4 + x3 | .3", 
               dist = "mixture")

The data generation now proceeds as usual in simstudy:

set.seed(2716)
dx <- genData(1000, def)
dx

##         id     x1   x2    x3   xMix
##    1:    1  1.640 4.12  7.13  4.125
##    2:    2 -0.633 6.89  9.07 -0.633
##    3:    3  1.152 2.95  8.71  1.152
##    4:    4  1.519 5.53  8.82  5.530
##    5:    5  0.206 5.55  9.31  5.547
##   ---                              
##  996:  996  2.658 1.87  8.09  1.870
##  997:  997  2.604 4.44  9.09  2.604
##  998:  998  0.457 5.56 10.87 10.867
##  999:  999 -0.400 4.29  9.03 -0.400
## 1000: 1000  2.838 4.78  9.17  9.174

Here are two plots. The top shows the densities for the original distributions separately, and the bottom plot shows the mixture distribution (which is the distribution of xMix):

And it is easy to show that the mixture proportions are indeed based on the probabilities that were defined:

dx[, .(p1=mean(xMix == x1), p2=mean(xMix == x2), p3=mean(xMix == x3))]

##       p1    p2    p3
## 1: 0.298 0.405 0.297

Zero-inflated

One classic mixture model is the zero-inflated Poisson model. We can easily generate data from this model using a mixture distribution. In this case, the outcome is $0$ with probability $p$ and is a draw from a Poisson distribution with mean (and variance) $\lambda$ with probability $1-p$. As a result, there will be an over-representation of 0’s in the observed data set. In this example $p$ = 0.2 and $\lambda = 2$:

def <- defData(varname = "x0", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "xPois", formula = 2, dist = "poisson")
def <- defData(def, varname = "xMix", formula = "x0 | .2 + xPois | .8", 
               dist = "mixture")

set.seed(2716)
dx <- genData(1000, def)

The figure below shows a histogram of the Poisson distributed $x_{pois}$ on top and a histogram of the mixture on the bottom. It is readily apparent that the mixture distribution has “too many” zeros relative to the Poisson distribution:

I am fitting model below (using the pscl package) to see if it is possible to recover the assumptions I used in the data generation process. With 1000 observations, of course, it is easy:

library(pscl)
zfit <- zeroinfl(xMix ~ 1 | 1, data = dx)

summary(zfit)

## 
## Call:
## zeroinfl(formula = xMix ~ 1 | 1, data = dx)
## 
## Pearson residuals:
##    Min     1Q Median     3Q    Max 
## -1.035 -1.035 -0.370  0.296  4.291 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6959     0.0306    22.8   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.239      0.107   -11.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 9 
## Log-likelihood: -1.66e+03 on 2 Df

The estimated value of $lambda$ from the model is the exponentiated value of the coefficient from the Poisson model: $e^{0.6959}$. The estimate is quite close to the true value $\lambda = 2$:

exp(coef(zfit)[1])

## count_(Intercept) 
##              2.01

And the estimated probability of drawing a zero (i.e. $\hat{p}$) is based on a simple transformation of the coefficient of the binomial model ($-1.239$), which is on the logit scale. Again, the estimate is quite close to the true value $p = 0.2$:

1/(1 + exp(-coef(zfit)[2]))

## zero_(Intercept) 
##            0.225

Outlier in linear regression

In this final example, I use the mixture option to generate outliers in the context of a regression model. This is done first by generating outcomes $y$ as a function of a predictor $x$. Next, alternative outcomes $y_{outlier}$ are generated independent of $x$. The observed outcomes $y_{obs}$ are a mixture of the outliers $y_{outlier}$ and the predicted $y$’s. In this simulation, 2.5% of the observations will be drawn from the outliers:

def <- defData(varname = "x", formula = 0, variance = 9, 
               dist = "normal")
def <- defData(def, varname = "y", formula = "3+2*x", variance = 7, 
               dist = "normal")
def <- defData(def, varname = "yOutlier", formula = 12, variance = 6, 
               dist = "normal")
def <- defData(def, varname = "yObs", 
               formula = "y | .975 + yOutlier | .025", 
               dist = "mixture")

set.seed(2716)
dx <- genData(100, def)

This scatter plot shows the relationship between $y_{obs}$ and $x$; the red dots represent the observations drawn from the outlier distribution:

Once again, it is illustrative to fit a few models to estimate the linear relationships between the $y$ and $x$. The model that includes the true value of $y$ (as opposed to the outliers) unsurprisingly recovers the true relationship. The model that includes the observed outcomes (the mixture distribution) underestimates the relationship. And a robust regression model (using the rlm function MASS package) provides a less biased estimate:

lm1 <- lm( y ~ x, data = dx)
lm2 <- lm( yObs ~ x, data = dx)

library(MASS)
rr <- rlm(yObs ~ x , data = dx)

library(stargazer)

stargazer(lm1, lm2, rr, type = "text",
          omit.stat = "all", omit.table.layout = "-asn",
          report = "vcs")

## 
## ================================
##            Dependent variable:  
##          -----------------------
##             y         yObs      
##            OLS     OLS   robust 
##                          linear 
##            (1)     (2)     (3)  
## x         2.210   2.030   2.150 
##          (0.093) (0.136) (0.111)
##                                 
## Constant  2.780   3.310   2.950 
##          (0.285) (0.417) (0.341)
##                                 
## ================================

The scatter plot below includes the fitted lines from the estimated models: the blue line is the true regression model, the red line is the biased estimate based on the data that includes outliers, and the black line is the robust regression line that is much closer to the truth:

The mixture option is still experimental, though it is available on github. One enhancement I hope to make is to allow the mixture probability to be a function of covariates. The next release on CRAN will certainly include some form of this new distribution option.

What can we really expect to learn from a pilot study?

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 12 Nov 2019 00:00:00 +0000

I am involved with a very interesting project - the NIA IMPACT Collaboratory - where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.

There is a rich literature on this topic. For example, these papers by Lancaster et al and Leon et al provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the feasibility of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?

Another thing the authors agree on is that the pilot study is not generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)

My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.

A small, two-arm pilot study

In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect $\Delta = 50$. The outcome is a continuous measure with a within-arm standard deviation $\sigma = 100$. In some fields of research, the effect size would be standardized as $d = \Delta / \sigma$. (This is also known as Cohen’s $d$.) So, in this case the true standardized effect size $d=0.5$.

If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate $\alpha = 0.05$. Using the pwr.t.test function in the pwr library, we specify the treatment effect (as $d$), significance level $\alpha$, and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):

library(pwr) 
pwr.t.test(n = NULL, d =  50/100, sig.level = 0.05, 
    power = 0.80, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 64
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

If we do not have an estimate of $d$ or even of the individual components $\Delta$ and $\sigma$, we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator $rx$ and an outcome $y$:

library(simstudy)

defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2)
ss <- 30

set.seed(22821)
dd <- genData(n = ss*2)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(defd, dd)
head(dd)

##    id rx    y
## 1:  1  0 -150
## 2:  2  1   48
## 3:  3  0 -230
## 4:  4  1  116
## 5:  5  1   91
## 6:  6  1  105

Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform $d$? We have a couple of options. In the first case, we can estimate both $\Delta$ and $\sigma$ from the data and use those results directly in power calculations:

lmfit <- lm(y ~ rx, data = dd) 
Delta <- coef(lmfit)["rx"]
Delta

## rx 
## 78

sd.rx <- dd[rx==1, sd(y)]
sd.ctl <- dd[rx==0, sd(y)]

pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2)  / 2 )
pool.sd

## [1] 94

The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated $\hat{d}$ is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):

pwr.t.test(n = NULL, d =  Delta/pool.sd, sig.level = 0.05, 
    power = 0.80, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 24
##               d = 0.83
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):

pwr.t.test(n = NULL, d =  50/pool.sd, sig.level = 0.05, 
    power = 0.80, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 56
##               d = 0.53
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Speak truth to power

Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true $d$)?

In the first case, where we actually used the true $d$ to get the sample size estimate, we just recover the 80% power estimate. No surprise there:

pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.8

In the second case, where we used $\hat{d} = \hat{\Delta} / \hat{\sigma}$ to get the sample size $n=24$, the true power of the larger RCT would be 40%:

pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.4

And if we had used $\hat{d} = 50 / \hat{\sigma}$ to get the sample size estimate $n=56$, the true power would have been 75%:

pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.75

Conservative estimate of standard deviation

While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this 1995 paper by R.H. Browne presents the idea that we can use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study.

The confidence interval for the standard deviation is not routinely provided in R. Another paper analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the $\gamma*100\%$ one-sided confidence interval for $\sigma^2$:

\[ \left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right) \]

where $\chi^2_{N-2;\gamma}$ is determined by $P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma$. So, if $\gamma = 0.95$ then we can get a one-sided 95% confidence interval for the standard deviation using that formulation:

gamma <- 0.95
qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE)
ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi  )
ucl

## [1] 111

The point estimate $\hat{\sigma}$ is 94, and the one-sided 95% confidence interval is $(0, 111)$. (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.)

If we use $\hat{\sigma}_{ucl} = 111$ to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate $\hat{\sigma} = 94$ (where the sample size requirement was 56):

pwr.t.test(n = NULL, d =  50/ucl, sig.level = 0.05, 
    power = 0.80, type = "two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 78
##               d = 0.45
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Ultimately, using $\gamma = 0.95$ might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of $\gamma$’s, from 0.5 to 0.9, which I also do in the next section.

Simulation of different approaches

At this point, we need to generate multiple iterations to see how the various approaches perform over repeated pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above.

As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate $\Delta$ (in the figures below, this would be the columns labeled ‘truth’) with using the effect size point estimate from the pilot (labeled pilot). I also compare using a point estimate of $\sigma$ from the pilot (where $\gamma=0$), with using the upper limit of a one-sided confidence interval defined by $\gamma$. In these simulations I compare three levels of $\gamma$: $\gamma \in (0.5, 0.7, 0.9)$.

In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error $\sigma = 100$ so that true $d$ is 0.30 or 0.75.

The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%.

Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for $\hat{\sigma}$. In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using $\gamma$ higher than 0.6 or 0.7 might provide an adequate cushion.

Second, it appears using the effect size estimate from the pilot as the basis for an RCT power analysis is risky. The box plots labeled as pilot exhibit much more variation than the ‘true’ box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be.

The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal.

A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that can help improve the design of the RCT.

References:

Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312.

Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629.

Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940.

Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672.

Support:

This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

Addendum

Below is the code I used to run the simulations and generate the plots

getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) {
  
  estring <- paste0("rx * ", esize)
  defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2)
  
  N <- ssize * 2
  
  dd <- genData(n = N)
  dd <- trtAssign(dd, grpName = "rx")
  dd <- addColumns(defd, dd)
  
  lmfit <- lm(y~rx, data = dd)

  sd.rx <- dd[rx==1, sd(y)]
  sd.ctl <- dd[rx==0, sd(y)]
  pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2)  / 2 )
  
  qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE)
  ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi  )

  p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0)
  p.eff <- esize * (use.est == FALSE) + 
              coef(lmfit)["rx"] * (use.est == TRUE)
  
  if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd
  
  nstar <- round(pwr.t.test(n = NULL, d =  p.eff/p.sd, sig.level = 0.05, 
                            power = 0.80, type = "two.sample")$n,0)  
  
  power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05, 
                      type = "two.sample")
  
  return(data.table(ssize, esize, gamma, use.est,
    estsd = estsd, ucl = ucl, nstar, power = power$power,
    est = coef(lmfit)["rx"], 
    lcl.est = confint(lmfit)["rx",1] , 
    ucl.est = confint(lmfit)["rx",2])
  )
  
}

dres <- data.table()
  
for (i in c(30, 60)) {
 for (j in c(30, 75)) {
  for (k in c(0, .5, .7)) {
   for (l in c(FALSE, TRUE)) {
    dd <- rbindlist(lapply(1:5000, 
      function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l))
    )
    dres <- rbind(dres, dd)
}}}}

above80 <- dres[, .(x80 = mean(power >= 0.80)), 
                  keyby = .(ssize, esize, gamma, use.est)]
above80[, l80 := scales::percent(x80, accuracy = 1)]
  
g_labeller <- function(value) {
    paste("\U03B3", "=", value) # unicode for gamma
}

e_labeller <- function(value) {
  paste("\U0394", "=", value) # unicdoe for Delta
}
  
ggplot(data = dres[ssize == 30], 
       aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) +
  geom_hline(yintercept = 0.8, color = "white") +
  geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) +
  theme(panel.grid = element_blank(),
        panel.background = element_rect(fill = "grey92"),
        axis.ticks = element_blank(),
        plot.title = element_text(size = 9, face = "bold")) +
  facet_grid(esize ~ gamma, 
    labeller = labeller(gamma = g_labeller, esize = e_labeller)) +
  scale_x_discrete(
    name = "\n source of effect size used for power calculation") +
  scale_y_continuous(limits = c(0,1), breaks = c(0, .8),
                      name = "distribution of power estimates \n") +
  ggtitle("Distribution of power estimates (n = 30 per treatment arm)") +
  geom_text(data = above80[ssize == 30], 
            aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95,
            size = 2.5)

Any one interested in a function to quickly generate data with many predictors?

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 29 Oct 2019 00:00:00 +0000

A couple of months ago, I was contacted about the possibility of creating a simple function in simstudy to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.

I’m presenting a new function here as a work-in-progress. I am putting it out there in case other folks have opinions about what might be most useful; feel free to let me know if you do. If not, I am likely to include something very similar to this in the next iteration of simstudy, which will be version 0.1.16.

Function genMultPred

In its latest iteration, the new function has three interesting arguments. The first two are predNorm and predBin, which are each vectors of length 2. The first value indicates the number of predictors to generate with either a standard normal distribution or a binary distribution, respectively. The second value in each vector represents the number of variables that will actually be predictive of the outcome. (Obviously, the second value cannot be greater than the first value.)

The third interesting argument is corStrength, which is a non-negative number indicating the overall strength of the correlation between the predictors. When corStrength is set to 0 (which is the default), the variables are generated assuming independence. When corStrength is non-zero, a random correlation matrix is generated using package clusterGeneration [Weiliang Qiu and Harry Joe. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation).] The corStrength value is passed on to the argument ratioLambda in the function genPositiveDefMat. As the value of corStrength increases, higher levels of correlation are induced in the random correlation matrix for the predictors.

Currently, the outcome can only have one of three distributions: normal, binomial, or Poisson.

One possible enhancement would be to allow the distributions of the predictors to have more flexibility. However, I’m not sure the added complexity would be worth it. Again, you could always take the more standard simstudy approach of function genData if you wanted more flexibility.

Here’s the function, in case you want to take a look under the hood:

genMultPred <- function(n, predNorm, predBin, 
                        dist = "normal", sdy = 1, corStrength = 0) {
  
  normNames <- paste0("n", 1:predNorm[1])
  binNames <- paste0("b", 1:predBin[1])
  
  ## Create the definition tables to be used by genData
  
  defn <- data.table(varname = normNames,
                     formula = 0,
                     variance = 1,
                     dist = "normal",
                     link = "identity")
  
  defb <- data.table(varname = binNames,
                     formula = 0.5,
                     variance = NA,
                     dist = "binary",
                     link = "identity")
  
  defx <- rbind(defn, defb)
  attr(defx, which = "id") <- "id"
  
  ## Create the coefficient values - all normally distributed
  
  ncoefs <- rnorm(predNorm[1], 0, 1)
  setzero <- sample(1:predNorm[1], (predNorm[1] - predNorm[2]), 
                    replace = FALSE)
  ncoefs[setzero] <- 0
  
  bcoefs <- rnorm(predBin[1], 0, 1)
  setzero <- sample(1:predBin[1], (predBin[1] - predBin[2]), 
                    replace = FALSE)
  bcoefs[setzero] <- 0
  
  coefs <- c(ncoefs, bcoefs)
  names(coefs) <- c(normNames, binNames)
  
  ## Generate the predictors
  
  if (corStrength <= 0) {     # predictors are independent
    
    dx <- genData(n, defx)
    
  } else {
    
    rLambda <- max(1, corStrength)
    covx <- cov2cor(genPositiveDefMat(nrow(defx), 
                        lambdaLow = 1, ratioLambda = rLambda)$Sigma)
    dx <- genCorFlex(n, defx, corMatrix = covx)
    
  }
  
  ## Generate the means (given the predictors)
  
  mu <- as.matrix(dx[,-"id"]) %*% coefs
  dx[, mu := mu]
  
  ## Generate the outcomes based on the means
  
  if (dist == "normal") {
    dx[, y := rnorm(n, mu, sdy)]
  } else if (dist == "binary") {
    dx[, y := rbinom(n, 1, 1/(1 + exp(-mu)))]  # link = logit
  } else if (dist == "poisson") {
    dx[, y := rpois(n, exp(mu))]               # link = log
  } 
  
  dx[, mu := NULL]
  
  return(list(data = dx[], coefs = coefs))
}

A brief example

Here is an example with 7 normally distributed covariates and 4 binary covariates. Only 3 of the continuous covariates and 2 of the binary covariates will actually be predictive.

library(simstudy)
library(clusterGeneration)

set.seed(732521)

dx <- genMultPred(250, c(7, 3), c(4, 2))

The function returns a list of two objects. The first is a data.table containing the generated predictors and outcome:

round(dx$data, 2)

##       id    n1    n2    n3    n4    n5    n6    n7 b1 b2 b3 b4     y
##   1:   1  0.15  0.12 -0.07 -1.38 -0.05  0.58  0.57  1  1  0  1 -1.07
##   2:   2  1.42 -0.64  0.08  0.83  2.01  1.18  0.23  1  1  0  0  4.42
##   3:   3 -0.71  0.77  0.94  1.59 -0.53 -0.05  0.26  0  0  0  0  0.09
##   4:   4  0.35 -0.80  0.90 -0.79 -1.72 -0.16  0.09  0  0  1  1 -0.58
##   5:   5 -0.22 -0.72  0.62  1.40  0.17  2.21 -0.45  0  1  0  1 -2.18
##  ---                                                                
## 246: 246 -1.04  1.62  0.40  1.46  0.80 -0.77 -1.27  0  0  0  0 -1.19
## 247: 247 -0.85  1.56  1.39 -1.25 -0.82 -0.63  0.13  0  1  0  0 -0.70
## 248: 248  0.72 -0.83 -0.04 -1.38  0.61 -0.71 -0.06  1  0  1  1  0.74
## 249: 249 -0.15  1.62 -1.01 -0.79 -0.53  0.44 -0.46  1  1  1  1  0.95
## 250: 250 -0.59  0.34 -0.31  0.18 -0.86 -0.90  0.22  1  0  1  0 -1.90

The second object is the set of coefficients that determine the average response conditional on the predictors:

round(dx$coefs, 2)

##    n1    n2    n3    n4    n5    n6    n7    b1    b2    b3    b4 
##  2.48  0.62  0.28  0.00  0.00  0.00  0.00  0.00  0.00  0.53 -1.21

Finally, we can “recover” the original coefficients with linear regression:

lmfit <- lm(y ~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + b1 + b2 + b3 + b4, 
            data = dx$data)

Here’s a plot showing the 95% confidence intervals of the estimates along with the true values. The yellow lines are covariates where there is truly no association.

Addendum: correlation among predictors

Here is a pair of examples using the corStrength argument. In the first case, the observed correlations are close to 0, whereas in the second case, the correlations range from -0.50 to 0.25. The impact of corStrength will vary depending on the number of potential predictors.

set.seed(291212)

# Case 1

dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 0)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2
## n1  1.00 -0.02  0.02  0.03 -0.01 -0.01
## n2 -0.02  1.00 -0.01  0.03 -0.03  0.00
## n3  0.02 -0.01  1.00  0.00 -0.04 -0.01
## n4  0.03  0.03  0.00  1.00  0.06 -0.01
## b1 -0.01 -0.03 -0.04  0.06  1.00 -0.01
## b2 -0.01  0.00 -0.01 -0.01 -0.01  1.00

# Case 2

dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 50)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2
## n1  1.00  0.09  0.08 -0.32  0.25  0.04
## n2  0.09  1.00 -0.29 -0.47 -0.05 -0.02
## n3  0.08 -0.29  1.00 -0.46 -0.01 -0.01
## n4 -0.32 -0.47 -0.46  1.00 -0.20 -0.05
## b1  0.25 -0.05 -0.01 -0.20  1.00 -0.04
## b2  0.04 -0.02 -0.01 -0.05 -0.04  1.00

Selection bias, death, and dying

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 15 Oct 2019 00:00:00 +0000

I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death.

A key challenge for researchers is figuring out how to analyze events that occur just before death. For example, it is not unusual to consider hospitalization in the week or month before death as a poor outcome. For example, here is a paper in the Journal of Palliative Care Medicine that describes an association of homecare nursing and reduced hospitalizations in the week before death. While there is no denying the strength of the association, it is less clear how much of that association is causal.

In particular, there is the possibility of selection bias that may be result when considering only patients who have died. In this post, I want to describe the concept of selection bias and simulate data that mimics the process of end-stage disease in order to explore how these issues might play out when we are actually evaluating the causal effect of an exposure or randomized intervention.

Selection bias

Selection bias is used to refer to different concepts by different researchers (see this article by Haneuse or this one by Hernán et al for really nice discussions of these issues). The terminology doesn’t matter as much as understanding the underlying data generating processes that distinguish the different ideas.

The key issue is to understand what is being selected. In one case, the exposure or intervention is the focus. And in the second case, it is how the patients or subjects are selected into the study more generally that induces the bias. The first selection process is typically referred to by epidemiologists as confounding bias (though it is also called treatment-selection bias), while the second is actually selection bias.

When I’ve written about these issues before (for example, see here), I’ve described how DAGs can be useful to illuminate the potential biases. Below, I have drawn a diagram to represent a simple case of selection bias. Say we are interested in measuring the causal relationship between income and blood pressure in some population in which the two are actually not causally related. If people with higher income are more likely to visit a doctor, and if people with higher blood pressure are also more likely to visit a doctor, the underlying causal relationship might be well represented by the DAG on the left in the figure below.

Let’s say we recruit participants for our study right outside of a medical facility. Choosing this location (as opposed to, say, a shopping mall where the causal model on the left would not be relevant), we are inducing a relationship between income and blood pressure. This can be seen in the DAG on the right, where we have effectively “controlled” for medical facility access in our selection process. The induced statistical relationship can be described in this way: if someone is at the medical center and they have relatively low income, they are more likely to have relatively high blood pressure. Conversely, if someone is there and they have relatively low blood pressure, they are more likely to have relatively high income. Based on this logic, we would expect to see a negative relationship between income and blood pressure in our study sample drawn from patients visiting a medical facility.

To explore by simulation, we can generate a large population of individuals with uncorrelated income and blood pressure. Selection will be a function of both:

n = 5000
set.seed(748347)

income <- rnorm(n); 
bp <- rnorm(n)

logitSelect <- -2 + 2*income + 2*bp
pSelect <- 1/(1+exp(-logitSelect))
select <- rbinom(n, 1, pSelect)

dPop <- data.table(income, bp, select)
dSample <- dPop[select == 1]

The plot on the left below is the overall population of 5000; there is no obvious relationship between the income and blood pressure. The group that was recruited at the medical facility and enrolled in the study (a subset of the original population) is shown in purple in the plot on the right. In this subset, there does indeed appear to be a relationship between the two characteristics. An estimate of the association, which we know is zero, based on the sample would be biased; that bias is due to how we selected participants into the study.

Hospitalization before death

In the next simulation, let’s consider a somewhat more complex process, though with the same underlying structure and similar bias as the simpler case above. The next DAG (below) shows three different time periods. In this case there is an indicator of homecare by a nurse $N_1$, $N_2$, and $N_3$. (In this particular example, an individual has home nursing care in all three periods or they don’t have any home nursing care in any period. This is not a requirement.) In each period, each patient has an underlying time-dependent health status, ranging from $1$ (healthiest) to $4$ (sickest). In this simulated study, the underlying health status $U_1$, $U_2$, and $U_3$ are considered latent (i.e. unmeasured). The progression of health status is governed by a Markov process that is independent of any kind of treatment. (See here and here for a more detailed description of how this is done using simstudy.)

The probability of hospitalization is a solely a function of the underlying health status, and nothing else. (I could make hospitalization a function of palliative care as well, but this just simplifies matters. In both cases the estimates will be biased - you can try for yourself.)

Death is a function of underlying health status and palliative care. While it does not seem to be the case in practice, I am assuming that less aggressive care results in shorter survival times. And the sicker the patient is in a particular period, the greater risk of dying in that period. (There should be lines between death in various periods and all subsequent measures, but I have eliminated them for clarity sake.)

The code to generate the data starts with the definitions: first, I define an initial health state $S_0$ that can range from 1 to 3 and the transition matrix $P$ for the Markov process. Next, I define the hospitalization and death outcomes:

bDef <- defData(varname = "S0", formula = "0.4;0.4;0.2",
                     dist = "categorical")

P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
                 0.0, 0.4, 0.4, 0.2,
                 0.0, 0.0, 0.6, 0.4,
                 0.0, 0.0, 0.0, 1.0),
              nrow = 4))

pDef <- defDataAdd(varname = "hospital", formula = "-2 + u",
                   dist = "binary", link = "logit")
pDef <- defDataAdd(pDef, varname = "death", 
                   formula = "-2 + u + homenurse * 1.5",
                   dist = "binary", link = "logit")

The data generation process randomizes individuals to nursing home care (or care as usual) in the first period, and creates all of the health status measures and outcomes. The last step removes any data for an individual that was generated after they died. (The function trimData is new and only available in simstudy 0.1.15, which is available on CRAN - as of 10/21/2019)

set.seed(272872)

dd <- genData(10000, bDef)
dd <- trtAssign(dd, grpName = "homenurse")

dp <- addMarkov(dd, transMat = P, 
                chainLen = 4, id = "id", 
                pername = "seq", start0lab = "S0",
                varname = "u")

dp <- addColumns(pDef, dp)
dp <- trimData(dp, seqvar = "seq", eventvar = "death")

A short follow-up period

If we have a relatively short follow up period in our randomized trial of supportive care at home (nursecare), only a portion of the sample will die; as result, we can only compare the hospitalization before death for a subset of the sample. By selecting on death, we will induce a relationship between the intervention and the outcome where none truly exists. Inspecting the DAGs below, it is apparent that this is a classic case of selection bias. Since we cannot control for the unmeasured health status $U$, hospitalization and death are associated. And, since treatment and death are causally related, by selecting on death we are in the same situation as we were in the first example.

d1 <- dp[seq == 1]

If we consider only those who died in the first period, we will be including 61% of the sample:

d1[, mean(death)]

## [1] 0.6109

To get a sense of the bias, I am considering three models. The first model estimates the effect of the intervention on hospitalization for only those who died in the first period; we expect that this will have a negative bias. In the second model, we use the same subset of patients who died, but adjust for underlying health status; the hospitalization coefficient should be close to zero. Finally, we estimate a model for everyone in period 1, regardless of whether they died. again, we expect the effect size to be close to 0.

fit1 <- glm(hospital ~ homenurse, data=d1[death==1], 
            family = "binomial")
fit2 <- glm(hospital ~ homenurse + u, data=d1[death==1], 
            family = "binomial")
fit3 <- glm(hospital ~ homenurse, data=d1, 
            family = "binomial")

library(stargazer)

stargazer(fit1, fit2, fit3, type = "text",
          column.labels = c("died", "died - adjusted", "all"), 
          omit.stat = "all", omit.table.layout = "-asn")

## 
## =============================================
##                   Dependent variable:        
##           -----------------------------------
##                        hospital              
##             died    died - adjusted    all   
##              (1)          (2)          (3)   
## homenurse -0.222***     -0.061       -0.049  
##            (0.053)      (0.057)      (0.040) 
##                                              
## u                      1.017***              
##                         (0.039)              
##                                              
## Constant   0.108**     -2.005***    -0.149***
##            (0.042)      (0.092)      (0.028) 
##                                              
## =============================================

While these are estimates from a single data set (I should really do more extensive experiment based on many different data sets), the estimates do seem to support our expectations. Indeed, if we cannot measure the underlying health status, the estimate of the intervention effect on hospitalization prior to death is biased; we would conclude that supportive care reduces the probability of hospitalization before death when we know (based on the data generation process used here) that it does not.

Extended follow-up

We might think that if we could follow everyone up until death (and hence not select on death), the bias would be eliminated. However, this not the case. The treatment effect is essentially an average of the effect over all time periods, and we know that for each time period, the effect estimate is biased due to selection. And averaging across biased estimates yields a biased estimate.

This issue is closely related to a general issue for causal survival analysis. It has been pointed out that it is not possible to estimate a causal treatment effect using hazard rates, as we do when we use Cox proportional hazard models. This is true even if treatment has been randomized and the two treatment arms are initially balanced with respect to underlying health status. The challenge is that after the first set of deaths, the treatment groups will no longer be balanced with respect to health status; some people survived because of the intervention, others because they were generally healthier. At each point in the survival analysis, the model for risk of death is conditioning (i.e. selecting on) those who did not die. So, there is built in selection bias in the modelling. If you are interested in reading more about these issues, I recommend taking a look at these papers by Hernán and Aalen et al..

Now, back to the simulation. In this case, we analyze everyone who has died within 4 periods, which is about 97% of the initial sample, virtually everyone.

dDied <- dp[death == 1]
nrow(dDied)/nrow(d1)

## [1] 0.9658

The effect estimate based on this data set is only unbiased when we are able to control for underlying health status. Otherwise, extending follow-up does not help remove any bias.

fit4 <- glm(hospital ~ homenurse, data=dDied, family = "binomial")
fit5 <- glm(hospital ~ homenurse + u, data=dDied, family = "binomial")

stargazer(fit4, fit5, type = "text",
          omit.stat = "all", omit.table.layout = "-asn")

## 
## ==============================
##           Dependent variable: 
##           --------------------
##                 hospital      
##              (1)        (2)   
## homenurse -0.383***   -0.048  
##            (0.041)    (0.045) 
##                               
## u                    1.020*** 
##                       (0.028) 
##                               
## Constant   0.296***  -2.028***
##            (0.030)    (0.070) 
##                               
## ==============================

In the future, I hope to explore alternative ways to analyze these types of questions. In the case of survival analysis, models that do not condition on death have been proposed to get at causal estimates. This may not be possible when the outcome of interest (health care before death) is defined by conditioning on death. We may actually need to frame the question slightly differently to be able to get an unbiased estimate.

References:

Seow, H., Sutradhar, R., McGrail, K., Fassbender, K., Pataky, R., Lawson, B., Sussman, J., Burge, F. and Barbera, L., 2016. End-of-life cancer care: temporal association between homecare nursing and hospitalizations. Journal of palliative medicine, 19(3), pp.263-270.

Haneuse, S., 2016. Distinguishing selection bias and confounding bias in comparative effectiveness research. Medical care, 54(4), p.e23.

Hernán, M.A., Hernández-Díaz, S. and Robins, J.M., 2004. A structural approach to selection bias. Epidemiology, 15(5), pp.615-625.

Hernán, M.A., 2010. The hazards of hazard ratios. Epidemiology, 21(1), p.13.

Aalen, O.O., Cook, R.J. and Røysland, K., 2015. Does Cox analysis of a randomized survival study yield a causal treatment effect?. Lifetime data analysis, 21(4), pp.579-593.

Support:

This research is supported by the National Institutes of Health National Institute on Aging R33AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

There's always at least two ways to do the same thing: an example generating 3-level hierarchical data using simstudy

keith.goldfeld@nyumc.org (Keith Goldfeld) — Thu, 03 Oct 2019 00:00:00 +0000

“I am working on a simulation study that requires me to generate data for individuals within clusters, but each individual will have repeated measures (say baseline and two follow-ups). I’m new to simstudy and have been going through the examples in R this afternoon, but I wondered if this was possible in the package, and if so whether you could offer any tips to get me started with how I would do this?”

This question popped up in my in-box a couple of days ago. And since I always like an excuse to do a little coding early in the morning to get my brain going, I decided to create a little example, though in this case, there were at least two ways to go about it. I sent back both options, and am putting them up here, since I know this kind of data generation problem comes up frequently. In fact, the post I recently wrote on open cohorts in stepped-wedge designs had to deal with this same issue, though in a slightly more elaborate way.

Three-level hierarchical data

In this example, we want individuals clustered within groups, and measurements clustered within individual, as depicted by this figure:

The hierarchical scheme represented implies that outcomes for individuals within groups are correlated, and that measurements over time for a particular individual are correlated. The structure of these two levels of correlation can take on a variety of forms. In the examples that follow, I am going to assume that the correlation between the individuals in a group is constant, as are the individual measurements over time. We could easily make the assumption that measurements closer in time will be more highly correlated than measurements further apart in time (such as auto-regressive correlation with 1 period of lag), but since we have only three measurements, it is not totally unreasonable to assume constant correlation.

Generating data explicitly with random effects

Enough with the preliminaries - let’s get to the data generation. In the first approach, both levels of correlation will be induced with group- and individual-level random effects using the following underlying model:

\[Y_{ijt} = \beta_t + \gamma_j + \alpha_i + \epsilon_{ijt},\]

where $Y_{ijt}$ is the outcome for person $i$ in group $j$ during time period $t$. $\beta_t$ is the mean outcome during period $t$, $t \in \{ 0,3, 6 \}$. $\gamma_j$ is the group-specific effect, and $\gamma_j \sim N(0,\sigma^2_\gamma)$. $\alpha_i$ is the individual-specific effect, and $\alpha_i \sim N(0,\sigma^2_\alpha)$. Finally, $\epsilon_{ijt}$ is the noise for each particular measurement, where $\epsilon_{ijt} \sim N(0,\sigma^2_\epsilon)$.

The group, individual, and outcome definitions are the first order of business. In this example, $\sigma^2_\gamma = 2$, $\sigma^2_\alpha = 1.3$, and $\sigma^2_\epsilon = 1.1$. In addition, the average outcomes at baseline, 3 months and 6 months, are 3, 4, and 6, respectively:

library(simstudy)

### Group defintion

defg <- defData(varname = "gamma", formula=0, variance = 2, id = "cid")

### Individal definition

defi <- defDataAdd(varname = "alpha", formula = 0, variance = 1.3)

### Outcome definition

defC <- defCondition(condition = "period == 0", 
                     formula = "3 + gamma + alpha",
                     dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1", 
                     formula = "4 + gamma + alpha",
                     dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2", 
                     formula = "6 + gamma + alpha",
                     dist = "nonrandom")

defy <- defDataAdd(varname = "y", formula = "mu", variance = 1.1)

To generate the data, first we create the group level records, then the individual level records, and finally the repeated measurements for each individual:

set.seed(3483)
dgrp1 <- genData(100, defg)

dind1 <- genCluster(dgrp1, "cid", numIndsVar = 20, level1ID = "id")
dind1 <- addColumns(defi, dind1)

dper1 <- addPeriods(dind1, nPeriods = 3, idvars = "id")
dper1 <- addCondition(defC, dper1, newvar = "mu")

dper1 <- addColumns(defy, dper1)

Here is a plot of the outcome data by period, with the grey lines representing individuals, and the red lines representing the group averages:

Here is a calculation of the observed covariance matrix. The total variance for each outcome should be close to $\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon = 4.4$, and the observed covariance should be close to $\sigma^2_\gamma + \sigma^2_\alpha = 3.3$

dcor1 <- dcast(dper1, id + cid ~ period, value.var = "y")
setnames(dcor1, c("id", "cid", "y0", "y1", "y2"))

dcor1[, cov(cbind(y0, y1, y2))]

##     y0  y1  y2
## y0 4.5 3.2 3.4
## y1 3.2 4.2 3.2
## y2 3.4 3.2 4.6

The correlation $\rho$ show be close to

\[ \rho = \frac{\sigma^2_\gamma + \sigma^2_\alpha}{\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon} = \frac{3.3}{4.4} = 0.75\]

(For a more elaborate derivation of correlation coefficients, see this post on stepped-wedge designs.)

dcor1[, cor(cbind(y0, y1, y2))]

##      y0   y1   y2
## y0 1.00 0.73 0.75
## y1 0.73 1.00 0.73
## y2 0.75 0.73 1.00

Directly generating correlated data

In this second approach, the group-level correlation is once again generated using a group effect. However, the individual-level effect is replaced by noise that is explicitly correlated across time. The model here is

\[Y_{ijt} = \beta_t + \gamma_j + \phi_{ijt},\]

where the noise $\mathbf{\phi}_{ij}$ is a vector of noise components $\{\phi_{ij0},\phi_{ij3},\phi_{ij6}\} \sim N(\mathbf{0}, \Sigma)$, and

\[\Sigma = \left [ \begin{matrix} \sigma^2_\phi & \rho \sigma^2_\phi & \rho \sigma^2_\phi \\ \rho \sigma^2_\phi & \sigma^2_\phi & \rho \sigma^2_\phi \\ \rho \sigma^2_\phi & \rho \sigma^2_\phi & \sigma^2_\phi \end{matrix} \right ] \]

In this case $\sigma^2_\gamma$ is still 2, and $\sigma^2_\phi = 2.4$ to ensure that total variation is 4.4. We set $\rho = 0.54167$ so that the $\rho \sigma^2_\phi = 1.3$, ensuring that the overall covariance of the observed outcome $y$ across periods is $3.3$ as in the first method.

defg <- defData(varname = "gamma", 
                formula = 0, variance = 2, id = "cid")
defg <- defData(defg, varname = "mu", 
                formula = 0, dist = "nonrandom")
defg <- defData(defg, varname = "phi", 
                formula = 2.4, dist = "nonrandom")

defC <- defCondition(condition = "period == 0", 
                     formula = "3 + gamma + e",
                     dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1", 
                     formula = "4 + gamma + e",
                     dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2", 
                     formula = "6 + gamma + e",
                     dist = "nonrandom")

In the data generation process, the function addCorGen is used to create the correlated noise across time:

set.seed(3483)
dgrp2 <- genData(100, defg)

dind2 <- genCluster(dgrp2, "cid", numIndsVar = 20, level1ID = "id")

dper2 <- addPeriods(dind2, nPeriods = 3, idvars = "id")
dper2 <- addCorGen(dper2, "id", nvars = 3, param1 = "mu", param2 = "phi",
            rho = .54167, dist = "normal", corstr = "cs", cnames = "e")
dper2 <- addCondition(defC, dper2, newvar = "y")

I won’t do a second plot, because it would look identical to the one above. But I am calculating the covariance and correlation matrices for the outcome to illustrate for you that the two slightly different approaches do indeed generate similarly distributed data.

dcor2 <- dcast(dper2, id + cid ~ period, value.var = "y")
setnames(dcor2, c("id", "cid", "y0", "y1", "y2"))

dcor2[, cov(cbind(y0, y1, y2))]

##     y0  y1  y2
## y0 4.4 3.4 3.3
## y1 3.4 4.4 3.4
## y2 3.3 3.4 4.5

dcor2[, cor(cbind(y0, y1, y2))]

##      y0   y1   y2
## y0 1.00 0.76 0.75
## y1 0.76 1.00 0.76
## y2 0.75 0.76 1.00

In the example here, I wouldn’t say either approach is better. For some, the purely random effects approach may be more intuitive, and for others the correlated noise might be. However, if we want a more complex correlation pattern, like the AR-1 pattern I mentioned earlier, one approach may in fact be a little more straightforward to implement.

And no, I don’t respond so thoroughly to every question I get; sometimes it is better for you to struggle a bit to figure something out.

Simulating an open cohort stepped-wedge trial

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 17 Sep 2019 00:00:00 +0000

In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with stepped-wedge designs, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time. This means that for some patients, the data collection period straddles the transition from control to intervention.

Whenever I contemplate a simulation, I first think about the general structure of the data generating process before even thinking about outcome model. In the case of a more standard two-arm randomized trial, that structure is quite simple and doesn’t require much, if any, thought. In this case, however, the overlaying of a longitudinal patient outcome process on top of a stepped-wedge design presents a little bit of a challenge.

Adding to the challenge is that, in addition to being a function of site- and individual-specific characteristics/effects, the primary outcome will likely be a function of time-varying factors. In particular here, certain patient-level health-related factors that might contribute to the decision to prescribe anti-psychotic medications, and the time-varying intervention status, which is determined by the stepped-wedge randomization scheme. So, the simulation needs to accommodate the generation of both types of time-varying variables.

I’ve developed a bare-boned simulation of sites and patients to provide a structure that I can add to at some point in the future. While this is probably a pretty rare study design (though as stepped-wedge designs become more popular, it may be less rare than I am imagining), I thought the code could provide yet another example of how to approach a potentially vexing simulation in a relatively simple way.

Data definition

The focus here is on the structure of the data, so I am not generating any outcome data. However, in addition to generating the treatment assignment, I am creating the time-varying health status, which will affect the outcome process when I get to that.

In this simulation, there will be 5 sites, each followed for 25 weeks (starting with week 0). Each week, a site will have approximately 20 new patients, so we should expect to generate around $5 \times 25 \times 20 = 2500$ total patients.

For each patient, we will be generating a series of health status, which ranges from 1 to 4, with 1 being healthiest, and 4 being death. I will use a Markov chain to generate this series. Two arguments required to simulate the Markov process are the starting state (which is created in S0) and the transition matrix P, which determines the probabilities of moving from one state to another.

NPER <- 25
perDef <- defDataAdd(varname = "npatient",  formula = 20,
                     dist = "poisson")

patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
                     dist = "categorical")

P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
                 0.1, 0.3, 0.4, 0.2,
                 0.0, 0.1, 0.5, 0.4,
                 0.0, 0.0, 0.0, 1.0),
              nrow = 4))

Data generation

The data generation process starts with the sites and then proceeds to the patient level data. To begin, the five sites are generated (for now without any site-specific variables, but that could easily be modified in the future). Next, records for each site for each of the 25 periods (from week 0 to week 24) are generated; these site level records include the number patients to be generated for each site, each week:

set.seed(3837263)

dsite <- genData(5, id = "site")

dper <- addPeriods(dsite, nPeriods = NPER, idvars = "site", 
                   timeid = "site.time", perName = "period")
dper <- addColumns(perDef, dper)

dper

##      site period site.time npatient
##   1:    1      0         1       17
##   2:    1      1         2       20
##   3:    1      2         3       25
##   4:    1      3         4       18
##   5:    1      4         5       23
##  ---                               
## 121:    5     20       121       17
## 122:    5     21       122       15
## 123:    5     22       123       16
## 124:    5     23       124       19
## 125:    5     24       125       20

Now, we assign each of the five sites to its own intervention “wave”. The first site starts at the beginning of the the study, week 0. The second starts 4 weeks later at week 4, and so on, until the fifth and last site starts the intervention at week 16. (Obviously, a more realistic simulation would include many more sites, but all of this can easily be scaled up.) The intervention indicator is $I_{ct}$, and is set to 1 when cluster $c$ during week $t$ is in the intervention, and is 0 otherwise.

dsw <- trtStepWedge(dper, "site", nWaves = 5, lenWaves = 4, 
                    startPer = 0, perName = "period",
                    grpName = "Ict")

dsw <- dsw[, .(site, period, startTrt, Ict)]

Here are the intervention assignments for the first two sites during the first 8 weeks.

dsw[site %in% c(1,2) & period < 8]

##     site period startTrt Ict
##  1:    1      0        0   1
##  2:    1      1        0   1
##  3:    1      2        0   1
##  4:    1      3        0   1
##  5:    1      4        0   1
##  6:    1      5        0   1
##  7:    1      6        0   1
##  8:    1      7        0   1
##  9:    2      0        4   0
## 10:    2      1        4   0
## 11:    2      2        4   0
## 12:    2      3        4   0
## 13:    2      4        4   1
## 14:    2      5        4   1
## 15:    2      6        4   1
## 16:    2      7        4   1

To generate the patients, we start by generating the 2500 or so individual records. The single baseline factor that we include this time around is the starting health status S0.

dpat <- genCluster(dper, cLevelVar = "site.time", 
                   numIndsVar = "npatient", level1ID = "id")
dpat <- addColumns(patDef, dpat)
dpat

##       site period site.time npatient   id S0
##    1:    1      0         1       17    1  2
##    2:    1      0         1       17    2  1
##    3:    1      0         1       17    3  2
##    4:    1      0         1       17    4  2
##    5:    1      0         1       17    5  1
##   ---                                       
## 2524:    5     24       125       20 2524  3
## 2525:    5     24       125       20 2525  2
## 2526:    5     24       125       20 2526  1
## 2527:    5     24       125       20 2527  1
## 2528:    5     24       125       20 2528  1

Here is a visualization of the patients (it turns out there are 2528 of them) by site and starting point, with each point representing a patient. The color represents the intervention status: light blue is control (pre-intervention) and dark blue is intervention. Even though a patient may start in the pre-intervention period, they may actually receive services in the intervention period, as we will see further on down.

The patient health status series are generated using a Markov chain process. This particular transition matrix has an “absorbing” state, as indicated by the probability 1 in the last row of the matrix. Once a patient enters state 4, they will not transition to any other state. (In this case, state 4 is death.)

dpat <- addMarkov(dpat, transMat = P, 
                  chainLen = NPER, id = "id", 
                  pername = "seq", start0lab = "S0")

dpat

##        site period site.time npatient   id S0 seq state
##     1:    1      0         1       17    1  2   1     2
##     2:    1      0         1       17    1  2   2     3
##     3:    1      0         1       17    1  2   3     3
##     4:    1      0         1       17    1  2   4     3
##     5:    1      0         1       17    1  2   5     4
##    ---                                                 
## 63196:    5     24       125       20 2528  1  21     4
## 63197:    5     24       125       20 2528  1  22     4
## 63198:    5     24       125       20 2528  1  23     4
## 63199:    5     24       125       20 2528  1  24     4
## 63200:    5     24       125       20 2528  1  25     4

Now, we aren’t interested in the periods following the one where death occurs. So, we want to trim the data.table dpat to include only those periods leading up to state 4 and the first period in which state 4 is entered. We do this first by identifying the first time a state of 4 is encountered for each individual (and if an individual never reaches state 4, then all the individual’s records are retained, and the variable .last is set to the maximum number of periods NPER, in this case 25).

dlast <- dpat[, .SD[state == 4][1,], by = id][, .(id, .last = seq)]
dlast[is.na(.last), .last := NPER]

dlast

##         id .last
##    1:    1     5
##    2:    2    13
##    3:    3     2
##    4:    4     6
##    5:    5     3
##   ---           
## 2524: 2524     7
## 2525: 2525     5
## 2526: 2526    19
## 2527: 2527    20
## 2528: 2528     8

Next, we use the dlast data.table to “trim” dpat. We further trim the data set so that we do not have patient-level observations that extend beyond the overall follow-up period:

dpat <- dlast[dpat][seq <= .last][ , .last := NULL][]
dpat[, period := period + seq - 1]
dpat <- dpat[period < NPER]

dpat

##          id site period site.time npatient S0 seq state
##     1:    1    1      0         1       17  2   1     2
##     2:    1    1      1         1       17  2   2     3
##     3:    1    1      2         1       17  2   3     3
##     4:    1    1      3         1       17  2   4     3
##     5:    1    1      4         1       17  2   5     4
##    ---                                                 
## 12608: 2524    5     24       125       20  3   1     3
## 12609: 2525    5     24       125       20  2   1     2
## 12610: 2526    5     24       125       20  1   1     1
## 12611: 2527    5     24       125       20  1   1     1
## 12612: 2528    5     24       125       20  1   1     1

And finally, we merge the patient data with the stepped-wedge treatment assignment data to create the final data set. The individual outcomes for each week could now be generated, because would we have all the baseline and time-varying information in a single data set.

dpat <- merge(dpat, dsw, by = c("site","period"))
setkey(dpat, id, period)
dpat <- delColumns(dpat, c("site.time", "seq", "npatient"))

dpat

##        site period   id S0 state startTrt Ict
##     1:    1      0    1  2     2        0   1
##     2:    1      1    1  2     3        0   1
##     3:    1      2    1  2     3        0   1
##     4:    1      3    1  2     3        0   1
##     5:    1      4    1  2     4        0   1
##    ---                                       
## 12608:    5     24 2524  3     3       16   1
## 12609:    5     24 2525  2     2       16   1
## 12610:    5     24 2526  1     1       16   1
## 12611:    5     24 2527  1     1       16   1
## 12612:    5     24 2528  1     1       16   1

Here is what the individual trajectories of health state look like. In the plot, each column represents a different site, and each row represents a different starting week. For example the fifth row represents patients who appear for the first time in week 4. Sites 1 and 2 are already in the intervention in week 4, so none of these patients will transition. However, patients in sites 3 through 5 enter in the pre-intervention stage in week 4, and transition into the intervention at different points, depending on the site.

The basic structure is in place, so we are ready to extend this simulation to include more covariates, random effects, and outcomes. And once we’ve done that, we can explore analytic approaches.

This study is supported by the National Institutes of Health National Institute on Aging R61AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

Analyzing a binary outcome arising out of within-cluster, pair-matched randomization

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 03 Sep 2019 00:00:00 +0000

A key motivating factor for the simstudy package and much of this blog is that simulation can be super helpful in understanding how best to approach an unusual, or least unfamiliar, analytic problem. About six months ago, I described the DREAM Initiative (Diabetes Research, Education, and Action for Minorities), a study that used a slightly innovative randomization scheme to ensure that two comparison groups were evenly balanced across important covariates. At the time, we hadn’t finalized the analytic plan. But, now that we have started actually randomizing and recruiting (yes, in that order, oddly enough), it is important that we do that, with the help of a little simulation.

The study design

The original post has the details about the design and matching algorithm (and code). The randomization is taking place at 20 primary care clinics, and patients within these clinics are matched based on important characteristics before randomization occurs. There is little or no risk that patients in the control arm will be “contaminated” or affected by the intervention that is taking place, which will minimize the effects of clustering. However, we may not want to ignore the clustering altogether.

Possible analytic solutions

Given that the primary outcome is binary, one reasonable procedure to assess whether or not the intervention is effective is McNemar’s test, which is typically used for paired dichotomous data. However, this approach has two limitations. First, McNemar’s test does not take into account the clustered nature of the data. Second, the test is just that, a test; it does not provide an estimate of effect size (and the associated confidence interval).

So, in addition to McNemar’s test, I considered four additional analytic approaches to assess the effect of the intervention: (1) Durkalski’s extension of McNemar’s test to account for clustering, (2) conditional logistic regression, which takes into account stratification and matching, (3) standard logistic regression with specific adjustment for the three matching variables, and (4) mixed effects logistic regression with matching covariate adjustment and a clinic-level random intercept. (In the mixed effects model, I assume the treatment effect does not vary by site, since I have also assumed that the intervention is delivered in a consistent manner across the sites. These may or may not be reasonable assumptions.)

While I was interested to see how the two tests (McNemar and the extension) performed, my primary goal was to see if any of the regression models was superior. In order to do this, I wanted to compare the methods in a scenario without any intervention effect, and in another scenario where there was an effect. I was interested in comparing bias, error rates, and variance estimates.

Data generation

The data generation process parallels the earlier post. The treatment assignment is made in the context of the matching process, which I am not showing this time around. Note that in this initial example, the outcome y depends on the intervention rx (i.e. there is an intervention effect).

library(simstudy)

### defining the data

defc <- defData(varname = "ceffect", formula = 0, variance = 0.4, 
                dist = "normal", id = "cid")

defi <- defDataAdd(varname = "male", formula = .4, dist = "binary")
defi <- defDataAdd(defi, varname = "age", formula = 0, variance = 40)
defi <- defDataAdd(defi, varname = "bmi", formula = 0, variance = 5)

defr <- defDataAdd(varname = "y", 
  formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0.45*rx + ceffect", 
  dist = "binary", link = "logit")

### generating the data

set.seed(547317)

dc <- genData(20, defc)

di <- genCluster(dc, "cid", 60, "id")
di <- addColumns(defi, di)

### matching and randomization within cluster (cid)

library(parallel)
library(Matching)

RNGkind("L'Ecuyer-CMRG")  # to set seed for parallel process

### See addendum for dmatch code

dd <- rbindlist(mclapply(1:nrow(dc), 
                         function(x) dmatch(di[cid == x]),
                         mc.set.seed = TRUE
                         )
                )

### generate outcome

dd <- addColumns(defr, dd)
setkey(dd, pair)
dd

##      cid ceffect  id male   age     bmi rx pair y
##   1:   1   1.168  11    1  4.35  0.6886  0 1.01 1
##   2:   1   1.168  53    1  3.85  0.2215  1 1.01 1
##   3:   1   1.168  51    0  6.01 -0.9321  0 1.02 0
##   4:   1   1.168  58    0  7.02  0.1407  1 1.02 1
##   5:   1   1.168  57    0  9.25 -1.3253  0 1.03 1
##  ---                                             
## 798:   9  -0.413 504    1 -8.72 -0.0767  1 9.17 0
## 799:   9  -0.413 525    0  1.66  3.5507  0 9.18 0
## 800:   9  -0.413 491    0  4.31  2.6968  1 9.18 0
## 801:   9  -0.413 499    0  7.36  0.6064  0 9.19 0
## 802:   9  -0.413 531    0  8.05  0.8068  1 9.19 0

Based on the outcomes of each individual, each pair can be assigned to a particular category that describes the outcomes. Either both fail, both succeed, or one fails and the other succeeds. These category counts can be represented in a $2 \times 2$ contingency table. The counts are the number of pairs in each of the four possible pairwise outcomes. For example, there were 173 pairs where the outcome was determined to be unsuccessful for both intervention and control arms.

dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(0,1), 
                          labels = c("no success", "success"))]
dpair[, rx := factor(`1`, levels = c(0, 1), 
                     labels = c("no success", "success"))]

dpair[, table(control,rx)]

##             rx
## control      no success success
##   no success        173     102
##   success            69      57

Here is a figure that depicts the $2 \times 2$ matrix, providing a visualization of how the treatment and control group outcomes compare. (The code is in the addendum in case anyone wants to see the lengths I took to make this simple graphic.)

McNemar’s test

McNemar’s test requires the data to be in table format, and the test really only takes into consideration the cells which represent disagreement between treatment arms. In terms of the matrix above, this would be the lower left and upper right quadrants.

ddc <- dcast(dd, pair ~ rx, value.var = "y")
dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]

mcnemar.test(dmat)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  dmat
## McNemar's chi-squared = 6, df = 1, p-value = 0.01

Based on the p-value = 0.01, we would reject the null hypothesis that the intervention has no effect.

Durkalski extension of McNemar’s test

Durkalski’s test also requires the data to be in tabular form, though there essentially needs to be a table for each cluster. The clust.bin.pair function needs us to separate the table into vectors a, b, c, and d, where each element in each of the vectors is a count for a specific cluster. Vector a is collection of counts for the upper left hand quadrants, b is for the upper right hand quadrants, etc. We have 20 clusters, so each of the four vectors has length 20. Much of the work done in the code below is just getting the data in the right form for the function.

library(clust.bin.pair)

ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
ddc[, ypair :=  2*`0` + 1*`1`]
dvec <- ddc[, .N, keyby=.(cid, ypair)]
allpossible <- data.table(expand.grid(1:20, 0:3))
setnames(allpossible, c("cid","ypair")) 

setkey(dvec, cid, ypair)
setkey(allpossible, cid, ypair)

dvec <- dvec[allpossible]
dvec[is.na(N), N := 0]

a <- dvec[ypair == 0, N]
b <- dvec[ypair == 1, N]
c <- dvec[ypair == 2, N]
d <- dvec[ypair == 3, N]

clust.bin.pair(a, b, c, d, method = "durkalski")

## 
##  Durkalski's Chi-square test
## 
## data:  a, b, c, d
## chi-square = 5, df = 1, p-value = 0.03

Again, the p-value, though larger, leads us to reject the null.

Conditional logistic regression

Conditional logistic regression is conditional on the pair. Since the pair is similar with respect to the matching variables, no further adjustment (beyond specifying the strata) is necessary.

library(survival)
summary(clogit(y ~ rx + strata(pair), data = dd))$coef["rx",]

##      coef exp(coef)  se(coef)         z  Pr(>|z|) 
##    0.3909    1.4783    0.1559    2.5076    0.0122

Logistic regression with matching covariates adjustment

Using logistic regression should in theory provide a reasonable estimate of the treatment effect, though given that there is clustering, I wouldn’t expect the standard error estimates to be correct. Although we are not specifically modeling the matching, by including covariates used in the matching, we are effectively estimating a model that is conditional on the pair.

summary(glm(y~rx + age + male + bmi, data = dd, 
            family = "binomial"))$coef["rx",]

##   Estimate Std. Error    z value   Pr(>|z|) 
##     0.3679     0.1515     2.4285     0.0152

Generalized mixed effects model with matching covariates adjustment

The mixed effects model merely improves on the logistic regression model by ensuring that any clustering effects are reflected in the estimates.

library(lme4)

summary(glmer(y ~ rx + age + male + bmi + (1|cid), data= dd, 
              family = "binomial"))$coef["rx",]

##   Estimate Std. Error    z value   Pr(>|z|) 
##     0.4030     0.1586     2.5409     0.0111

Comparing the analytic approaches

To compare the methods, I generated 1000 data sets under each scenario. As I mentioned, I wanted to conduct the comparison under two scenarios. The first when there is no intervention effect, and the second with an effect (I will use the effect size used to generate the first data set.

I’ll start with no intervention effect. In this case, the outcome definition sets the true parameter of rx to 0.

defr <- defDataAdd(varname = "y", 
  formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0*rx + ceffect", 
  dist = "binary", link = "logit")

Using the updated definition, I generate 1000 datasets, and for each one, I apply the five analytic approaches. The results from each iteration are stored in a large list. (The code for the iterative process is shown in the addendum below.) As an example, here are the contents from the 711th iteration:

res[[711]]

## $clr
##       coef exp(coef) se(coef)      z Pr(>|z|)
## rx -0.0263     0.974    0.162 -0.162    0.871
## 
## $glm
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.6583     0.1247  -5.279 1.30e-07
## rx           -0.0309     0.1565  -0.198 8.43e-01
## age          -0.0670     0.0149  -4.495 6.96e-06
## male         -0.5131     0.1647  -3.115 1.84e-03
## bmi           0.1308     0.0411   3.184 1.45e-03
## 
## $glmer
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.7373     0.1888   -3.91 9.42e-05
## rx           -0.0340     0.1617   -0.21 8.33e-01
## age          -0.0721     0.0156   -4.61 4.05e-06
## male         -0.4896     0.1710   -2.86 4.20e-03
## bmi           0.1366     0.0432    3.16 1.58e-03
## 
## $mcnemar
## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  dmat
## McNemar's chi-squared = 0.007, df = 1, p-value = 0.9
## 
## 
## $durk
## 
##  Durkalski's Chi-square test
## 
## data:  a, b, c, d
## chi-square = 0.1, df = 1, p-value = 0.7

Summary statistics

To compare the five methods, I am first looking at the proportion of iterations where the p-value is less then 0.05, in which case we would reject the the null hypothesis. (In the case where the null is true, the proportion is the Type 1 error rate; when there is truly an effect, then the proportion is the power.) I am less interested in the hypothesis test than the bias and standard errors, but the first two methods only provide a p-value, so that is all we can assess them on.

Next, I calculate the bias, which is the average effect estimate minus the true effect. And finally, I evaluate the standard errors by looking at the estimated standard error as well as the observed standard error (which is the standard deviation of the point estimates).

pval <- data.frame(
  mcnm = mean(sapply(res, function(x) x$mcnemar$p.value <= 0.05)),
  durk = mean(sapply(res, function(x) x$durk$p.value <= 0.05)),
  clr =mean(sapply(res, function(x) x$clr["rx","Pr(>|z|)"] <= 0.05)),
  glm = mean(sapply(res, function(x) x$glm["rx","Pr(>|z|)"] <= 0.05)),
  glmer = mean(sapply(res, function(x) x$glmer["rx","Pr(>|z|)"] <= 0.05))
)

bias <- data.frame(
  clr = mean(sapply(res, function(x) x$clr["rx", "coef"])),
  glm = mean(sapply(res, function(x) x$glm["rx", "Estimate"])),
  glmer = mean(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)

se <- data.frame(
  clr = mean(sapply(res, function(x) x$clr["rx", "se(coef)"])),
  glm = mean(sapply(res, function(x) x$glm["rx", "Std. Error"])),
  glmer = mean(sapply(res, function(x) x$glmer["rx", "Std. Error"]))
)

obs.se <- data.frame(
  clr = sd(sapply(res, function(x) x$clr["rx", "coef"])),
  glm = sd(sapply(res, function(x) x$glm["rx", "Estimate"])),
  glmer = sd(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)

sumstat <- round(plyr::rbind.fill(pval, bias, se, obs.se), 3)
rownames(sumstat) <- c("prop.rejected", "bias", "se.est", "se.obs")
sumstat

##                mcnm  durk   clr   glm glmer
## prop.rejected 0.035 0.048 0.043 0.038 0.044
## bias             NA    NA 0.006 0.005 0.005
## se.est           NA    NA 0.167 0.161 0.167
## se.obs           NA    NA 0.164 0.153 0.164

In this first case, where the true underlying effect size is 0, the Type 1 error rate should be 0.05. The Durkalski test, the conditional logistical regression, and the mixed effects model are below that level but closer than the other two methods. All three models provide unbiased point estimates, but the standard logistic regression (glm) underestimates the standard errors. The results from the conditional logistic regression and the mixed effects model are quite close across the board.

Here are the summary statistics for a data set with an intervention effect of 0.45. The results are consistent with the “no effect” simulations, except that the standard linear regression model exhibits some bias. In reality, this is not necessarily bias, but a different estimand. The model that ignores clustering is a marginal model (with respect to the site), whereas the conditional logistic regression and mixed effects models are conditional on the site. (I’ve described this phenomenon here and here.) We are interested in the conditional effect here, so that argues for the conditional models.

The conditional logistic regression and the mixed effects model yielded similar estimates, though the mixed effects model had slightly higher power, which is the reason I opted to use this approach at the end of the day.

##                mcnm  durk   clr    glm  glmer
## prop.rejected 0.766 0.731 0.784  0.766  0.796
## bias             NA    NA 0.000 -0.033 -0.001
## se.est           NA    NA 0.164  0.156  0.162
## se.obs           NA    NA 0.165  0.152  0.162

In this last case, the true underlying data generating process still includes an intervention effect but no clustering. In this scenario, all of the analytic yield similar estimates. However, since there is no guarantee that clustering is not a factor, the mixed effects model will still be the preferred approach.

##                mcnm  durk    clr    glm  glmer
## prop.rejected 0.802 0.774  0.825  0.828  0.830
## bias             NA    NA -0.003 -0.002 -0.001
## se.est           NA    NA  0.159  0.158  0.158
## se.obs           NA    NA  0.151  0.150  0.150

The DREAM Initiative is supported by the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases R01DK11048. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

Addendum: multiple datasets and model estimates

gen <- function(nclust, m) {
  
  dc <- genData(nclust, defc)
  di <- genCluster(dc, "cid", m, "id")
  di <- addColumns(defi, di)
  
  dr <- rbindlist(mclapply(1:nrow(dc), function(x) dmatch(di[cid == x])))
  dr <- addColumns(defr, dr)
  
  dr[]
  
}

iterate <- function(ncluster, m) {
  
  dd <- gen(ncluster, m)
  
  clrfit <- summary(clogit(y ~ rx + strata(pair), data = dd))$coef
  glmfit <- summary(glm(y~rx + age + male + bmi, data = dd, 
                        family = binomial))$coef
  mefit <- summary(glmer(y~rx + age + male + bmi + (1|cid), data= dd, 
                         family = binomial))$coef
  
  ## McNemar
  
  ddc <- dcast(dd, pair ~ rx, value.var = "y")
  dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]

  mc <- mcnemar.test(dmat)
  
  # Clustered McNemar
  
  ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
  ddc[, ypair :=  2*`0` + 1*`1`]
  dvec <- ddc[, .N, keyby=.(cid, ypair)]
  allpossible <- data.table(expand.grid(1:20, 0:3))
  setnames(allpossible, c("cid","ypair")) 

  setkey(dvec, cid, ypair)
  setkey(allpossible, cid, ypair)

  dvec <- dvec[allpossible]
  dvec[is.na(N), N := 0]

  a <- dvec[ypair == 0, N]
  b <- dvec[ypair == 1, N]
  c <- dvec[ypair == 2, N]
  d <- dvec[ypair == 3, N]
  
  durk <- clust.bin.pair(a, b, c, d, method = "durkalski")
  
  list(clr = clrfit, glm = glmfit, glmer = mefit,
       mcnemar = mc, durk = durk)

}

res <- mclapply(1:1000, function(x) iterate(20, 60))

Code to generate figure

library(ggmosaic)

dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(1,0), 
                          labels = c("success", "no success"))]
dpair[, rx := factor(`1`, levels = c(0, 1), 
                     labels = c("no success", "success"))]

p <- ggplot(data = dpair) +
  geom_mosaic(aes(x = product(control, rx)))

pdata <- data.table(ggplot_build(p)$data[[1]])
pdata[, mcnemar := factor(c("diff","same","same", "diff"))]

textloc <- pdata[c(1,4), .(x=(xmin + xmax)/2, y=(ymin + ymax)/2)]

ggplot(data = pdata) +
  geom_rect(aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, 
                        fill = mcnemar)) +
  geom_label(data = pdata, 
          aes(x = (xmin+xmax)/2, y = (ymin+ymax)/2, label=.wt),
          size = 3.2) +
  scale_x_continuous(position = "top", 
                     breaks = textloc$x, 
                     labels = c("no success", "success"), 
                     name = "intervention",
                     expand = c(0,0)) +
  scale_y_continuous(breaks = textloc$y, 
                     labels = c("success", "no success"),
                     name = "control",
                     expand = c(0,0)) +
  scale_fill_manual(values = c("#6b5dd5", "grey80")) +
  theme(panel.grid = element_blank(),
        legend.position = "none",
        axis.ticks = element_blank(),
        axis.text.x = element_text(angle = 0, hjust = 0.5),
        axis.text.y = element_text(angle = 90, hjust = 0.5)
  )

Original matching algorithm

dmatch <- function(dsamp) {
  
  dsamp[, rx := 0]
  dused <- NULL
  drand <- NULL
  dcntl <- NULL
  
  while (nrow(dsamp) > 1) {
    
    selectRow <- sample(1:nrow(dsamp), 1)
    
    dsamp[selectRow, rx := 1]
    
    myTr <- dsamp[, rx]
    myX <- as.matrix(dsamp[, .(male, age, bmi)])
    
    match.dt <- Match(Tr = myTr, X = myX, 
                      caliper = c(0, 0.50, .50), ties = FALSE)
    
    if (length(match.dt) == 1) {  # no match
      
      dused <- rbind(dused, dsamp[selectRow])
      dsamp <- dsamp[-selectRow, ]
      
    } else {                      # match
      
      trt <- match.dt$index.treated
      ctl <- match.dt$index.control
      
      drand <- rbind(drand, dsamp[trt])
      dcntl <- rbind(dcntl, dsamp[ctl])
      
      dsamp <- dsamp[-c(trt, ctl)]
      
    }
  }
  
  dcntl[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
  drand[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
  
  rbind(dcntl, drand)
  
}

simstudy updated to version 0.1.14: implementing Markov chains

keith.goldfeld@nyumc.org (Keith Goldfeld) — Tue, 20 Aug 2019 00:00:00 +0000

I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends only on the previous health status (in the immediate past). While there are packages to do this sort of thing (see for example the markovchain package), I hadn’t yet stumbled upon them while I was tackling my problem. So, I wrote my own functions, which I’ve now incorporated into the latest version of simstudy that is now available on CRAN. As a way of announcing the new release, here is a brief overview of Markov chains and the new functions. (See here for a more complete list of changes.)

Markov processes

The key “parameter” of a stochastic Markov process is the transition matrix, which defines the probability of moving from one state to another (or remaining in the same state). Each row of the matrix is indexed by the current state, while the columns are indexed by the target state. The values of the matrix represent the probabilities of transitioning from the current state to the target state. The sum of the probabilities across each row must equal one.

In the transition matrix below, there are three states $(1, 2, 3)$. The probability of moving from state 1 to state 3 is represented by $p_{13}$. Likewise the probability of moving from state 3 to state 2 is $p_{32}$. And $\sum_{j=1}^3 p_{ij} = 1$ for all $i \in (1,2,3)$.

\[ \left( \begin{matrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{matrix} \right ) \]

Here’s a possible $3 \times 3$ transition matrix:

\[ \left( \begin{matrix} 0.5 & 0.4 & 0.1 \\ 0.2 & 0.5 & 0.3 \\ 0.0 & 0.0 & 1.0 \end{matrix} \right ) \]

In this case, the probability of moving from state 1 to state 2 is $40\%$, whereas there is no possibility that you can move from 3 to 1 or 2. (State 3 is considered to be an “absorbing” state since it is not possible to leave; if we are talking about health status, state 3 could be death.)

function genMarkov

The new function genMarkov generates a random sequence for the specified number of individuals. (The sister function addMarkov is quite similar, though it allows users to add a Markov chain to an existing data set.) In addition to defining the transition matrix, you need to indicate the length of the chain to be generated for each simulated unit or person. The data can be returned either in long or wide form, depending on how you’d ultimately like to use the data. In the first case, I am generating wide format data for sequences of length of 6 for 12 individuals:

library(simstudy)
set.seed(3928398)

tmatrix <- matrix(c(0.5, 0.4, 0.1,
                    0.2, 0.5, 0.3,
                    0.0, 0.0, 1.0), 3, 3, byrow = T)

dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = TRUE)
dd

##     id S1 S2 S3 S4 S5 S6
##  1:  1  1  2  2  1  2  2
##  2:  2  1  1  2  2  2  3
##  3:  3  1  1  2  3  3  3
##  4:  4  1  2  2  1  1  2
##  5:  5  1  1  2  2  2  3
##  6:  6  1  1  1  1  1  1
##  7:  7  1  1  1  1  2  2
##  8:  8  1  1  1  1  1  1
##  9:  9  1  1  2  3  3  3
## 10: 10  1  1  2  3  3  3
## 11: 11  1  2  2  2  2  1
## 12: 12  1  2  1  1  2  1

In the long format, the output is multiple records per id. This could be useful if you are going to be estimating longitudinal models, or as in this case, creating longitudinal plots:

set.seed(3928398)

dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = FALSE)

Here are the resulting data (for the first two individuals):

dd[id %in% c(1,2)]

##     id period state
##  1:  1      1     1
##  2:  1      2     2
##  3:  1      3     2
##  4:  1      4     1
##  5:  1      5     2
##  6:  1      6     2
##  7:  2      1     1
##  8:  2      2     1
##  9:  2      3     2
## 10:  2      4     2
## 11:  2      5     2
## 12:  2      6     3

And here’s a plot for each individual, showing their health status progressions over time:

I do plan on sharing the details of the simulation that inspired the creation of these new functions, though I am still working out a few things. In the meantime, as always, if anyone has any suggestions or questions about simstudy, definitely let me know.