ouR data generation
https://www.rdatagen.net/
Recent content on ouR data generationHugo -- gohugo.iokeith.goldfeld@nyumc.org (Keith Goldfeld)keith.goldfeld@nyumc.org (Keith Goldfeld)Tue, 19 Jan 2021 00:00:00 +0000Finding answers faster for COVID-19: an application of Bayesian predictive probabilities
https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/
Tue, 19 Jan 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/
<script src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/header-attrs/header-attrs.js"></script>
<p>As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of <a href="https://bit.ly/3qhY2f5" target="_blank">COMPILE</a>, a prospective individual patient data meta-analysis, recognizes this. They are regularly monitoring the data to determine as quickly as possible if there is a sufficiently strong signal to indicate effectiveness of convalescent plasma (CP) for hospitalized patients not on ventilation.</p>
<p>How much data is enough to draw a conclusion? We know that at some point in the next few months, many if not all of the studies included in the meta-analysis will reach their target enrollment, and will stop recruiting new patients; at that point, the meta-analysis data set will be complete. Before that end-point, an interim DSMB analysis might indicate there is a high probability that CP is effective although it does not meet the pre-established threshold of 95%. If we know the specific number of patients that will ultimately be included in the final data set, we can predict the probability that the findings will put us over that threshold, and possibly enable a recommendation. If this probability is not too low, the DSMB may decide it is worth waiting for the complete results before drawing any conclusions.</p>
<p>Predicting the probability of success (or futility) is done using the most recent information collected from the study, which includes observed data, the parameter estimates, and the uncertainty surrounding these estimates (which is reflected in the posterior probability distribution).</p>
<p>This post provides an example using a simulated data set to show how this prediction can be made.</p>
<div id="determining-success" class="section level2">
<h2>Determining success</h2>
<p>In this example, the outcome is the WHO 11-point ordinal scale for clinical status at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. As in COMPILE, I’ll use a Bayesian proportional odds model to assess the effectiveness of CP. The measure of effectiveness is an odds ratio (OR) that compares the cumulative odds of having a worse outcome for the treated group compared to the cumulative odds for the control group:</p>
<p><span class="math display">\[
\text{Cumulative odds for level } k \text{ in treatment arm } j =\frac{P(Y_{ij} \ge k)}{P(Y_{ij} \lt k)}, \ k \in \{1,\dots, 10\}
\]</span></p>
<p>The goal is to reduce the odds of having a bad outcome, so a successful therapy is one where <span class="math inline">\(OR \lt 1\)</span>. In a Bayesian context, we estimate the posterior probability distribution of the <span class="math inline">\(OR\)</span> (based on prior assumptions before we have collected any data). We will recommend the therapy in the case that most of the probability density lies to the left of 1; in particular we will claim success only when <span class="math inline">\(P(OR \lt 1) > 0.95\)</span>. For example, in the figure the posterior distribution on top would lead us to consider the therapy successful since 95% of the density falls below 1, whereas the distribution on the bottom would not:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
</div>
<div id="data-set" class="section level2">
<h2>Data set</h2>
<p>This data set here is considerably simpler than the COMPILE data that has motivated all of this. Rather than structuring this example as a multi-study data set, I am assuming a rather simple two-arm design without any sort of clustering. I am including two binary covariates related to sex and age. The treatment in this case reduces the odds of worse outcomes (or increases the odds of better outcomes). For more detailed discussion of generating ordinal outcomes, see this earlier <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/" target="_blank">post</a> (but note that I have flipped direction of cumulative probability in the odds formula).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def1 <- defDataAdd(varname="male", formula="0.7", dist = "binary")
def1 <- defDataAdd(def1, varname="over69", formula="0.6", dist = "binary")
def1 <- defDataAdd(def1,
varname="z", formula="0.2*male + 0.3*over69 - 0.3*rx", dist = "nonrandom")
baseprobs <- c(0.10, 0.15, 0.08, 0.07, 0.08, 0.08, 0.11, 0.10, 0.09, 0.08, 0.06)
RNGkind("L'Ecuyer-CMRG")
set.seed(9121173)
dd <- genData(450)
dd <- trtAssign(dd, nTrt = 2, grpName = "rx")
dd <- addColumns(def1, dd)
dd <- genOrdCat(dd, adjVar = "z", baseprobs = baseprobs, catVar = "y")</code></pre>
<p>Here is a plot of the cumulative proportions by treatment arm for the first 450 patients in the (simulated) trial. The treatment arm has more patients with lower WHO-11 scores, so for the most part lies above the control arm line. (This may be a little counter-intuitive, so it may be worthwhile to think about it for a moment.)</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="estimate-a-bayes-ordinal-cumulative-model" class="section level2">
<h2>Estimate a Bayes ordinal cumulative model</h2>
<p>With the data from 450 patients in hand, the first step is to estimate a Bayesian proportional odds model, which I am doing in <code>Stan</code>. I use the package <code>cmdstanr</code> to interface between <code>R</code> and <code>Stan</code>.</p>
<p>Here is the model:</p>
<p><span class="math display">\[
\text{logit}\left(P(y_{i}) \ge k \right) = \tau_k + \beta_1 I(\text{male}) + \beta_2 I(\text{over69}) + \delta T_i, \ \ \ k \in \{ 1,\dots,10 \}
\]</span></p>
<p>where <span class="math inline">\(T_i\)</span> is the treatment indicator for patient <span class="math inline">\(i\)</span>, and <span class="math inline">\(T_i = 1\)</span> when patient <span class="math inline">\(i\)</span> receives CP. <span class="math inline">\(\delta\)</span> represents the log odds ratio, so <span class="math inline">\(OR = e^{\delta}\)</span>. I’ve included the Stan code for the model in the the first <a href="#addendumA">addendum</a>.</p>
<pre class="r"><code>library(cmdstanr)
dt_to_list <- function(dx) {
N <- nrow(dx) ## number of observations
L <- dx[, length(unique(y))] ## number of levels of outcome
y <- as.numeric(dx$y) ## individual outcome
rx <- dx$rx ## treatment arm for individual
x <- model.matrix(y ~ factor(male) + factor(over69), data = dx)[, -1]
D <- ncol(x)
list(N=N, L=L, y=y, rx=rx, x=x, D=D)
}
mod <- cmdstan_model("pprob.stan")
fit <- mod$sample(
data = dt_to_list(dd),
seed = 271263,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 4 finished in 52.3 seconds.
## Chain 1 finished in 52.9 seconds.
## Chain 3 finished in 53.4 seconds.
## Chain 2 finished in 55.6 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 53.6 seconds.
## Total execution time: 55.8 seconds.</code></pre>
</div>
<div id="extract-posterior-distribution" class="section level2">
<h2>Extract posterior distribution</h2>
<p>Once the model is fit, our primary interest is whether we can make a recommendation about the therapy. A quick check to verify if <span class="math inline">\(P(OR < 1) > 0.95\)</span> confirms that we are not there yet.</p>
<pre class="r"><code>library(posterior)
draws_df <- as_draws_df(fit$draws())
draws_dt <- data.table(draws_df[-grep("^yhat", colnames(draws_df))])
mean(draws_dt[, OR < 1])</code></pre>
<pre><code>## [1] 0.89</code></pre>
<p>A plot that shows the bottom 95% portion of the density in blue makes it clear that the threshold has not been met:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="the-elements-of-the-predictive-probability" class="section level2">
<h2>The elements of the predictive probability</h2>
<p>We have collected complete data from 450 patients out of an expect 500, though we are not yet ready declare success. An interesting question to ask at this point is “<em>given what we have observed up until this point, what is the probability that we will declare success after 50 additional patients are included in the analysis?</em>” If the probability is sufficiently high, we may decide to delay releasing the inconclusive results pending the updated data set. (On the other hand, if the probability is quite low, there may be no point in delaying.)</p>
<p>The prediction incorporates three potential sources of uncertainty. First, there is the uncertainty regarding the parameters, which is described by the posterior distribution. Second, even if we knew the parameters with certainty, the outcome remains stochastic (i.e. not pre-determined conditional on the parameters). Finally, we don’t necessarily know the characteristics of the remaining patients (though we may have some or all of that information if recruitment has been finished but complete follow-up has not).</p>
<p>In the algorithm that follows - the steps follow from these three elements of uncertainty:</p>
<ol style="list-style-type: decimal">
<li>Generate 50 new patients by bootstrap sampling (with replacement) from the observed patients. The distribution of covariates of the new 50 patients will be based on the original 450 patients.</li>
<li>Make a single draw from the posterior distribution of our estimates to generate a set of parameters.</li>
<li>Using the combination of new patients and parameters, generate ordinal outcomes for each of the 50 patients.</li>
<li>Combine this new data set with the 450 existing patients to create a single analytic file.</li>
<li>Fit a new Bayesian model with the 500-patient data set.</li>
<li>Record <span class="math inline">\(P(OR < 1)\)</span> based on this new posterior distribution. If <span class="math inline">\(P(OR < 1)\)</span> is <span class="math inline">\(\gt 95\%\)</span>, we consider the result to be a “success”, otherwise it is “not a success”.</li>
</ol>
<p>We repeat this cycle, say 1000 times. <em>The proportion of cycles that are counted as a success represents the predictive probability of success</em>.</p>
<div id="step-1-new-patients" class="section level4">
<h4>Step 1: new patients</h4>
<pre class="r"><code>library(glue)
dd_new <- dd[, .(id = sample(id, 25, replace = TRUE)), keyby = rx]
dd_new <- merge(dd[, .(id, male, over69)], dd_new, by = "id")
dd_new[, id:= (nrow(dd) + 1):(nrow(dd) +.N)]</code></pre>
</div>
<div id="step-2-draw-set-of-parameters" class="section level4">
<h4>Step 2: draw set of parameters</h4>
<pre class="r"><code>draw <- as.data.frame(draws_dt[sample(.N, 1)])</code></pre>
<p>The coefficients <span class="math inline">\(\hat{\beta}_1\)</span> (male), <span class="math inline">\(\hat{\beta}_2\)</span> (over69), and <span class="math inline">\(\hat{\delta}\)</span> (treatment effect) are extracted from the draw from the posterior:</p>
<pre class="r"><code>D <- dt_to_list(dd)$D
beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
delta <- draw$delta
coefs <- as.matrix(c(beta, delta))
coefs</code></pre>
<pre><code>## [,1]
## [1,] 0.22
## [2,] 0.60
## [3,] -0.19</code></pre>
<p>Using the draws of the <span class="math inline">\(\tau_k\)</span>’s, I’ve calculated the corresponding probabilities that can be used to generate the ordinal outcome for the new observations:</p>
<pre class="r"><code>tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
tau <- c(tau, Inf)
cprop <- plogis(tau)
xprop <- diff(cprop)
baseline <- c(cprop[1], xprop)
baseline</code></pre>
<pre><code>## [1] 0.117 0.136 0.123 0.076 0.089 0.101 0.114 0.102 0.054 0.040 0.048</code></pre>
</div>
<div id="step-3-generate-outcome-using-coefficients-and-baseline-probabilities" class="section level4">
<h4>Step 3: generate outcome using coefficients and baseline probabilities</h4>
<pre class="r"><code>zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
dd_new$z <- zmat %*% coefs
setkey(dd_new, id)
dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")</code></pre>
</div>
<div id="step-4-combine-new-with-existing" class="section level4">
<h4>Step 4: combine new with existing</h4>
<pre class="r"><code>dx <- rbind(dd, dd_new)</code></pre>
</div>
<div id="step-5-fit-model" class="section level4">
<h4>Step 5: fit model</h4>
<pre class="r"><code>fit_pp <- mod$sample(
data = dt_to_list(dx),
seed = 737163,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 2 finished in 66.5 seconds.
## Chain 4 finished in 67.3 seconds.
## Chain 3 finished in 67.5 seconds.
## Chain 1 finished in 68.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 67.4 seconds.
## Total execution time: 68.2 seconds.</code></pre>
</div>
<div id="step-6-assess-success-for-single-iteration" class="section level4">
<h4>Step 6: assess success for single iteration</h4>
<pre class="r"><code>draws_pp <- data.table(as_draws_df(fit_pp$draws()))
draws_pp[, mean(OR < 1)]</code></pre>
<pre><code>## [1] 0.79</code></pre>
</div>
</div>
<div id="estimating-the-predictive-probability" class="section level2">
<h2>Estimating the predictive probability</h2>
<p>The next step is to pull all these elements together in a single function that we can call repeatedly to estimate the predictive probability of success. This probability is estimated by calculating the proportion of iterations that result in a success.</p>
<p>Computing resources required for this estimation might be quite substantial. If we iterate 1000 times, we need to fit that many Bayesian models. And 1000 Bayesian model estimates could be prohibitive - a high performance computing cluster (HPC) may be necessary. (I touched on this <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">earlier</a> when I describe exploring the characteristic properties of Bayesian models.) I have provided the code below in the second <a href="#addendumB">addendum</a> in case any readers are interested in trying to implement on an HPC.</p>
<p>I’ll conclude with a figure that shows how predictive probabilities can vary depending on the observed sample size and <span class="math inline">\(P(OR < 1)\)</span> for the interim data set. Based on the data generating process I’ve used here, if we observe a <span class="math inline">\(P(OR < 1) = 90\%\)</span> at an interim look after 250 patients, it is considerably more probable that we will end up over 95% than if we observe that same probability at an interim look after 450 patients. This makes sense, of course, since the estimate at 450 patients will have less uncertainty, and adding 50 patients will not likely change to results dramatically. The converse is true after 250 patients.</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>Ultimately, the interpretation of the predictive probability will depend on the urgency of making a recommendation, the costs of waiting, the costs of deciding to soon, and other factors specific to the trial and those making the decisions.</p>
<p><a name="addendumA"></a></p>
<p> </p>
</div>
<div id="addendum-a-stan-code" class="section level2">
<h2>Addendum A: stan code</h2>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=2> L; // number of WHO categories
int<lower=1,upper=L> y[N]; // vector of categorical outcomes
int<lower=0,upper=1> rx[N]; // treatment or control
int<lower=1> D; // number of covariates
row_vector[D] x[N]; // matrix of covariates N x D matrix
}
parameters {
vector[D] beta; // covariate estimates
real delta; // overall control effect
ordered[L-1] tau; // cut-points for cumulative odds model ([L-1] vector)
}
transformed parameters{
vector[N] yhat;
for (i in 1:N){
yhat[i] = x[i] * beta + rx[i] * delta;
}
}
model {
// priors
beta ~ student_t(3, 0, 10);
delta ~ student_t(3, 0, 2);
tau ~ student_t(3, 0, 8);
// outcome model
for (i in 1:N)
y[i] ~ ordered_logistic(yhat[i], tau);
}
generated quantities {
real OR = exp(delta);
}</code></pre>
<p><a name="addendumB"></a></p>
<p> </p>
</div>
<div id="appendix-b-hpc-code" class="section level2">
<h2>Appendix B: HPC code</h2>
<pre class="r"><code>library(slurmR)
est_from_draw <- function(n_draw, Draws, dd_obs, D, s_model) {
set_cmdstan_path(path = "/.../cmdstan/2.25.0")
dd_new <- dd_obs[, .(id = sample(id, 125, replace = TRUE)), keyby = rx]
dd_new <- merge(dd_obs[, .(id, male, over69)], dd_new, by = "id")
dd_new[, id:= (nrow(dd_obs) + 1):(nrow(dd_obs) +.N)]
draw <- as.data.frame(Draws[sample(.N, 1)])
beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
delta <- draw$delta
coefs <- as.matrix(c(beta, delta))
tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
tau <- c(tau, Inf)
cprop <- plogis(tau)
xprop <- diff(cprop)
baseline <- c(cprop[1], xprop)
zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
dd_new$z <- zmat %*% coefs
setkey(dd_new, id)
dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")
dx <- rbind(dd_obs, dd_new)
fit_pp <- s_model$sample(
data = dt_to_list(dx),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)
draws_pp <- data.table(as_draws_df(fit_pp$draws()))
return(data.table(n_draw, prop_success = draws_pp[, mean(OR < 1)]))
}
job <- Slurm_lapply(
X = 1L:1080L,
FUN = est_from_draw,
Draws = draws_dt,
dd_obs = dd,
D = D,
s_model = mod,
njobs = 90L,
mc.cores = 4L,
job_name = "i_pp",
tmp_path = "/.../scratch",
plan = "wait",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("dt_to_list"),
overwrite = TRUE
)
job
res <- Slurm_collect(job)
rbindlist(res)[, mean(prop_success >= 0.95)]</code></pre>
</div>
Coming soon: effortlessly generate ordinal data without assuming proportional odds
https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/
Tue, 05 Jan 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I’m starting off 2021 with my 99th post ever to introduce a new feature that will be incorporated into <code>simstudy</code> soon to make it a bit easier to generate ordinal data without requiring an assumption of proportional odds. I should wait until this feature has been incorporated into the development version, but I want to put it out there in case any one has any further suggestions. In any case, having this out in plain view will motivate me to get back to work on the package.</p>
<p>In the <a href="https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/" target="_blank">past</a>, I have generated ordinal without the non-proportional odds assumption, but it was a bit cumbersome. I’ve wanted to simplify the approach by incorporating the functionality directly into <code>genOrdCat</code>. While this last step is not complete, it is very close. For now, there is a temporary function <code>genOrdCatNP</code>.</p>
<div id="starting-with-the-proportional-odds-assumption" class="section level2">
<h2>Starting with the proportional odds assumption</h2>
<p>First, I am generating a data set using <code>genOrdCat</code> and an assumption of proportionality. This will be a large data set to eliminate any sampling issues when summarizing the odds ratios. (Please take a look <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/" target="_blank">here</a> for a little more detail on how how the odds and cumulative odds are defined.) The data will include a treatment or exposure indicator and an ordinal categorical outcome with four possible responses. In this case, the log odds-ratio comparing the odds for the treatment group with the control group will be -1.3, which translates to an odds ratio (OR) of 0.27.</p>
<pre class="r"><code>d1 <- defDataAdd(varname = "z", formula = "rx * 1.3", dist = "nonrandom")
set.seed(382763)
dd <- genData(100000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)
baseprobs <- c(.4, .3, .2, .1)
dd_p <- genOrdCat(dtName = dd, adjVar = "z", baseprobs = baseprobs)</code></pre>
<p>Here is a calculation of the cumulative proportions and odds, as well as the odds ratios, which are all close to -1.3:</p>
<pre class="r"><code>(prop <- prop.table(dd_p[, table(rx, cat)], 1))</code></pre>
<pre><code>## cat
## rx 1 2 3 4
## 0 0.40 0.30 0.20 0.10
## 1 0.15 0.24 0.32 0.29</code></pre>
<pre class="r"><code>(cumprop <- data.table(apply(prop, 1, cumsum)))</code></pre>
<pre><code>## 0 1
## 1: 0.4 0.15
## 2: 0.7 0.39
## 3: 0.9 0.71
## 4: 1.0 1.00</code></pre>
<pre class="r"><code>(cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))])</code></pre>
<pre><code>## odds0 odds1
## 1: 0.67 0.18
## 2: 2.32 0.63
## 3: 8.84 2.45
## 4: Inf Inf</code></pre>
<pre class="r"><code>cumodds[1:3, odds1/odds0]</code></pre>
<pre><code>## [1] 0.27 0.27 0.28</code></pre>
<p>To visualize cumulative proportional odds, here’s a plot using the really nice <code>likert</code> package:</p>
<pre class="r"><code>library(likert)
item <- data.frame(dd_p[, "cat"])
names(item) <- "response"
bin.grp <- factor(dd_p[, rx])
likert.bin <- likert(item, grouping = bin.grp)
plot(likert.bin) + ggtitle("Proportional odds")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/index.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="its-now-simple-to-relax-the-proportionality-assumption" class="section level2">
<h2>It’s now simple to relax the proportionality assumption</h2>
<p>Two arguments are needed to relax the assumption: <code>npVar</code> and <code>npAdj</code>. The user specifies the first argument to indicate what variable is driving the lack of proportionality. In this case, it will be the treatment itself, since that is the only covariate in the data set. The second argument is a vector that represents the violation of the assumption at each level of the ordinal measure. The length of this vector is currently the same length as the number of ordinal categories, though the last value will always be 0.</p>
<p>In this example, the cumulative log-odds ratio for the first category will be 0.3 points more than the proportional assumption of -1.3, so should be quite close to -1.0 (or 0.37 on the OR scale). The cumulative log-odds ratio (and OR) for the second category will be the same as in the case of proportionality. And, the cumulative log-odds ratio for the third category will be 1.0 less than -1.3, or around -2.3 (which translates to 0.10 on the OR scale)</p>
<pre class="r"><code>npAdj <- c(-0.3, 0.0, 1.0, 0)
dd_np <- genOrdCatNP(dtName = dd, adjVar = "z", baseprobs = baseprobs,
npVar = "rx", npAdj = npAdj)
cumprop <- data.table(apply(prop.table(dd_np[, table(rx, cat)], 1), 1, cumsum))
cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))]
cumodds[1:3, odds1/odds0]</code></pre>
<pre><code>## [1] 0.37 0.27 0.10</code></pre>
<p>And here is what non-proportional cumulative odds looks like:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/index.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
Constrained randomization to evaulate the vaccine rollout in nursing homes
https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/
Tue, 22 Dec 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>On an incredibly heartening note, two COVID-19 vaccines have been approved for use in the US and other countries around the world. More are possibly on the way. The big challenge, at least here in the United States, is to convince people that these vaccines are safe and effective; we need people to get vaccinated as soon as they are able to slow the spread of this disease. I for one will not hesitate for a moment to get a shot when I have the opportunity, though I don’t think biostatisticians are too high on the priority list.</p>
<p>Those who <em>are</em> at the top of the priority list are staff and residents of nursing homes. Unfortunately, within these communities, there are pockets of ambivalence or skepticism about vaccines in general, and the COVID vaccine in particular. In the past, influenza vaccine rates for residents at some facilities could be as low as 50%. I am part of an effort organized by researchers affiliated with the <a href="https://impactcollaboratory.org/" target="_blank">IMPACT Collaboratory</a> to figure out a way to increase these numbers.</p>
<p>This effort involves a cluster randomized trial (CRT) to evaluate the effectiveness of a wide-ranging outreach program designed to encourage nursing home residents to get their shots. The focus is on facilities that have relatively high proportions of African-American and Latinx residents, because these facilities have been among the least successful in the past in convincing residents to get vaccinated. The outcome measure of the trial, which will be measured at the individual level, will be the probability of vaccination within 5 weeks of being available at the nursing home.</p>
<p>The nursing homes in the study are members of one of four national nursing home networks or corporations. In this CRT, the randomization will be stratified by these networks and by the proportion of African-American and Latinx residents. We are defining the race/ethnicity strata using cutoffs of proportions: <25%, 25% to 40%, and >40%. We want our randomization to provide balance with respect to racial/ethnic composition in the intervention and control arms within each individual stratum. However, we are concerned that the strata with fewer nursing homes have a high risk of imbalance just by chance. Constrained randomization is one possible way to mitigate this risk, which is the focus here in this post.</p>
<div id="constrained-randomization" class="section level3">
<h3>Constrained randomization</h3>
<p>The basic idea is pretty simple. We generate a large number of possible randomization lists based on the requirements of the study design. For each randomization, we evaluate whether the balancing criteria have been met; by selecting only the subset of randomizations that pass this test, we create a sample of eligible randomizations. With this list of possible randomizations is in hand, we randomly select one, which becomes the actual randomization. Because we have limited the final selection only to possible randomizations that have been vetted for balance (or whatever criteria we require), we are guaranteed to satisfy the pre-specified criteria.</p>
</div>
<div id="simulated-data" class="section level3">
<h3>Simulated data</h3>
<p>I am using a single simulated data set to illustrate the constrained randomization process. Using the <code>simstudy</code> package, creating this data set is a two-step process of <em>defining</em> the data and then <em>generating</em> the data.</p>
<div id="defining-the-data" class="section level4">
<h4>Defining the data</h4>
<p>There will be a total of 200 nursing homes in 3 (rather than 4) networks. Just as in the real data, racial/ethnic composition will differ by network (because they are based in different parts of the country). And the networks are different sizes. The proportions of African-American/Latinx residents are generated using the <code>beta</code> distribution, which ranges from 0 to 1. In <code>simstudy</code>, the beta distribution is parameterized with a mean (specified in the <code>formula</code> argument) and dispersion (specified in the <code>variance</code> argument. See <a href="https://kgoldfeld.github.io/simstudy/articles/simstudy.html" target="_blank">this</a> for more details on the <em>beta</em> distribution.</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def <- defData(varname = "network", formula = "0.3;0.5;0.2",
dist = "categorical", id = "site")
defC <- defCondition(condition = "network == 1",
formula = "0.25", variance = 10, dist = "beta")
defC <- defCondition(defC, condition = "network == 2",
formula = "0.3", variance = 7.5, dist = "beta")
defC <- defCondition(defC, condition = "network == 3",
formula = "0.35", variance = 5, dist = "beta")</code></pre>
</div>
<div id="generating-the-data" class="section level4">
<h4>Generating the data</h4>
<pre class="r"><code>set.seed(2323761)
dd <- genData(200, def, id = "site")
dd <- addCondition(defC, dd, newvar = "prop")
dd[, stratum := cut(prop, breaks = c(0, .25, .4, 1),
include.lowest = TRUE, labels = c(1, 2, 3))]
dd</code></pre>
<pre><code>## site prop network stratum
## 1: 1 0.340 2 2
## 2: 2 0.181 2 1
## 3: 3 0.163 2 1
## 4: 4 0.099 3 1
## 5: 5 0.178 2 1
## ---
## 196: 196 0.500 2 3
## 197: 197 0.080 3 1
## 198: 198 0.479 3 3
## 199: 199 0.071 2 1
## 200: 200 0.428 2 3</code></pre>
</div>
</div>
<div id="randomization" class="section level3">
<h3>Randomization</h3>
<p>Now that we have a data set in hand, we can go ahead an randomize. I am using the <code>simstudy</code> function <code>trtAssign</code>, which allows us to specify the strata as well as the the ratio of controls to intervention facilities. In this case, we have a limit in the number of sites at which we can implement the intervention. In this simulation, I assume that we’ll randomize 150 sites to control, and 50 to the intervention, a 3:1 ratio.</p>
<pre class="r"><code>dr <- trtAssign(dd, nTrt = 2, balanced = TRUE, strata = c("network", "stratum"),
ratio = c(3, 1), grpName = "rx")</code></pre>
<p>We want to inspect an average proportion of African-American/Latinx residents within each strata (without adjusting for nursing home size, which is ignored here). First, we create a data table that includes the difference in average proportions between the facilities randomized to the intervention and those randomized to control:</p>
<pre class="r"><code>dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
dc <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
dc[, dif := abs(`1` - `0`)]</code></pre>
<p>Looking at both the table and the figure, Stratum 3 (>40%) in Network 1 jumps out as having the largest discrepancy, about 15 percentage points:</p>
<pre><code>## network stratum 0 1 dif
## 1: 1 1 0.16 0.192 0.03062
## 2: 1 2 0.31 0.305 0.00641
## 3: 1 3 0.45 0.600 0.14568
## 4: 2 1 0.15 0.143 0.00394
## 5: 2 2 0.32 0.321 0.00310
## 6: 2 3 0.49 0.547 0.05738
## 7: 3 1 0.13 0.085 0.04948
## 8: 3 2 0.33 0.331 0.00029
## 9: 3 3 0.56 0.607 0.04284</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/index.en_files/figure-html/unnamed-chunk-7-1.png" width="576" /></p>
</div>
<div id="constraining-the-randomization" class="section level3">
<h3>Constraining the randomization</h3>
<p>We want to do better and ensure that the maximum difference within a stratum falls below some specified threshold, say a 3 percentage point difference. All we need to do is repeatedly randomize and then check balance. I’ve written a function <code>randomize</code> that will be called repeatedly. Here I generate 1000 randomization lists, but in some cases I might need to generate many more, particularly if it is difficult to achieve targeted balance in any particular randomization.</p>
<pre class="r"><code>randomize <- function(dd) {
dr <- trtAssign(dd, nTrt = 2, strata = c("network", "stratum"), balanced = TRUE,
ratio = c(3, 1), grpName = "rx")
dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
dx <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
dx[, dif := abs(`1` - `0`)]
list(is_candidate = all(dx$dif < 0.03), randomization = dr[,.(site, rx)],
balance = dx)
}
rand_list <- lapply(1:1000, function(x) randomize(dd))</code></pre>
<p>Here is one randomization that fails to meet the criteria as 5 of the 9 strata exceed the 3 percentage point threshold:</p>
<pre><code>## [[1]]
## [[1]]$is_candidate
## [1] FALSE
##
## [[1]]$randomization
## site rx
## 1: 1 0
## 2: 2 1
## 3: 3 0
## 4: 4 1
## 5: 5 0
## ---
## 196: 196 0
## 197: 197 1
## 198: 198 0
## 199: 199 0
## 200: 200 0
##
## [[1]]$balance
## network stratum 0 1 dif
## 1: 1 1 0.16 0.207 0.0503
## 2: 1 2 0.30 0.334 0.0330
## 3: 1 3 0.45 0.600 0.1457
## 4: 2 1 0.15 0.138 0.0107
## 5: 2 2 0.32 0.330 0.0078
## 6: 2 3 0.50 0.514 0.0142
## 7: 3 1 0.13 0.085 0.0493
## 8: 3 2 0.34 0.311 0.0239
## 9: 3 3 0.55 0.647 0.0950</code></pre>
<p>Here is another that passes, as all differences are below the 3 percentage point threshold:</p>
<pre><code>## [[1]]
## [[1]]$is_candidate
## [1] TRUE
##
## [[1]]$randomization
## site rx
## 1: 1 1
## 2: 2 0
## 3: 3 1
## 4: 4 0
## 5: 5 1
## ---
## 196: 196 1
## 197: 197 0
## 198: 198 1
## 199: 199 1
## 200: 200 0
##
## [[1]]$balance
## network stratum 0 1 dif
## 1: 1 1 0.16 0.18 0.0168
## 2: 1 2 0.31 0.31 0.0039
## 3: 1 3 0.49 0.49 0.0041
## 4: 2 1 0.15 0.14 0.0144
## 5: 2 2 0.32 0.33 0.0064
## 6: 2 3 0.50 0.52 0.0196
## 7: 3 1 0.12 0.12 0.0095
## 8: 3 2 0.34 0.31 0.0239
## 9: 3 3 0.57 0.58 0.0134</code></pre>
<p>All that remains is to identify all the randomization sets that met the criteria (in this case there are only 6, suggesting we should probably generate at least 100,000 randomizations to ensure we have enough to pick from).</p>
<pre class="r"><code>candidate_indices <- sapply(rand_list, function(x) x[["is_candidate"]])
candidates <- rand_list[candidate_indices]
(n_candidates <- length(candidates))</code></pre>
<pre><code>## [1] 6</code></pre>
<pre class="r"><code>selected <- sample(x = n_candidates, size = 1)
ds <- candidates[[selected]][["randomization"]]
ds <- merge(dd, ds, by = "site")
dx <- ds[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]</code></pre>
<p>And looking at the plot confirms that we have a randomization scheme that is balanced based on our target:</p>
<p><img src="https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/index.en_files/figure-html/unnamed-chunk-12-1.png" width="576" /></p>
<p>Of course, the selection criteria could be based on any combination of factors. We may have multiple means that we want to balance, or we might want the two arms to be similar with respect to the standard deviation of a measure. These additional criteria may require more randomization schemes to be generated just because balance is that much more difficult to achieve, but all that really costs is computing time, not programming effort.</p>
<p>
<p><small><font color="darkkhaki">
Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
A Bayesian implementation of a latent threshold model
https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/
Tue, 08 Dec 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In the <a href="https://www.rdatagen.net/post/a-latent-threshold-model/" target="_blank">previous post</a>, I described a latent threshold model that might be helpful if we want to dichotomize a continuous predictor but we don’t know the appropriate cut-off point. This was motivated by a need to identify a threshold of antibody levels present in convalescent plasma that is currently being tested as a therapy for hospitalized patients with COVID in a number of RCTs, including those that are particpating in the ongoing <a href="https://bit.ly/3lTTc4Q" target="_blank">COMPILE meta-analysis</a>.</p>
<p>Barring any specific scientific rationale, we could pick an arbitrary threshold and continue with our analysis. Unfortunately, our estimates would not reflect the uncertainty around the selection of that threshold point; an approach that incorporates this uncertainty would be more appropriate. Last time, I described a relatively simple scenario with a single continuous predictor, a latent threshold, and a continuous outcome; the estimates were generated using the <code>R</code> package <code>chngppt</code>. Because I want to be able to build more flexible models in the future that could accommodate multiple continuous predictors (and latent thresholds), I decided to implement a Bayesian version of the model.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>Before laying out the model (described in much more detail in the <a href="https://bit.ly/3fYbd0M" target="_blank">Stan User’s Guide</a>), I should highlight two key features. First, we assume that the distribution of the outcome differs on either side of the threshold. In this example, we expect that the outcome data for antibody levels below the threshold are distributed as <span class="math inline">\(N(\alpha, \sigma)\)</span>, and that data above the threshold are <span class="math inline">\(N(\beta, \sigma)\)</span>. Second, since we do not know the threshold value, the likelihood is specified as a mixture across the range of all possible thresholds; the posterior distribution of the parameters <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> reflect the uncertainty where the threshold lies.</p>
<p>The observed data include the continuous outcome <span class="math inline">\(\textbf{y}\)</span> and a continuous antibody measure <span class="math inline">\(\textbf{x}\)</span>. There are <span class="math inline">\(M\)</span> possible pre-specified thresholds that are reflected in the vector <span class="math inline">\(\mathbf{c}\)</span>. Each candidate threshold is treated as a discrete quantity and a probability <span class="math inline">\(\lambda_m\)</span> is attached to each. Here is the model for the outcome conditional on the distribution parameters as well as the probability of the thresholds:</p>
<p><span class="math display">\[p(\textbf{y}|\alpha, \beta, \sigma, \mathbf{\lambda}) = \sum_{m=1}^M \lambda_m \left(\prod_{i: \; x_i < c[m]} \text{normal}(y_i | \alpha, \sigma) \prod_{i: \; x_i \ge c[m]} \text{normal}(y_i | \beta, \sigma)\right)\]</span></p>
</div>
<div id="implmentation-in-stan" class="section level3">
<h3>Implmentation in Stan</h3>
<p>Here is a translation of the model into <code>Stan</code>. The data for the model include the antibody level <span class="math inline">\(x\)</span>, the outcome <span class="math inline">\(y\)</span>, and the candidate thresholds included in the vector <span class="math inline">\(\mathbf{c}\)</span> which has length <span class="math inline">\(M\)</span>. In this example, the candidate vector is based on the <em>range</em> of observed antibody levels.</p>
<pre class="stan"><code>data {
int<lower=1> N; // number of observations
real x[N]; // antibody measures
real y[N]; // outcomes
int<lower=1> M; // number of candidate thresholds
real c[M]; // candidate thresholds
}</code></pre>
<p>At the outset, equal probability will be assigned to each of the <span class="math inline">\(M\)</span> candidate thresholds, which is <span class="math inline">\(1/M\)</span>. Since Stan operates in log-probabilities, this is translated to <span class="math inline">\(\text{log}(1/M) = \text{-log}(M)\)</span>:</p>
<pre class="stan"><code>transformed data {
real lambda;
lambda = -log(M);
}</code></pre>
<p>The three parameters that define the two distributions (above and below the threshold) are <span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span>:</p>
<pre class="stan"><code>parameters {
real alpha;
real beta;
real<lower=0> sigma;
}</code></pre>
<p>This next block is really the implementation of the threshold model. <span class="math inline">\(\mathbf{lp}\)</span> is a vector of log probabilities, where each element represents the log of each summand in the model specified above.</p>
<pre class="stan"><code>transformed parameters {
vector[M] lp;
lp = rep_vector(lambda, M);
for (m in 1:M)
for (n in 1:N)
lp[m] = lp[m] + normal_lpdf(y[n] | x[n] < c[m] ? alpha : beta, sigma);
}</code></pre>
<p>The notation <code>y[n] | x[n] < c[m] ? alpha : beta, sigma</code> is Stan’s shorthand for an if-then-else statement (<strong>this is note Stan code!</strong>):</p>
<pre><code>if x[n] < c[m] then
y ~ N(alpha, sigma)
else if x[n] >= c[m] then
y ~ N(beta, sigma)</code></pre>
<p>And finally, here is the specification of the priors and the full likelihood, which is the sum of the log-likelihoods across the candidate thresholds. The function <code>log_sum_exp</code> executes the summation across the <span class="math inline">\(M\)</span> candidate thresholds specified in the model above.</p>
<pre class="stan"><code>model {
alpha ~ student_t(3, 0, 2.5);
beta ~ student_t(3, 0, 2.5);
sigma ~ exponential(1);
target += log_sum_exp(lp);
}</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generated to explore this model is based on the same data definitions I used in the <a href="https://www.rdatagen.net/post/a-latent-threshold-model/" target="_blank">last post</a> to explore the MLE model.</p>
<pre class="r"><code>library(simstudy)
set.seed(87654)
d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status",
variance = 1, dist = "normal")
dd <- genData(500, d1)</code></pre>
<p>The threshold is quite apparent here. In the right hand plot, the latent classes are revealed.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/p3.png" style="width:90.0%" /></p>
</div>
<div id="model-fitting" class="section level3">
<h3>Model fitting</h3>
<p>We use the <code>rstan</code> package to access Stan, passing along the observed antibody data, outcome data, as well as the candidate thresholds:</p>
<pre class="r"><code>library(rstan)
rt <- stanc("/.../threshold.stan");
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
N <- nrow(dd3)
y <- dd3[, y]
x <- dd3[, antibody]
c <- seq(round(min(x), 1), round(max(x), 1), by = .1)
M <- length(c)
studydata3 <- list(N=N, x=x, y=y, M=M, c=c)
fit3 <- sampling(sm, data = studydata3, iter = 3000, warmup = 500,
cores = 4L, chains = 4, control = list(adapt_delta = 0.8))</code></pre>
<p>The first order of business is to make sure that the MCMC algorithm sampled the parameter space in a well-behave manner. Everything looks good here:</p>
<pre class="r"><code>library(bayesplot)
posterior <- as.array(fit3)
lp <- log_posterior(fit3)
np <- nuts_params(fit3)
color_scheme_set("mix-brightblue-gray")
mcmc_trace(posterior, pars = c("alpha","beta", "sigma"),
facet_args = list(nrow = 3), np = np) +
xlab("Post-warmup iteration")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/trace3.png" style="width:80.0%" /></p>
<p>The posterior distributions of the three parameters of interest (<span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span>) are quite close to the values used in the data generation process:</p>
<pre class="r"><code>mcmc_intervals(posterior, pars = c("alpha","beta", "sigma"))</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/estimates3.png" style="width:70.0%" /></p>
</div>
<div id="the-posterior-probability-of-the-threshold" class="section level3">
<h3>The posterior probability of the threshold</h3>
<p>Even though the distributions of <span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span> are marginal with respect to the candidate thresholds, we may still be interested in the posterior distribution of the thresholds. An approach to estimating this is described in the <a href="https://mc-stan.org/docs/2_25/stan-users-guide/change-point-section.html#posterior-distribution-of-the-discrete-change-point" target="_blank">User’s Guide</a>. I provide a little more detail and code for generating the plot in the <a href="#addendum">addendum</a>.</p>
<p>The plot shows the log-probability for each of the candidate thresholds considered, with a red dashed line drawn at <span class="math inline">\(-0.7\)</span>, the true threshold used in the data generation process. In this case, the probability (and log-probability) peaks at this point. In fact, there is a pretty steep drop-off on either side, indicating that we can have a lot of confidence that the threshold is indeed <span class="math inline">\(-0.7\)</span>.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/threshold3.png" style="width:80.0%" /></p>
</div>
<div id="when-there-is-a-single-distribution" class="section level3">
<h3>When there is a single distribution</h3>
<p>If we update the data definitions to generate a single distribution (<em>i.e.</em> the outcome is independent of the antibody measure), the threshold model with a struggles to identify a threshold, and the parameter estimates have more uncertainty.</p>
<pre class="r"><code>d1 <- updateDef(d1, changevar = "y", newformula = "0")
dd <- genData(500, d1)</code></pre>
<p>Here is a plot of the data based on the updated assumption:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/p0.png" style="width:90.0%" /></p>
<p>And here are the posterior probabilities for the parameters - now with much wider credible intervals:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/estimates0.png" style="width:70.0%" /></p>
<p>Here is the posterior distribution of thresholds, intentionally plotted to highlight the lack of distinction across the candidate thresholds:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/threshold0.png" style="width:80.0%" /></p>
<p id="addendum">
</p>
</div>
<div id="addendum---posterior-probabilties-of-the-threshold" class="section level3">
<h3>Addendum - posterior probabilties of the threshold</h3>
<p>Here’s a little more background on how the posterior probabilities for the threshold were calculated. As a reminder, <span class="math inline">\(\textbf{c}\)</span> is a vector of candidate thresholds of length <span class="math inline">\(M\)</span>. We define a quantity <span class="math inline">\(q(c_m | data)\)</span> as</p>
<p><span class="math display">\[
q(c_m | data) = \frac{1}{R}\sum_{r=1}^R \text{exp}(lp_{rc_m})
\]</span>
where <span class="math inline">\(lp_{cr_m}\)</span> is the value of <span class="math inline">\(lp\)</span> from the <em>r</em>’th draw for threshold candidate <span class="math inline">\(c_m\)</span>. We are actually interested in <span class="math inline">\(p(c_m|data\)</span>), which is related to <span class="math inline">\(q\)</span>:</p>
<p><span class="math display">\[
p(c_m | data) = \frac{q(c_m | data)}{\sum_{m'=1}^M q(c_{m'}|data)}
\]</span></p>
<p>The <code>R</code> code is a little bit involved, because the log-probabilities are so small that exponentiating them to recover the probabilities runs into floating point limitations. In the examples I have been using here, the log probabilities ranged from <span class="math inline">\(-4400\)</span> to <span class="math inline">\(-700\)</span>. On my device the smallest value I can meaningfully exponentiate is <span class="math inline">\(-745\)</span>; anything smaller results in a value of 0, rendering it impossible to estimate <span class="math inline">\(q\)</span>.</p>
<p>To get around this problem, I used the <code>mpfr</code> function in the <code>Rmfpr</code> package. Here is a simple example to show how exponentiate a hihgly negative variable <span class="math inline">\(b\)</span>. A helper variable <span class="math inline">\(a\)</span> is specified to set the precision, which can then be used to derive the desired result, which is <span class="math inline">\(\text{exp}(b)\)</span>.</p>
<p>Everything is fine if <span class="math inline">\(b \ge -745\)</span>:</p>
<pre class="r"><code>library(Rmpfr)
b <- -745
exp(b)</code></pre>
<pre><code>## [1] 4.94e-324</code></pre>
<p>For <span class="math inline">\(b<-745\)</span>, we have floating point issues:</p>
<pre class="r"><code>b <- -746
exp(b)</code></pre>
<pre><code>## [1] 0</code></pre>
<p>So, we turn to <code>mpfr</code> to get the desired result. First, specify <span class="math inline">\(a\)</span> with the proper precision:</p>
<pre class="r"><code>(a <- mpfr(exp(-100), precBits=64))</code></pre>
<pre><code>## 1 'mpfr' number of precision 64 bits
## [1] 3.72007597602083612001e-44</code></pre>
<p>And now we can calculate <span class="math inline">\(\text{exp}(b)\)</span>:</p>
<pre class="r"><code>a^(-b/100)</code></pre>
<pre><code>## 1 'mpfr' number of precision 64 bits
## [1] 1.03828480951583225515e-324</code></pre>
<p>The code to calculate <span class="math inline">\(\text{log}(p_{c_m})\)</span> extracts the draws of <span class="math inline">\(lp\)</span> from the sample, exponentiates, and sums to get the desired result.</p>
<pre class="r"><code>library(glue)
a <- mpfr(exp(-100), precBits=64)
qc <- NULL
for(m in 1:M) {
lp.i <- glue("lp[{m}]")
le <- rstan::extract(fit3, pars = lp.i)[[1]]
q <- a^(-le/100)
qc[m] <- sum(q)
}
qcs <- mpfr2array(qc, dim = M)
lps <- log(qcs/sum(qcs))
dps <- data.table(c, y=as.numeric(lps))
ggplot(data = dps, aes(x = c, y = y)) +
geom_vline(xintercept = -0.7, color = "red", lty = 3) +
geom_line(color = "grey60") +
geom_point(size = 1) +
theme(panel.grid = element_blank()) +
ylab("log(probability)") +
xlab("threshold from low to not low") +
scale_y_continuous(limits = c(-800, 0))</code></pre>
</div>
A latent threshold model to dichotomize a continuous predictor
https://www.rdatagen.net/post/a-latent-threshold-model/
Tue, 24 Nov 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-latent-threshold-model/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>This is the context. In the convalescent plasma pooled individual patient level meta-analysis we are conducting as part of the <a href="https://bit.ly/3nBxPXd" target="_blank">COMPILE</a> study, there is great interest in understanding the impact of antibody levels on outcomes. (I’ve described various aspects of the analysis in previous posts, most recently <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">here</a>). In other words, not all convalescent plasma is equal.</p>
<p>If we had a clear measure of antibodies, we could model the relationship of these levels with the outcome of interest, such as health status as captured by the WHO 11-point scale or mortality, and call it a day. Unfortunately, at the moment, there is no single measure across the RCTs included in the meta-analysis (though that may change). Until now, the RCTs have used a range of measurement “platforms” (or technologies), which may measure different components of the convalescent plasma using different scales. Given these inconsistencies, it is challenging to build a straightforward model that simply estimates the relationship between antibody levels and clinical outcomes.</p>
<p>The study team is coalescing around the idea of comparing the outcomes of patients who received <em>low</em> levels of antibodies with patients who received <em>not low</em> levels (as well as with patients who received no antibodies). One thought (well, really my thought) is to use a model that can jointly estimate the latent threshold and, given that threshold, estimate a treatment effect. Importantly, this model would need to accommodate multiple antibody measures and their respective thresholds.</p>
<p>To tackle this problem, I have turned to a class of models called change point or threshold models. My ultimate goal is to fit a Bayesian model that can estimate threshold and effect-size parameters for any number of RCTs using any number of antibody measures. At this point we are a few steps removed from that, so in this post I’ll start with a simple case of a single RCT and a single antibody measure, and use a maximum likelihood estimation method implemented in the <code>R</code> package <a href="https://cran.r-project.org/web/packages/chngpt/index.html" target="_blank">chngpt</a> to estimate parameters from a simulated data set. In a subsequent post, I’ll implement a Bayesian version of this simple model, and perhaps in a third post, I’ll get to the larger model that incorporates more complexity.</p>
<div id="visualizing-simple-scenarios" class="section level3">
<h3>Visualizing simple scenarios</h3>
<p>Change point models appear to be most commonly used in the context of time series data where the focus is on understanding if a trend or average has shifted at a certain point in a sequence of measurements over time. In the case of COMPILE, the target would be a threshold for a continuous antibody measure across multiple patients; we are interested in measuring the average outcome for patients on either side of the threshold.</p>
<p>The following plots show three scenarios. On the left, there is no threshold; the distribution of continuous outcomes is the same across all values of the the antibody measure. In the middle, there is a threshold at <span class="math inline">\(-0.7\)</span>; patients with antibody levels below <span class="math inline">\(-0.7\)</span> have a lower average outcome than patients with antibodies above <span class="math inline">\(-0.7\)</span>. On the right, the threshold is shifted to <span class="math inline">\(0.5\)</span>.</p>
<p>The key here is that the outcome is solely a function of the latent categorical status - not the actual value of the antibody level. This may be a little simplistic, because we might expect the antibody level itself to be related to the outcome based on some sort of linear or non-linear relationship rather than the dichotomous relationship we are positing here. However, if we set our sights on detecting a difference in average clinical outcomes for patients categorized as having been exposed to <em>low</em> and <em>not low</em> antibody levels rather than on understanding the full nature of their relationship, this simplification may be reasonable.</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-1-1.png" width="864" /></p>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I think if you see the data generation process, the model and assumptions might make more sense. We start with an antibody level that, for simplicity’s sake, has a standard normal distribution. In this simulation, the latent group status (i.e. <em>low</em> vs. <em>not low</em>) is not determined completely by the threshold (though it certainly could); here, the probability that latent status is <em>not low</em> is about <span class="math inline">\(5\%\)</span> for patients with antibody levels that fall below <span class="math inline">\(-0.7\)</span>, but is <span class="math inline">\(95\%\)</span> for patients that exceed threshold.</p>
<pre class="r"><code>library(simstudy)
set.seed(87654)
d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status",
variance = 1, dist = "normal")
dd <- genData(500, d1)
dd</code></pre>
<pre><code>## id antibody latent_status y
## 1: 1 -1.7790 0 0.5184
## 2: 2 0.2423 1 3.2174
## 3: 3 -0.4412 1 1.8948
## 4: 4 -1.2505 0 0.9816
## 5: 5 -0.0552 1 2.9251
## ---
## 496: 496 -0.4634 1 2.7298
## 497: 497 0.6862 0 -0.0507
## 498: 498 -1.0899 0 0.9680
## 499: 499 2.3395 1 1.9540
## 500: 500 -0.4874 1 3.5238</code></pre>
</div>
<div id="simple-model-estimation" class="section level3">
<h3>Simple model estimation</h3>
<p>The <code>chngptm</code> function in the <code>chngpt</code> package provides an estimate of the threshold as well as the treatment effect of antibody lying above this latent threshold. The parameters in this simple case are recovered quite well. The fairly narrow <span class="math inline">\(95\%\)</span> confidence interval (2.2, 2.8) just misses the true value. The very narrow <span class="math inline">\(95\%\)</span> CI for the threshold is (-0.73, -0.69) just does include the true value.</p>
<pre class="r"><code>library(chngpt)
fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) 0.296 0.130 0.0547 0.563 2.26e-02
## antibody>chngpt 2.520 0.139 2.2416 2.787 1.99e-73
##
## Threshold:
## est Std. Error (lower upper)
## -0.70261 0.00924 -0.72712 -0.69092</code></pre>
</div>
<div id="alternative-scenarios" class="section level3">
<h3>Alternative scenarios</h3>
<p>When there is more ambiguity in the relationship between the antibody threshold and the classification into the two latent classes of <em>low</em> and <em>not low</em>, there is more uncertainty in both the effect and threshold estimates. Furthermore, the effect size estimate is attenuated, since the prediction of the latent class is less successful.</p>
<p>In the next simulation, the true threshold remains at <span class="math inline">\(-0.7\)</span>, but the probability that a patient below the threshold actually does not have <em>low</em> levels of antibodies increases to about <span class="math inline">\(21\%\)</span>, while the probability of a patient above the threshold does not have <em>low</em> levels of antibodies decreases to <span class="math inline">\(79\%\)</span>. There is more uncertainty regarding the the threshold, as the <span class="math inline">\(95\%\)</span> CI is (-1.09, -0.62). And the estimated effect is <span class="math inline">\(1.5 \; (1.3, 2.0)\)</span> is attenuated with more uncertainty. Given the added uncertainty in the data generation process, these estimates are what we would expect.</p>
<pre class="r"><code>d1 <- updateDef(d1, changevar = "latent_status",
newformula = "-1.3 + 2.6 * (antibody > -0.7)")
dd <- genData(500, d1)
fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) 0.881 0.159 0.50 1.12 3.05e-08
## antibody>chngpt 1.439 0.173 1.17 1.85 1.09e-16
##
## Threshold:
## est Std. Error (lower upper)
## -0.6298 0.0579 -0.8083 -0.5814</code></pre>
<p>The effect size has an impact on the estimation of a threshold. At the extreme case where there is no effect, the concept of a threshold is not meaningful; we would expect there to be great uncertainty with the estimate for the threshold. As the true effect size grows, we would expect the precision of the threshold estimate to increase as well (subject to the latent class membership probabilities just described). The subsequent plot shows the point estimates and <span class="math inline">\(95\%\)</span> CIs for thresholds at different effect sizes. The true threshold is <span class="math inline">\(0.5\)</span> and effect sizes range from 0 to 2:</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>This last figure shows that the uncertainty around the effect size estimate is higher at lower levels of true effectiveness. This higher level of uncertainty in the estimated effect is driven by the higher level of uncertainty in the estimate of the threshold at lower effect sizes (as just pointed out above).</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="with-a-fundamentally-different-data-generating-process" class="section level3">
<h3>With a fundamentally different data generating process</h3>
<p>What happens when the underlying data process is quite different from the one we have been imagining? Is the threshold model useful? I would say “maybe not” in the case of a single antibody measurement. I alluded to this a bit earlier in the post, justifying the idea by arguing it might make more sense with multiple types of antibody measurements. We will hopefully find that out if I get to that point. Here, I briefly investigate the estimates we get from a threshold model when the outcome is linearly related to the antibody measurement, and there is in fact no threshold, as in this data set:</p>
<pre class="r"><code>d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "y", formula = "antibody", variance = 1, dist = "normal")
dd <- genData(500, d1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>The estimated threshold is near the center of the antibody data (which in this case is close to <span class="math inline">\(0\)</span>), with a fairly narrow <span class="math inline">\(95\%\)</span> confidence interval. The effect size is essentially a comparison of the means for patients with measurements below <span class="math inline">\(0\)</span> compared to patients above <span class="math inline">\(0\)</span>. If this were the actual data generation process, it might be preferable to model the relationship directly using simple linear regression without estimating a threshold.</p>
<pre class="r"><code>fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) -0.972 0.162 -1.24 -0.607 2.19e-09
## antibody>chngpt 1.739 0.109 1.58 2.006 1.15e-57
##
## Threshold:
## est Std. Error (lower upper)
## -0.0713 0.2296 -0.3832 0.5170</code></pre>
</div>
Exploring the properties of a Bayesian model using high performance computing
https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/
Tue, 10 Nov 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>An obvious downside to estimating Bayesian models is that it can take a considerable amount of time merely to fit a model. And if you need to estimate the same model repeatedly, that considerable amount becomes a prohibitive amount. In this post, which is part of a series (last one <a href="https://bit.ly/31kCCDV" target="_blank">here</a>) where I’ve been describing various aspects of the Bayesian analyses we plan to conduct for the <a href="https://bit.ly/31hDwB0" target="_blank">COMPILE</a> meta-analysis of convalescent plasma RCTs, I’ll present a somewhat elaborate model to illustrate how we have addressed these computing challenges to explore the properties of these models.</p>
<p>While concept of statistical power may not be part of the Bayesian analytic framework, there are many statisticians who would like to assess this property regardless of the modeling approach. These assessments require us to generate multiple data sets and estimate a model for each. In this case, we’ve found that each run through the MCMC-algorithm required to sample from the posterior probability of the Bayesian model can take anywhere from 7 to 15 minutes on our laptops or desktops. If we want to analyze 1,000 data sets using these methods, my laptop would need to run continuously for at least a week. And if we want to explore models under different assumptions about the data generating process or prior distributions - well, that would be impossible.</p>
<p>Fortunately, we have access to the <a href="https://bit.ly/37wJdPs" target="_blank">High Performance Computing Core at NYU Langone Health</a> (HPC), which enables us to analyze 1,000 data sets in about 90 minutes. Still pretty intensive, but clearly a huge improvement. This post describes how we adapted our simulation and modeling process to take advantage of the power and speed of the the HPC.</p>
<div id="compile-study-design" class="section level3">
<h3>COMPILE Study design</h3>
<p>There are numerous randomized control trials being conducted around the world to evaluate the efficacy of antibodies in convalescent blood plasma in improving outcomes for patients who have been hospitalized with COVID. Because each trial lacks adequate sample size to allow us to draw any definitive conclusions, we have undertaken a project to pool individual level data from these various studies into a single analysis. (I described the general approach and some conceptual issues <a href="https://bit.ly/3o8o8Rr" target="_blank">here</a> and <a href="https://bit.ly/2HcXCpr" target="_blank">here</a>). The outcome is an 11-point categorical score developed by the <a href="https://bit.ly/3m7h4CI" target="_blank">WHO</a>, with 0 indicating no COVID infection and 10 indicating death. As the intensity of support increases, the scores increase. For the primary outcome, the score will be based on patient status 14 days after randomization.</p>
<p>This study is complicated by the fact that each RCT is using one of three different control conditions: (1) usual care (unblinded), (2) non-convalescent plasma, and (3) saline solution. The model conceptualizes the three control conditions as three treatments to be compared against the reference condition of convalescent plasma. The overall treatment effect will represent a quantity centered around the three control effects.</p>
<p>Because we have access to individual-level data, we will be able to adjust for important baseline characteristics that might be associated with the prognosis at day 14; these are baseline WHO-score, age, sex, and symptom duration prior to randomization. The primary analysis will adjust for these factors, and secondary analyses will go further to investigate if any of these factors modify the treatment effect. In the example in this post, I will present the model based on a secondary analysis that considers only a single baseline factor symptom duration; we are allowing for the possibility that the treatment may be more or less effective depending on the symptom duration. (For example, patients who have been sicker longer may not respond to the treatment, while those who are treated earlier may.)</p>
</div>
<div id="model" class="section level3">
<h3>Model</h3>
<p>Here is the model that I am using (as I mentioned, the planned COMPILE analysis will be adjusting for additional baseline characteristics):</p>
<p><span class="math display">\[
\text{logit} \left(P\left(Y_{kis} \ge y\right)\right) = \tau_{yk} + \beta_s + I_{ki} \left( \gamma_{kcs} + \delta_{kc} \right), \; \; y \in \{1, \dots L-1\} \text{ with } L \text{ response levels}
\]</span>
And here are the assumptions for the <strong>prior distributions</strong>:</p>
<p><span class="math display">\[
\begin{aligned}
\tau_{yk} &\sim t_{\text{student}} \left( \text{df=} 3, \mu=0, \sigma = 5 \right), \; \; \text{monotone within } \text{site } k \\
\beta_s &\sim \text{Normal} \left( \mu=0, \sigma = 5 \right), \qquad \; \;s \in \{1,\dots, S \} \text{ for symptom duration strata}, \beta_1 = 0 \\
\gamma_{kcs} &\sim \text{Normal}\left( \gamma_{cs}, 1 \right), \qquad \qquad \;\;\;\;\;\;c \in \{0, 1, 2\} \text{ for control conditions }, \gamma_{kc1} = 0 \text{ for all } k \\
\gamma_{cs} &\sim \text{Normal}\left( \Gamma_s, 0.25 \right), \qquad \qquad \; \; \gamma_{c1} = 0 \text{ for all } c \\
\Gamma_s &\sim t_{\text{student}} \left( 3, 0, 1 \right), \qquad \qquad \qquad \Gamma_{1} = 0 \\
\delta_{kc} &\sim \text{Normal}\left( \delta_c, \eta \right)\\
\delta_c &\sim \text{Normal}\left( -\Delta, 0.5 \right) \\
\eta &\sim t_{\text{student}}\left(3, 0, 0.25 \right) \\
-\Delta &\sim t_{\text{student}} \left( 3, 0, 2.5 \right)
\end{aligned}
\]</span>
There are <span class="math inline">\(K\)</span> RCTs. The outcome for the <span class="math inline">\(i\)</span>th patient from the <span class="math inline">\(k\)</span>th trial on the <span class="math inline">\(L\)</span>-point scale at day 14 is <span class="math inline">\(Y_{ki}=y\)</span>, <span class="math inline">\(y=0,\dots,L-1\)</span> (although the COMPILE study will have <span class="math inline">\(L = 11\)</span> levels, I will be using <span class="math inline">\(L=5\)</span> to speed up estimation times a bit). <span class="math inline">\(I_{ki}\)</span> indicates the treatment assignment for subject <span class="math inline">\(i\)</span> in the <span class="math inline">\(k\)</span>th RCT, <span class="math inline">\(I_{ki} = 0\)</span> if patient <span class="math inline">\(i\)</span> received CP and <span class="math inline">\(I_{ki} = 1\)</span> if the patient was in <em>any</em> control arm. There are three control conditions <span class="math inline">\(C\)</span>: standard of care, <span class="math inline">\(C=0\)</span>; non-convalescent plasma, <span class="math inline">\(C=1\)</span>; saline/LR with coloring, <span class="math inline">\(C=2\)</span>; each RCT <span class="math inline">\(k\)</span> is attached to a specific control condition. There are also <span class="math inline">\(S=3\)</span> symptom duration strata: short duration, <span class="math inline">\(s=0\)</span>; moderate duration, <span class="math inline">\(s=1\)</span>; and long duration, <span class="math inline">\(s=2\)</span>. (COMPILE will use five symptom duration strata - again I am simplifying.)</p>
<p><span class="math inline">\(\tau_{yk}\)</span> corresponds to the <span class="math inline">\(k\)</span>th RCT’s intercept associated with level <span class="math inline">\(y\)</span>; the <span class="math inline">\(\tau\)</span>’s represent the cumulative log odds for patients with in symptom duration group <span class="math inline">\(s=0\)</span> and receiving CP treatment. Within a particular RCT, all <span class="math inline">\(\tau_{yk}\)</span>’s, satisfy the monotonicity requirements for the intercepts of the proportional odds model. <span class="math inline">\(\beta_s\)</span>, <span class="math inline">\(s \in {2, 3}\)</span>, is the main effect of symptom duration (<span class="math inline">\(\beta_1 = 0\)</span>, where <span class="math inline">\(s=1\)</span> is the reference category). <span class="math inline">\(\gamma_{kcs}\)</span> is the moderating effect of strata <span class="math inline">\(s\)</span> in RCT <span class="math inline">\(k\)</span>; <span class="math inline">\(\gamma_{kc1} = 0\)</span>, since <span class="math inline">\(s=1\)</span> is the reference category. <span class="math inline">\(\delta_{kc}\)</span> is the RCT-specific control effect, where RCT <span class="math inline">\(k\)</span> is using control condition <span class="math inline">\(c\)</span>.</p>
<p>Each <span class="math inline">\(\gamma_{kcs}\)</span> is normally distributed around a control type/symptom duration mean <span class="math inline">\(\gamma_{cs}\)</span>. And each <span class="math inline">\(\gamma_{cs}\)</span> is centered around a pooled mean <span class="math inline">\(\Gamma_s\)</span>. The <span class="math inline">\(\delta_{kc}\)</span>’s are assumed to be normally distributed around a control-type specific effect <span class="math inline">\(\delta_c\)</span>, with variance <span class="math inline">\(\eta\)</span> that will be estimated; the <span class="math inline">\(\delta_c\)</span>’s are normally distributed around <span class="math inline">\(-\Delta\)</span> (we take <span class="math inline">\(-\Delta\)</span> as the mean of the distribution to which <span class="math inline">\(\delta_c\)</span> belongs so that <span class="math inline">\(\exp(\Delta)\)</span> will correspond to the cumulative log-odds ratio for CP relative to control, rather than for control relative to CP.). (For an earlier take on these types of models, see <a href="https://bit.ly/34ila4Q" target="_blank">here</a>.)</p>
</div>
<div id="go-or-no-go" class="section level3">
<h3>Go or No-go</h3>
<p>The focus of a Bayesian analysis is the estimated posterior probability distribution of a parameter or parameters of interest, for example the log-odds ratio from a cumulative proportional odds model. At the end of an analysis, we have credibility intervals, means, medians, quantiles - all concepts associated a probability distribution.</p>
<p>A “Go/No-go” decision process like a hypothesis test is not necessarily baked into the Bayesian method. At some point, however, even if we are using a Bayesian model to inform our thinking, we might want to or have to make a decision. In this case, we might want recommend (or not) the use of CP for patients hospitalized with COVID-19. Rather than use a hypothesis test to reject or fail to reject a null hypothesis of no effect, we can use the posterior probability to create a decision rule. In fact, this is what we have done.</p>
<p>In the proposed design of COMPILE, the CP therapy will be deemed a success if both of these criteria are met:</p>
<p><span class="math display">\[ P \left( \exp\left(-\Delta\right) < 1 \right) = P \left( OR < 1 \right) > 95\%\]</span></p>
<p><span class="math display">\[P \left( OR < 0.80 \right) > 50\%\]</span>
The first statement ensures that the posterior probability of a good outcome is very high. If we want to be conservative, we can obviously increase the percentage threshold above <span class="math inline">\(95\%\)</span>. The second statement says that the there is decent probability that the treatment effect is clinically meaningful. Again, we can modify the target OR and/or the percentage threshold based on our desired outcome.</p>
</div>
<div id="goals-of-the-simulation" class="section level3">
<h3>Goals of the simulation</h3>
<p>Since there are no Type I or Type II errors in the Bayesian framework, the concept of power (which is the probability of rejecting the null hypothesis when it is indeed not true) does not logically flow from a Bayesian analysis. However, if we substitute our decision rules for a hypothesis test, we can estimate the probability (call it Bayesian power, though I imagine some Bayesians would object) that we will make a “Go” decision given a specified treatment effect. (To be truly Bayesian, we should impose some uncertainty on what that specific treatment effect is, and calculate a probability distribution of Bayesian power. But I am keeping things simpler here.)</p>
<p>Hopefully, I have provided sufficient motivation for the need to simulate data and fit multiple Bayesian models. So, let’s do that now.</p>
</div>
<div id="the-simulation" class="section level3">
<h3>The simulation</h3>
<p>I am creating four functions that will form the backbone of this simulation process: <code>s_define</code>, <code>s_generate</code>, <code>s_estimate</code>, and <code>s_extract</code>. Repeated calls to each of these functions will provide us with the data that we need to get an estimate of Bayesian power under our (static) data generating assumptions.</p>
<div id="data-definitions" class="section level4">
<h4>Data definitions</h4>
<p>The first definition table, <code>defC1</code>, sets up the RCTs. Each RCT has specific symptom duration interaction effect <span class="math inline">\(a\)</span> and control treatment effect <span class="math inline">\(b\)</span>. To introduce a little variability in sample size, 1/3 of the studies will be larger (150 patients), and 2/3 will be smaller (75 patients).</p>
<p>The remaining tables, <code>defC2</code>, <code>defS</code>, and <code>defC3</code>, define patient-level data. <code>defC2</code> adds the control group indicator (0 = CP, 1 = standard care, 2 = non-convalescent plasma, 3 = saline) and the symptom duration stratum. <code>defS</code> defines the interaction effect conditional on the stratum. <code>defC3</code> defines the ordinal categorical outcome.</p>
<pre class="r"><code>s_define <- function() {
defC1 <- defDataAdd(varname = "a",formula = 0, variance = .005, dist = "normal")
defC1 <- defDataAdd(defC1,varname = "b",formula = 0, variance= .01, dist = "normal")
defC1 <- defDataAdd(defC1,varname = "size",formula = "75+75*large", dist = "nonrandom")
defC2 <- defDataAdd(varname="C_rv", formula="C * control", dist = "nonrandom")
defC2 <- defDataAdd(defC2, varname = "ss", formula = "1/3;1/3;1/3",
dist = "categorical")
defS <- defCondition(
condition = "ss==1",
formula = 0,
dist = "nonrandom")
defS <- defCondition(defS,
condition = "ss==2",
formula = "(0.09 + a) * (C_rv==1) + (0.10 + a) * (C_rv==2) + (0.11 + a) * (C_rv==3)",
dist = "nonrandom")
defS <- defCondition(defS,
condition = "ss==3",
formula = "(0.19 + a) * (C_rv==1) + (0.20 + a) * (C_rv==2) + (0.21 + a) * (C_rv==3)",
dist = "nonrandom")
defC3 <- defDataAdd(
varname = "z",
formula = "0.1*(ss-1)+z_ss+(0.6+b)*(C_rv==1)+(0.7+b)*(C_rv==2)+(0.8+b)*(C_rv==3)",
dist = "nonrandom")
list(defC1 = defC1, defC2 = defC2, defS = defS, defC3 = defC3)
}</code></pre>
</div>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>The data generation process draws on the definition tables to create an instance of an RCT data base. This process includes a function <code>genBaseProbs</code> that I described <a href="https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/" target="_blank">previously</a>.</p>
<pre class="r"><code>s_generate <- function(deflist, nsites) {
genBaseProbs <- function(n, base, similarity, digits = 2) {
n_levels <- length(base)
x <- gtools::rdirichlet(n, similarity * base)
x <- round(floor(x*1e8)/1e8, digits)
xpart <- x[, 1:(n_levels-1)]
partsum <- apply(xpart, 1, sum)
x[, n_levels] <- 1 - partsum
return(x)
}
basestudy <- genBaseProbs(n = nsites,
base = c(.10, .35, .25, .20, .10),
similarity = 100)
dstudy <- genData(nsites, id = "study")
dstudy <- trtAssign(dstudy, nTrt = 3, grpName = "C")
dstudy <- trtAssign(dstudy, nTrt = 2, strata = "C", grpName = "large", ratio = c(2,1))
dstudy <- addColumns(deflist[['defC1']], dstudy)
dind <- genCluster(dstudy, "study", numIndsVar = "size", "id")
dind <- trtAssign(dind, strata="study", grpName = "control")
dind <- addColumns(deflist[['defC2']], dind)
dind <- addCondition(deflist[["defS"]], dind, newvar = "z_ss")
dind <- addColumns(deflist[['defC3']], dind)
dl <- lapply(1:nsites, function(i) {
b <- basestudy[i,]
dx <- dind[study == i]
genOrdCat(dx, adjVar = "z", b, catVar = "ordY")
})
rbindlist(dl)[]
}</code></pre>
</div>
<div id="model-estimation" class="section level4">
<h4>Model estimation</h4>
<p>The estimation involves creating a data set for <code>Stan</code> and sampling from the Bayesian model. The <code>Stan</code> model is included in the addendum.</p>
<pre class="r"><code>s_estimate <- function(dd, s_model) {
N <- nrow(dd) ## number of observations
L <- dd[, length(unique(ordY))] ## number of levels of outcome
K <- dd[, length(unique(study))] ## number of studies
y <- as.numeric(dd$ordY) ## individual outcome
kk <- dd$study ## study for individual
ctrl <- dd$control ## treatment arm for individual
cc <- dd[, .N, keyby = .(study, C)]$C ## specific control arm for study
ss <- dd$ss
x <- model.matrix(ordY ~ factor(ss), data = dd)[, -1]
studydata <- list(N=N, L= L, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc, ss=ss, x=x)
fit <- sampling(s_model, data=studydata, iter = 4000, warmup = 500,
cores = 4L, chains = 4, control = list(adapt_delta = 0.8))
fit
}</code></pre>
</div>
<div id="estimate-extraction" class="section level4">
<h4>Estimate extraction</h4>
<p>The last step is the extraction of summary data from the posterior probability distributions. I am collecting quantiles of the key parameters, including <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(OR = \exp(-\Delta)\)</span>. For the Bayesian power analysis, I am estimating the probability of falling below the two thresholds for each data set. And finally, I want to get a sense of the quality of each estimation process by recovering the number of divergent chains that resulted from the MCMC algorithm (more on that <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">here</a>).</p>
<pre class="r"><code>s_extract <- function(iternum, mcmc_res) {
posterior <- as.array(mcmc_res)
x <- summary(
mcmc_res,
pars = c("Delta", "delta", "Gamma", "beta", "alpha", "OR"),
probs = c(0.025, 0.5, 0.975)
)
dpars <- data.table(iternum = iternum, par = rownames(x$summary), x$summary)
p.eff <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 1)
p.clinic <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 0.8)
dp <- data.table(iternum = iternum, p.eff = p.eff, p.clinic = p.clinic)
sparams <- get_sampler_params(mcmc_res, inc_warmup=FALSE)
n_divergent <- sum(sapply(sparams, function(x) sum(x[, 'divergent__'])))
ddiv <- data.table(iternum, n_divergent)
list(ddiv = ddiv, dpars = dpars, dp = dp)
}</code></pre>
</div>
<div id="replication" class="section level4">
<h4>Replication</h4>
<p>Now we want to put all these pieces together and repeatedly execute those four functions and save the results from each. I’ve <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/" target="_blank">described</a> using <code>lapply</code> to calculate power in a much more traditional setting. We’re going to take the same approach here, except on steroids, replacing <code>lapply</code> not with <code>mclapply</code>, the parallel version, but with <code>Slurm_lapply</code>, which is a function in the <code>slurmR</code> package.</p>
<p><a href="https://slurm.schedmd.com/documentation.html" target="_blank">Slurm</a> (Simple Linux Utility for Resource Management) is a HPC cluster job scheduler. <a href="https://uscbiostats.github.io/slurmR/" target="_blank">slurmR</a> is a wrapper that mimics many of the R <code>parallel</code> package functions, but in a Slurm environment. The strategy here is to define a meta-function (<code>iteration</code>) that itself calls the four functions already described, and then call that function repeatedly. <code>Slurm_lapply</code> does that, and rather than allocating the iterations to different <em>cores</em> on a computer like <code>mclapply</code> does, it allocates the iterations to different <em>nodes</em> on the HPC, using what is technically called a <em>job array</em>. Each node is essentially its own computer. In addition to that, each node has multiple cores, so we can run the different MCMC chains in parallel within a node; we have parallel processes within a parallel process. I have access to 100 nodes at any one time, though I find I don’t get much performance improvement if I go over 90, so that is what I do here. Within each node, I am using 4 cores. I am running 1,980 iterations, so that is 22 iterations per node. As I mentioned earlier, all of this runs in about an hour and a half.</p>
<p>The following code includes the “meta-function” <code>iteration</code>, the compilation of the <code>Stan</code> model (which only needs to be done once, thankfully), the <code>Slurm_lapply</code> call, and the <strong>Slurm</strong> batch code that I need to execute to get the whole process started on the HPC, which is called Big Purple here at NYU. (All of the R code goes into a single <code>.R</code> file, the batch code is in a <code>.slurm</code> file, and the Stan code is in its own <code>.stan</code> file.)</p>
<pre class="r"><code>iteration <- function(iternum, s_model, nsites) {
s_defs <- s_define()
s_dd <- s_generate(s_defs, nsites = nsites)
s_est <- s_estimate(s_dd, s_model)
s_res <- s_extract(iternum, s_est)
return(s_res)
}</code></pre>
<pre class="r"><code>library(simstudy)
library(rstan)
library(data.table)
library(slurmR)
rt <- stanc("/.../r/freq_bayes.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
job <- Slurm_lapply(
X = 1:1980,
iteration,
s_model = sm,
nsites = 9,
njobs = 90,
mc.cores = 4,
tmp_path = "/.../scratch",
overwrite = TRUE,
job_name = "i_fb",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("s_define", "s_generate", "s_estimate", "s_extract"),
plan = "wait")
job
res <- Slurm_collect(job)
diverge <- rbindlist(lapply(res, function(l) l[["ddiv"]]))
ests <- rbindlist(lapply(res, function(l) l[["dpars"]]))
probs <- rbindlist(lapply(res, function(l) l[["dp"]]))
save(diverge, ests, probs, file = "/.../data/freq_bayes.rda")</code></pre>
<pre><code>#!/bin/bash
#SBATCH --job-name=fb_parent
#SBATCH --mail-type=END,FAIL # send email if the job end or fail
#SBATCH --mail-user=keith.goldfeld@nyulangone.org
#SBATCH --partition=cpu_short
#SBATCH --time=3:00:00 # Time limit hrs:min:sec
#SBATCH --output=fb.out # Standard output and error log
module load r/3.6.3
cd /.../r
Rscript --vanilla fb.R</code></pre>
</div>
<div id="results" class="section level4">
<h4>Results</h4>
<p>Each of the three extracted data tables are combined across simulations and the results are saved to an <code>.rda</code> file, which can be loaded locally in R and summarized. In this case, we are particularly interested in the Bayesian power estimate, which is the proportion of data sets that would results in a “go” decision (a recommendation to strongly consider using the intervention).</p>
<p>However, before we consider that, we should first get a rough idea about how many replications had <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">divergence issues</a>, which we extracted into the <code>diverge</code> data table. For each replication, we used four chains of length 3,500 each (after the 500 warm-up samples), accounting for a total of 14,000 chains. Here are the proportion of replications with at least one divergent chain:</p>
<pre class="r"><code>load("DataBayesCOMPILE/freq_bayes.rda")
diverge[, mean(n_divergent > 0)]</code></pre>
<pre><code>## [1] 0.102</code></pre>
<p>While 10% of replications with at least 1 divergent chain might seem a little high, we can get a little more comfort from the fact that it appears that almost all replications had fewer than 35 (0.25%) divergent chains:</p>
<pre class="r"><code>diverge[, mean(n_divergent < 35)]</code></pre>
<pre><code>## [1] 0.985</code></pre>
<p>To get a general sense of how well our model is working, we can plot the distribution of posterior medians. In particular, this will allow us to assess how well the model is recovering the values used in the data generating process. In this case, I am excluding the 29 replications with 35 or more divergent chains:</p>
<p><img src="https://www.rdatagen.net/post/2020-11-10-a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models.en_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Finally, we are ready to report the estimated Bayesian power (again, using the replications with limited number of divergent chains) and show the distribution of probabilities.</p>
<pre class="r"><code>probs_d <- merge(probs, diverge, by = "iternum")[n_divergent < 35]
probs_d[, mean(p.eff > 0.95 & p.clinic > 0.50)]</code></pre>
<pre><code>## [1] 0.726</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-11-10-a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>So, given an actual effect <span class="math inline">\(OR=\exp(-0.70) = 0.50\)</span>, we would conclude with a decision to go ahead with the therapy with 73% probability. However, a single estimate of power based on one effect size is a bit incomplete; it would be preferable to assess power under numerous scenarios of effect sizes and perhaps prior distribution assumptions to get a more complete picture. And if you have access to a HPC, this may actually be something you can do in a realistic period of time.</p>
</div>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The <code>stan</code> model that implements the model described at the outset actually looks a little different than that model in two key ways. First, there is a parameter <span class="math inline">\(\alpha\)</span> that appears in the outcome model, which represents an overall intercept across all studies. Ideally, we wouldn’t need to include this parameter since we want to fix it at zero, but the model behaves very poorly without it. We do include it, but with a highly restrictive prior that will constrain it to be very close to zero. The second difference is that standard normal priors appear in the model - this is to alleviate issues related to divergent chains, which I described in a <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">previous post</a>.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=2> L; // number of WHO categories
int<lower=1> K; // number of studies
int<lower=1,upper=L> y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=3> cc[K]; // specific control for site
int<lower=1,upper=3> ss[N]; // strata
row_vector[2] x[N]; // strata indicators N x 2 matrix
}
parameters {
real Delta; // overall control effect
vector[2] Gamma; // overall strata effect
real alpha; // overall intercept for treatment
ordered[L-1] tau[K]; // cut-points for cumulative odds model (K X [L-1] matrix)
real<lower=0> eta_0; // sd of delta_k (around delta)
// non-central parameterization
vector[K] z_ran_rx; // site-specific effect
vector[2] z_phi[K]; // K X 2 matrix
vector[3] z_delta;
vector[2] z_beta;
vector[2] z_gamma[3]; // 3 X 2 matrix
}
transformed parameters{
vector[3] delta; // control-specific effect
vector[K] delta_k; // site specific treatment effect
vector[2] gamma[3]; // control-specific duration strata effect (3 X 2 matrix)
vector[2] beta; // covariate estimates of ss
vector[2] gamma_k[K]; // site-specific duration strata effect (K X 2 matrix)
vector[N] yhat;
delta = 0.5 * z_delta + Delta; // was 0.1
beta = 5 * z_beta;
for (c in 1:3)
gamma[c] = 0.25 * z_gamma[c] + Gamma;
for (k in 1:K){
delta_k[k] = eta_0 * z_ran_rx[k] + delta[cc[k]];
}
for (k in 1:K)
gamma_k[k] = 1 * z_phi[k] + gamma[cc[k]];
for (i in 1:N)
yhat[i] = alpha + x[i] * beta + ctrl[i] * (delta_k[kk[i]] + x[i]*gamma_k[kk[i]]);
}
model {
// priors
z_ran_rx ~ std_normal();
z_delta ~ std_normal();
z_beta ~ std_normal();
alpha ~ normal(0, 0.25);
eta_0 ~ student_t(3, 0, 0.25);
Delta ~ student_t(3, 0, 2.5);
Gamma ~ student_t(3, 0, 1);
for (c in 1:3)
z_gamma[c] ~ std_normal();
for (k in 1:K)
z_phi[k] ~ std_normal();
for (k in 1:K)
tau[k] ~ student_t(3, 0, 5);
// outcome model
for (i in 1:N)
y[i] ~ ordered_logistic(yhat[i], tau[kk[i]]);
}
generated quantities {
real OR;
OR = exp(-Delta);
}</code></pre>
</div>
A refined brute force method to inform simulation of ordinal response data
https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/
Tue, 27 Oct 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Francisco, a researcher from Spain, reached out to me with a challenge. He is interested in exploring various models that estimate correlation across multiple responses to survey questions. This is the context:</p>
<ul>
<li>He doesn’t have access to actual data, so to explore analytic methods he needs to simulate responses.</li>
<li>It would be ideal if the simulated data reflect the properties of real-world responses, some of which can be gleaned from the literature.</li>
<li>The studies he’s found report only means and standard deviations of the ordinal data, along with the correlation matrices, <em>but not probability distributions of the responses</em>.</li>
<li>He’s considering <code>simstudy</code> for his simulations, but the function <code>genOrdCat</code> requires a set of probabilities for each response measure; it doesn’t seem like simstudy will be helpful here.</li>
</ul>
<p>Ultimately, we needed to figure out if we can we use the empirical means and standard deviations to derive probabilities that will yield those same means and standard deviations when the data are simulated. I thought about this for a bit, and came up with a bit of a work-around; the approach seems to work decently and doesn’t require any outrageous assumptions.</p>
<p>I might have kept this between the two of us, but in the process of looking more closely at my solution, I generated a plot that was so beautiful and interesting that I needed to post it. And since I am posting the image, I thought I might as well go ahead and describe the solution in case any one else might find it useful. But first, the plot:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<div id="a-little-more-detail" class="section level3">
<h3>A little more detail</h3>
<p>In the simplest scenario, we want to simulate responses from a single survey question with responses ranging from 1 to 7, where 1 might signify <em>totally disagree</em> and 7 would mean <em>totally agree</em>, with gradations in between. Responses collected from a population will be distributed across the seven categories, and the proportion of responses that fall within each category represents the probability of a response.</p>
<p>To inform the simulation, we have a journal article that reports only a mean and standard deviation from responses to that same question collected in an earlier study. The idea is to find the probabilities for the possible responses that correspond to those observed means and standard deviations. That is, how do we go from the mean and standard deviation to a set of probabilities?</p>
<p>The reverse - going from known probabilities to a mean response and standard deviation - is much easier: we just calculate the weighted mean and weighted standard deviations, where the weights are the probabilities.</p>
<p>For example, say the probability distribution of the seven categorical responses is 21%, 20%, 18%, 15%, 13%, 9%, and 4% responding 1, 2, … , and 7, respectively, and represented by this histogram:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Under this distribution the weighted mean and standard deviation are 3.2 and 1.8, respectively:</p>
<pre class="r"><code>weighted.mean(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))</code></pre>
<pre><code>## [1] 3.2</code></pre>
<pre class="r"><code>weighted.sd(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))</code></pre>
<pre><code>## [1] 1.8</code></pre>
</div>
<div id="the-brute-force-approach" class="section level3">
<h3>The brute force approach</h3>
<p>My first thought about how use <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> was simple, if a bit crude. Generate a slew of probabilities (like a million or so) and calculate the weighted mean and standard deviation for each distribution. I would look for the probabilities that yielded values that were close to my target (i.e. those that had been reported in the literature).</p>
<p>There are a couple of drawbacks to this approach. First, it is not particularly systematic, since we generating the probabilities randomly, and even though we have large numbers, we are not guaranteed to generate combinations that reflect our targets. Second, there is no reason to think that the generated randomly generated distributions will look like the true distribution. And third, there is no reason to think that, even if we do find a match, the distribution is unique.</p>
<p>I actually went ahead and implemented this approach and found two distributions that also yield <span class="math inline">\(\mu\)</span> = 3.2 and and <span class="math inline">\(\sigma\)</span> = 1.8 (truth be told, I did this part first and then found the distribution above using the method I will describe in a second):</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Here are the target <span class="math inline">\(\mu\)</span>’s and <span class="math inline">\(\sigma\)</span>’s for the distributions on the right and left:</p>
<pre class="r"><code>p_left <- c(0.218, 0.174, 0.170, 0.206, 0.134, 0.022, 0.077)
c(weighted.mean(1:7, p_left), weighted.sd(1:7, p_left))</code></pre>
<pre><code>## [1] 3.2 1.8</code></pre>
<pre class="r"><code>p_right <- c(0.185, 0.185, 0.247, 0.217, 0.011, 0.062, 0.092)
c(weighted.mean(1:7, p_right), weighted.sd(1:7, p_right))</code></pre>
<pre><code>## [1] 3.2 1.8</code></pre>
</div>
<div id="drawing-on-the-beta-distribution" class="section level3">
<h3>Drawing on the <em>beta</em> distribution</h3>
<p>Thinking about probabilities always draws me to the <em>beta</em> family distribution, a continuous distribution from 0 to 1. Theses distributions are parameterized with two shape values, often referred to as <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. Here are a few probability density functions (pdf’s) for <span class="math inline">\((a,b)\)</span> pairs of (1, 1.6) in yellow, (2, 4) in red, and (2, 2) in blue:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>I had an idea that generating different pdf’s based on different values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> might provide a more systematic way of generating probabilities. If we carve the pdf into <span class="math inline">\(K\)</span> sections (where <span class="math inline">\(K\)</span> is the number of responses, in our case 7), then the area under the pdf in the <span class="math inline">\(k\)</span>th slice could provide the probability for the <span class="math inline">\(k\)</span>th response. Since each pdf is unique (determined by specific values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>), this would ensure different (i.e. unique) sets of probabilities to search through.</p>
<p>Using the example from above where <span class="math inline">\(a\)</span> = 1 and <span class="math inline">\(b\)</span> = 1.6, here is how the slices look based on the seven categories:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>The cumulative probability at each slice <span class="math inline">\(x \in \{1, ..., 7\}\)</span> is (<span class="math inline">\(P(X < x/7)\)</span>), and can be calculated in <code>R</code> with the function <code>pbeta</code>:</p>
<pre class="r"><code>z <- pbeta((1:7)/7, 2, 4)
z</code></pre>
<pre><code>## [1] 0.15 0.44 0.71 0.89 0.97 1.00 1.00</code></pre>
<p>The probability of for each category is <span class="math inline">\(P(X = x) = P(X < x) - P(X < (x-1))\)</span>, and is calculated easily:</p>
<pre class="r"><code>p <- z - c(0, z[-7])
p</code></pre>
<pre><code>## [1] 0.1518 0.2904 0.2684 0.1786 0.0851 0.0239 0.0018</code></pre>
<p>This is the transformed probability distribution from the continuous <em>beta</em> scale to the discrete categorical scale:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
<p>And finally here are <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> associated with these values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>:</p>
<pre class="r"><code>c(weighted.mean(1:7, p), weighted.sd(1:7, p))</code></pre>
<pre><code>## [1] 2.8 1.3</code></pre>
</div>
<div id="brute-force-refined" class="section level3">
<h3>Brute force, refined</h3>
<p>If we create a grid of <span class="math inline">\((a, b)\)</span> values, there will be an associated, and unique, set of probabilities for each pair derived from slicing the pdf into <span class="math inline">\(K\)</span> sections And for each of these sets of probabilities, we can calculate the means and standard deviations. We then find the <span class="math inline">\((\mu, \sigma)\)</span> pair that is closest to our target. While this idea is not that much better than the brute force approach suggested above, at least it is now systematic. If we do it in two steps, first by searching for the general region and then zooming in to find a specific set of probabilities, we can really speed things up and use less memory.</p>
<p>Is limiting the search to <em>beta</em>-based distributions justifiable? It might depend on the nature of responses in a particular case, but it does seem reasonable; most importantly, it assures fairly well-behaved distributions that could plausibly reflect a wide range of response patterns. Barring any additional information about the distributions, then, I would have no qualms using this approach. (If it turns out that this is a common enough problem, I would even consider implementing the algorithm as a <code>simstudy</code> function.)</p>
<p>Now, it is time to reveal the secret of the plot (if you haven’t figured it out already). Each point is just the <span class="math inline">\((\mu, \sigma)\)</span> pair generated by a specific <span class="math inline">\((a, b)\)</span> pair. Here is the code, implementing the described algorithm:</p>
<pre class="r"><code>get_abp <- function(a, b, size) {
x <- 1:size
z <- pbeta((1:size)/size, a, b)
p <- z - c(0, z[-size])
sigma <- weighted.sd(x, p)
mu <- weighted.mean(x, p)
data.table(a, b, mu, sigma, t(p))
}
get_p <- function(a, b, n) {
ab <- asplit(expand.grid(a = a, b = b), 1)
rbindlist(lapply(ab, function(x) get_abp(x[1], x[2], n)))
}
a <- seq(.1, 25, .1)
b <- seq(.1, 25, .1)
ab_res <- get_p(a, b, 7)</code></pre>
<p>We can fill in the plot with more points by increasing the range of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> that we search, but creating such a huge table of look-up values is time consuming and starts to eat up memory. In any case, there is no need, because we will refine the search by zooming in on the area closest to our target.</p>
<p>Here is the plot again, based on <span class="math inline">\(a\)</span>’s and <span class="math inline">\(b\)</span>’s ranging from 0.1 to 25, with the superimposed target pair <span class="math inline">\((\mu =3.2, \sigma = 1.8)\)</span></p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>To zoom in, we first find the point in the grid that is closest to our target (based on Euclidean distance). We then define a finer grid around this point in the grid, and re-search for the closest point. We do have to be careful that we do not search for invalid values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> (i.e. <span class="math inline">\(a \le 0\)</span> and <span class="math inline">\(b \le 0\)</span>). Once we find our point, we have the associated probabilities.:</p>
<pre class="r"><code>t_mu = 3.2
t_s = 1.8
ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
close_point <- ab_res[distance == min(distance), .(a, b, distance)]
a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
a_zoom <- a_zoom[a_zoom > 0]
b_zoom <- b_zoom[b_zoom > 0]
res_zoom <- get_p(a_zoom, b_zoom, 7)</code></pre>
<p>Here is the new search region:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>And the selection of the point:</p>
<pre class="r"><code>res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
res_zoom[distance == min(distance)]</code></pre>
<pre><code>## a b mu sigma V1 V2 V3 V4 V5 V6 V7 distance
## 1: 0.97 1.6 3.2 1.8 0.22 0.2 0.17 0.15 0.12 0.09 0.046 0.0021</code></pre>
</div>
<div id="applying-the-beta-search-to-a-bigger-problem" class="section level3">
<h3>Applying the <em>beta</em>-search to a bigger problem</h3>
<p>To conclude, I’ll finish with Francisco’s more ambitious goal of simulating correlated responses to multiple questions. In this case, we will assume four questions, all with responses ranging from 1 to 7. The target (<span class="math inline">\(\mu, \sigma\)</span>) pairs taken from the (hypothetical) journal article are:</p>
<pre class="r"><code>targets <- list(c(2.4, 0.8), c(4.1, 1.2), c(3.4, 1.5), c(5.8, 0.8))</code></pre>
<p>The correlation matrix taken from this same article is:</p>
<pre class="r"><code>corMat <- matrix(c(
1.00, 0.09, 0.11, 0.05,
0.09, 1.00, 0.35, 0.16,
0.11, 0.35, 1.00, 0.13,
0.05, 0.16, 0.13, 1.00), nrow=4,ncol=4)</code></pre>
<p>The <code>get_target_prob</code> function implements the search algorithm described above:</p>
<pre class="r"><code>get_target_prob <- function(t_mu, t_s, ab_res) {
ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
close_point <- ab_res[distance == min(distance), .(a, b, distance)]
a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
a_zoom <- a_zoom[a_zoom > 0]
b_zoom <- b_zoom[b_zoom > 0]
res_zoom <- get_p(a_zoom, b_zoom, 7)
res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
baseprobs <- as.vector(res_zoom[distance == min(distance), paste0("V", 1:7)], "double")
baseprobs
}</code></pre>
<p>Calling the function conducts the search and provides probabilities for each question:</p>
<pre class="r"><code>probs <- lapply(targets, function(x) get_target_prob(x[1], x[2], ab_res))
(probs <- do.call(rbind, probs))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.1e-01 4.8e-01 0.3334 0.077 0.0059 8.8e-05 3.7e-08
## [2,] 7.9e-03 8.5e-02 0.2232 0.306 0.2537 1.1e-01 1.3e-02
## [3,] 1.0e-01 2.1e-01 0.2345 0.209 0.1502 7.9e-02 1.7e-02
## [4,] 2.5e-08 5.4e-05 0.0036 0.051 0.2638 5.0e-01 1.8e-01</code></pre>
<p>At least in theory, Francisco can now conduct his simulation study. In this case, I am generating a huge sample size to minimize sampling variation with the hope that we can recover the means, standard deviations and correlations, which, of course, we do:</p>
<pre class="r"><code>d_ind <- genData(100000)
dx <- genOrdCat(d_ind, adjVar = NULL, baseprobs = probs, catVar = "y",
corMatrix = corMat, asFactor = FALSE)</code></pre>
<pre class="r"><code>apply(as.matrix(dx[, -1]), 2, mean)</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## 2.4 4.1 3.4 5.8</code></pre>
<pre class="r"><code>apply(as.matrix(dx[, -1]), 2, sd)</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## 0.8 1.2 1.5 0.8</code></pre>
<pre class="r"><code>cor(as.matrix(dx[, -1]))</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## grp1 1.000 0.08 0.099 0.043
## grp2 0.080 1.00 0.331 0.142
## grp3 0.099 0.33 1.000 0.110
## grp4 0.043 0.14 0.110 1.000</code></pre>
<p>In the end, Francisco seemed to be satisfied with the solution - at least satisfied enough to go to the trouble to have a bottle of wine sent to me in New York City, which was definitely above and beyond. While my wife and I will certainly enjoy the wine - and look forward to being able to travel again so maybe we can enjoy a glass in person - seeing that image emerge from a <em>beta</em>-distribution was really all I needed. Salud.</p>
</div>
simstudy just got a little more dynamic: version 0.2.1
https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/
Tue, 13 Oct 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p><code>simstudy</code> version 0.2.1 has just been submitted to <a href="https://cran.rstudio.com/web/packages/simstudy/" target="_blank">CRAN</a>. Along with this release, the big news is that I’ve been joined by Jacob Wujciak-Jens as a co-author of the package. He initially reached out to me from Germany with some suggestions for improvements, we had a little back and forth, and now here we are. He has substantially reworked the underbelly of <code>simstudy</code>, making the package much easier to maintain, and positioning it for much easier extension. And he implemented an entire system of formalized tests using <a href="https://testthat.r-lib.org/" target="_blank">testthat</a> and <a href="https://cran.r-project.org/web/packages/hedgehog/vignettes/hedgehog.html" target="_blank">hedgehog</a>; that was always my intention, but I never had the wherewithal to pull it off, and Jacob has done that. But, most importantly, it is much more fun to collaborate on this project than to toil away on my own.</p>
<p>You readers, though, are probably more interested in the changes that, as a user, you will notice. There are a number of bug fixes (hopefully you never encountered those, but I know some of you have, because you have pointed them out to me) and improved documentation, including some new vignettes. There is even a nice new <a href="https://kgoldfeld.github.io/simstudy/index.html" target="_blank">website</a> that is created with the help of <a href="https://pkgdown.r-lib.org/" target="_blank">pkgdown</a>.</p>
<p>The most exciting extension of this new version is the ability to modify data definitions on the fly using externally defined variables. Often, we’d like to explore data generation and modeling under different scenarios. For example, we might want to understand the operating characteristics of a model given different variance or other parametric assumptions. There was already some functionality built into <code>simstudy</code> to facilitate this type of dynamic exploration, with <code>updateDef</code> and <code>updateDefAdd</code>, that allows users to edit lines of existing data definition tables. Now, there is an additional and, I think, more powerful mechanism - called <em>double-dot</em> reference - to access variables that do not already exist in a defined data set or data definition.</p>
<div id="double-dot-external-variable-reference" class="section level3">
<h3>Double-dot external variable reference</h3>
<p>It may be useful to think of an external reference variable as a type of hyperparameter of the data generation process. The reference is made directly in the formula itself, using a double-dot (“..”) notation before the variable name.</p>
<p>Here is a simple example:</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "x", formula = 0,
variance = 5, dist = "normal")
def <- defData(def, varname = "y", formula = "..B0 + ..B1 * x",
variance = "..sigma2", dist = "normal")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: x 0 5 normal identity
## 2: y ..B0 + ..B1 * x ..sigma2 normal identity</code></pre>
<p><code>B0</code>, <code>B1</code>, and <code>sigma2</code> are not part of the simstudy data definition, but will be set external to that process, either in the global environment or within the context of a function.</p>
<pre class="r"><code>B0 <- 4;
B1 <- 2;
sigma2 <- 9
set.seed(716251)
dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))
coef(fit)</code></pre>
<pre><code>## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4 0.28 14 2.6e-25
## x 2 0.13 15 5.9e-28</code></pre>
<pre class="r"><code>fit$sigma</code></pre>
<pre><code>## [1] 2.8</code></pre>
<p>It is easy to create a new data set on the fly with different slope and variance assumptions without having to go to the trouble of updating the data definitions.</p>
<pre class="r"><code>B1 <- 3
sigma2 <- 16
dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))
coef(fit)</code></pre>
<pre><code>## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4 0.43 10 4.6e-17
## x 3.1 0.22 14 8.6e-26</code></pre>
<pre class="r"><code>fit$sigma</code></pre>
<pre><code>## [1] 4.2</code></pre>
</div>
<div id="using-with-apply-functions" class="section level3">
<h3>Using with <em>apply</em> functions</h3>
<p>Double-dot references can be flexibly applied using <code>lapply</code> (or the parallel version <code>mclapply</code>) to create a range of data sets under different assumptions:</p>
<pre class="r"><code>gen_data <- function(sigma2, d) {
dd <- genData(400, d)
dd[, sigma2 := sigma2]
dd
}
sigma2s <- c(1:9)^2
dd_m <- lapply(sigma2s, function(s) gen_data(s, def))
dd_m <- rbindlist(dd_m)
ggplot(data = dd_m, aes(x = x, y = y)) +
geom_point(size = .5, color = "#DDAA33") +
facet_wrap(sigma2 ~ .) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-10-13-simstudy-just-got-a-little-more-dynamic-version-0-2-0.en_files/figure-html/unnamed-chunk-5-1.png" width="768" /></p>
</div>
<div id="using-with-vectors" class="section level3">
<h3>Using with vectors</h3>
<p>Double-dot referencing is also vector friendly. For example, if we want to create a mixture distribution from a vector of values (which we can also do using a <em>categorical</em> distribution), we can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4:</p>
<pre class="r"><code>defblk <- defData(varname = "blksize",
formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")
defblk</code></pre>
<pre><code>## varname formula variance dist link
## 1: blksize ..sizes[1] | .5 + ..sizes[2] | .5 0 mixture identity</code></pre>
<pre class="r"><code>sizes <- c(2, 4)
genData(1000, defblk)</code></pre>
<pre><code>## id blksize
## 1: 1 2
## 2: 2 4
## 3: 3 2
## 4: 4 4
## 5: 5 4
## ---
## 996: 996 4
## 997: 997 2
## 998: 998 4
## 999: 999 4
## 1000: 1000 4</code></pre>
<p>There are a few other changes to the package that are described <a href="https://kgoldfeld.github.io/simstudy/news/index.html" target="_blank">here</a> (but look for version <code>0.2.0</code> - we found a pretty major bug right away and fixed it, hence <code>0.2.1</code>). Moving forward, we have some more things in the works, of course. And if you have suggestions of your own, you know where to find us.</p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>Here’s a more detailed example to show how double-dot references simplify things considerably in a case where I originally used the <code>updateDef</code> function. In a post where I described <a href="https://www.rdatagen.net/post/regression-to-the-mean" target="_blank">regression to the mean</a>, there is an <a href="https://www.rdatagen.net/post/regression-to-the-mean/#addendum" target="_blank">addendum</a> that I adapt here using double-dot references. I’m not going into the motivation for the code here - check out the <a href="https://www.rdatagen.net/post/regression-to-the-mean" target="_blank">post</a> if you’d like to see more.)</p>
<p>Here’s the data original code (both examples require the <code>parallel</code> package):</p>
<pre class="r"><code>d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "x2", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")
rtomean <- function(n, d) {
dd <- genData(n, d)
data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}
repl <- function(xvar, nrep, ucoef, d) {
d <- updateDef(d, "x1", newvariance = xvar)
d <- updateDef(d, "x2", newvariance = xvar)
dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d)))
mudif <- unlist(lapply(dif, mean))
data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
}
dres <- list()
i <- 0
for (ucoef in c(0, 1, 2, 3)) {
i <- i + 1
uform <- genFormula( c(0, ucoef), "U")
d <- updateDef(d, "x1", newformula = uform)
d <- updateDef(d, "x2", newformula = uform)
dr <- mclapply(seq(1, 4, by = 1), function(x) repl(x, 1000, ucoef, d))
dres[[i]] <- rbindlist(dr)
}
dres <- rbindlist(dres)</code></pre>
<p>And here is the updated code:</p>
<pre class="r"><code>d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "x2", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")
rtomean <- function(n, d, ucoef, xvar) {
dd <- genData(n, d)
data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}
repl <- function(nrep, d, ucoef, xvar) {
dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d, ucoef, xvar)))
mudif <- unlist(lapply(dif, mean))
data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
}
ucoef <- c(0:3)
xvar <- c(1:4)
params <- asplit(expand.grid(ucoef = ucoef, xvar = xvar), 1)
dres <- rbindlist(mclapply(params, function(x) repl(1000, d, x["ucoef"], x["xvar"])))</code></pre>
<p>The code is much cleaner and the data generating process doesn’t really lose any clarity. Importantly, this change has allowed me to take advantage of the <code>apply</code> approach (rather than using a loop). I’d conclude that double-dot references have the potential to simplify things quite a bit.</p>
</div>
Permuted block randomization using simstudy
https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/
Tue, 29 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Along with preparing power analyses and statistical analysis plans (SAPs), generating study randomization lists is something a practicing biostatistician is occasionally asked to do. While not a particularly interesting activity, it offers the opportunity to tackle a small programming challenge. The title is a little misleading because you should probably skip all this and just use the <code>blockrand</code> package if you want to generate randomization schemes; don’t try to reinvent the wheel. But, I can’t resist. Since I was recently asked to generate such a list, I’ve been wondering how hard it would be to accomplish this using <code>simstudy</code>. There are already built-in functions for simulating stratified randomization schemes, so maybe it could be a good solution. The key element that is missing from simstudy, of course, is the permuted block setup.</p>
<div id="why-permuted-block-randomization" class="section level3">
<h3>Why permuted block randomization?</h3>
<p>What is <em>permuted block</em> randomization and why even bother? The <em>block</em> part indicates that patients will be randomized in sub-groups. If we have blocks of size two, patients will be paired sequentially, with each patient randomized to different arms. This helps with balance over time and when recruitment is less than anticipated. If we were to randomize 100 patients to treatment and control (50 in each arm), it is possible that treatment assignments will cluster close to the start of the trial, just by chance. If the time period is related to the outcome in some way, this would be undesirable. Furthermore, if trial recruitment lagged and had to stop early, there would be an actual lack of balance across the arms.</p>
<p>The argument for block randomization seems strong enough. But if coordinators in the field know that we are using this approach, there is a risk of influencing patient recruitment. If it is known that patients are randomized in blocks of size four, and the first two patients are randomized to drug therapy, the coordinator will know that the next two patients will be randomized to control. This could influence who the coordinator recruits into the study, particularly if they believe drug therapy is superior. They may actively or unconsciously recruit healthier patients when it is known that they are going to get the control. (This, of course, is much less of an issue when recruiters/coordinators are completely blinded to group assignment.) By changing the block sizes in an unpredictable manner, by <em>permuting</em> the sizes, this problem is solved. Hence, <em>permuted block randomization</em>.</p>
</div>
<div id="simstudy-code" class="section level3">
<h3>simstudy code</h3>
<p>I want to walk through the code that will generate permuted block randomization. In this scenario we are conducting a trial to compare a <em>drug</em> therapy with <em>placebo</em> in at least 120 patients. We would like to randomize within blocks of size two or four, and the order of the blocks will themselves be randomized. We assume that each block size will have equal probability of being selected, though balance across different block sizes is not guaranteed. The preliminary code shown here implements these specifications:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
set.seed(1234)
n <- 120
levels <- c("D", "P") # Drug/Placebo
blk_sizes <- c(2, 4)
n_arms <- length(levels)
p_blk <- 1/length(blk_sizes)</code></pre>
<p>The first step is to generate a sequence of blocks with varying block sizes. We take advantage of the <code>mixture</code> distribution option in <code>simstudy</code> to generate blocks. This distribution is specified using a string with the format “<span class="math inline">\(x_1|p_1 + \dots + x_k|p_k\)</span>”. In this case, there are <span class="math inline">\(k=2\)</span> block sizes; <span class="math inline">\(x_1 = 2\)</span>, <span class="math inline">\(p_1 = 0.5\)</span>, <span class="math inline">\(x_2 = 4\)</span>, and <span class="math inline">\(p_2 = 0.5\)</span>. We construct the mixture formula using the predefined block sizes, and use that formula to define the data that we need to generate:</p>
<pre class="r"><code>v <- sapply(blk_sizes, function(x) paste(x, p_blk, sep = "|"))
mixformula <- paste(v, collapse = "+")
def <- defData(varname = "blk_size", formula = mixformula,
dist = "mixture", id = "blk_id")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: blk_size 2|0.5+4|0.5 0 mixture identity</code></pre>
<p>Now, we need generate enough blocks to support the target number of patients to be randomized; that is, the sum of the block sizes should at least as large as the target. If all block sizes were the minimum block size (in this case <span class="math inline">\(2\)</span>), then we would need at least <span class="math inline">\(n/2\)</span> blocks. Clearly, we will need fewer, but we will start with <span class="math inline">\(n/2\)</span> here and remove the excess:</p>
<pre class="r"><code>maxblocks <- ceiling(n / min(blk_sizes))
dd <- genData(maxblocks, def)
#--- removing the excess
nblocks <- dd[, threshold := (cumsum(blk_size) >= n) * .I][threshold > 0]
dd <- dd[1:nblocks[1, threshold]]
tail(dd)</code></pre>
<pre><code>## blk_id blk_size threshold
## 1: 36 4 0
## 2: 37 2 0
## 3: 38 2 0
## 4: 39 4 0
## 5: 40 4 0
## 6: 41 4 41</code></pre>
<p>In the final step, we use cluster data generation to create the individual patients, defining each block as a cluster. Treatment assignment is stratified by each block:</p>
<pre class="r"><code>di <- genCluster(dd, cLevelVar = "blk_id", numIndsVar = "blk_size", level1ID = "id")
dtrt <- trtAssign(di, nTrt = n_arms, strata = "blk_id", grpName = "arm")
dtrt <- dtrt[, .(id, blk_id, blk_size, arm = factor(arm, labels = levels))]</code></pre>
<p>Here are examples of the block randomization results for four blocks:</p>
<pre class="r"><code>dtrt[blk_id == 5]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 15 5 4 P
## 2: 16 5 4 D
## 3: 17 5 4 P
## 4: 18 5 4 D</code></pre>
<pre class="r"><code>dtrt[blk_id == 8]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 25 8 2 D
## 2: 26 8 2 P</code></pre>
<pre class="r"><code>dtrt[blk_id == 19]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 59 19 2 P
## 2: 60 19 2 D</code></pre>
<pre class="r"><code>dtrt[blk_id == 26]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 73 26 4 D
## 2: 74 26 4 P
## 3: 75 26 4 P
## 4: 76 26 4 D</code></pre>
</div>
<div id="a-real-world-application" class="section level3">
<h3>A real-world application</h3>
<p>I’ve created a function <code>blkRandom</code> based on this code so that I can illustrate this functionality in a more realistic setting. In a current multi-site study that I’m working on (already did the power analysis, finalizing the SAP), we need to provide a randomization list so that subject recruitment can begin. Randomization will be stratified by each of the sites (1 through 7), by sex (M and F), and by location of recruitment (A or B); in total, there will be <span class="math inline">\(7\times 2 \times 2 = 28\)</span> strata. For each of the 28 strata we want to randomize 50 potential subjects using permuted block randomization; for particular strata, this is certainly too large a number, but it doesn’t hurt to overestimate as long as the clinical trial software system can handle it.</p>
<p>Here is how the function would work for a single strata (just showing the first and last blocks):</p>
<pre class="r"><code>blkRandom(n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))[c(1:4, 47:50)]</code></pre>
<pre><code>## blk_id blk_size threshold id arm
## 1: 1 4 0 1 A
## 2: 1 4 0 2 B
## 3: 1 4 0 3 A
## 4: 1 4 0 4 B
## 5: 13 4 13 47 B
## 6: 13 4 13 48 A
## 7: 13 4 13 49 A
## 8: 13 4 13 50 B</code></pre>
<p>Here is a wrapper function for <code>blkRandom</code> that incorporates a specific strata <span class="math inline">\(s\)</span>. This will enable us to do permuted block randomization within different subgroups of the population, such as males and females, or sites:</p>
<pre class="r"><code>sBlkRandom <- function(s, n, levels, blk_sizes) {
dB <- blkRandom(n, levels, blk_sizes)
dB[, .(id = paste0(id, s), stratum = s, arm)]
}
sBlkRandom(s = "M1A", n = 30, levels = c("A", "B"), blk_sizes = c(2, 4))[1:5]</code></pre>
<pre><code>## id stratum arm
## 1: 1M1A M1A B
## 2: 2M1A M1A A
## 3: 3M1A M1A A
## 4: 4M1A M1A B
## 5: 5M1A M1A B</code></pre>
<p>All the pieces are now in place.</p>
<p>We need to create a list of strata, each of which requires its own permuted block randomization list:</p>
<pre class="r"><code>library(tidyr)
#--- specify all strata variables
site <- c(1 : 7)
sex <- c("M", "F")
location <- c("A", "B")
#--- create strata
strata <- expand.grid(sex = sex,site = site, location = location)
strata <- unite(strata, "stratum", sep = "")$stratum
head(strata)</code></pre>
<pre><code>## [1] "M1A" "F1A" "M2A" "F2A" "M3A" "F3A"</code></pre>
<p>With the list of strata in hand - we create the randomization lists using <code>lapply</code> to repeatedly call <code>sBlkRandom</code>:</p>
<pre class="r"><code>rbindlist(lapply(
strata,
function(s) sBlkRandom(s, n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))
)
)</code></pre>
<pre><code>## id stratum arm
## 1: 1M1A M1A B
## 2: 2M1A M1A A
## 3: 3M1A M1A A
## 4: 4M1A M1A B
## 5: 5M1A M1A B
## ---
## 1420: 46F7B F7B B
## 1421: 47F7B F7B B
## 1422: 48F7B F7B A
## 1423: 49F7B F7B B
## 1424: 50F7B F7B A</code></pre>
</div>
<div id="going-with-the-tried-and-true" class="section level3">
<h3>Going with the tried and true</h3>
<p>This has been fun and I think successful, but as I mentioned, you might want to stick with the established <a href="https://cran.r-project.org/web/packages/blockrand/index.html" target="_blank">blockrand</a> package that is designed around this very specific goal. I have written a simple wrapper function that rectifies one slightly minor shortcoming (block sizes are specified as blk_size<span class="math inline">\(/ 2\)</span>) and allows us to use <code>lapply</code> to make repeated calls across the strata:</p>
<pre class="r"><code>library(blockrand)
rand_stratum <- function(stratum, n, levels, blk_sizes) {
blk_sizes <- blk_sizes / 2
dB <- data.table(blockrand(
n = n,
num.levels = length(levels),
levels = levels,
id.prefix = stratum,
block.prefix = stratum,
stratum = stratum,
block.sizes = blk_sizes)
)
dB[, .(id, stratum, treatment)]
}
rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))</code></pre>
<pre><code>## id stratum treatment
## 1: M1A01 M1A B
## 2: M1A02 M1A A
## 3: M1A03 M1A B
## 4: M1A04 M1A A
## 5: M1A05 M1A B
## ---
## 1420: F7B46 F7B A
## 1421: F7B47 F7B A
## 1422: F7B48 F7B B
## 1423: F7B49 F7B A
## 1424: F7B50 F7B B</code></pre>
<p>In case the elegance and simplicity (not to mention all the other features that it provides, but I didn’t show you) are not compelling enough, the speed comparison isn’t even close: <code>blockrand</code> is 7 times faster than my <code>simstudy</code> solution (albeit on the millisecond scale - so you might not actually notice it).</p>
<pre class="r"><code>library(microbenchmark)
microbenchmark(
rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A", "B"), c(2, 4)))),
rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))
)</code></pre>
<pre><code>## Unit: milliseconds
## expr
## rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A", "B"), c(2, 4))))
## rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))
## min lq mean median uq max neval cld
## 276 290 331 304 344 698 100 b
## 38 40 48 44 49 120 100 a</code></pre>
</div>
Generating probabilities for ordinal categorical data
https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/
Tue, 15 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">here</a>. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.</p>
<div id="motivation" class="section level3">
<h3>Motivation</h3>
<p>The outcome for the analysis that we will be conducting is the WHO 11-point ordinal scale for clinical improvement at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. We plan to use a Bayesian proportional odds model to assess the effectiveness of the therapy. Since this is a meta-analysis, we will be including these data from a collection of studies being conducted around the world.</p>
<p>Typically, in a proportional odds model one has to make an <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/" target="{"_blank"">assumption about proportionality</a>. In this case, while we are willing to make that assumption <em>within</em> specific studies, we are not willing to make that assumption <em>across</em> the various studies. This means we need to generate a separate set of intercepts for each study that we simulate.</p>
<p>In the proportional odds model, we are modeling the log-cumulative odds at a particular level. The simplest model with a single exposure/treatment covariate for a specific study or cluster <span class="math inline">\(k\)</span> is</p>
<p><span class="math display">\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{xk} + \beta A,\]</span>
where <span class="math inline">\(x\)</span> ranges from 1 to 10, all the levels of the WHO score excluding the lowest level <span class="math inline">\(x=0\)</span>. <span class="math inline">\(A\)</span> is the treatment indicator, and is <span class="math inline">\(A=1\)</span> for patients who receive the treatment. <span class="math inline">\(\alpha_{xk}\)</span> is the intercept for each study/cluster <span class="math inline">\(k\)</span>. <span class="math inline">\(\beta\)</span> is interpreted as the log-odds ratio comparing the odds of the treated with the non-treated within each study. The proportionality assumption kicks in here when we note that <span class="math inline">\(\beta\)</span> is constant for all levels of <span class="math inline">\(x\)</span>. In addition, in this particular model, we are assuming that the log-odds ratio is constant across studies (not something we will assume in a more complete model). We make no assumptions about how the study intercepts relate to each other.</p>
<p>To make clear what it would mean to make a stronger assumption about the odds <em>across</em> studies consider this model:</p>
<p><span class="math display">\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{x} + b_k + \beta A,\]</span></p>
<p>where the intercepts for each study are related, since they are defined as <span class="math inline">\(\alpha_{x} + b_k\)</span>, and share <span class="math inline">\(\alpha_x\)</span> in common. If we compare the log-odds of the treated in one study <span class="math inline">\(k\)</span> with the log-odds of treated in another study <span class="math inline">\(j\)</span> (so <span class="math inline">\(A=1\)</span> in both cases), the log-odds ratio is <span class="math inline">\(b_j - b_k\)</span>. The ratio is independent of <span class="math inline">\(x\)</span>, which implies a strong proportional odds assumption across studies. In contrast, the same comparison across studies based on the first model is <span class="math inline">\(\alpha_{xj} - \alpha_{xk}\)</span>, which is <em>not</em> necessarily constant across different levels of <span class="math inline">\(x\)</span>.</p>
<p>This is a long way of explaining why we need to generate different sets of intercepts for each study. In short, we would like to make the more relaxed assumption that odds are not proportional across studies or clusters.</p>
</div>
<div id="the-dirichlet-distribution" class="section level3">
<h3>The Dirichlet distribution</h3>
<p>In order to generate ordinal categorical data I use the <a href="https://kgoldfeld.github.io/simstudy/reference/genOrdCat.html" target="_blank">genOrdCat</a> function in the <code>simstudy</code> package. This function requires a set of baseline probabilities that sum to one; these probabilities map onto level-specific intercepts. There will be a distinct set of baseline probabilities for each study and I will create a data set for each study. The challenge is to be able to generate unique baseline probabilities as if I were sampling from a population of studies.</p>
<p>If I want to generate a single probability (i.e. a number between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>), a good solution is to draw a value from a <em>beta</em> distribution, which has two shape parameters <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>.</p>
<p>Here is a single draw from <span class="math inline">\(beta(3, 3)\)</span>:</p>
<pre class="r"><code>set.seed(872837)
rbeta(1, shape1 = 3, shape2 = 3)</code></pre>
<pre><code>## [1] 0.568</code></pre>
<p>The mean of the <em>beta</em> distribution is <span class="math inline">\(\alpha/(\alpha + \beta)\)</span> and the variance is <span class="math inline">\(\alpha\beta/(\alpha+\beta)^2(\alpha + \beta + 1)\)</span>. We can reduce the variance and maintain the same mean by increasing <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> by a constant factor (see <a href="#addendum">addendum</a> for a pretty picture):</p>
<pre class="r"><code>library(data.table)
d1 <- data.table(s = 1, value = rbeta(1000, shape1 = 1, shape2 = 2))
d2 <- data.table(s = 2, value = rbeta(1000, shape1 = 5, shape2 = 10))
d3 <- data.table(s = 3, value = rbeta(1000, shape1 = 100, shape2 = 200))
dd <- rbind(d1, d2, d3)
dd[, .(mean(value), sd(value)), keyby = s]</code></pre>
<pre><code>## s V1 V2
## 1: 1 0.338 0.2307
## 2: 2 0.336 0.1195
## 3: 3 0.333 0.0283</code></pre>
<p>The <em>Dirichlet</em> distribution is a multivariate version of the <em>beta</em> distribution where <span class="math inline">\(K\)</span> values between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span> are generated, with the caveat that they sum to <span class="math inline">\(1\)</span>. Instead of <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>, the Dirichlet is parameterized by a vector of length <span class="math inline">\(K\)</span></p>
<p><span class="math display">\[\boldsymbol{\alpha} = \left(\alpha_1,\dots, \alpha_K\right)^T,\]</span></p>
<p>where there are <span class="math inline">\(K\)</span> levels of the ordinal outcome. A draw from this distribution returns a vector <span class="math inline">\(\boldsymbol{p} = ( p_1, \dots, p_K)^T\)</span> where <span class="math inline">\(\sum_{i=1}^K p_i = 1\)</span> and</p>
<p><span class="math display">\[E(p_k)=\frac{\alpha_k}{\sum_{i=1}^K \alpha_i}.\]</span>
A draw from a Dirichlet distribution with <span class="math inline">\(K=2\)</span> is actually equivalent to a draw from a <em>beta</em> distribution where <span class="math inline">\(\boldsymbol{\alpha} = (\alpha, \beta)^T\)</span>. Before, I generated data from a <span class="math inline">\(beta(1, 2)\)</span>, and now here is a draw from <span class="math inline">\(Dirichlet\left(\boldsymbol\alpha = (1,2)\right)\)</span> using <code>rdirichlet</code> from the <code>gtools</code> package:</p>
<pre class="r"><code>library(gtools)
dir <- rdirichlet(1000, alpha = c(1,2))
head(dir)</code></pre>
<pre><code>## [,1] [,2]
## [1,] 0.3606 0.639
## [2,] 0.4675 0.533
## [3,] 0.2640 0.736
## [4,] 0.0711 0.929
## [5,] 0.5643 0.436
## [6,] 0.0188 0.981</code></pre>
<p>The first column has the same distribution as the <span class="math inline">\(beta\)</span> distribution from before; the mean and standard deviation are close to the values estimated above:</p>
<pre class="r"><code>c(mean(dir[,1]), sd(dir[,1]))</code></pre>
<pre><code>## [1] 0.332 0.236</code></pre>
<p>To ramp things up a bit, say we have <span class="math inline">\(K = 5\)</span>, and the target mean values for each level are <span class="math inline">\(\boldsymbol{p} = \left(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9} \right)\)</span>, one way to specify this is:</p>
<pre class="r"><code>dir_1 <- rdirichlet(1000, alpha = c(1, 2, 3, 2, 1))
head(dir_1)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1710 0.6637 0.0676 0.0633 0.0343
## [2,] 0.1130 0.1150 0.2803 0.4229 0.0689
## [3,] 0.1434 0.0678 0.3316 0.1721 0.2851
## [4,] 0.0250 0.1707 0.3841 0.2490 0.1712
## [5,] 0.0633 0.3465 0.4056 0.0853 0.0993
## [6,] 0.1291 0.1510 0.3993 0.2612 0.0593</code></pre>
<p>Here are the observed means for each <span class="math inline">\(p_k\)</span>, pretty close to the target:</p>
<pre class="r"><code>apply(dir_1, 2, mean)</code></pre>
<pre><code>## [1] 0.111 0.221 0.328 0.229 0.112</code></pre>
<p>Of course, we could generate data with a similar target <span class="math inline">\(\boldsymbol{p}\)</span> by multiplying <span class="math inline">\(\boldsymbol\alpha\)</span> by a constant <span class="math inline">\(c\)</span>. In this case, we use <span class="math inline">\(c=10\)</span> and see that the average values for each <span class="math inline">\(p_k\)</span> are also close to the target:</p>
<pre class="r"><code>dir_2 <- rdirichlet(1000, alpha = c(10, 20, 30, 20, 10))
apply(dir_2, 2, mean)</code></pre>
<pre><code>## [1] 0.113 0.222 0.334 0.220 0.111</code></pre>
<p>There is a key difference between specifying <span class="math inline">\(\boldsymbol{\alpha}\)</span> and <span class="math inline">\(c\boldsymbol{\alpha}\)</span>. Just as in the <em>beta</em> distribution, as <span class="math inline">\(c\)</span> grows larger, the variation within each <span class="math inline">\(p_k\)</span> decreases. This will be useful when generating the study specific probabilities if we want explore different levels of variation.</p>
<p>Here’s the standard deviations from the two data sets just generated:</p>
<pre class="r"><code>apply(dir_1, 2, sd)</code></pre>
<pre><code>## [1] 0.102 0.131 0.144 0.134 0.098</code></pre>
<pre class="r"><code>apply(dir_2, 2, sd)</code></pre>
<pre><code>## [1] 0.0333 0.0425 0.0508 0.0421 0.0333</code></pre>
</div>
<div id="generating-the-baseline-probabilities" class="section level3">
<h3>Generating the baseline probabilities</h3>
<p>A simple function that includes two key arguments - the base probabilities (which are really <span class="math inline">\(\boldsymbol{\alpha}\)</span>) and a <em>similarity</em> index (which is really just the constant <span class="math inline">\(c\)</span>) - implements these ideas to generate study-specific probabilities for each outcome level. As the <em>similarity</em> index increases, the variation across studies or sites decreases. The function includes an additional adjustment to ensure that the row totals sum exactly to <span class="math inline">\(1\)</span> and not to some value infinitesimally greater than <span class="math inline">\(1\)</span> as a result of rounding. Such a rounding error could cause problems for the function <code>genOrdCat</code>.</p>
<pre class="r"><code>genBaseProbs <- function(n, base, similarity, digits = 8) {
n_levels <- length(base)
x <- rdirichlet(n, similarity * base)
#--- ensure that each vector of probabilities sums exactly to 1
x <- round(floor(x*1e8)/1e8, digits) # round the generated probabilities
xpart <- x[, 1:(n_levels-1)] # delete the base prob of the final level
partsum <- apply(xpart, 1, sum) # add the values of levels 1 to K-1
x[, n_levels] <- 1 - partsum # the base prob of the level K = 1 - sum(1:[K-1])
return(x)
}</code></pre>
<p>In this first example, I am generating 11 values (representing base probabilities) for each of 9 studies using a relatively low similarity index, showing you the first six studies:</p>
<pre class="r"><code>basestudy <- genBaseProbs(
n = 9,
base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
similarity = 15,
)
round(head(basestudy), 3)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 0.094 0.022 0.121 0.100 0.061 0.102 0.053 0.309 0.059 0.078 0.000
## [2,] 0.025 0.079 0.043 0.197 0.083 0.044 0.099 0.148 0.025 0.150 0.107
## [3,] 0.007 0.042 0.084 0.066 0.049 0.145 0.191 0.323 0.078 0.012 0.003
## [4,] 0.061 0.021 0.063 0.104 0.092 0.292 0.112 0.110 0.113 0.026 0.008
## [5,] 0.067 0.023 0.021 0.042 0.063 0.473 0.108 0.127 0.016 0.013 0.046
## [6,] 0.001 0.018 0.054 0.225 0.150 0.301 0.043 0.081 0.100 0.008 0.020</code></pre>
<p>A great way to see the variability is a cumulative probability plot for each individual study. With a relatively low <em>similarity</em> index, you can generate quite a bit of variability across the studies. In order to create the plot, I need to first calculate the cumulative probabilities:</p>
<pre class="r"><code>library(ggplot2)
library(viridis)
cumprobs <- data.table(t(apply(basestudy, 1, cumsum)))
n_levels <- ncol(cumprobs)
cumprobs[, id := .I]
dm <- melt(cumprobs, id.vars = "id", variable.factor = TRUE)
dm[, level := factor(variable, labels = c(0:10))]
ggplot(data = dm, aes(x=level, y = value)) +
geom_line(aes(group = id, color = id)) +
scale_color_viridis( option = "D") +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.position = "none")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>Here is a plot of data generated using a <em>similarity</em> index of 150. Variation is reduced pretty dramatically:</p>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
</div>
<div id="using-base-probabilities-to-generate-ordinal-data" class="section level3">
<h3>Using base probabilities to generate ordinal data</h3>
<p>Now that we have these base probabilities, the last step is to use them to generate ordinal outcomes. I am generating the simplest of data sets: 9 “studies” each with 500 subjects, without any covariates or even treatment assignment. Since the <code>genOrdCat</code> requires an adjustment variable, I am adjusting everyone by 0. (This is something I need to fix - there should be no such requirement.)</p>
<pre class="r"><code>library(simstudy)
d_study <- genData(9, id = "study")
d_ind <- genCluster(d_study, "study", numIndsVar = 500, "id")
d_ind[, z := 0]
d_ind</code></pre>
<pre><code>## study id z
## 1: 1 1 0
## 2: 1 2 0
## 3: 1 3 0
## 4: 1 4 0
## 5: 1 5 0
## ---
## 4496: 9 4496 0
## 4497: 9 4497 0
## 4498: 9 4498 0
## 4499: 9 4499 0
## 4500: 9 4500 0</code></pre>
<p>To generate the ordinal categorical outcome, we have to treat each study separately since they have unique baseline probabilities. This can be accomplished using <code>lapply</code> in the following way:</p>
<pre class="r"><code>basestudy <- genBaseProbs(
n = 9,
base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
similarity = 50
)
list_ind <- lapply(
X = 1:9,
function(i) {
b <- basestudy[i,]
d_x <- d_ind[study == i]
genOrdCat(d_x, adjVar = "z", b, catVar = "ordY")
}
)</code></pre>
<p>The output <code>list_ind</code> is a list of data tables, one for each study. For example, here is the 5th data table in the list:</p>
<pre class="r"><code>list_ind[[5]]</code></pre>
<pre><code>## study id z ordY
## 1: 5 2001 0 7
## 2: 5 2002 0 9
## 3: 5 2003 0 5
## 4: 5 2004 0 9
## 5: 5 2005 0 9
## ---
## 496: 5 2496 0 9
## 497: 5 2497 0 4
## 498: 5 2498 0 7
## 499: 5 2499 0 5
## 500: 5 2500 0 11</code></pre>
<p>And here is a table of proportions for each study that we can compare with the base probabilities:</p>
<pre class="r"><code>t(sapply(list_ind, function(x) x[, prop.table(table(ordY))]))</code></pre>
<pre><code>## 1 2 3 4 5 6 7 8 9 10 11
## [1,] 0.106 0.048 0.086 0.158 0.058 0.162 0.092 0.156 0.084 0.028 0.022
## [2,] 0.080 0.024 0.092 0.134 0.040 0.314 0.058 0.110 0.028 0.110 0.010
## [3,] 0.078 0.050 0.028 0.054 0.148 0.172 0.162 0.134 0.058 0.082 0.034
## [4,] 0.010 0.056 0.116 0.160 0.054 0.184 0.102 0.084 0.156 0.056 0.022
## [5,] 0.010 0.026 0.036 0.152 0.150 0.234 0.136 0.084 0.120 0.026 0.026
## [6,] 0.040 0.078 0.100 0.092 0.170 0.168 0.196 0.050 0.038 0.034 0.034
## [7,] 0.006 0.064 0.058 0.064 0.120 0.318 0.114 0.068 0.082 0.046 0.060
## [8,] 0.022 0.070 0.038 0.160 0.182 0.190 0.074 0.068 0.070 0.036 0.090
## [9,] 0.054 0.046 0.052 0.128 0.100 0.290 0.102 0.092 0.080 0.030 0.026</code></pre>
<p>Of course, the best way to compare is to plot the data for each study. Here is another cumulative probability plot, this time including the observed (generated) probabilities in black over the baseline probabilities used in the data generation in red:</p>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-18-1.png" width="672" /></p>
<p>Sometime soon, I plan to incorporate something like the function <code>genBaseProbs</code> into <code>simstudy</code> to make it easier to incorporate non-proportionality assumptions into simulation studies that use ordinal categorical outcomes.</p>
<p id="addendum">
</p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The variance of the <em>beta</em> distribution (and similarly the <em>Dirichlet</em> distribution) decreases as <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> both increase proportionally (keeping the mean constant). I’ve plotted the variance of the <em>beta</em> distribution for <span class="math inline">\(\alpha = 1\)</span> and different levels of <span class="math inline">\(\beta\)</span> and <span class="math inline">\(C\)</span>. It is clear that at any level of <span class="math inline">\(\beta\)</span> (I’ve drawn a line at <span class="math inline">\(\beta = 1\)</span>), the variance decreases as <span class="math inline">\(C\)</span> increases. It is also clear that, holding <span class="math inline">\(\alpha\)</span> constant, the relationship of <span class="math inline">\(\beta\)</span> to variance is not strictly monotonic:</p>
<pre class="r"><code>var_beta <- function(params) {
a <- params[1]
b <- params[2]
(a * b) / ( (a + b)^2 * (a + b + 1))
}
loop_b <- function(C, b) {
V <- sapply(C, function(x) var_beta(x*c(1, b)))
data.table(b, V, C)
}
b <- seq(.1, 25, .1)
C <- c(0.01, 0.1, 0.25, 0.5, 1, 2, 4, 10, 100)
d_var <- rbindlist(lapply(b, function(x) loop_b(C, x)))
ggplot(data = d_var, aes(x = b, y = V, group = C)) +
geom_vline(xintercept = 1, size = .5, color = "grey80") +
geom_line(aes(color = factor(C))) +
scale_y_continuous(name = expression("Var beta"~(alpha==1~","~beta))) +
scale_x_continuous(name = expression(beta)) +
scale_color_viridis(discrete = TRUE, option = "B", name = "C") +
theme(panel.grid = element_blank(),
legend.title.align=0.15)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
</div>
Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis
https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/
Tue, 01 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>The federal government recently granted emergency approval for the use of antibody rich blood plasma when treating hospitalized COVID-19 patients. This announcement is <a href="https://www.statnews.com/2020/08/24/trump-opened-floodgates-convalescent-plasma-too-soon/" target="_blank">unfortunate</a>, because we really don’t know if this promising treatment works. The best way to determine this, of course, is to conduct an experiment, though this approval makes this more challenging to do; with the general availability of convalescent plasma (CP), there may be resistance from patients and providers against participating in a randomized trial. The emergency approval sends the incorrect message that the treatment is definitively effective. Why would a patient take the risk of receiving a placebo when they have almost guaranteed access to the therapy?</p>
<p>This doesn’t obviate the fact that we still need to figure out if CP is effective. Last month, I described an <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="_blank">approach</a> to pooling data across multiple different, but similar, studies as way to leverage information that is being generated around the country and world from ongoing trials. I was particularly cryptic about the nature of the research, because the paper describing the details of the proposed research had not yet been <a href="https://jamanetwork.com/journals/jama/fullarticle/2768851" target="_blank">published</a>. Now, the project has a name (<strong>COMPILE</strong>), a <a href="https://med.nyu.edu/departments-institutes/population-health/divisions-sections-centers/biostatistics/research/continuous-monitoring-pooled-international-trials-convalescent-plasma-covid19-hospitalized-patients" target="_blank">website</a>, and most importantly, participating studies committed to sharing data.</p>
<p>In preparation for the analyses, we have been developing a statistical plan, which is based on a pooled Bayesian model similar to what I <a href="https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/" target="_blank">described</a>) earlier. The Bayesian approach offers much needed flexibility in this context when we must make a principled decision as quickly as possible. Indeed, now that the <a href="https://www.fda.gov/news-events/press-announcements/fda-issues-emergency-use-authorization-convalescent-plasma-potential-promising-covid-19-treatment" target="_blank">emergency approval</a> has been granted, there is even more urgency. The study’s Data Safety and Monitoring Board will be evaluating the data frequently, which a Bayesian approach accommodates quite well. (I imagine I will have much to write about over the next few months as we try to better understand the implications and challenges of taking this path.)</p>
<p>In this post, I am describing a nitty-gritty issue related to Markov chain Monte Carlo (MCMC) estimation: stability. It may not sound super-exciting, but stable estimation is key to drawing correct inferences from the estimated posterior distribution. As a model becomes more complex, the MCMC estimation in <code>stan</code> can be plagued by degenerate sampling caused by divergent transitions. MCMC works well when the algorithm takes the sampler across the full posterior distribution without getting stuck, but all bets are off when the process breaks down.</p>
<p>Using a slightly simplified version of the data and model we are proposing for COMPILE, I want to show how to see if things have gotten stuck at some point, and then present a possible solution to getting things unstuck. (I highly encourage you to look <a href="http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html" target="_blank">here</a> and <a href="https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html" target="_blank">here</a> for many more details if this is indeed exciting to you.)</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>The studies included in this meta-analysis will be similar in that they are all providing 1- or 2-units of plasma to the patients randomized to therapy. However, there will be differences with respect to the control arm: studies will use a saline solution, non-convalescent plasma, or usual care (in the last case, the study is not blinded). We need to account for the possibility that the treatment effect will vary slightly depending on the type of control. In this case, I am assuming a binary outcome (though in the actual study we are using an ordinal outcome and a proportional odds model). Here is the logistic model:
<span class="math display">\[\text{logodds}(P(y_{ik}=1)) = \alpha_0 + b_k + \delta_k x_{ik}\]</span>
where the observed data are <span class="math inline">\(y_{ik}\)</span>, the indicator for the outcome (say death) of patient <span class="math inline">\(i\)</span> in study <span class="math inline">\(k\)</span>, and <span class="math inline">\(x_{ik}\)</span>, an indicator set to <span class="math inline">\(1\)</span> if the patient is in the control arm, 0 otherwise. The parameters are <span class="math inline">\(\alpha_0\)</span>, the global logodds of death (note here the intercept represents the treatment condition, not the control condition as is usually the case); <span class="math inline">\(b_k\)</span>, the study-specific logodds of death given treatment; and <span class="math inline">\(\delta_k\)</span>, the “control” effect (a logodds ratio) specific to study <span class="math inline">\(k\)</span>.</p>
<p>The really interesting aspect of this model is <span class="math inline">\(\delta_k\)</span>. This effect is a function of three components - the study site, the control-type, and the the overall/global treatment effect. We are assuming that there is a tendency for studies to vary around a control-type effect average. So, with three controls <span class="math inline">\(c\in {1,2,3}\)</span>:
<span class="math display">\[\delta_k \sim N(\delta_c, \eta_0),\]</span></p>
<p>determined by the control-type of study <span class="math inline">\(k\)</span>. Furthermore, we assume that the control-type effects, the <span class="math inline">\(\delta_c\)</span>’s, vary around a global effect <span class="math inline">\(\Delta\)</span>:
<span class="math display">\[\delta_c \sim N(\Delta, \eta).\]</span>
We assume that <span class="math inline">\(\eta\)</span> is quite small; that is, the control effects will likely be very similar. We are not actually interested in <span class="math inline">\(\eta\)</span> so we do not attempt to estimate it. However <span class="math inline">\(\eta_0\)</span>, the variability across studies, is important, so we <em>will</em> estimate that.</p>
</div>
<div id="generating-data" class="section level3">
<h3>Generating data</h3>
<p>To start, here are the <code>R</code> packages I am using to generate the data.</p>
<pre class="r"><code>library(simstudy)
library(data.table)</code></pre>
<p>Next, I define the study level parameters: the study-specific intercept <span class="math inline">\(b_k\)</span> and the study-specific “control” effect <span class="math inline">\(\delta_k\)</span>, which is a function of the control-type. Note I do not specify a study-level “control” effect, there will just be natural variation across studies. The individual-level outcome <span class="math inline">\(y_{ik}\)</span> is defined as a function of study parameters. The overall treatment effect is <span class="math inline">\(\Delta = 0.5\)</span> on the logodds ratio scale.</p>
<pre class="r"><code>def_s <- defDataAdd(varname = "b_k", formula = 0, variance = 0.025)
def_s <- defDataAdd(
def_s, varname = "delta_k",
formula = "(c_type==1) * 0.4 + (c_type==2) * 0.5 + (c_type==3) * 0.6",
dist = "nonrandom"
)
def_i <- defDataAdd(
varname = "y", formula = "-1 + b_k + rx * delta_k",
dist = "binary", link = "logit")</code></pre>
<p>I am generating 7 studies with under each control type, for a total of 21 studies. Each study has 50 patients, 25 in each arm, for a total of 1050 patients.</p>
<pre class="r"><code>dc <- genData(3, id = "c_type")
ds <- genCluster(dc, "c_type", numIndsVar = 7, level1ID = "site")
ds <- addColumns(def_s, ds)
di <- genCluster(ds, "site", 50, "id")
di <- trtAssign(di, 2, strata = "site", grp = "rx")
di <- addColumns(def_i, di)</code></pre>
</div>
<div id="estimation-using-stan" class="section level3">
<h3>Estimation using stan</h3>
<p>I am using <code>rstan</code> directly to sample from the posterior distribution, as opposed to using a more user-friendly package like <code>brms</code> or <code>rstanarm</code>. I’ve actually been warned against taking this approach by folks at stan, because it can be more time consuming and could lead to problems of the sort that I am showing you how to fix. However, I find building the model using stan code very satisfying and illuminating; this process has really given me a much better appreciation of the Bayesian modeling. And besides, this model is odd enough that trying to shoehorn it into a standard <code>brms</code> model might be more trouble than it is worth.</p>
<pre class="r"><code>library(rstan)
library(ggplot)
library(bayesplot)</code></pre>
<p>The stan code, which can reside in its own file, contains a number of <em>blocks</em> that define the model. The <code>data</code> block specifies the data will be provided to the model; this can include summary data as well as raw data. The <code>parameters</code> block is where you specify the parameters of the model that need to be estimated. The <code>transformed parameters</code> block includes another set of parameters that will be estimated, but are a function of parameters defined in the previous block. And in this case, the last block is the <code>model</code> where prior distributions are specified as well as the likelihood (outcome) model. Rather than walk you through the details here, I will let you study a bit and see how this relates to the model I specified above.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=0> C; // number of control types
int<lower=1> K; // number of studies
int y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=C> cc[K]; // specific control for site
}
parameters {
real alpha; // overall intercept for treatment
vector[K] beta_k; // site specific intercept
real<lower=0> sigma_b; // sd of site intercepts
vector[K] delta_k; // site specific treatment effect
real<lower=0> eta_0; // sd of delta_k (around delta_c)
vector[C] delta_c; // control-specific effect
real Delta; // overall control effect
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = alpha + beta_k[kk[i]] + (ctrl[i] * (delta_k[kk[i]]));
}
model {
// priors
alpha ~ student_t(3, 0, 2.5);
beta_k ~ normal(0, sigma_b);
sigma_b ~ cauchy(0, 1);
eta_0 ~ cauchy(0, 1);
for (k in 1:K)
delta_k[k] ~ normal(delta_c[cc[k]], eta_0);
delta_c ~ normal(Delta, 0.5);
Delta ~ normal(0, 10);
// likelihood/outcome model
y ~ bernoulli_logit(yhat);
}</code></pre>
<p>We need to compile the stan code so that it can be called from the <code>R</code> script:</p>
<pre class="r"><code>rt_c <- stanc("binary_outcome.stan")
sm_c <- stan_model(stanc_ret = rt_c, verbose=FALSE)</code></pre>
<p>And here is the <code>R</code> code that prepares the data for the stan program and then samples from the posterior distribution. In this case, I will be using 4 different Monte Carlo chains of 2500 draws each (after allowing for 500 warm-up draws), so we will have a actual sample size of 10,000.</p>
<pre class="r"><code>N <- nrow(di) ;
C <- di[, length(unique(c_type))]
K <- di[, length(unique(site))]
y <- as.numeric(di$y)
kk <- di$site
ctrl <- di$rx
cc <- di[, .N, keyby = .(site, c_type)]$c_type
sampdat <- list(N=N, C=C, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc)
fit_c <- sampling(
sm_c, data = sampdat, iter = 3000, warmup = 500,
show_messages = FALSE, cores = 4, refresh = 0,
control = list(adapt_delta = 0.8)
)</code></pre>
</div>
<div id="inspecting-the-posterior-distribution" class="section level3">
<h3>Inspecting the posterior distribution</h3>
<p>Assuming that everything has run smoothly, the first thing to do is to make sure that the MCMC algorithm adequately explored the full posterior distribution in the sampling process. We are typically interested in understanding the properties of the distribution, like the mean or median, or 95% credible intervals. To have confidence that these properties reflect the true posterior probabilities, we need to be sure that the sample they are drawn from is a truly representative one.</p>
<p>A quick and effective way to assess the “representativeness” of the sample is to take a look at the trace plot for a particular parameter, which literally tracks the path of the MCMC algorithm through the posterior distribution. Below, I’ve included two trace plots, one for <span class="math inline">\(\Delta\)</span>, the overall effect, and the other for <span class="math inline">\(\eta_0\)</span>, the variability of a study’s control effect <span class="math inline">\(\delta_k\)</span> around <span class="math inline">\(\delta_c\)</span>. On the left, the plots appear as they should, with lines jumping up and down. However, these particular plots include red indicators where the algorithm got stuck, where there were <em>divergent transitions</em>. We really don’t want to see any of this indicators, because that is a sign that our sample is not representative of the posterior distribution.</p>
<pre class="r"><code>posterior_c <- as.array(fit_c)
lp_c <- log_posterior(fit_c)
np_c <- nuts_params(fit_c)
color_scheme_set("mix-brightblue-gray")
mcmc_trace(posterior_c, pars = "Delta", np = np_c) +
xlab("Post-warmup iteration")
mcmc_trace(
posterior_c, pars = "Delta", np = np_c, window = c(1500, 1700)
) +
xlab("Post-warmup iteration")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/trace_c.png" /></p>
<p>On the right in the figure above, I’ve zoomed in on steps 700 to 900 to see if we can see any patterns. And sure enough, we can. In the <span class="math inline">\(\Delta\)</span> plot, straight lines appear in the middle, evidence that the sampling for some of the chains did indeed get stuck. Likewise, the plot for <span class="math inline">\(\eta_0\)</span> shows flat lines near <span class="math inline">\(0\)</span>.</p>
<p>There’s an additional plot that shows the same thing but in a slightly more dramatic and comprehensive way. This plot (shown below) has a line for each step connecting the parameter estimates of that step. The red lines represent divergent transitions. The important thing to note here is that in all cases with divergent transitions, <span class="math inline">\(\eta_0\)</span> found itself close to <span class="math inline">\(0\)</span>. In other words, the sampling was getting stuck at this point, and this is the likely culprit for the sampling issues.</p>
<pre class="r"><code>color_scheme_set("darkgray")
parcoord_c <-mcmc_parcoord(
posterior_c, np = np_c,
pars = c("eta_0", "sigma_b", "alpha",
"delta_c[1]", "delta_c[2]","delta_c[3]", "Delta")
)
parcoord_c +
scale_x_discrete(expand = c(0.01, 0.01)) +
theme(panel.background = element_rect(fill = "grey90")) +
ylim(-3, 3) +
ggtitle("Original model specification")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/parc_c.png" id="id" class="class" style="width:75.0%;height:75.0%" /></p>
</div>
<div id="a-remedy-for-divergent-transitions" class="section level3">
<h3>A remedy for divergent transitions</h3>
<p>In a moment, I will provide a brief illustration of the perils of divergent transitions, but before that I want to describe “non-centered parameterization,” an approach that can be used to mitigate divergence. The idea is that since sampling from a standard Gaussian or normal distribution is less likely to lead to problematic transitions, we should try to do this as much as possible. “In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation.” (See <a href="https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html" target="_blank">here</a> for the source of this quote and much more.)</p>
<p>For example, in the original model specification, we parameterized <span class="math inline">\(\delta_k\)</span> and <span class="math inline">\(\delta_c\)</span> as</p>
<p><span class="math display">\[
\delta_k \sim N(\delta_c, \eta_0) \\
\delta_c \sim N(\Delta, 0.5)
\]</span></p>
<p><span class="math display">\[
\eta_0 \sim Cauchy(0, 1) \\
\Delta \sim N(0, 10),
\]</span></p>
<p>whereas using “non-centered” parameterization, we would incorporate two latent standard normal variables <span class="math inline">\(\theta_{rx}\)</span> and <span class="math inline">\(\theta_c\)</span> into the model. <span class="math inline">\(\delta_k\)</span> and <span class="math inline">\(\delta_c\)</span> have the same prior distribution as the original model, but we are now sampling from standard normal prior distribution:
<span class="math display">\[
\delta_k = \delta_c + \eta_0 \theta_{rx} \\
\theta_{rx} \sim N(0, 1)
\]</span></p>
<p><span class="math display">\[
\delta_c = \Delta + 0.5\theta_c\\
\theta_c \sim N(0, 1)
\]</span></p>
<p><span class="math display">\[
\eta_0 \sim Cauchy(0, 1) \\
\Delta \sim N(0, 10).
\]</span></p>
<p>This transformation makes the path through the posterior distribution much smoother and, at least in this case, eliminates the divergent transitions.</p>
<p>Here is the stan code using non-centered parameterization (again, feel free to linger and study):</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=0> C; // number of control types
int<lower=1> K; // number of studies
int y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=C> cc[K]; // specific control for site
}
parameters {
real alpha; // overall intercept for treatment
real<lower=0> sigma_b;
real<lower=0> eta_0; // sd of delta_k (around delta)
real Delta; // overall control effect
// non-centered parameterization
vector[K] z_ran_rx; // site level random effects (by period)
vector[K] z_ran_int; // individual level random effects
vector[C] z_ran_c; // individual level random effects
}
transformed parameters{
vector[N] yhat;
vector[K] beta_k;
vector[K] delta_k; // site specific treatment effect
vector[C] delta_c;
beta_k = sigma_b * z_ran_int + alpha;
for (i in 1:C)
delta_c[i] = 0.5 * z_ran_c[i] + Delta;
for (i in 1:K)
delta_k[i] = eta_0 * z_ran_rx[i] + delta_c[cc[i]];
for (i in 1:N)
yhat[i] = beta_k[kk[i]] + ctrl[i] * delta_k[kk[i]];
}
model {
// priors
alpha ~ student_t(3, 0, 2.5);
z_ran_c ~ std_normal();
z_ran_int ~ std_normal();
z_ran_rx ~ std_normal();
sigma_b ~ cauchy(0, 1);
eta_0 ~ cauchy(0, 1);
Delta ~ normal(0, 10);
// outcome model
y ~ bernoulli_logit(yhat);
}</code></pre>
<p>Looking at the trace plots from the non-centered model makes it clear that divergent transitions are no longer a problem. There are no red indicators, and the patterns of straight lines have been eliminated:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/trace_nc.png" /></p>
</div>
<div id="proceed-with-caution-if-you-ignore-the-divergence-warnings" class="section level3">
<h3>Proceed with caution if you ignore the divergence warnings</h3>
<p>If we don’t heed the warnings, how bad can things be? Well, it will probably depend on the situation, but after exploring with multiple data sets, I have convinced myself that it is probably a good idea to reduce the number of divergent transitions as close to 0 as possible.</p>
<p>I conducted an experiment by generating 100 data sets and fitting a model using both the original and non-centered parameterizations. I collected the posterior distribution for each data set and estimation method, and plotted the density curves. (In case you are interested, I used a parallel process running on a high-performance computing core to do this; running on my laptop, this would have taken about 5 hours, but on the HPC it ran in under 15 minutes.) The purpose of this was to explore the shapes of the densities across the different data sets. I know it is a little odd to use this frequentist notion of repeatedly sampling datasets to evaluate the performance of these two approaches, but I find it to be illuminating. (If you’re interested in the code for that, let me know.)</p>
<p>Below on the right, the plot of the posteriors from the non-centered parameterization shows variability in location, but is otherwise remarkably consistent in shape and scale. On the left, posterior densities show much more variation; some are quite peaked and others are even bi-modal. (While I am not showing this here, the densities from samples with more divergent transitions tend to diverge the most from the well-behaved densities on the right.)</p>
<p>Although the mean or median estimates from a divergent sample may not be too far off from its non-divergent counterpart, the more general description of the distribution may be quite far off the mark, making it likely that inferences too will be off the mark.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/post_plot.png" /></p>
</div>
Generating data from a truncated distribution
https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/
Tue, 18 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>A researcher reached out to me the other day to see if the <code>simstudy</code> package provides a quick and easy way to generate data from a truncated distribution. Other than the <code>noZeroPoisson</code> distribution option (which is a <em>very</em> specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy. I’ve actually had it in the back of my mind to add something like this to <code>simstudy</code>, but have hesitated because it might mean changing (or at least adding to) the <code>defData</code> table structure.</p>
<p>However, it may be time to go for it. The process and coding are actually relatively straightforward, so there is no real reason not to. I was developing a simple prototype for several probability distributions (though the concept can easily be applied to any distribution where the cumulative distribution function, or CDF, is readily accessible), and am sharing here in case you need to do this before it is available in the package, or if you just want to implement yourself.</p>
<div id="what-is-a-truncated-distribution" class="section level3">
<h3>What is a truncated distribution?</h3>
<p>A truncated probability distribution is one derived from limiting the domain of an existing distribution. A picture is worth a thousand words. On the left, we have a histogram for 10,000 observations drawn from a full (non-truncated) <strong>Gaussian</strong> or <strong>normal</strong> distribution with mean 0 and standard deviation 3. In the middle, the histogram represents data drawn from the positive portion of the same distribution (i.e. is truncated at the left by 0). And on the far right, the truncation is defined by the boundaries <span class="math inline">\((-3, 3.5)\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-2-1.png" width="768" /></p>
</div>
<div id="leveraging-the-uniform-distribution-and-a-cdf" class="section level3">
<h3>Leveraging the uniform distribution and a CDF</h3>
<p>A while back, I <a href="https://www.rdatagen.net/post/correlated-data-copula/" target="blank">described</a> a <em>copula</em> approach to generating correlated data from different distributions (ultimately implemented in functions <code>genCorGen</code> and <code>addCorGen</code>). I wrote about combining a draw from a uniform distribution with the CDF of any target distribution to facilitate random number generation from the target generation. This is an approach that works well for truncated distributions also, where the truncated distribution is the target.</p>
<p>Again - visuals help to explain how this works. To start, here are several CDFs of normal distributions with different means and variances:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>The CDF of a distribution (usually written as <span class="math inline">\(F(x)\)</span>) effectively defines that distribution: <span class="math inline">\(F(x) = P(X \le x)\)</span>. Since probabilities by definition range from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>, we know that <span class="math inline">\(F(x)\)</span> also ranges from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>. It is also the case, that <span class="math inline">\(F(x)\)</span> is monotonically increasing (or at least non-decreasing) from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>.</p>
<p>Let’s say we want to generate a draw from <span class="math inline">\(N(\mu =0, \sigma = 3)\)</span> using the the CDF. We can first generate a draw from <span class="math inline">\(u = Uniform(0,1)\)</span>. We then treat <span class="math inline">\(u\)</span> as a value of the CDF, and map it back <span class="math inline">\(x\)</span> to get our draw from the target distribution. So, <span class="math inline">\(x = F^{-1}(u)\)</span>. In <code>R</code>, the CDF for the normal distribution can be determined using the <code>qnorm</code> function, where the first argument is a probability value between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>. This would be the <code>R</code> code to generate a single draw from <span class="math inline">\(N(0, 3)\)</span> using a random draw from <span class="math inline">\(Uniform(0, 1)\)</span>:</p>
<pre class="r"><code>(u <- runif(1))</code></pre>
<pre><code>## [1] 0.9</code></pre>
<pre class="r"><code>qnorm(u, mean = 0, sd = 3)</code></pre>
<pre><code>## [1] 3.9</code></pre>
<p>This is how <span class="math inline">\(u = 0.9\)</span> relates to the draw of <span class="math inline">\(x=3.9\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>To generate a random sample of 10,000 draws from <span class="math inline">\(N(0, 3)\)</span>, this process is replicated 10,000 times:</p>
<pre class="r"><code>library(ggplot2)
u <- runif(10000)
x <- qnorm(u, mean = 0, sd = 3)
ggplot(data = data.frame(x), aes(x = x)) +
geom_histogram(fill = "#CCC591", alpha = 1, binwidth = .2, boundary = 0) +
theme(panel.grid = element_blank(),
axis.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="extending-the-inverse-process-to-generate-truncation" class="section level3">
<h3>Extending the inverse process to generate truncation</h3>
<p>Let’s say we are only interested in generating data from the middle portion of the <span class="math inline">\(N(0,3)\)</span> distribution, between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. The trick is to use the corresponding CDF values, <span class="math inline">\(F(a)\)</span> and <span class="math inline">\(F(b)\)</span> as the basis of the randomization.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>To generate data within the constraints <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>, all we would need to do is generate a value from the uniform distribution with minimum equal to <span class="math inline">\(F(a)\)</span> and maximum <span class="math inline">\(F(b)\)</span>. We then conduct the mapping as we did before when drawing from the full distribution. By constraining <span class="math inline">\(u\)</span> to be between <span class="math inline">\(F(a)\)</span> and <span class="math inline">\(F(b)\)</span>, we force the values of the target distribution to lie between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>.</p>
<p>Now, we are ready to create a simple function <code>rnormt</code> that implements this: The <code>pnorm</code> function provides the CDF at a particular value:</p>
<pre class="r"><code>rnormt <- function(n, range, mu, s = 1) {
# range is a vector of two values
F.a <- pnorm(min(range), mean = mu, sd = s)
F.b <- pnorm(max(range), mean = mu, sd = s)
u <- runif(n, min = F.a, max = F.b)
qnorm(u, mean = mu, sd = s)
}</code></pre>
<p>Here, I am generating the data plotted above, showing the code this time around.</p>
<pre class="r"><code>library(data.table)
library(simstudy)
library(paletteer)
defC <- defCondition(condition= "tt == 1",
formula = "rnormt(10000, c(-Inf, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 2",
formula = "rnormt(10000, c(0, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 3",
formula = "rnormt(10000, c(-3, 3.5), mu = 0, s = 3)")
dd <- genData(30000)
dd <- trtAssign(dd, nTrt = 3, grpName = "tt")
dd <- addCondition(defC, dd, "x")
dd[, tt := factor(tt,
labels = c("No truncation", "Left truncation at 0", "Left and right truncation"))]
ggplot(data = dd, aes(x = x, group = tt)) +
geom_histogram(aes(fill = tt), alpha = 1, binwidth = .2, boundary = 0) +
facet_grid(~tt) +
theme(panel.grid = element_blank(),
axis.title = element_blank(),
legend.position = "none") +
scale_fill_paletteer_d("wesanderson::Moonrise2")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-9-1.png" width="768" /></p>
</div>
<div id="going-beyond-the-normal-distribution" class="section level3">
<h3>Going beyond the normal distribution</h3>
<p>With this simple approach, it is possible to generate a truncated distribution using any distribution available in <code>R</code>. Here is another example that allows us to generate truncated data from a <strong>gamma</strong> distribution:</p>
<pre class="r"><code>rgammat <- function(n, range, shape, scale = 1) {
F.a <- pgamma(min(range), shape = shape, scale = scale)
F.b <- pgamma(max(range), shape = shape, scale = scale)
u <- runif(n, min = F.a, max = F.b)
qgamma(u, shape = shape, scale = scale)
}</code></pre>
<p>To conclude, here is a plot of gamma-based distributions using <code>rgammat</code>. And I’ve added similar plots for <strong>beta</strong> and <strong>Poisson</strong> distributions - I’ll leave it to you to write the functions. But, if you don’t want to do that, <code>simstudy</code> will be updated at some point soon to help you out.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-11-1.png" width="768" /></p>
</div>
A hurdle model for COVID-19 infections in nursing homes
https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/
Tue, 04 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Late last <a href="https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/" target="blank">year</a>, I added a <em>mixture</em> distribution to the <code>simstudy</code> package, largely motivated to accommodate <em>zero-inflated</em> Poisson or negative binomial distributions. (I really thought I had added this two years ago - but time is moving so slowly these days.) These distributions are useful when modeling count data, but we anticipate observing more than the expected frequency of zeros that would arise from a non-inflated (i.e. “regular”) Poisson or negative binomial distribution.</p>
<p>There is a related, though subtly different model - the <em>hurdle</em> model - that I want to talk about here, because it has come up in the context of planning a new study to address COVID-19 infection rates in nursing homes using much more aggressive testing strategies.</p>
<p>A hurdle model is a modified count model that also assumes a relatively high frequency of zeros, but is set up as a two-stage data generation process rather than as a mixture distribution. In the first stage, the development of any (i.e. at least 1) new infections during the observation period is described by binomial probability distribution. If there is at least one infection, the process has transitioned over the “hurdle” into the second stage; here, the number of new infections is described by a non-zero count distribution, such as a truncated Poisson or truncated negative binomial distribution.</p>
<div id="model-specification" class="section level3">
<h3>Model specification</h3>
<p>If <span class="math inline">\(N_i\)</span> is the number of new infections in nursing home <span class="math inline">\(i\)</span> over the observation period and we have two intervention arms <span class="math inline">\(T_i \in \{0,1\}\)</span>, the intervention effect at each stage of the process can be modeled simply as:</p>
<p><span class="math display">\[
\text{logodds}\left[ P(N_i > 0) \right] = \beta_0 + \beta_1 T_i + \boldsymbol{X_i \beta_2}\]</span></p>
<p><span class="math display">\[\text{log}(N_i | N_i \ge 1) = \alpha_0 + \alpha_1 T_i + \boldsymbol{X_i \alpha_2} + \text{log}(D_i)\]</span></p>
<p>The intervention effect for the binomial stage is <span class="math inline">\(\beta_1\)</span> (on the logodds scale) and the intervention effect for the hurdle (count) stage is <span class="math inline">\(\alpha_1\)</span> (on the log scale). <span class="math inline">\(\boldsymbol{X_i}\)</span> are any covariates that are used for stratified randomization.</p>
<p><span class="math inline">\(D_i\)</span> is the number of resident-days observed during the follow-up period, and <span class="math inline">\(\text{log}(D_i)\)</span> is the “offset”; we are effectively modeling a rate of infections <span class="math inline">\(\text{log}(N_i/D_i)\)</span>. This will take into account the fact that residents will be observed for different lengths of time - some moving into the nursing home after the study has started, and others leaving or dying before the study is complete.</p>
</div>
<div id="simulating-a-hurdle-model" class="section level3">
<h3>Simulating a hurdle model</h3>
<p>Simulating data from this model is relatively straightforward, complicated only by the need to generate varying observation periods. Essentially, we must generate two outcomes - a binary outcome and a non-zero count outcome (in this case it will be from a non-zero Poisson distribution), and the observed outcome is 0 if the binary outcome is actually 0, and the value of the count outcome if the binary outcome is 1.</p>
<p>To get things going, here are the packages I will use. The <code>pscl</code> package provides a function <code>hurdle</code> to estimate the model parameters from our simulated data, and <code>stargazer</code> package outputs the model in a nice, readable format.</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(pscl)
library(stargazer)</code></pre>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>In this simulation the average observation time is 80 days (out of 90 maximum), and on average, each nursing home will have 100 residents. In the control arm, 95% of the nursing homes will have at least one infection, and 80% of the intervention arm will have at least one. The corresponding odds ratio is <span class="math inline">\((0.80/0.20)/(.95/0.05) = 0.21\)</span>.</p>
<p>The infection rate per 1000 resident-days for the control arm will be <span class="math inline">\(\sim (20/8000)*1000 = 2.5\)</span>; for the intervention arm, the rate will be <span class="math inline">\(\sim (20/8000) *0.8 * 1000 = 2.0\)</span>.</p>
<p>Here is the data definition table <code>defHurdle</code> created by the function <code>defDataAdd</code> that encodes these assumptions:</p>
<pre><code>## varname formula variance dist link
## 1: nRes 100 0 poisson identity
## 2: aDays 80 0 poisson identity
## 3: nDays pmin(90, aDays) 0 nonrandom identity
## 4: pDays nRes * nDays 0 nonrandom identity
## 5: xBin 0.95 - 0.15 * rx 0 binary identity
## 6: xCnt log(20/8000)+log(0.8)*rx+log(pDays) 0 noZeroPoisson log
## 7: y xBin * xCnt 0 nonrandom identity</code></pre>
<p>The data generation is only at the nursing home level. In this example, we are assuming 500 nursing homes:</p>
<pre class="r"><code>set.seed(29211)
dx <- genData(500)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(defHurdle, dx)
dx</code></pre>
<pre><code>## id rx nRes aDays nDays pDays xBin xCnt y
## 1: 1 1 113 86 86 9718 1 16 16
## 2: 2 0 89 66 66 5874 1 16 16
## 3: 3 1 83 82 82 6806 1 13 13
## 4: 4 0 91 95 90 8190 1 27 27
## 5: 5 1 97 70 70 6790 0 17 0
## ---
## 496: 496 1 116 85 85 9860 0 17 0
## 497: 497 1 89 94 90 8010 1 14 14
## 498: 498 0 112 92 90 10080 1 20 20
## 499: 499 1 97 71 71 6887 1 21 21
## 500: 500 0 92 68 68 6256 1 13 13</code></pre>
</div>
<div id="data-visualization" class="section level4">
<h4>Data visualization</h4>
<p>A plot of the data shows the effect at each stage of the hurdle process:</p>
<pre class="r"><code>dx[, rate1000 := (y/pDays)*1000]
dx[ , rx := factor(rx, labels = c("No intervention", "Intervention"))]
dm <- dx[rate1000 != 0, .(mu = mean(rate1000)), keyby = rx]
ggplot(data = dx, aes(x = rate1000)) +
geom_vline(aes(xintercept = mu), data = dm, lty = 3, color = "grey50") +
geom_histogram(binwidth = .1,
aes(y = (..count..)/sum(..count..), fill = rx)) +
facet_grid(rx ~ .) +
theme(panel.grid = element_blank(),
legend.position = "none") +
scale_y_continuous(labels = scales::percent,
name = "proportion",
expand = c(0, 0),
breaks = c(c(.05, .10)),
limits = c(0, .11)) +
scale_x_continuous(name = "# infections per 1000 resident-days") +
scale_fill_manual(values = c("#EDCB64", "#B62A3D")) </code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-04-a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations.en_files/figure-html/unnamed-chunk-5-1.png" width="480" /></p>
</div>
<div id="parameter-estimation" class="section level4">
<h4>Parameter estimation</h4>
<p>I fit two models here. The first includes a possible intervention effect, and the second assumes no intervention effect. The purpose in fitting the second model is to provide a basis of comparison.</p>
<pre class="r"><code>hfit1 <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx)
hfit1.0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)</code></pre>
<p>The hurdle model returns two sets of estimates. The first component of the model shown here is binomial model. The estimated intervention effect (odds ratio) is <span class="math inline">\(exp(-1.570) = 0.21\)</span>, as expected. Note that the log-likelihood reported here is for the composite hurdle model (both stages).</p>
<pre class="r"><code>stargazer(hfit1, hfit1.0, type = "text", zero.component = TRUE,
notes =" ", notes.append = FALSE, notes.label="",
dep.var.labels.include = FALSE, dep.var.caption = "",
omit.stat = "n", object.names = TRUE, model.numbers = FALSE)</code></pre>
<pre><code>##
## ====================================
## hfit1 hfit1.0
## ------------------------------------
## rxIntervention -1.570***
## (0.325)
##
## Constant 2.900*** 1.900***
## (0.285) (0.133)
##
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
## </code></pre>
<p>The second component is the count model. The estimated intervention effect is <span class="math inline">\(exp(-0.279) = 0.76\)</span>, which is close to the true value of <span class="math inline">\(0.80\)</span>. (The reported log-likelihoods are the same as in the binomial model.)</p>
<pre class="r"><code>stargazer(hfit1, hfit1.0, type = "text", zero.component = FALSE,
notes =" ", notes.append = FALSE, notes.label="",
dep.var.labels.include = FALSE, dep.var.caption = "",
omit.stat = "n", object.names = TRUE, model.numbers = FALSE)</code></pre>
<pre><code>##
## ====================================
## hfit1 hfit1.0
## ------------------------------------
## rxIntervention -0.279***
## (0.023)
##
## Constant -5.980*** -6.090***
## (0.014) (0.011)
##
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
## </code></pre>
<p>In this particular case, the intervention alters both the binomial probability and the county distribution, but that will not necessarily always be the case. A log-likelihood ratio test (LRT) is a global test that compares the model that explicitly excludes an intervention effect (<code>hfit1.0</code>) with the model that includes an intervention effect. If the likelihoods under each are close enough, then the model that excludes the intervention effect is considered sufficient, and there is no reason to conclude that the intervention is effective. We can use the p-value based on the LRT as a measure of whether or not the intervention is generally effective, either because it changes the binomial probability, the count distribution, or both.</p>
<p>In this case, the p-value is quite low:</p>
<pre class="r"><code>lrt1 <- -2*(logLik(hfit1.0) - logLik(hfit1))
1 - pchisq(lrt1, 2)</code></pre>
<pre><code>## 'log Lik.' 0 (df=2)</code></pre>
</div>
</div>
<div id="alternative-scenarios" class="section level3">
<h3>Alternative scenarios</h3>
<p>Here are three additional scenarios that provide examples of ways the intervention can affect the outcome. In Scenario 2, the intervention no longer has an effect on the probability of having at least one infection, but still has an effect on the count. In Scenario 3, the intervention <em>only</em> effects the probability of having at least one infection, and not the count distribution. And in Scenario 4, the intervention has no effect at all at either stage.</p>
<pre class="r"><code>defHurdle.V2 <- updateDef(defHurdle, "xBin", "0.95")
defHurdle.V3 <- updateDef(defHurdle, "xCnt", "log(20/8000) + log(pDays)")
defHurdle.V4 <- updateDef(defHurdle.V3, "xBin", "0.95")</code></pre>
<p>The plots bear out the underlying parameters. We can see the probability of a zero is the same across treatment arms in Scenario 2, just as the distributions of the count variable in Scenario 3 appear equivalent. In Scenario 4, it is hard to distinguish between the two distributions across interventions.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-04-a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations.en_files/figure-html/unnamed-chunk-14-1.png" width="960" /></p>
<p>Here are the model fits - the results are consistent with the plots:</p>
<pre><code>##
## ================================================================================
## hfit2 hfit2.0 hfit3 hfit3.0 hfit4 hfit4.0
## --------------------------------------------------------------------------------
## rxIntervention -0.099 -0.871*** 0.058
## (0.446) (0.287) (0.342)
##
## Constant 3.180*** 3.130*** 2.440*** 1.940*** 2.500*** 2.530***
## (0.323) (0.223) (0.233) (0.135) (0.239) (0.171)
##
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
## </code></pre>
<pre><code>##
## ================================================================================
## hfit2 hfit2.0 hfit3 hfit3.0 hfit4 hfit4.0
## --------------------------------------------------------------------------------
## rxIntervention -0.210*** 0.010 -0.029
## (0.022) (0.022) (0.021)
##
## Constant -6.010*** -6.110*** -6.000*** -6.000*** -5.980*** -6.000***
## (0.015) (0.011) (0.015) (0.011) (0.015) (0.010)
##
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
## </code></pre>
<p>And finally, the p-values from the LRTs of the models under each of the three scenarios are consistent with the underlying data generating processes. It is only in the last scenario where there is no reason to believe that the intervention has some sort of effect.</p>
<pre class="r"><code>round(c(lrt2 = 1 - pchisq(lrt2, 2),
lrt3=1 - pchisq(lrt3, 2),
lrt4=1 - pchisq(lrt4, 2)), 4)</code></pre>
<pre><code>## lrt2 lrt3 lrt4
## 0.0000 0.0067 0.3839</code></pre>
<p>
<small><font color="darkkhaki">
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/" target="blank">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
<div id="addendum---estimating-power" class="section level3">
<h3>Addendum - estimating power</h3>
<p>If you’ve visited my blog before, you might have <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/" target="blank">picked up</a> on the fact that I like to use simulation to estimate sample size or power when planning a randomized trial. This allows me to be sure everyone understands the assumptions.</p>
<p>To estimate power, I generate multiple data sets under a specific set of assumptions and estimate intervention effects for each data set. The power of the study under this set of assumptions is the proportion of times we would conclude that the intervention is effective. In the context of a hurdle model, I use the p-value from the LRT as the arbiter of effectiveness; the proportion of p-values less than 0.05 is the power.</p>
<pre class="r"><code>gData <- function(n, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(defHurdle, dx)
dx[]
}
estModel <- function(dx) {
hfit <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx, )
hfit0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)
lrt <- -2*(logLik(hfit0) - logLik(hfit))
data.table(p.zero = coef(summary(hfit))$zero["rx", "Pr(>|z|)"],
p.count = coef(summary(hfit))$count["rx", "Pr(>|z|)"],
X2 = 1 - pchisq(lrt, 2))
}
iter <- function(n, defHurdle, i) {
dx <- gData(n, def)
hfit <- estModel(dx)
return(data.table(i = i, hfit))
}
diter <- rbindlist(lapply(1:1000, function(i) iter(50, defHurdle, i)))</code></pre>
<p>Here are the results from the individual replications Scenario 1 effect assumptions and 50 nursing homes:</p>
<pre class="r"><code>diter</code></pre>
<pre><code>## i p.zero p.count X2
## 1: 1 0.9975 4.06e-04 0.000437
## 2: 2 0.0449 1.05e-03 0.000216
## 3: 3 0.0713 5.92e-03 0.002246
## 4: 4 0.0449 5.85e-04 0.000128
## 5: 5 0.1891 3.20e-02 0.034025
## ---
## 996: 996 0.3198 7.04e-03 0.014600
## 997: 997 0.1891 1.13e-02 0.013579
## 998: 998 0.3198 8.16e-04 0.001973
## 999: 999 1.0000 4.45e-06 0.000023
## 1000: 1000 0.5590 2.34e-03 0.007866</code></pre>
<p>And here is the estimate of power - in this case there is about 90% power that we will conclude that there is an effect of some type given the assumptions under Scenario 1:</p>
<pre class="r"><code>diter[, mean(X2 <= 0.05)]</code></pre>
<pre><code>## [1] 0.898</code></pre>
<p>In conclusion, here is a power plot for a range of effect size assumptions, sample size assumptions, and control arm assumptions. In all of these cases, I assumed that the binomial probability under the control condition would be 70%, (If anyone wants to see the code for generating all of this data and the plot, I can post on github. However, it is really just an extension of what is shown here.)</p>
<p><img src="https://www.rdatagen.net/img/post-hurdle/power.png" id="id" class="class" style="width:95.0%;height:95.0%" /></p>
</div>
A Bayesian model for a simulated meta-analysis
https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/
Tue, 21 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>This is essentially an addendum to the previous <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="blank">post</a> where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the <code>nlme</code> package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package <code>rstan</code>.</p>
<div id="create-the-data-set" class="section level3">
<h3>Create the data set</h3>
<p>We’ll use the exact same data generating process as <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="blank">described</a> in some detail in the previous post.</p>
<pre class="r"><code>library(simstudy)
library(rstan)
library(data.table)</code></pre>
<pre class="r"><code>defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study",
formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
dist = "poisson")
defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)
ds <- genData(12, defS)
dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)
d.obs <- dc[, .(study, id, x, y)]</code></pre>
</div>
<div id="build-the-stan-model" class="section level3">
<h3>Build the Stan model</h3>
<p>There are multiple ways to estimate a <code>Stan</code> model in <code>R</code>, but I choose to build the Stan code directly rather than using the <code>brms</code> or <code>rstanarm</code> packages. In the Stan code, we need to define the data structure, specify the parameters, specify any transformed parameters (which are just a function of the parameters), and then build the model - which includes laying out the prior distributions as well as the likelihood.</p>
<p>In this case, the model is slightly different from what was presented in the context of a mixed effects model. This is the mixed effects model:</p>
<p><span class="math display">\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\
\\
\delta_k = \delta_0 + v_k \\
e_{ik} \sim N(0, \sigma_k^2), v_k \sim N(0,\tau^2)
\]</span>
In this Bayesian model, things are pretty much the same:
<span class="math display">\[ y_{ik} \sim N(\alpha_k + \delta_k x_{ik}, \sigma_k^2) \\
\\
\delta_k \sim N(\Delta, \tau^2)
\]</span></p>
<p>The key difference is that there are prior distributions on <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\tau\)</span>, introducing an additional level of uncertainty into the estimate. I would expect that the estimate of the overall treatment effect <span class="math inline">\(\Delta\)</span> will have a wider 95% CI (credible interval in this context) than the 95% CI (confidence interval) for <span class="math inline">\(\delta_0\)</span> in the mixed effects model. This added measure of uncertainty is a strength of the Bayesian approach.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=1> K; // number of studies
real y[N]; // vector of continuous outcomes
int<lower=1,upper=K> kk[N]; // study for individual
int<lower=0,upper=1> x[N]; // treatment arm for individual
}
parameters {
vector[K] beta; // study-specific intercept
vector[K] delta; // study effects
real<lower=0> sigma[K]; // sd of outcome dist - study specific
real Delta; // average treatment effect
real <lower=0> tau; // variation of treatment effects
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = beta[kk[i]] + x[i] * delta[kk[i]];
}
model {
// priors
sigma ~ normal(0, 2.5);
beta ~ normal(0, 10);
tau ~ normal(0, 2.5);
Delta ~ normal(0, 10);
delta ~ normal(Delta, tau);
// outcome model
for (i in 1:N)
y[i] ~ normal(yhat[i], sigma[kk[i]]);
}</code></pre>
</div>
<div id="generate-the-posterior-distributions" class="section level3">
<h3>Generate the posterior distributions</h3>
<p>With the model in place, we transform the data into a <code>list</code> so that Stan can make sense of it:</p>
<pre class="r"><code>N <- nrow(d.obs) ## number of observations
K <- dc[, length(unique(study))] ## number of studies
y <- d.obs$y ## vector of continuous outcomes
kk <- d.obs$study ## study for individual
x <- d.obs$x ## treatment arm for individual
ddata <- list(N = N, K = K, y = y, kk = kk, x = x)</code></pre>
<p>And then we compile the Stan code:</p>
<pre class="r"><code>rt <- stanc("model.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)</code></pre>
<p>Finally, we can sample data from the posterior distribution:</p>
<pre class="r"><code>fit <- sampling(sm, data=ddata, seed = 3327, iter = 10000, warmup = 2500,
control=list(adapt_delta=0.9))</code></pre>
</div>
<div id="check-the-diagonstic-plots" class="section level3">
<h3>Check the diagonstic plots</h3>
<p>Before looking at any of the output, it is imperative to convince ourselves that the MCMC process was a stable one. The <em>trace</em> plot is the most basic way to assess this. Here, I am only showing these plots for <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\tau\)</span>, but the plots for the other parameters looked similar, which is to say everything looks good:</p>
<pre class="r"><code>pname <- c("Delta", "tau")
stan_trace(object = fit, pars = pname)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
<div id="look-at-the-results" class="section level3">
<h3>Look at the results</h3>
<p>It is possible to look inspect the distribution of any or all parameters. In this case, I am particularly interested in the treatment effects at the study level, and overall. That is, the focus here is on <span class="math inline">\(\Delta\)</span>, <span class="math inline">\(\delta_k\)</span>, and <span class="math inline">\(\tau\)</span>.</p>
<pre class="r"><code>pname <- c("delta", "Delta","tau")
print(fit, pars=pname, probs = c(0.05, 0.5, 0.95))</code></pre>
<pre><code>## Inference for Stan model: model.
## 4 chains, each with iter=10000; warmup=2500; thin=1;
## post-warmup draws per chain=7500, total post-warmup draws=30000.
##
## mean se_mean sd 5% 50% 95% n_eff Rhat
## delta[1] 6.39 0.01 1.13 4.51 6.41 8.22 29562 1
## delta[2] -0.78 0.01 1.62 -3.45 -0.78 1.85 28188 1
## delta[3] -0.14 0.01 1.39 -2.37 -0.16 2.18 28909 1
## delta[4] 3.08 0.00 0.59 2.09 3.08 4.05 34277 1
## delta[5] -0.16 0.01 1.01 -1.77 -0.18 1.52 27491 1
## delta[6] 3.87 0.00 0.86 2.47 3.87 5.27 35079 1
## delta[7] 4.04 0.01 1.11 2.21 4.03 5.87 32913 1
## delta[8] 5.23 0.01 1.29 3.12 5.23 7.36 33503 1
## delta[9] 1.79 0.01 1.25 -0.27 1.78 3.82 30709 1
## delta[10] 1.38 0.01 1.12 -0.46 1.38 3.21 30522 1
## delta[11] 4.47 0.01 1.25 2.43 4.47 6.54 34573 1
## delta[12] 0.79 0.01 1.45 -1.60 0.80 3.16 33422 1
## Delta 2.48 0.00 0.89 1.01 2.50 3.89 31970 1
## tau 2.72 0.00 0.71 1.72 2.64 4.01 24118 1
##
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 15:47:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).</code></pre>
<p>The forest plot is quite similar to the one based on the mixed effects model, though as predicted, the 95% CI is considerably wider:</p>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
<p>As a comparison, here is the plot from the mixed effects model estimated using the <code>nlme</code> package in the previous post. The bootstrapped estimates of uncertainty at the study level are quite close to the Bayesian measure of uncertainty; the difference really lies in the uncertainty around the global estimate.</p>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
</div>
Simulating multiple RCTs to simulate a meta-analysis
https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/
Tue, 07 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I am currently involved with an RCT that is struggling to recruit eligible patients (by no means an unusual problem), increasing the risk that findings might be inconclusive. A possible solution to this conundrum is to find similar, ongoing trials with the aim of pooling data in a single analysis, to conduct a <em>meta-analysis</em> of sorts.</p>
<p>In an ideal world, this theoretical collection of sites would have joined forces to develop a single study protocol, but often there is no structure or funding mechanism to make that happen. However, this group of studies may be similar enough - based on the target patient population, study inclusion and exclusion criteria, therapy protocols, comparison or control condition, randomization scheme, and outcome measurement - that it might be reasonable to estimate a single treatment effect and some measure of uncertainty.</p>
<p>This pooling approach would effectively be a prospective meta-analysis using <em>individual participant data</em>. The goal is to estimate a single treatment effect for this intervention or therapy that has been evaluated by different groups under varying research conditions, with possibly different treatment effects in each study.</p>
<p>To explore how all of this works, I generated some data and fit some models. As usual I thought the code would be more useful sitting on this blog rather than hidden away on some secure server.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>In this simulation, I am using a generic continuous outcome <span class="math inline">\(y_{ik}\)</span>, for individual <span class="math inline">\(i\)</span> who is participating in study <span class="math inline">\(k\)</span>. The individual outcome is a function of the study itself and whether that individual received the experimental therapy (<span class="math inline">\(x_{ik} = 1\)</span> for patients in the experimental arm):</p>
<p><span class="math display">\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\
\\
\delta_k = \delta_0 + v_k
\]</span>
<span class="math inline">\(\alpha_k\)</span> is the intercept for study <span class="math inline">\(k\)</span>, or the average outcome for patients in study <span class="math inline">\(k\)</span> in the control arm. <span class="math inline">\(\delta_k\)</span> is the treatment effect in study <span class="math inline">\(k\)</span> and can be decomposed into a common treatment effect across all studies <span class="math inline">\(\delta_0\)</span> and a study-specific effect <span class="math inline">\(v_k\)</span>. <span class="math inline">\(v_k\)</span> is often assumed to be normally distributed, <span class="math inline">\(v_k \sim N(0, \tau^2)\)</span>. An individual effect, <span class="math inline">\(e_{ik}\)</span> is also assumed to be normally distributed, <span class="math inline">\(e_{ik} \sim N(0, \sigma_k^2)\)</span>. Note that the variance <span class="math inline">\(\sigma_k^2\)</span> of individual effects might differ across studies; that is, in some studies patients may be more similar to each other than in other studies.</p>
</div>
<div id="the-simulation-assumptions" class="section level3">
<h3>The simulation assumptions</h3>
<p>Before starting - here are the necessary libraries in case you want to follow along:</p>
<pre class="r"><code>library(simstudy)
library(parallel)
library(nlme)
library(data.table)</code></pre>
<p>In these simulations, there are 12 studies, each enrolling a different number of patients. There are a set of smaller studies, moderately sized studies, and larger studies. We are not really interested in the variability of the intercepts (<span class="math inline">\(\alpha_k\)</span>’s), but we generate based on a normal distribution <span class="math inline">\(N(3, 2)\)</span>. The overall treatment effect is set at <span class="math inline">\(3\)</span>, and the study-specific effects are distributed as <span class="math inline">\(N(0, 6)\)</span>. We use a gamma distribution to create the study-specific within study variation <span class="math inline">\(\sigma^2_k\)</span>: the average within-study variance is <span class="math inline">\(16\)</span>, and will range between <span class="math inline">\(1\)</span> and <span class="math inline">\(64\)</span> (the variance of the variances is <span class="math inline">\(mean^2 \times dispersion = 16^2 \times 0.2 = 51.2\)</span>). The study-specific data are generated using these assumptions:</p>
<pre class="r"><code>defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study",
formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
dist = "poisson")</code></pre>
<p>The individual outcomes are generated based on the model specified above:</p>
<pre class="r"><code>defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>First, we generate the study level data:</p>
<pre class="r"><code>RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)
ds <- genData(12, defS)
ds</code></pre>
<pre><code>## study a.k d.0 v.k s2.k size.study n.study
## 1: 1 2.51 3 2.7437 5.25 2 30
## 2: 2 1.51 3 -4.8894 30.48 2 37
## 3: 3 1.62 3 -4.1762 15.06 1 22
## 4: 4 3.34 3 0.2494 3.26 2 44
## 5: 5 2.34 3 -2.9078 5.59 1 15
## 6: 6 1.70 3 1.3498 7.42 2 44
## 7: 7 4.17 3 -0.4135 14.58 2 45
## 8: 8 2.14 3 0.7826 25.78 2 44
## 9: 9 2.54 3 -1.1197 15.72 1 28
## 10: 10 3.10 3 -2.1275 10.00 1 24
## 11: 11 2.62 3 -0.0812 32.76 2 40
## 12: 12 1.17 3 -0.5745 30.94 2 49</code></pre>
<p>And then we generate individuals within each study, assign treatment, and add the outcome:</p>
<pre class="r"><code>dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)</code></pre>
<p>The observed data set obviously does not include any underlying study data parameters. The figure based on this data set shows the individual-level outcomes by treatment arm for each of the 12 studies. The study-specific treatment effects and differences in within-study variation are readily apparent.</p>
<pre class="r"><code>d.obs <- dc[, .(study, id, x, y)]</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-8-1.png" width="528" /></p>
</div>
<div id="initial-estimates" class="section level3">
<h3>Initial estimates</h3>
<p>If each study went ahead and analyzed its own data set separately, the emerging picture would be a bit confusing. We would have 12 different estimates, some concluding that the treatment is effective, and others not able to draw that conclusion. A plot of the 12 model estimates along with the 95% confidence intervals highlights the muddled picture. For additional reference, I’ve added points that represent the true (and unknown) study effects in blue, including a blue line at the value of the overall treatment effect.</p>
<pre class="r"><code>lm.ind <- function(z, dx) {
fit <- lm(y~x, data = dx)
data.table(z, coef(fit)["x"], confint(fit, "x"))
}
res <- lapply(1:d.obs[, length(unique(study))], function(z) lm.ind(z, d.obs[study == z]))</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
</div>
<div id="the-meta-analysis" class="section level3">
<h3>The meta-analysis</h3>
<p>The meta-analysis is an attempt to pool the findings from all the studies to try to draw an overall conclusion. Traditionally, meta-analysis has been conducted using only the summary information from each study - effect size estimates, standard errors, and sample sizes. More recently, researchers have started to use individual-level data to estimate an overall effect. There are advantages to this added level of detail, particularly in enhancing the ability to model patient-level and study-level characteristics that might influence the effect size; these adjustments could help reduce the variance of the effect size estimates.</p>
<p>There are packages in <code>R</code> specifically designed to conduct meta-analysis, but I am doing it “manually” through the use of the <code>nlme</code> package, which estimates mixed-effects model that mimics the underlying data process. (In a subsequent post, I will do the same thing using a Bayesian model implement using <code>rstan</code>.) I opted for <code>nlme</code> over the <code>lme4</code> package, because the former can accommodate the possibility of different within-study variation.</p>
<p>The model fit here includes a study specific (fixed) intercept, an overall treatment effect, and a study-specific treatment effect. And, as I just mentioned, the within-study variation is accommodated:</p>
<pre class="r"><code>lmefit <- lme(y ~ factor(study) + x - 1,
random = ~ x - 1 | study,
weights = varIdent(form = ~ 1 | study),
data = d.obs, method = 'REML'
)</code></pre>
<p>The model estimate for the overall treatment effect is 2.5, just under but close to the true value of 3.0:</p>
<pre class="r"><code>round(coef(summary(lmefit))["x",], 3)</code></pre>
<pre><code>## Warning in pt(-abs(tVal), fDF): NaNs produced</code></pre>
<pre><code>## Value Std.Error DF t-value p-value
## 2.481 0.851 410.000 2.915 0.004</code></pre>
</div>
<div id="bootstrapping-uncertainty" class="section level3">
<h3>Bootstrapping uncertainty</h3>
<p>Every meta-analysis I’ve seen includes a forest plot that shows the individual study estimates along with the global estimate of primary interest. In my version of this plot, I wanted to show the estimated study-level effects from the model (<span class="math inline">\(\delta_0 + v_k\)</span>) along with 95% confidence intervals. The model fit does not provide a variance estimate for each study-level treatment effect, so I have estimated the standard error using bootstrap methods. I repeatedly sample from the observed data (sampling stratified by study and treatment arm) and estimate the same fixed effects model. For each iteration, I keep the estimated study-specific treatment effect as well as the estimated pooled effect:</p>
<pre class="r"><code>bootest <- function() {
bootid <- d.obs[, .(id = sample(id, .N, replace = TRUE)), keyby = .(study, x)][, .(id)]
dboot <- merge(bootid, d.obs, by = "id")
bootfit <- tryCatch(
{ lme(y ~ factor(study) + x - 1,
random = ~ x - 1 | study,
weights = varIdent(form = ~ 1 | study),
data = dboot, method = 'REML')
},
error = function(e) {
return("error")
},
warn = function(w) {
return("warning")
}
)
if (class(bootfit) == "lme") {
return(data.table(t(random.effects(bootfit) + fixed.effects(bootfit)["x"]),
pooled = fixed.effects(bootfit)["x"]))
}
}
res <- mclapply(1:3000, function(x) bootest(), mc.cores = 4)
res <- rbindlist(res)</code></pre>
<p>The next plot shows the individual study estimates based on the pooled analysis along with the overall estimate in red, allowing us to bring a little clarity to what was an admittedly confusing picture. We might conclude from these findings that the intervention appears to be effective.</p>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-14-1.png" width="576" /></p>
<p>As an aside, it is interesting to compare the two forest plot figures in the post, because it is apparent that the point estimates for the individual studies in the second plot are “pulled” closer to the overall average. This is the direct result of the mixed effects model that imposes a structure in the variation of effect sizes across the 12 studies. In contrast, the initial plot shows individual effect sizes that were independently estimated without any such constraint or structure. Pooling across groups or clusters generally has an attenuating effect on estimates.</p>
</div>
Consider a permutation test for a small pilot study
https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/
Tue, 23 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Recently I <a href="https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/">wrote</a> about the challenges of trying to learn too much from a small pilot study, even if it is a randomized controlled trial. There are limitations on how much you can learn about a treatment effect given the small sample size and relatively high variability of the estimate. However, the temptation for researchers is usually just too great; it is only natural to want to see if there is any kind of signal of an intervention effect, even though the pilot study is focused on questions of feasibility and acceptability.</p>
<p>Through my work with the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>, I have been involved with planning a research initiative to test the feasibility of studying a highly innovative strategy in nursing homes to reduce the risk of Covid-19 infections among both residents and staffs. Given that the strategy is so novel, there are big questions about whether it can even be implemented, and how the outcome measures can be collected. So, it may be premature to figure out if the approach will reduce infection. But still, it is hard not to try to gain a little insight into the potential effect of the intervention.</p>
<p>One of the lead investigators suggested a permutation test, because we know the sample is going to be small and we might not want to be forced to make parametric assumptions about the outcome. In the context of a pilot study, the permutation test might give a crude indication about the potential impact of an intervention. Would the full-blown follow-up study be conducted if there is no observed effect in the pilot? That is a bigger question. But, the suggestion of some sort of signal might provide additional motivation if feasibility was no longer a question; we would still need to be <a href="https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/">careful</a> about how we incorporate these findings into planning for the bigger study.</p>
<div id="permutation-test-explained-briefly" class="section level3">
<h3>Permutation test explained, briefly</h3>
<p>Typically, if we are comparing outcomes for two treatment arms, we calculate a statistic that quantifies the comparison. For example, this could be a difference in group means, a risk ratio, or a log-odds ratio. For whatever statistic we use, there would be an underlying sampling distribution of that statistic under the assumption that there is no difference between the two groups. Typically, the sampling distribution would be estimated analytically using additional assumptions about the underlying distributions of the observed data, such as normal or Poisson. We then use the sampling distribution to calculate a <em>p-value</em> for the observed value of the statistic.</p>
<p>The permutation approach is an alternative way to generate the sampling distribution of the statistic under an assumption of no group differences <em>without</em> making any assumptions about the distributions of the data. If group membership does not influence the outcome, it wouldn’t matter if we re-arranged all the treatment assignments in the data. We could do that and estimate the statistic. In fact, we could do that for all the possible arrangements, and that would give us a distribution of the statistic for that sample under the assumption of no group effect. (If the number of possible arrangements is excessive, we could just take a large sample of those possible arrangements, which is what I do below.) To get a <em>p-value</em>, we compare the observed statistic to this manufactured distribution.</p>
<p>Now to the simulations.</p>
</div>
<div id="the-data-generation-process" class="section level3">
<h3>The data generation process</h3>
<p>In this proposed study, we are interested in measuring the rate of Covid-19 infection in 8 nursing homes. Given the nature of the spread of disease and the inter-relationship of the infections between residents, the nursing home is the logical unit of analysis. So, we will only have 8 observations - hardly anything to hang your hat on. But, can the permutation provide any useful information?</p>
<p>The data generation process starts with generating a pool of residents at each site, about 15 per home. The study will run for followed for 4 months (or 120 days), and residents will come and go during that period. We are going to assume that the average residence time is 50 days at each home, but there will be some variability. Based on the number of patients and average length of stay, we can calculate the number of patient-days per site. The number of infected patients <span class="math inline">\(y\)</span> at a site is a function of the intervention and the time of exposure (patient-days). We will be comparing the average rates (<span class="math inline">\(y/patDays\)</span>) for the two groups.</p>
<p>In the first simulation, I am assuming no treatment effect, because I want to assess the Type 1 error (the probability of concluding there is an effect given we know there is no effect).</p>
<p>Here is a function to generate the data definitions and a second function to go through the simple data generation process:</p>
<pre class="r"><code>library(simstudy)
library(parallel)</code></pre>
<pre class="r"><code>defs <- function() {
def <- defDataAdd(varname = "nRes", formula = 15, dist = "poisson")
def <- defDataAdd(def, varname = "nDays", formula = 50, dist = "poisson")
def <- defDataAdd(def, varname = "patDays",
formula = "nRes * pmin(120, nDays)",
dist = "nonrandom")
def <- defDataAdd(def, varname = "y",
formula = "-4 - 0.0 * rx + log(patDays)",
variance = 1,
dist = "negBinomial", link = "log")
def <- defDataAdd(def, varname = "rate",
formula = "y/patDays",
dist = "nonrandom")
return(def[])
}</code></pre>
<pre class="r"><code>gData <- function(n, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(def, dx)
dx[]
}</code></pre>
<p>And here we actually generate a single data set:</p>
<pre class="r"><code>RNGkind(kind = "L'Ecuyer-CMRG")
#set.seed(72456)
set.seed(82456)
def <- defs()
dx <- gData(8, def)
dx</code></pre>
<pre><code>## id rx nRes nDays patDays y rate
## 1: 1 1 16 40 640 0 0.000000000
## 2: 2 0 9 59 531 8 0.015065913
## 3: 3 1 20 57 1140 58 0.050877193
## 4: 4 1 14 51 714 14 0.019607843
## 5: 5 1 7 59 413 4 0.009685230
## 6: 6 0 14 38 532 4 0.007518797
## 7: 7 0 10 40 400 19 0.047500000
## 8: 8 0 11 56 616 22 0.035714286</code></pre>
<p>The observed difference in rates is quite close to 0:</p>
<pre class="r"><code>dx[, mean(rate), keyby = rx]</code></pre>
<pre><code>## rx V1
## 1: 0 0.02644975
## 2: 1 0.02004257</code></pre>
<pre class="r"><code>obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]
obs.diff</code></pre>
<pre><code>## [1] -0.006407182</code></pre>
</div>
<div id="the-permutation-test" class="section level3">
<h3>The permutation test</h3>
<p>With 8 sites, there are <span class="math inline">\(8!\)</span> possible permutations, or a lot ways to scramble the treatment assignments.</p>
<pre class="r"><code>factorial(8)</code></pre>
<pre><code>## [1] 40320</code></pre>
<p>I decided to implement this myself in a pretty rudimentary way, though there are R packages out there that can certainly do this better. Since I am comparing averages, I am creating a vector that represents the contrast.</p>
<pre class="r"><code>rx <- dx$rx/sum(dx$rx)
rx[rx==0] <- -1/(length(dx$rx) - sum(dx$rx))
rx</code></pre>
<pre><code>## [1] 0.25 -0.25 0.25 0.25 0.25 -0.25 -0.25 -0.25</code></pre>
<p>I’m taking a random sample of 5000 permutations of the contrast vector, storing the results in a matrix:</p>
<pre class="r"><code>perm <- t(sapply(1:5000, function(x) sample(rx, 8, replace = FALSE)))
head(perm)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] -0.25 -0.25 0.25 -0.25 0.25 0.25 -0.25 0.25
## [2,] -0.25 -0.25 0.25 -0.25 0.25 -0.25 0.25 0.25
## [3,] 0.25 -0.25 -0.25 0.25 -0.25 -0.25 0.25 0.25
## [4,] 0.25 -0.25 0.25 0.25 -0.25 0.25 -0.25 -0.25
## [5,] -0.25 -0.25 -0.25 0.25 0.25 0.25 0.25 -0.25
## [6,] 0.25 -0.25 0.25 -0.25 0.25 -0.25 0.25 -0.25</code></pre>
<p>Using a simple operation of matrix multiplication, I’m calculating a rate difference for each of the sampled permutations:</p>
<pre class="r"><code>perm.diffs <- perm %*% dx$rate
head(perm.diffs)</code></pre>
<pre><code>## [,1]
## [1,] 0.005405437
## [2,] 0.025396039
## [3,] 0.004918749
## [4,] -0.007490399
## [5,] -0.004336380
## [6,] 0.007538896</code></pre>
<p>Here is an estimate of the 2-sided <em>p-value</em>:</p>
<pre class="r"><code>mean(abs(perm.diffs) > abs(obs.diff))</code></pre>
<pre><code>## [1] 0.7166</code></pre>
<p>And finally, here is a histogram of the permuted rate differences, with the observed rate different overlaid as a red line. The observed value lies pretty much right in the middle of the distribution, which is what the <em>p-value</em> suggests:</p>
<pre class="r"><code>ggplot(data = data.frame(perm.diffs), aes(x = perm.diffs)) +
geom_histogram(binwidth = .005, color = "white") +
geom_vline(xintercept = obs.diff, color = "red") +
scale_y_continuous(expand = c(0,0), breaks = seq(2500, 10000, 2500)) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-23-permutation-test-for-a-covid-19-pilot-nursing-home-study.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="operating-characteristics" class="section level3">
<h3>Operating characteristics</h3>
<p>As in a power analysis by simulation, we can estimate the Type 1 error rate by generating many data sets, and for each one calculate a <em>p-value</em> using the permutation test. The proportion of <em>p-values</em> less than 0.05 would represent the Type 1 error rate, which should be close to 0.05.</p>
<pre class="r"><code>iter <- function(n) {
dx<- gData(n, defs())
obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]
rx <- dx$rx/sum(dx$rx)
rx[rx==0] <- -1/(n - sum(dx$rx))
perm <- t(sapply(1:20000, function(x) sample(rx, n, replace = FALSE)))
perm.diffs <- perm %*% dx$rate
mean(abs(perm.diffs) > abs(obs.diff))
}</code></pre>
<p>Here we use 5000 data sets to estimate the Type 1 error rate under the data generating process we’ve been using all along, and for each of those data sets we use 5000 permutations to estimate the p-value.</p>
<pre class="r"><code>res <- unlist(mclapply(1:5000, function(x) iter(8), mc.cores = 4))
mean(res < .05)</code></pre>
<pre><code>## [1] 0.0542</code></pre>
</div>
<div id="the-risks-of-using-a-model-with-assumptions" class="section level3">
<h3>The risks of using a model (with assumptions)</h3>
<p>If we go ahead and try to find a signal using a parametric model, there’s a chance we’ll be led astray. These data are count data, so it would not be strange to consider Poisson regression model to estimate the treatment effect (in this case, the effect would be a rate <em>ratio</em> rather than a rate <em>difference</em>). Given that the data are quite limited, we may not really be in a position to verify whether the Poisson distribution is appropriate; as a result, it might be hard to actually select the right model. (In reality, I know that this model <em>will</em> lead us astray, because we used a negative binomial distribution, a distribution with more variance than the Poisson, to generate the count data.)</p>
<p>Just as before, we generate 5000 data sets. For each one we fit a generalized linear model with a Poisson distribution and a log link, and store the effect estimate along with the <em>p-value</em>.</p>
<pre class="r"><code>chkglm <- function(n) {
dx <- gData(n, defs())
glmfit <- glm( y ~ rx + offset(log(patDays)), family = poisson, data = dx)
data.table(t(coef(summary(glmfit))["rx",]))
}
glm.res <- rbindlist(mclapply(1:5000, function(x) chkglm(8)))</code></pre>
<p>The estimated Type 1 error is far greater than 0.05; there would be a pretty good chance that we will be over-enthusiastic about the potential success of our new nursing home strategy if it was not actually effective.</p>
<pre class="r"><code>glm.res[, .(mean(`Pr(>|z|)` < 0.05))]</code></pre>
<pre><code>## V1
## 1: 0.6152</code></pre>
</div>
<div id="when-there-is-a-treatment-effect" class="section level3">
<h3>When there is a treatment effect</h3>
<p>In the case where there is actually a treatment effect, the observed effect size is more likely to fall closer to one of the distribution’s tails, depending on the direction of the effect. If the treatment reduces the number infections, we would expect the rate difference to be <span class="math inline">\(< 0\)</span>, as it is in this particular case:</p>
<pre class="r"><code>def <- updateDefAdd(def, changevar = "y",
newformula = "-4 - 1.2 * rx + log(patDays)" )</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-23-permutation-test-for-a-covid-19-pilot-nursing-home-study.en_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
<p>At the end of the day, if you feel like you must estimate the treatment effect in a pilot study before moving on to the larger trial, one option is to use a non-parametric approach like a permutation test that requires fewer assumptions to lead you astray.</p>
<p>In the end, though, we opted for a different model. If we do get the go ahead to conduct this study, we will fit a Bayesian model instead. We hope this will be flexible enough to accommodate a range of assumptions and give us a potentially more informative posterior probability of a treatment effect. If we actually get the opportunity to do this, I’ll consider describing that model here.</p>
<p>
<p><small><font color="darkkhaki">
Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
When proportional odds is a poor assumption, collapsing categories is probably not going to save you
https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/
Tue, 09 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Continuing the discussion on cumulative odds models I started <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">last time</a>, I want to investigate a solution I always assumed would help mitigate a failure to meet the proportional odds assumption. I’ve believed if there is a large number of categories and the relative cumulative odds between two groups don’t appear proportional across all categorical levels, then a reasonable approach is to reduce the number of categories. In other words, fewer categories translates to proportional odds. I’m not sure what led me to this conclusion, but in this post I’ve created some simulations that seem to throw cold water on that idea.</p>
<div id="when-the-odds-are-proportional" class="section level3">
<h3>When the odds <em>are</em> proportional</h3>
<p>I think it is illustrative to go through a base case where the odds are actually proportional. This will allow me to introduce the data generation and visualization that I’m using to explore this issue. I am showing a lot of code here, because I think it is useful to see how it is possible to visualize cumulative odds data and the model estimates.</p>
<p>The first function <code>genDT</code> generates a data set with two treatment arms and an ordinal outcome. <code>genOrdCat</code> uses a base set of probabilities for the control arm, and the experimental arm probabilities are generated under an assumption of proportional cumulative odds (see the <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">previous post</a> for more details on what cumulative odds are and what the model is).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
genDT <- function(nobs, baseprobs, defA) {
dT <- genData(nobs)
dT <- trtAssign(dT, grpName = "rx")
dT <- addColumns(defA, dT)
dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
dT[]
}</code></pre>
<p>In this case, I’ve set the base probabilities for an ordinal outcome of 8 categories. The log of the cumulative odds ratio comparing experimental arm to control is 1.0 (and is parameterized as -1.0). In this case, the proportional odds ratio should be about 2.7.</p>
<pre class="r"><code>baseprobs <- c(0.05, 0.10, 0.15, 0.25, .20, 0.15, .05, .05)
defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")
set.seed(29672) # 19779
dT.prop <- genDT(200, baseprobs, defA)</code></pre>
<p>Calculation of the observed cumulative odds ratio at each response level doesn’t provide an entirely clear picture about proportionality, but the sample size is relatively small given the number of categories.</p>
<pre class="r"><code>codds <- function(cat, dx) {
dcodds <- dx[, .(codds = mean(as.numeric(r) <= cat)/mean(as.numeric(r) > cat)),
keyby = rx]
round(dcodds[rx == 1, codds] / dcodds[rx==0, codds], 2)
}
sapply(1:7, function(x) codds(x, dT.prop))</code></pre>
<pre><code>## [1] 1.48 3.81 3.12 1.83 2.05 3.59 2.02</code></pre>
<div id="a-visual-assessment" class="section level4">
<h4>A visual assessment</h4>
<p>An excellent way to assess proportionality is to do a visual comparison of the <em>observed</em> cumulative probabilities with the <em>estimated</em> cumulative probabilities from the cumulative odds model that makes the assumption of proportional odds.</p>
<p>I’ve written three functions that help facilitate this comparison. <code>getCumProbs</code> converts the parameter estimates of cumulative odds from the model to estimates of cumulative probabilities.</p>
<pre class="r"><code>getCumProbs <- function(coefs) {
cumprob0 <- data.table(
cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")])), 1),
r = factor(1 : nrow(coefs)),
rx = 0
)
cumprob1 <- data.table(
cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")] +
coefs["rx", 1])), 1),
r = factor(1 : nrow(coefs)),
rx = 1
)
rbind(cumprob0, cumprob1)[]
}</code></pre>
<p>The function <code>bootCumProb</code> provides a single bootstrap from the data so that we can visualize the uncertainty of the estimated cumulative probabilities. In this procedure, a random sample is drawn (with replacement) from the data set, a <code>clm</code> model is fit, and the cumulative odds are converted to cumulative probabilities.</p>
<pre class="r"><code>library(ordinal)
bootCumProb <- function(bootid, dx) {
sampid <- dx[, .(srow = sample(.I, replace = TRUE)), keyby = rx][, srow]
dtBoot <- dx[sampid,]
bootFit <- clm(r ~ rx, data = dtBoot)
bcoefs <- coef(summary(bootFit))
bcumProbs <- getCumProbs(bcoefs)
bcumProbs[, bootid := bootid]
bcumProbs[]
}</code></pre>
<p>The third function <code>fitPlot</code> fits a <code>clm</code> model to the original data set, collects the bootstrapped estimates, calculates the observed cumulative probabilities, converts the estimated odds to estimated probabilities, and generates a plot of the observed data, the model fit, and the bootstrap estimates.</p>
<pre class="r"><code>library(ggplot2)
library(paletteer)
fitPlot <- function(dx) {
clmFit <- clm(r ~ rx, data = dx)
coefs <- coef(summary(clmFit))
bootProbs <- rbindlist(lapply(1:500, function(x) bootCumProb(x, dx)))
cumObsProbs <- dx[, .N, keyby = .(rx, r)]
cumObsProbs[, cumprob := cumsum(N)/sum(N) , keyby = rx]
cumModProbs <- getCumProbs(coefs)
ggplot(data = cumObsProbs, aes(x = r, y = cumprob , color = factor(rx))) +
geom_line(data = cumModProbs, alpha = 1, aes(group=rx)) +
geom_line(data = bootProbs, alpha = .01,
aes(group = interaction(rx, bootid))) +
geom_point() +
ylab("cumulative probability") +
xlab("ordinal category") +
theme(panel.grid = element_blank(),
legend.position = "none") +
scale_color_paletteer_d("jcolors::pal6")
}</code></pre>
<p>Here is a plot based on the original data set of 200 observations. The observed values are quite close to the modeled estimates, and well within the range of the bootstrap estimates.</p>
<pre class="r"><code>fitPlot(dT.prop)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-7-1.png" width="528" /></p>
</div>
<div id="collapsing-the-categories" class="section level4">
<h4>Collapsing the categories</h4>
<p>Continuing with the same data set, let’s see what happens when we collapse categories together. I’ve written a function <code>collapseCat</code> that takes a list of vectors of categories that are to be combined and returns a new, modified data set.</p>
<pre class="r"><code>collapseCat <- function(dold, collapse) {
dx <- copy(dold)
for (i in 1:length(collapse)) {
dx[r %in% collapse[[i]], r:= min(collapse[[i]])]
}
dx[, r := factor(r)]
dx[, r := factor(r, labels = c(1:length(levels(r))))]
dx[]
}</code></pre>
<p>Here is the distribution of the original data set:</p>
<pre class="r"><code>dT.prop[, table(rx, r)]</code></pre>
<pre><code>## r
## rx 1 2 3 4 5 6 7 8
## 0 7 4 14 31 19 15 8 2
## 1 10 22 19 19 16 11 2 1</code></pre>
<p>And if we combine categories 1, 2, and 3 together, as well as 7 and 8, here is the resulting distribution based on the remaining five categories. Here’s a quick check to see that the categories were properly combined:</p>
<pre class="r"><code>collapse <- list(c(1,2,3), c(7,8))
collapseCat(dT.prop, collapse)[, table(rx, r)]</code></pre>
<pre><code>## r
## rx 1 2 3 4 5
## 0 25 31 19 15 10
## 1 51 19 16 11 3</code></pre>
<p>If we create four modified data sets based on different combinations of groups, we can fit models and plot the cumulative probabilities for all for of them. In all cases the proportional odds assumption still seems pretty reasonable.</p>
<pre class="r"><code>collapse <- list(list(c(3, 4), c(6, 7)),
list(c(1,2,3), c(7,8)),
list(c(1,2,3), c(4, 5), c(7,8)),
list(c(1,2), c(3, 4, 5), c(6, 7, 8))
)
dC.prop <- lapply(collapse, function(x) collapseCat(dT.prop, x))
cplots <- lapply(dC.prop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
</div>
<div id="non-proportional-cumulative-odds" class="section level3">
<h3>Non-proportional cumulative odds</h3>
<p>That was all just a set-up to explore what happens in the case of non-proportional odds. To do that, there’s just one more function to add - we need to generate data that does not assume proportional cumulative odds. I use the <code>rdirichlet</code> in the <code>gtools</code> package to generate values between 0 and 1, which sum to 1. The key here is that there is no pattern in the data - so that the ratio of the cumulative odds will not be constant.</p>
<pre class="r"><code>genDTnon <- function(nobs, ncat) {
ps <- gtools::rdirichlet(2, rep(2, ncat))
p0 <- paste(ps[1, -ncat], collapse = ";")
p1 <- paste(ps[2, -ncat], collapse = ";")
defc <- defCondition(condition = "rx == 0", formula = p0,
dist = "categorical")
defc <- defCondition(defc, condition = "rx == 1", formula = p1,
dist = "categorical")
dx <- genData(nobs)
dx <- trtAssign(dx, grpName = "rx")
dx <- addCondition(defc, dx, "r")
dx[, r := factor(r)]
dx[]
}</code></pre>
<p>Again, we generate a data set with 200 observations and an ordinal categorical outcome with 8 levels. The plot of the observed and estimated cumulative probabilities suggests that the proportional odds assumption is not a good one here. Some of the observed probabilities are quite far from the fitted lines, particularly at the low end of the ordinal scale. It may not be a disaster to to use a <code>clm</code> model here, but it is probably not a great idea.</p>
<pre class="r"><code>dT.nonprop <- genDTnon(200, 8)
fitPlot(dT.nonprop)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-13-1.png" width="528" /></p>
<p>The question remains - if we reduce the number of categories does the assumption of proportional odds come into focus? The four scenarios shown here do not suggest much improvement. The observed data still fall outside or at the edge of the bootstrap bands for some levels in each case.</p>
<pre class="r"><code>dC.nonprop <- lapply(collapse, function(x) collapseCat(dT.nonprop, x))
cplots <- lapply(dC.nonprop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>What should we do in this case? That is a tough question. The proportional odds model for the original data set with eight categories is probably just as reasonable as estimating a model using any of the combined data sets; there is no reason to think that any one of the alternatives with fewer categories will be an improvement. And, as we learned <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">last time</a>, we may actually lose power by collapsing some of the categories. So, it is probably best to analyze the data set using its original structure, and find the best model for that data set. Ultimately, that best model may need to relax the proportionality assumption; a post on that will need to be written another time.</p>
</div>
Considering the number of categories in an ordinal outcome
https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/
Tue, 26 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In two Covid-19-related trials I’m involved with, the primary or key secondary outcome is the status of a patient at 14 days based on a World Health Organization ordered rating scale. In this particular ordinal scale, there are 11 categories ranging from 0 (uninfected) to 10 (death). In between, a patient can be infected but well enough to remain at home, hospitalized with milder symptoms, or hospitalized with severe disease. If the patient is hospitalized with severe disease, there are different stages of oxygen support the patient can be receiving, such as high flow oxygen or mechanical ventilation.</p>
<p>It is common to analyze ordinal categorical outcomes like the WHO status measure using a cumulative proportional odds model. (I’ve described these models in a number of posts, starting <a href="https://www.rdatagen.net/post/ordinal-regression/">here</a> and <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a>.) We’ve be been wrestling with the question of whether to use the full 11-point scale or to collapse categories to create a simpler outcome of four or five groups. One issue that comes up is whether this reduction would increase or decrease our ability to detect a treatment effect, assuming of course that there is a treatment effect. To explore the issue, I turned to simulation.</p>
<div id="a-very-quick-recap-of-the-model" class="section level3">
<h3>A very quick recap of the model</h3>
<p>In the cumulative proportional odds model, we are comparing a series of cumulative odds across two groups, and we make an assumption that the ratio of these cumulative odds for the two groups is consistent throughout, the proportional odds assumption.</p>
<p>The cumulative odds for the control group that the status is <span class="math inline">\(x\)</span> is</p>
<p><span class="math display">\[
\text{cOdds}_{c}(x) = \frac{P(Status \le x | rx = Control)}{P(Status \gt x |rx = Control)}
\]</span></p>
<p>And the cumulative odds <strong><em>ratio</em></strong> comparing <em>Control</em> to <em>Treated</em> is</p>
<p><span class="math display">\[
\text{COR}_{ct}(x) = \frac{\text{cOdds}_c(x)}{\text{cOdds}_t(x)}
\]</span></p>
<p>In the proportional odds model, with a measure that has <span class="math inline">\(K\)</span> levels we make the assumption that</p>
<p><span class="math display">\[
\text{COR}_{ct}(1) = \text{COR}_{ct}(2) = \dots =\text{COR}_{ct}(K)
\]</span></p>
<p>The model that we estimate is</p>
<p><span class="math display">\[
\text{logit}(P(Status \le x)) = \alpha_x - \beta * I(rx = Treat)
\]</span>
where <span class="math inline">\(\alpha_x\)</span> is the log cumulative odds for a particular levels <span class="math inline">\(x\)</span>, and <span class="math inline">\(-\beta = \text{COR}_{ct}(k)\)</span>, the (proportional) log odds ratio across all <span class="math inline">\(k\)</span> status levels.</p>
</div>
<div id="conceputalizing-the-categories" class="section level3">
<h3>Conceputalizing the categories</h3>
<p>I am comparing estimates of models for outcome scales that use a range of categories, from 2 to 16. (I expanded beyond 11 to get a better sense of the results when the gradations become quite fine.) The figure shows the 16-group structure collapsing into 2 groups. The first row depicts the distribution of the control group across 16 categories. The second row combines the 2 rightmost purple categories of the first row into a single category, resulting in 15 total categories. Moving downwards, a pair of adjacent categories are combined at each step, until only 2 categories remain at the bottom.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-26-the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome.en_files/figure-html/unnamed-chunk-3-1.png" width="768" /></p>
<p>And here are the actual probabilities for the bottom seven rows, from 8 categories down to 2:</p>
<pre class="r"><code>baseprobs[7:1]</code></pre>
<pre><code>## [[1]]
## [1] 0.075 0.075 0.075 0.075 0.175 0.175 0.175 0.175
##
## [[2]]
## [1] 0.075 0.075 0.150 0.175 0.175 0.175 0.175
##
## [[3]]
## [1] 0.075 0.075 0.150 0.350 0.175 0.175
##
## [[4]]
## [1] 0.150 0.150 0.350 0.175 0.175
##
## [[5]]
## [1] 0.15 0.15 0.35 0.35
##
## [[6]]
## [1] 0.30 0.35 0.35
##
## [[7]]
## [1] 0.3 0.7</code></pre>
</div>
<div id="generating-the-data" class="section level3">
<h3>Generating the data</h3>
<p>To simulate the data, I use the function <code>genOrdCat</code> in <code>simstudy</code> that uses the base probabilities and the log-odds ratio transforming variable, which in this case is <span class="math inline">\(z\)</span>. (I introduced this function a while <a href="https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/">back</a>.) In this case the log odds ratio <span class="math inline">\((-\beta)\)</span> is 1, which translates to a cumulative odds ratio of <span class="math inline">\(exp(1) = 2.72\)</span>.</p>
<pre class="r"><code>library(simstudy)
defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")
genDT <- function(nobs, baseprobs, defA) {
dT <- genData(nobs)
dT <- trtAssign(dT, grpName = "rx")
dT <- addColumns(defA, dT)
dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
dT[]
}</code></pre>
<p>A single data set of 5000 observations with 6 categories looks like this:</p>
<pre class="r"><code>set.seed(7891237)
(dx <- genDT(5000, baseprobs[[5]], defA ))</code></pre>
<pre><code>## id rx z r
## 1: 1 0 0 1
## 2: 2 1 -1 3
## 3: 3 0 0 5
## 4: 4 0 0 4
## 5: 5 1 -1 1
## ---
## 4996: 4996 0 0 3
## 4997: 4997 1 -1 4
## 4998: 4998 0 0 3
## 4999: 4999 1 -1 3
## 5000: 5000 1 -1 4</code></pre>
<p>Here are the distributions by treatment arm:</p>
<pre class="r"><code>prop.table(dx[, table(rx, r)], margin = 1)</code></pre>
<pre><code>## r
## rx 1 2 3 4 5 6
## 0 0.0644 0.0772 0.1524 0.3544 0.1772 0.1744
## 1 0.1792 0.1408 0.2204 0.2880 0.1012 0.0704</code></pre>
<p>Here are the cumulative odds and the odds ratio for a response being 2 or less:</p>
<pre class="r"><code>(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 2)/mean(as.numeric(r) > 2)), keyby = rx])</code></pre>
<pre><code>## rx codds
## 1: 0 0.165
## 2: 1 0.471</code></pre>
<pre class="r"><code>dcodds[rx == 1, codds] / dcodds[rx==0, codds]</code></pre>
<pre><code>## [1] 2.85</code></pre>
<p>And here are the cumulative odds and COR for a response being 4 or less.</p>
<pre class="r"><code>(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 4)/mean(as.numeric(r) > 4)), keyby = rx])</code></pre>
<pre><code>## rx codds
## 1: 0 1.84
## 2: 1 4.83</code></pre>
<pre class="r"><code>dcodds[rx == 1, codds] / dcodds[rx==0, codds]</code></pre>
<pre><code>## [1] 2.62</code></pre>
<p>The CORs are both close to the true COR of 2.72.</p>
</div>
<div id="running-the-experiment" class="section level3">
<h3>Running the experiment</h3>
<p>I was particularly interested in understanding the impact of increasing the number of categories <span class="math inline">\(K\)</span> on the probability of observing a treatment effect (i.e. the power). This required generating many (in this case 10,000) data sets under each scenario defined by the number of categories ranging from 2 to 16, and then estimating a cumulative odds model for each data set. I used the <code>clm</code> function in the <code>ordinal</code> package.</p>
<p>Two functions implement this iteration. <code>analyzeDT</code> generates a data set and returns a model fit. <code>iterate</code> repeatedly calls <code>analyzeDT</code> and estimates power for a particular scenario by calculating the proportion of p-values that are less than 0.05:</p>
<pre class="r"><code>library(ordinal)
analyzeDT <- function(nobs, baseprobs, defA) {
dT <- genDT(nobs, baseprobs, defA)
clmFit <- clm(r ~ rx, data = dT)
coef(summary(clmFit))
}
iterate <- function(niter, nobs, baseprobs, defA) {
res <- lapply(1:niter, function(x) analyzeDT(nobs, baseprobs, defA))
mean( sapply(res, function(x) x["rx", "Pr(>|z|)"]) < 0.05)
}</code></pre>
<p><code>lapply</code> is used here to cycle through each scenario (for enhanced speed <code>mclapply</code> in the <code>parallel</code> package could be used):</p>
<pre class="r"><code>set.seed(1295)
power <- lapply(baseprobs, function(x) iterate(niter = 10000,
nobs = 100, x, defA))</code></pre>
</div>
<div id="effect-of-k-on-power" class="section level3">
<h3>Effect of K on power</h3>
<p>A plot of the estimates suggests a strong relationship between the number of categories and power:</p>
<p><img src="https://www.rdatagen.net/post/2020-05-26-the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>In this particular case, it seems apparent there are benefits to increasing from 2 to 6 categories. However, there are slight gains to be had by increasing the number of categories much beyond this; indeed, extending to the full 16 categories may not be worth the trouble, as the gains in power are minimal.</p>
<p>These minimal gains need to be weighed against the potential difficulty in acquiring the finely grained categorical outcomes. In cases where the defined categories are completely objective and are naturally collected as part of an operating environment - as in the WHO scale that might be gathered from an electronic health record - there is no real added burden to maximizing the number of categories. However, if the outcome scores are based on patient responses to a survey, the quality of data collection may suffer. Adding additional categories may confuse the patient and make the data collection process more burdensome, resulting in unreliable responses or even worse, missing data. In this case, the potential gains in power may be offset by poor data quality.</p>
</div>
To stratify or not? It might not actually matter...
https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/
Tue, 12 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Continuing with the theme of <em>exploring small issues that come up in trial design</em>, I recently used simulation to assess the impact of stratifying (or not) in the context of a multi-site Covid-19 trial with a binary outcome. The investigators are concerned that baseline health status will affect the probability of an outcome event, and are interested in randomizing by health status. The goal is to ensure balance across the two treatment arms with respect to this important variable. This randomization would be paired with an estimation model that adjusts for health status.</p>
<p>An alternative strategy is to ignore health status in the randomization, but to pre-specify an outcome model that explicitly adjusts for health status, just as in the stratification scenario. The question is, how do the operating characteristics (e.g. <em>power</em>, <em>variance</em>, and <em>bias</em>) of each approach compare. Are the (albeit minimal) logistics necessary for stratification worth the effort?</p>
<div id="simulation" class="section level3">
<h3>Simulation</h3>
<p>Simulations under a variety of scenarios suggest that stratification might not be necessary. (See this <a href="https://www.sciencedirect.com/science/article/pii/S0895435698001383">paper</a> for a much deeper, richer discussion of these issues.)</p>
<div id="define-the-data" class="section level4">
<h4>Define the data</h4>
<p>In these simulations, I assume that there are a small number of clusters. The proportion of high risk cases in each cluster varies (specified by <code>p</code>), as do the event rates (specified by <code>a</code>). The simulations vary the log odds of an outcome (<code>baseLO</code>), effect sizes/log-odds ratio (<code>effLOR</code>), and the effect of poor health status <code>xLOR</code>):</p>
<pre class="r"><code>library(simstudy)
library(parallel)
setDefs <- function(pX, precX, varRE, baseLO, effLOR, xLOR) {
defc <- defData(varname = "p", formula = pX, variance = precX,
dist = "beta", id = "site")
defc <- defData(defc, varname = "a", formula = 0, variance = varRE)
form <- genFormula(c(baseLO, effLOR, xLOR, 1), vars = c("rx", "x", "a"))
defi1 <- defDataAdd(varname = "x", formula = "p", dist = "binary")
defi2 <- defDataAdd(varname = "y", formula = form, dist = "binary", link = "logit")
return(list(defc = defc, defi1 = defi1, defi2 = defi2))
}</code></pre>
</div>
<div id="generate-the-data-and-estimates" class="section level4">
<h4>Generate the data and estimates</h4>
<p>Under each scenario, the data definitions are established by a call to <code>setDefs</code> and treatment is randomized, stratified by <em>site</em> only, or by <em>site</em> <strong>and</strong> <em>health status</em> <code>x</code>. (There is a slight bug in the <code>trtAssign</code> function that will generate an error if there is only a single observation in a site and particular strata - which explains my use of the <code>try</code> function to prevent the simulations from grinding to a halt. This should be fixed soon.)</p>
<p>For each generated data set under each scenario, we estimate a <em>generalized linear model</em>:</p>
<p><span class="math display">\[
logit(y_{ij}) = \beta_0 + \gamma_j + \beta_1r_i + \beta_2x_i \ ,
\]</span>
where <span class="math inline">\(y_{ij}\)</span> is the outcome for patient <span class="math inline">\(i\)</span> at site <span class="math inline">\(j\)</span>, <span class="math inline">\(r_i\)</span> is the treatment indicator, and <span class="math inline">\(x_i\)</span> is the health status indicator. <span class="math inline">\(\gamma_j\)</span> is a fixed site-level effect. The function returns parameter estimate for the log-odds ratio (the treatment effect <span class="math inline">\(\beta_1\)</span>), as well as its standard error estimate and p-value.</p>
<pre class="r"><code>genEsts <- function(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR) {
defs <- setDefs(pX, precX, varRE, baseLO, effLOR, xLOR)
dc <- genData(nclust, defs$defc)
dx <- genCluster(dc, "site", clustsize , "id")
dx <- addColumns(defs$defi1, dx)
dx <- try(trtAssign(dx, strata = strata, grpName = "rx"), silent = TRUE)
if ( (class(dx)[1]) == "try-error") {
return(NULL)
}
dx <- addColumns(defs$defi2, dx)
glmfit <- glm(y~factor(site) + rx + x, data = dx, family = "binomial")
estrx <- t(coef(summary(glmfit))["rx", ])
return(data.table(estrx))
}</code></pre>
</div>
<div id="iterate-through-multiple-scenarios" class="section level4">
<h4>Iterate through multiple scenarios</h4>
<p>We will “iterate” through different scenarios using the <code>mclapply</code> function the <code>parallel</code> package. For each scenario, we generate 2500 data sets and parameter estimates. For each of these scenarios, we calculate the</p>
<pre class="r"><code>forFunction <- function(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR) {
res <- rbindlist(mclapply(1:2500, function(x)
genEsts(strata, nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))
data.table(strata = length(strata), nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR,
est = res[, mean(Estimate)],
se.obs = res[, sd(Estimate)],
se.est = res[, mean(`Std. Error`)],
pval = res[, mean(`Pr(>|z|)` < 0.05)]
)
}</code></pre>
</div>
<div id="specify-the-scenarios" class="section level4">
<h4>Specify the scenarios</h4>
<p>We specify all the scenarios by creating a data table of parameters. Each row of this table represents a specific scenario, for which 2500 data sets will be generated and parameters estimated. For these simulations that I am reporting here, I varied the strata for randomization, the cluster size, the baseline event rate, and the effect size, for a total of 336 scenarios (<span class="math inline">\(2 \times 6 \times 4 \times 7\)</span>).</p>
<pre class="r"><code>strata <- list("site", c("site", "x"))
nclust <- 8
clustsize <- c(30, 40, 50, 60, 70, 80)
pX <- 0.35
precX <- 30
varRE <- .5
baseLO <- c(-1.5, -1.25, -1.0, -0.5)
effLOR <- seq(0.5, 0.8, by = .05)
xLOR <- c(.75)
dparam <- data.table(expand.grid(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR))
setnames(dparam, c("strata","nclust", "clustsize", "pX", "precX",
"varRE", "baseLO", "effLOR", "xLOR"))
dparam</code></pre>
<pre><code>## strata nclust clustsize pX precX varRE baseLO effLOR xLOR
## 1: site 8 30 0.35 30 0.5 -1.5 0.5 0.75
## 2: site,x 8 30 0.35 30 0.5 -1.5 0.5 0.75
## 3: site 8 40 0.35 30 0.5 -1.5 0.5 0.75
## 4: site,x 8 40 0.35 30 0.5 -1.5 0.5 0.75
## 5: site 8 50 0.35 30 0.5 -1.5 0.5 0.75
## ---
## 332: site,x 8 60 0.35 30 0.5 -0.5 0.8 0.75
## 333: site 8 70 0.35 30 0.5 -0.5 0.8 0.75
## 334: site,x 8 70 0.35 30 0.5 -0.5 0.8 0.75
## 335: site 8 80 0.35 30 0.5 -0.5 0.8 0.75
## 336: site,x 8 80 0.35 30 0.5 -0.5 0.8 0.75</code></pre>
</div>
<div id="run-the-simulation" class="section level4">
<h4>Run the simulation</h4>
<p>Everything is now set up. We go through each row of the scenario table <code>dparam</code> to generate the summaries for each scenario by repeated calls to <code>forFunction</code>, again using <code>mclapply</code>.</p>
<pre class="r"><code>resStrata <- mclapply(1:nrow(dparam), function(x) with(dparam[x,],
forFunction(strata[[1]], nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))
resStrata <- rbindlist(resStrata)
resStrata[, .(strata, baseLO, effLOR, xLOR, est, se.obs, se.est, pval)]</code></pre>
<pre><code>## strata baseLO effLOR est se.obs se.est pval
## 1: 1 -1.5 0.5 0.53 0.33 0.31 0.38
## 2: 2 -1.5 0.5 0.52 0.32 0.31 0.39
## 3: 1 -1.5 0.5 0.52 0.28 0.27 0.49
## 4: 2 -1.5 0.5 0.51 0.27 0.27 0.48
## 5: 1 -1.5 0.5 0.51 0.24 0.24 0.58
## ---
## 332: 2 -0.5 0.8 0.82 0.21 0.20 0.98
## 333: 1 -0.5 0.8 0.82 0.19 0.19 0.99
## 334: 2 -0.5 0.8 0.82 0.19 0.19 1.00
## 335: 1 -0.5 0.8 0.82 0.18 0.17 1.00
## 336: 2 -0.5 0.8 0.81 0.18 0.17 1.00</code></pre>
</div>
</div>
<div id="plotting-the-results" class="section level3">
<h3>Plotting the results</h3>
<p>The plots below compare the estimates of the two different stratification strategies. Each point represents a specific scenario under stratification by site alone and stratification by site along and health status. If there are differences in the two strategies, we would expect to see the points diverge from the horizontal line. For all four plots, there appears to be little if any divergence, suggesting that, for these scenarios at least, little difference between stratification scenarios.</p>
<div id="power" class="section level4">
<h4>Power</h4>
<p>In this first scatter plot, the estimated power under each stratification strategy is plotted. Power is estimated by the proportion of p-values in the 2500 iterations that were less than 0.05. Regardless of whether observed power for a particular scenario is high or low, we generally observe the same power under both strategies. The points do not diverge far from the red line, which represents perfect equality.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-8-1.png" width="576" /></p>
</div>
<div id="standard-errors" class="section level4">
<h4>Standard errors</h4>
<p>There are two ways to look at the variability of the two strategies. First, we can look at the observed variability of the effect estimates across the 2500 iterations. And second, we can look at the average of the standard error estimates across the iterations. In general, the two randomization schemes appear quite similar with respect to both observed and estimated variation.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-9-1.png" width="576" /></p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
</div>
<div id="treatment-effects" class="section level4">
<h4>Treatment effects</h4>
<p>In this last plot, the average estimated treatment effect is shown for each scenario. The two stratification strategies both appear to provide the same unbiased estimates of the treatment effect.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>
<p><small><font color="darkkhaki">
References:</p>
Kernan, Walter N., Catherine M. Viscoli, Robert W. Makuch, Lawrence M. Brass, and Ralph I. Horwitz. “Stratified randomization for clinical trials.” <em>Journal of clinical epidemiology</em> 52, no. 1 (1999): 19-26.
</font></small>
</p>
</div>
</div>
Simulation for power in designing cluster randomized trials
https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/
Tue, 28 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis. (I guess this is the appropriate time to pull out the quote made by the famous statistician Ronald Fisher: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”)</p>
<p>In these times, when researchers are scrambling to improve care for patients the Covid-19, there isn’t often time for those early conversations, or they are happening with many different people. Recently, I’ve been asked to help figure out what the sample size requirements are for four or five studies involving promising therapies for Covid-19 patients at various stages of the disease. In most cases, randomization is at the patient, so power/sample size calculations are much simpler. In other situations, cluster randomization at the unit or hospital floor level is being considered, so the sample size estimates are a little more involved.</p>
<p>There are analytic/formula solutions for sample size estimates in non-clustered randomization. And if the outcome is continuous, adjustments can be made using an estimate of the <em>design effect</em>, which I wrote about <a href="https://www.rdatagen.net/post/what-exactly-is-the-design-effect/">recently</a>. When the outcome is binary, or the number of clusters is small, or the cluster sizes themselves are small, I feel more comfortable using simulation methods. Indeed, the <code>simstudy</code> package grew out of my work to facilitate data generation for this very reason.</p>
<p>My intention here is to provide code to help others in case they want to conduct these relatively simple power analyses. One of the proposed studies expected to have a large number of small-sized clusters, so that is what I’ll simulate here.</p>
<div id="the-data-generation-process" class="section level3">
<h3>The data generation process</h3>
<p>To estimate power under a range of scenarios, I’ve written two functions to define the data generation process, one to generate the data, and a final one to generate a single data set and estimate the parameters of a mixed effects model.</p>
<div id="data-definitions" class="section level4">
<h4>data definitions</h4>
<p>The variance of the cluster-level random effect is based on a conversion of the intra-cluster correlation (<em>ICC</em>) to the logistic scale, which is done through a call to the function <code>iccRE</code>. The definition of the outcome is based on this random effect plus a log odds-ratio that is derived from the control proportion and the assumed percent change:</p>
<pre class="r"><code>library(simstudy)
library(lme4)
defRE <- function(icc, dist = "binary", varW = NULL) {
setVar <- iccRE(ICC = icc, dist = dist, varWithin = varW)
def <- defData(varname = "a", formula = 0, variance = setVar, id = "cluster")
return(def)
}
defBinOut <- function(p1, pctdelta) {
p2 <- (1 - pctdelta) * p1
int <- round(log( p1/(1-p1) ), 4)
effect <- round(log( (p2/(1-p2)) / (p1/(1-p1) )), 4)
formula <- genFormula( c(int, effect, 1), c("rx","a") )
def <- defDataAdd(varname = "y", formula = formula, dist = "binary",
link = "logit")
return(def)
}</code></pre>
</div>
<div id="data-generation" class="section level4">
<h4>data generation</h4>
<p>The data generation follows from the data definitions. First, cluster-level data are generated (along with treatment assignment), and then the individual patient level data.</p>
<pre class="r"><code>genDataSet <- function(nclust, clustsize, re.def, out.def) {
dClust <- genData(nclust, re.def)
dClust <- trtAssign(dClust, grpName = "rx")
dPat <- genCluster(dtClust = dClust, cLevelVar = "cluster",
numIndsVar = clustsize, level1ID = "id")
dPat <- addColumns(out.def, dPat)
return(dPat)
}</code></pre>
</div>
<div id="model-estimation" class="section level4">
<h4>model estimation</h4>
<p>The <em>p-values</em> used for the power calculation are estimated using <code>glmer</code> of the <code>lme4</code> package, a generalized mixed effects model. (If the outcome were continuous, we would use <code>lmer</code> instead.) Unfortunately, this can be relatively resource-intensive, so the repeated estimations over a wide range of scenarios can be rather time consuming.</p>
<p>One way to speed things up is eliminate a step in the <code>glmer</code> algorithm to that takes considerable time, but has the side effect of excluding information about whether or not the model estimation has converged. Convergence can be a particular problem if variation across clusters is low, as when the <em>ICC</em> is low. The function below keeps track of whether an iteration has converged (but only if <code>fast</code> is set to FALSE). One might want to explore how frequently there is a failure to converge before turning on the <code>fast</code> flag.</p>
<p>This function returns the convergence status, the estimate of the random effects variance, and the effect parameter estimate, standard error, and p-value.</p>
<p><br></p>
<pre class="r"><code>genBinEsts <- function(nclust, clustsize, re.def, out.def, fast = FALSE) {
dP <- genDataSet(nclust, clustsize, re.def, out.def)
mod.re <- glmer(y ~ rx + (1|cluster), data = dP, family = binomial,
control = glmerControl( optimizer = "bobyqa", calc.derivs = !(fast) ))
convStatus <- as.numeric(length(summary(mod.re)$optinfo$conv$lme4))
res <- data.table(convStatus, re = VarCorr(mod.re)$cluster,
t(coef(summary(mod.re))["rx",]))
return(res)
}</code></pre>
</div>
</div>
<div id="single-data-set" class="section level3">
<h3>Single data set</h3>
<p>Here is an example setting the <em>ICC</em> at 0.025, the control proportion at 40%, and an effect size that translates to a 30% reduction (so that the treatment proportion will be 28%).</p>
<pre class="r"><code>(defa <- defRE(icc = 0.025))</code></pre>
<pre><code>## varname formula variance dist link
## 1: a 0 0.0844 normal identity</code></pre>
<pre class="r"><code>(defy <- defBinOut(0.40, 0.30))</code></pre>
<pre><code>## varname formula variance dist link
## 1: y -0.4055 + -0.539 * rx + 1 * a 0 binary logit</code></pre>
<p>And when we generate a single data set and estimate the parameters by calling the last function:</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG")
set.seed(2711)
genBinEsts(40, 10, defa, defy)</code></pre>
<pre><code>## convStatus re.(Intercept) Estimate Std. Error z value Pr(>|z|)
## 1: 0 0.284 -0.625 0.279 -2.24 0.0249</code></pre>
</div>
<div id="estimating-power" class="section level3">
<h3>Estimating power</h3>
<p>Everything is set up now to estimate power with repeated calls to this group of functions. This process can be done using the <code>mclapply</code> function in the <code>parallel</code> package, as I illustrated in earlier <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/">post</a>. Here, I am showing a <em>for loop</em> implementation.</p>
<p>The variables <code>ICC</code>, <code>SS</code>, <code>nClust</code>, <code>ctlPROB</code>, and <code>pctDELTA</code> are vectors containing all the possible scenarios for which power will be estimated. In this case, power will be based on 1000 iterations under set of assumptions.</p>
<p><br></p>
<pre class="r"><code>library(parallel)
nIters <- 1000
results <- NULL
for (icc in ICC) {
for (ss in SS) {
for (nclust in nCLUST) {
for (p1 in ctlPROB) {
for (pdelta in pctDELTA) {
clustsize <- ss %/% nclust
p2 <- p1 * (1 - pdelta)
defa <- defRE(icc)
defy <- defBinOut(p1, pdelta)
res <- rbindlist(mclapply(1:nIters,
function(x) genBinEsts(nclust, clustsize, defa, defy)))
dres <- data.table(icc, ss, nclust, clustsize, p1, p2, pdelta,
converged = res[, mean(convStatus == 0)],
p.conv = res[convStatus == 0, mean(`Pr(>|z|)` < 0.05)],
p.all = res[convStatus != 2, mean(`Pr(>|z|)` < 0.05)])
print(dres)
results <- rbind(results, dres)
}
}
}
}
}</code></pre>
<p>The first set of simulations evaluated power at a range of <em>ICC’s</em>, sample sizes, and effect sizes (in terms of percentage reduction). The number of clusters was fixed at 40, so the cluster size increased along with sample size. The probability of an event for a patient in the control group was also fixed at 10%.</p>
<pre class="r"><code>ICC <- seq(0.025, .10, 0.025)
SS <- seq(200, 1600, 200)
pctDELTA <- c(.2, .3, .4, .5, .6)
nCLUST <- 40
ctlPROB <- 0.10</code></pre>
<p>The plot shows that at total sample sizes less 800, we would only be able detect effect sizes of 60% when the control proportion is 10%.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-10-1.png" width="1056" /></p>
<p>When the control proportion is at 40%, there is clearly a much higher probability that the study will detect an effect even at the smaller sample sizes. Under these scenarios, the <em>ICC</em> has a much greater impact on power than when the control proportion is much lower.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-11-1.png" width="1056" /></p>
</div>
<div id="other-properties-of-the-design" class="section level3">
<h3>Other properties of the design</h3>
<p>Of course, power is only one concern of many. For example, we may need to understand how a study’s design relates to bias and variance. In this case, I wondered how well the standard error estimates would compare to observed standard errors, particularly when the cluster sizes were on the lower end. Here are two plots comparing the two.</p>
<p>The lines represent the observed standard errors (the standard deviation of the measure of effect, or the parameter representing the log odds-ratio) at each sample size (assuming 40 clusters and an effect size of 30% reduction.) The points are the average estimate of the standard error with error bars that reflect <span class="math inline">\(\pm\)</span> 1 standard deviation.</p>
<p>In the first set scenarios, the control probability is set at 10%. For all <em>ICCs</em> except perhaps 0.10, it appears that the standard error estimates are, on average, too large, though there is quite a bit of variability. The over-estimation declines as between cluster variance increases.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-12-1.png" width="1056" /></p>
<p>In the second set of scenarios, where the control probability is at 40%, there is less variation in the standard error estimates (as reflected in the shorter length error bars), and there appears to be a slight underestimate of variation, particularly with the larger <em>ICCs</em>.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-13-1.png" width="1056" /></p>
<p>I have only scratched the surface here in terms of the scenarios that can be investigated. In addition there are other measurements to consider. Clearly, it would be useful to know if these observed biases in the standard error estimates disappear with larger cluster sizes, or how the number of clusters relates to this bias. And, I didn’t even look at the whether effect size estimates are biased in different scenarios. The point is, while power is important, we must also understand the quality of our estimates.</p>
</div>
Yes, unbalanced randomization can improve power, in some situations
https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/
Tue, 14 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/
<p>Last time I provided some simulations that <a href="https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/">suggested</a> that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test.</p>
<div id="generating-two-groups-with-different-means-and-variance" class="section level3">
<h3>Generating two groups with different means and variance</h3>
<p>Using <code>simstudy</code> (the latest version 1.16 is now available on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a>), it is possible to generate different levels of variance by specifying a formula in the data definition. In this example, the treatment group variance is five times the control group variance:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def <- defDataAdd(varname = "y", formula = "1.1*rx",
variance = "1*(rx==0) + 5*(rx==1)", dist = "normal")</code></pre>
<p>I have written a simple function to generate the data that can be used later in the power experiments:</p>
<pre class="r"><code>genDataSet <- function(n, ratio, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
dx <- addColumns(def, dx)
return(dx[])
}</code></pre>
<p>And now we can generate and look at some data.</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG")
set.seed(383)
dx1 <- genDataSet(72, c(1, 2), def)
davg <- dx1[, .(avg = mean(y)), keyby = rx]
library(paletteer)
library(ggplot2)
ggplot(data = dx1, aes(x = factor(rx), y = y) ) +
geom_jitter(width = .15, height = 0, aes(color = factor(rx))) +
geom_hline(data= davg, lty = 3, size = .75,
aes(yintercept = avg, color = factor(rx))) +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "grey98"),
legend.position = "none",
axis.title.x = element_blank()) +
scale_x_discrete(labels = c("control", "treatment")) +
scale_color_paletteer_d("jcolors::pal5")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-04-14-unbalanced-randomization-can-improve-power-in-some-situations.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="power-analyses" class="section level3">
<h3>Power analyses</h3>
<p>The following function generates a data set, records the difference in means for the two groups, and estimates the p-value of a <em>t-test</em> that assumes different variances for the two groups.</p>
<pre class="r"><code>genPvalue <- function(n, ratio, def) {
dx <- genDataSet(n, ratio, def)
mean.dif <- dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
p.value <- t.test(y~rx, data = dx)$p.value
data.table(r = paste0(ratio[1], ":", ratio[2]), mean.dif, p.value)
}</code></pre>
<p>In this comparison, we are considering three different designs or randomization schemes. In the first, randomization will be 1 to 1, so that half of the sample of 72 (n = 36) is assigned to the treatment arm. In the second, randomization will be 1 to 2, so that 2/3 of the sample (n=48) is assigned to treatment. And in the last, randomization will be 1 to 3, where 3/4 of the patients (n = 54) will be randomized to treatment. For each scenario, we will estimate the mean difference between the groups, the standard deviation of differences, and the power. All of these estimates will be based on 5000 data sets each, and we are still assuming treatment variance is five times the control variance.</p>
<pre class="r"><code>library(parallel)
ratios <- list(c(1, 1), c(1, 2), c(1, 3))
results <- mclapply(ratios, function(r)
rbindlist(mclapply(1:5000, function(x) genPvalue(72, r, def )))
)
results <- rbindlist(results)</code></pre>
<p>All three schemes provide an unbiased estimate of the effect size, though the unbalanced designs have slightly less variability:</p>
<pre class="r"><code>results[, .(avg.difference = mean(mean.dif),
sd.difference = sd(mean.dif)), keyby = r]</code></pre>
<pre><code>## r avg.difference sd.difference
## 1: 1:1 1.1 0.41
## 2: 1:2 1.1 0.38
## 3: 1:3 1.1 0.39</code></pre>
<p>The reduced variability translates into improved power for the unbalanced designs:</p>
<pre class="r"><code>results[, .(power = mean(p.value < 0.05)), keyby = r]</code></pre>
<pre><code>## r power
## 1: 1:1 0.75
## 2: 1:2 0.81
## 3: 1:3 0.81</code></pre>
</div>
<div id="benefits-of-imbalance-under-different-variance-assumptions" class="section level3">
<h3>Benefits of imbalance under different variance assumptions</h3>
<p>It seems reasonable to guess that if the discrepancy in variance between the two groups is reduced, there will be less advantage to over-allocating patients in the treatment arm. In fact, it may even be a disadvantage, as in the case of a binary outcome. Likewise, as the discrepancy increases, increased enthusiasm for unbalanced designs may be warranted.</p>
<p>Here is a plot (code available upon request) showing how the variation in the mean differences (shown as a standard deviation) relate to the design scheme and the underlying difference in the variance of the control and treatment groups. In all cases, the assumed variance for the control group was 1. The variance for the treatment group ranged from 1 to 9 in different sets of simulations. At each level of variance, four randomization schemes were evaluated: 1 to 1, 1 to 2, 1 to 3, and 1 to 4.</p>
<p>When variances are equal, there is no apparent benefit to using any other than a 1:1 randomization scheme. Even when the variance of the treatment group increases to 3, there is little benefit to a 1:2 arrangement. At higher levels of variance - in this case 5 - there appears to be more of a benefit to randomizing more people to treatment. However, at all the levels shown here, it does not look like anything above 1:2 is warranted.</p>
<p>So, before heading down the path of unbalanced randomization, make sure to take a look at your variance assumptions.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-14-unbalanced-randomization-can-improve-power-in-some-situations.en_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
</div>
Can unbalanced randomization improve power?
https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/
Tue, 31 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/
<p>Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.</p>
<p>I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes.</p>
<p>Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only <em>lose</em> efficiency if we use anything other than a 1:1 randomization.</p>
<div id="brief-public-service-announcement-simstudy-update" class="section level3">
<h3>Brief public service announcement: simstudy update</h3>
<p>When it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the <code>simstudy</code> package, which supports much of the simulations on this blog, could not readily handle anything other than 1:1 randomization. I had to quickly rectify that shortcoming. There is a new argument <code>ratio</code> in the <code>trtAssign</code> function where you can now specify any scheme for any number of treatment arms. This is available in version 1.16, which for the moment can be found only on github (kgoldfeld/simstudy).</p>
<p>Here is an example based on a 1:2:3 allocation. I’m not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the <code>balance</code> argument is set to <code>TRUE</code>, indicating that the allocation to the groups will be perfect, or as close as possible to the specified ratios. If <code>balance</code> is <code>FALSE</code>, the ratios are used as relative probabilities instead.</p>
<pre class="r"><code>library(simstudy)
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(16)
dx <- genData(600)
dx <- trtAssign(dx, nTrt = 3, balanced = TRUE,
ratio = c(1,2,3), grpName = "rx")
dx[, table(rx)]</code></pre>
<pre><code>## rx
## 1 2 3
## 100 200 300</code></pre>
</div>
<div id="unbalanced-designs-with-a-binary-outcome" class="section level3">
<h3>Unbalanced designs with a binary outcome</h3>
<p>The outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome).</p>
<p>If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters.</p>
<pre class="r"><code>estCoef <- function(n, formula, ratio) {
def <- defDataAdd(varname = "y", formula = formula, dist = "binary")
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
dx <- addColumns(def, dx)
coef(summary(glm(y~rx, family = binomial, data = dx)))
}
estCoef(n = 244*2, formula = "0.4 - 0.3 * 0.4 * rx", ratio = c(1, 1))</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4328641 0.1310474 -3.303111 0.0009561867
## rx -0.4577476 0.1924533 -2.378487 0.0173838304</code></pre>
<p>The probabilities of a bad outcome for the usual care group and drug treatment group are</p>
<pre class="r"><code>c(usual = 1/(1 + exp(0.433)), drug = 1/(1+exp(0.433 + 0.458))) </code></pre>
<pre><code>## usual drug
## 0.3934102 0.2909035</code></pre>
</div>
<div id="assessing-power" class="section level3">
<h3>Assessing power</h3>
<p>In order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05.</p>
<p>Here is the analytic solution for a 1:1 ratio.</p>
<pre class="r"><code>power.prop.test(p1 = .4, p2 = .7*.4, power = .80)</code></pre>
<pre><code>##
## Two-sample comparison of proportions power calculation
##
## n = 243.4411
## p1 = 0.4
## p2 = 0.28
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated <span class="math inline">\(n\)</span> in a simulation for power (using 1000 datasets), we should be close to 80%:</p>
<pre class="r"><code>est.power <- function(n, ratio, p1, reduction) {
formula = paste(p1, "* (1 -", reduction, "* rx)")
p.val <- estCoef(n, formula, ratio)["rx", 4]
return(p.val)
}
pvals <- unlist(mclapply(1:1000,
function(x) est.power(488, c(1, 1), 0.4, 0.3)))
mean(pvals < 0.05)</code></pre>
<pre><code>## [1] 0.814</code></pre>
</div>
<div id="the-power-experiment" class="section level3">
<h3>The power experiment</h3>
<p>Now we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions:</p>
<pre class="r"><code>res <- list()
for (p1 in c(.2, .3, .4, .5)) {
for (r in c(.2, .25, .3)) {
p2 <- (1- r) * p1
n <- ceiling(power.prop.test(p1 = p1, p2 = p2, power = .80)$n)*2
for (i in c(1:5)) {
pvals <- mclapply(1:1000, function(x) est.power(n, c(1, i), p1, r))
pvals <- unlist(pvals)
dres <- data.table(n, control_p = p1, pct_reduction = r,
control = 1, rx = i, power = mean( pvals < .05))
res <- append(res, list(dres))
}
}
}
res <- rbindlist(res)</code></pre>
<p>Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment:</p>
<pre class="r"><code>ggplot(data = res, aes(x = rx, y = power)) +
geom_line(color = "blue") +
facet_grid(control_p ~ pct_reduction, labeller = label_both) +
theme(panel.grid = element_blank()) +
scale_x_continuous(name = "ratio of treatment to control",
breaks = c(1:5), labels = paste0(c(1:5),":1")) +
scale_y_continuous(limits = c(.5,.9), breaks = c(.6, .7, .8))</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-03-31-can-unbalanced-randomization-improve-power.en_files/figure-html/unnamed-chunk-7-1.png" width="624" /></p>
</div>
<div id="continuous-outcomes" class="section level3">
<h3>Continuous outcomes</h3>
<p>In the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I’ll show some simulations that indicate this might be the case.</p>
<p>Stay well.</p>
</div>
When you want more than a chi-squared test, consider a measure of association
https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/
Tue, 17 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In my last <a href="https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/">post</a>, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the <span class="math inline">\(\chi^2\)</span> statistic.</p>
<p>I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal. This can arise, for example, when we want to compare Likert-type responses across multiple groups. The measure of effect I focused on - the cumulative proportional odds - is quite useful, but is potentially limited for two reasons. First, the proportional odds assumption may not be reasonable, potentially leading to biased estimates. Second, both factors may be nominal (i.e. not ordinal), it which case cumulative odds model is inappropriate.</p>
<p>An alternative, non-parametric measure of association that can be broadly applied to any contingency table is <em>Cramér’s V</em>, which is calculated as</p>
<p><span class="math display">\[
V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}
\]</span>
where <span class="math inline">\(\chi^2\)</span> is from the Pearson’s chi-squared test, <span class="math inline">\(N\)</span> is the total number of responses across all groups, <span class="math inline">\(r\)</span> is the number of rows in the contingency table, and <span class="math inline">\(c\)</span> is the number of columns. <span class="math inline">\(V\)</span> ranges from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>, with <span class="math inline">\(0\)</span> indicating no association, and <span class="math inline">\(1\)</span> indicating the strongest possible association. (In the <a href="#addendum">addendum</a>, I provide a little detail as to why <span class="math inline">\(V\)</span> cannot exceed <span class="math inline">\(1\)</span>.)</p>
<div id="simulating-independence" class="section level3">
<h3>Simulating independence</h3>
<p>In this first example, the distribution of ratings is independent of the group membership. In the data generating process, the probability distribution for <code>rating</code> has no reference to <code>grp</code>, so we would expect similar distributions of the response across the groups:</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
def <- defData(def, varname = "rating",
formula = "0.2;0.3;0.4;0.1", dist = "categorical")
set.seed(99)
dind <- genData(500, def)</code></pre>
<p>And in fact, the distributions across the 4 rating options do appear pretty similar for each of the 3 groups:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-2-1.png" width="576" /></p>
<p>In order to estimate <span class="math inline">\(V\)</span> from this sample, we use the <span class="math inline">\(\chi^2\)</span> formula (I explored the chi-squared test with simulations in a two-part post <a href="https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/">here</a> and <a href="https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/">here</a>):</p>
<p><span class="math display">\[
\sum_{i,j} {\frac{(O_{ij} - E_{ij})^2}{E_{ij}}}
\]</span></p>
<pre class="r"><code>observed <- dind[, table(grp, rating)]
obs.dim <- dim(observed)
getmargins <- addmargins(observed, margin = seq_along(obs.dim),
FUN = sum, quiet = TRUE)
rowsums <- getmargins[1:obs.dim[1], "sum"]
colsums <- getmargins["sum", 1:obs.dim[2]]
expected <- rowsums %*% t(colsums) / sum(observed)
X2 <- sum( ( (observed - expected)^2) / expected)
X2</code></pre>
<pre><code>## [1] 3.45</code></pre>
<p>And to check our calculation, here’s a comparison with the estimate from the <code>chisq.test</code> function:</p>
<pre class="r"><code>chisq.test(observed)</code></pre>
<pre><code>##
## Pearson's Chi-squared test
##
## data: observed
## X-squared = 3.5, df = 6, p-value = 0.8</code></pre>
<p>With <span class="math inline">\(\chi^2\)</span> in hand, we can estimate <span class="math inline">\(V\)</span>, which we expect to be quite low:</p>
<pre class="r"><code>sqrt( (X2/sum(observed)) / (min(obs.dim) - 1) )</code></pre>
<pre><code>## [1] 0.05874</code></pre>
<p>Again, to verify the calculation, here is an alternative estimate using the <code>DescTools</code> package, with a 95% confidence interval:</p>
<pre class="r"><code>library(DescTools)
CramerV(observed, conf.level = 0.95)</code></pre>
<pre><code>## Cramer V lwr.ci upr.ci
## 0.05874 0.00000 0.08426</code></pre>
<p> </p>
</div>
<div id="group-membership-matters" class="section level3">
<h3>Group membership matters</h3>
<p>In this second scenario, the distribution of <code>rating</code> is specified directly as a function of group membership. This is an extreme example, designed to elicit a very high value of <span class="math inline">\(V\)</span>:</p>
<pre class="r"><code>def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
defc <- defCondition(condition = "grp == 1",
formula = "0.75; 0.15; 0.05; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 2",
formula = "0.05; 0.75; 0.15; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 3",
formula = "0.05; 0.05; 0.15; 0.75", dist = "categorical")
# generate the data
dgrp <- genData(500, def)
dgrp <- addCondition(defc, dgrp, "rating")</code></pre>
<p>It is readily apparent that the structure of the data is highly dependent on the group:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-8-1.png" width="576" /></p>
<p>And, as expected, the estimated <span class="math inline">\(V\)</span> is quite high:</p>
<pre class="r"><code>observed <- dgrp[, table(grp, rating)]
CramerV(observed, conf.level = 0.95)</code></pre>
<pre><code>## Cramer V lwr.ci upr.ci
## 0.7400 0.6744 0.7987</code></pre>
<p> </p>
</div>
<div id="interpretation-of-cramérs-v-using-proportional-odds" class="section level3">
<h3>Interpretation of Cramér’s V using proportional odds</h3>
<p>A key question is how we should interpret V? Some folks suggest that <span class="math inline">\(V \le 0.10\)</span> is very weak and anything over <span class="math inline">\(0.25\)</span> could be considered quite strong. I decided to explore this a bit by seeing how various cumulative odds ratios relate to estimated values of <span class="math inline">\(V\)</span>.</p>
<p>To give a sense of what some log odds ratios (LORs) look like, I have plotted distributions generated from cumulative proportional odds models, using LORs ranging from 0 to 2. At 0.5, there is slight separation between the groups, and by the time we reach 1.0, the differences are considerably more apparent:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-10-1.png" width="288" /></p>
<p>My goal was to see how estimated values of <span class="math inline">\(V\)</span> change with the underlying LORs. I generated 100 data sets for each LOR ranging from 0 to 3 (increasing by increments of 0.05) and estimated <span class="math inline">\(V\)</span> for each data set (of which there were 6100). The plot below shows the mean <span class="math inline">\(V\)</span> estimate (in yellow) at each LOR, with the individual estimates represented by the grey points. I’ll let you draw you own conclusions, but (in this scenario at least), it does appear that 0.25 (the dotted horizontal line) signifies a pretty strong relationship, as LORs larger than 1.0 generally have estimates of <span class="math inline">\(V\)</span> that exceed this threshold.</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-11-1.png" width="480" /></p>
<p> </p>
</div>
<div id="p-values-and-cramérs-v" class="section level3">
<h3>p-values and Cramér’s V</h3>
<p>To end, I am just going to circle back to where I started at the beginning of the previous <a href="https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/">post</a>, thinking about p-values and effect sizes. Here, I’ve generated data sets with a relatively small between-group difference, using a modest LOR of 0.40 that translates to a measure of association <span class="math inline">\(V\)</span> just over 0.10. I varied the sample size from 200 to 1000. For each data set, I estimated <span class="math inline">\(V\)</span> and recorded whether or not the p-value from a chi-square test would have been deemed “significant” (i.e. p-value <span class="math inline">\(< 0.05\)</span>) or not. The key point here is that as the sample size increases and we rely solely on the chi-squared test, we are increasingly likely to attach importance to the findings even though the measure of association is quite small. However, if we actually consider a measure of association like Cramér’s <span class="math inline">\(V\)</span> (or some other measure that you might prefer) in drawing our conclusions, we are less likely to get over-excited about a result when perhaps we shouldn’t.</p>
<p>I should also comment that at smaller sample sizes, we will probably over-estimate the measure of association. Here, it would be important to consider some measure of uncertainty, like a 95% confidence interval, to accompany the point estimate. Otherwise, as in the case of larger sample sizes, we would run the risk of declaring success or finding a difference when it may not be warranted.</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-12-1.png" width="384" /></p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum-why-is-cramérs-v-le-1" class="section level3">
<h3>Addendum: Why is <em>Cramér’s V</em> <span class="math inline">\(\le\)</span> 1?</h3>
<p>Cramér’s <span class="math inline">\(V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}\)</span>, which cannot be lower than 0. <span class="math inline">\(V=0\)</span> when <span class="math inline">\(\chi^2 = 0\)</span>, which will only happen when the observed cell counts for all cells equal the expected cell counts for all cells. In other words, <span class="math inline">\(V=0\)</span> only when there is complete independence.</p>
<p>It is also the case that <span class="math inline">\(V\)</span> cannot exceed <span class="math inline">\(1\)</span>. I will provide some intuition for this using a relatively simple example and some algebra. Consider the following contingency table which represents complete separation of the three groups:</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-dep.png" height="225" /></p>
<p>I would argue that this initial <span class="math inline">\(3 \times 4\)</span> table is equivalent to the following <span class="math inline">\(3 \times 3\)</span> table that collapses responses <span class="math inline">\(1\)</span> and <span class="math inline">\(2\)</span> - no information about the dependence has been lost or distorted. In this case <span class="math inline">\(n_A = n_{A1} + n_{A2}\)</span>.</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-collapsed.png" height="200" /></p>
<p>In order to calculate <span class="math inline">\(\chi^2\)</span>, we need to derive the expected values based on this collapsed contingency table. If <span class="math inline">\(p_{ij}\)</span> is the probability for cell row <span class="math inline">\(i\)</span> and column <span class="math inline">\(j\)</span>, and <span class="math inline">\(p_i.\)</span> and <span class="math inline">\(p._j\)</span> are the row <span class="math inline">\(i\)</span> and column <span class="math inline">\(j\)</span> totals, respectively then independence implies that <span class="math inline">\(p_{ij} = p_i.p._j\)</span>. In this example, under independence, the expected cell count for cell <span class="math inline">\(i,j\)</span> is <span class="math inline">\(\frac{n_i}{N} \frac{n_j}{N} N = \frac{n_in_j}{N}\)</span>:</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-collapsed-ind.png" height="200" /></p>
<p>If we consider the contribution of group <span class="math inline">\(A\)</span> to <span class="math inline">\(\chi^2\)</span>, we start with the <span class="math inline">\(\sum_{group \ A} (O_j - E_j)^2/E_j\)</span> and end up with <span class="math inline">\(N - n_A\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
\chi^2_{\text{rowA}} &= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{\left ( \frac{n_An_B}{N} \right )^2}{\frac{n_An_B}{N}} + \frac{\left ( \frac{n_An_C}{N} \right )^2}{\frac{n_An_C}{N}} \\ \\
&= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&=N \left ( \frac{n_A^2 - \frac{2n_A^3}{N} +\frac{n_A^4}{N^2}} {n_A^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&=N \left ( 1 - \frac{2n_A}{N} +\frac{n_A^2}{N^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&= N - 2n_A +\frac{n_A^2}{N} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&= N - 2n_A + \frac{n_A}{N} \left ( {n_A} + n_B + n_C \right ) \\ \\
&= N - 2n_A + \frac{n_A}{N} N \\ \\
&= N - n_A
\end{aligned}
\]</span></p>
<p>If we repeat this on rows 2 and 3 of the table, we will find that <span class="math inline">\(\chi^2_{\text{rowB}} = N - n_B\)</span>, and <span class="math inline">\(\chi^2_{\text{rowC}} = N - n_C\)</span>, so</p>
<p><span class="math display">\[
\begin{aligned}
\chi^2 &= \chi^2_\text{rowA} +\chi^2_\text{rowB}+\chi^2_\text{rowC} \\ \\
&=(N - n_A) + (N - n_B) + (N - n_C) \\ \\
&= 3N - (n_A + n_B + n_C) \\ \\
&= 3N - N \\ \\
\chi^2 &= 2N
\end{aligned}
\]</span></p>
<p>And</p>
<p><span class="math display">\[
\frac{\chi^2}{2 N} = 1
\]</span></p>
<p>So, under this scenario of extreme separation between groups,</p>
<p><span class="math display">\[
V = \sqrt{\frac{\chi^2}{\text{min}(r-1, c-1) \times N}} = 1
\]</span></p>
<p>where <span class="math inline">\(\text{min}(r - 1, c - 1) = \text{min}(2, 3) = 2\)</span>.</p>
</div>
Alternatives to reporting a p-value: the case of a contingency table
https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/
Tue, 03 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/
<p>I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the <a href="https://www.amstat.org/"><em>American Statistical Association</em></a> has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In <a href="https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108">2016</a>, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.” These principles are worth noting here in case you don’t make it over to the original paper:</p>
<ul>
<li>p-values can indicate how incompatible the data are with a specified statistical model.</li>
<li>p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.<br />
</li>
<li>scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.</li>
<li>proper inference requires full reporting and transparency</li>
<li>a p-value, or statistical significance, does not measure the size of an effect or the importance of a result.</li>
<li>by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.</li>
</ul>
<p>More recently, the ASA <a href="https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913">elaborated</a> on this, responding to those who thought the initial paper was too negative, a list of many things <em>not</em> to do. In this new paper, the ASA argues that “knowing what not to do with p-values is indeed necessary, but it does not suffice.” We also need to know what we <em>should</em> do. One of those things should be focusing on effect sizes (and some measure of uncertainty, such as a confidence or credible interval) in order to evaluate an intervention or exposure.</p>
<div id="applying-principled-thinking-to-a-small-problem" class="section level3">
<h3>Applying principled thinking to a small problem</h3>
<p>Recently, I was discussing the presentation of results for a pilot study. I was arguing that we should convey the findings in a way that highlighted the general trends without leading readers to make overly strong conclusions, which p-values might do. So, I was arguing that, rather than presenting p-values, we should display effect sizes and confidence intervals, and avoid drawing on the concept of “statistical significance.”</p>
<p>Generally, this is not a problem; we can estimate an effect size like a difference in means, a difference in proportions, a ratio of proportions, a ratio of odds, or even the log of a ratio of odds. In this case, the outcome was a Likert-type survey where the response was “none”, “a little”, and “a lot”, and there were three comparison groups, so we had a <span class="math inline">\(3\times3\)</span> contingency table with one ordinal (i.e. ordered) factor. In this case, it is not so clear what the effect size measurement should be.</p>
<p>One option is to calculate a <span class="math inline">\(\chi^2\)</span> statistic, report the associated p-value, and call it a day. However, since the <span class="math inline">\(\chi^2\)</span> is not a measure of effect and the p-value is not necessarily a good measure of evidence, I considered estimating a cumulative odds model that would provide a measure of the association between group and response. However, I was a little concerned, because the typical version of this model makes an assumption of proportional odds, which I wasn’t sure would be appropriate here. (I’ve written about these models before, <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a> and <a href="https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/">here</a>, if you want to take a look.) It is possible to fit a cumulative odds model without the proportionality assumption, but then the estimates are harder to interpret since the effect size varies by group and response.</p>
<p>Fortunately, there is a more general measure of association for contingency tables with at least one, but possibly two, nominal factors: <em>Cramer’s V</em>. This measure which makes no assumptions about proportionality.</p>
<p>My plan is to simulate contingency table data, and in this post, I will explore the cumulative odds models. Next time, I’ll describe the <em>Cramer’s V</em> measure of association.</p>
</div>
<div id="non-proportional-cumulative-odds" class="section level3">
<h3>Non-proportional cumulative odds</h3>
<p>In the cumulative odds model (again, take a look <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a> for a little more description of these models), we assume that all the log-odds ratios are proportional. This may actually not be an unreasonable assumption, but I wanted to start with a data set that is generated without explicitly assuming proportionality. In the following data definition, the distribution of survey responses (<em>none</em>, <em>a little</em>, and <em>a lot</em>) across the three groups (<em>1</em>, <em>2</em>, and <em>3</em>) are specified uniquely for each group:</p>
<pre class="r"><code>library(simstudy)
# define the data
def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
defc <- defCondition(condition = "fgrp == 1",
formula = "0.70; 0.20; 0.10", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2",
formula = "0.10; 0.60; 0.30", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3",
formula = "0.05; 0.25; 0.70", dist = "categorical")
# generate the data
set.seed(99)
dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- addCondition(defc, dx, "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot"))
dx[]</code></pre>
<pre><code>## id fgrp frating
## 1: 1 2 a little
## 2: 2 3 a little
## 3: 3 3 a lot
## 4: 4 2 a little
## 5: 5 2 a little
## ---
## 176: 176 2 a lot
## 177: 177 1 none
## 178: 178 3 a lot
## 179: 179 2 a little
## 180: 180 2 a little</code></pre>
<p>A distribution plot based on these 180 observations indicates that the odds are not likely proportional; the “tell” is the large bulge for those in group <em>2</em> who respond <em>a little</em>.</p>
<pre class="r"><code>library(likert)
items <- dx[, .(frating)]
names(items) <- c(frating = "rating")
likert.data <- likert(items = items, grouping = dx$fgrp)
plot(likert.data, wrap = 100, low.color = "#DAECED",
high.color = "#CECD7B")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/pdata-1.png" width="672" /></p>
<p>The <span class="math inline">\(\chi^2\)</span> test, not so surprisingly, indicates that it would be reasonable to conclude there are differences in responses across the three groups:</p>
<pre class="r"><code>chisq.test(table(dx[, .(fgrp, frating)]))</code></pre>
<pre><code>##
## Pearson's Chi-squared test
##
## data: table(dx[, .(fgrp, frating)])
## X-squared = 84, df = 4, p-value <2e-16</code></pre>
<p>But, since we are trying to provide a richer picture of the association that will be less susceptible to small sample sizes, here is the cumulative (proportional) odds model fit using the <code>clm</code> function in the <code>ordinal</code> package.</p>
<pre class="r"><code>library(ordinal)
clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)</code></pre>
<pre><code>## formula: frating ~ fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -162.95 333.91 5(0) 4.61e-08 2.8e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fgrp2 2.456 0.410 5.98 2.2e-09 ***
## fgrp3 3.024 0.483 6.26 3.9e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little 0.335 0.305 1.10
## a little|a lot 2.945 0.395 7.45</code></pre>
<p>A plot of the observed proportions (show by the line) with the modeled proportions (shown as points) indicates that the model that makes the proportional assumption might not be doing a great job:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-2-1.png" width="288" /></p>
<p>If we fit a model that does not make the proportionality assumption and compare using either AIC statistic (lower is better) or a likelihood ratio test (small p-value indicates that the saturated/non-proportional model is better), it is clear that the non-proportional odds model for this dataset is a better fit.</p>
<pre class="r"><code>clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
summary(clmFit.sat)</code></pre>
<pre><code>## formula: frating ~ 1
## nominal: ~fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -149.54 311.08 7(0) 8.84e-11 4.7e+01
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little.(Intercept) 0.544 0.296 1.83
## a little|a lot.(Intercept) 1.634 0.387 4.23
## none|a little.fgrp2 -4.293 0.774 -5.54
## a little|a lot.fgrp2 -0.889 0.450 -1.98
## none|a little.fgrp3 -2.598 0.560 -4.64
## a little|a lot.fgrp3 -1.816 0.491 -3.70</code></pre>
<pre class="r"><code>anova(clmFit.prop, clmFit.sat)</code></pre>
<pre><code>## Likelihood ratio tests of cumulative link models:
##
## formula: nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1 logit flexible
## clmFit.sat frating ~ 1 ~fgrp logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## clmFit.prop 4 334 -163
## clmFit.sat 6 311 -150 26.8 2 1.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<p>It is possible that the poor fit is just a rare occurrence. Below is a plot that shows the average result (<span class="math inline">\(\pm 1 \ sd\)</span>) for 1000 model fits for 1000 data sets using the same data generation process. It appears those initial results were not an aberration - the proportional odds model fits a biased estimate, particularly for groups <em>1</em> and <em>2</em>. (The code to do this simulation is shown in the addendum.)</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-4-1.png" width="288" /></p>
</div>
<div id="proportional-assumption-fulfilled" class="section level3">
<h3>Proportional assumption fulfilled</h3>
<p>Here the data generation process is modified so that the proportionality assumption is incorporated.</p>
<pre class="r"><code>def <- defData(varname = "grp", formula = ".3;.5;.2",
dist = "categorical")
def <- defData(def, varname = "z", formula = "1*I(grp==2) + 2*I(grp==3)",
dist = "nonrandom")
baseprobs <- c(0.7, 0.2, 0.1)
dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- genOrdCat(dx, adjVar = "z", baseprobs, catVar = "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot")
)</code></pre>
<p>This is what proportional odds looks like - there are no obvious bulges, just a general shift rightward as we move from group <em>1</em> to <em>3</em>:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>When we fit the proportional model and compare it to the saturated model, we see no reason to reject the assumption of proportionality (based on either the AIC or LR statistics).</p>
<pre class="r"><code>clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)</code></pre>
<pre><code>## formula: frating ~ fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -176.89 361.77 4(0) 3.04e-09 2.7e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fgrp2 1.329 0.359 3.70 0.00022 ***
## fgrp3 2.619 0.457 5.73 1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little 0.766 0.299 2.56
## a little|a lot 2.346 0.342 6.86</code></pre>
<pre class="r"><code>clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
anova(clmFit.prop, clmFit.sat)</code></pre>
<pre><code>## Likelihood ratio tests of cumulative link models:
##
## formula: nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1 logit flexible
## clmFit.sat frating ~ 1 ~fgrp logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## clmFit.prop 4 362 -177
## clmFit.sat 6 365 -177 0.56 2 0.75</code></pre>
<p>And here is a plot summarizing a second set of 1000 iterations, this one using the proportional odds assumption. The estimates appear to be unbiased:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-8-1.png" width="288" /></p>
<p>I suspect that in many instances, Likert-type responses will look more like the second case than the first case, so that the cumulative proportional odds model could very well be useful in characterizing the association between group and response. Even if the assumption is not reasonable, the bias might not be terrible, and the estimate might still be useful as a measure of association. However, we might prefer a measure that is free of any assumptions, such as <em>Cramer’s V</em>. I’ll talk about that next time.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Ronald L. Wasserstein & Nicole A. Lazar (2016) The ASA Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133.</p>
Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019) Moving to a World Beyond “p < 0.05”, The American Statistician, 73:sup1, 1-19.
</font></small>
</p>
<p> </p>
</div>
<div id="addendum-code-for-replicated-analysis" class="section level2">
<h2>Addendum: code for replicated analysis</h2>
<pre class="r"><code>library(parallel)
RNGkind("L'Ecuyer-CMRG") # to set seed for parallel process
dat.nonprop <- function(iter, n) {
dx <- genData(n, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- addCondition(defc, dx, "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot")
)
clmFit <- clm(frating ~ fgrp, data = dx)
dprob.obs <- data.table(iter,
prop.table(dx[, table(fgrp, frating)], margin = 1))
setkey(dprob.obs, fgrp, frating)
setnames(dprob.obs, "N", "p.obs")
dprob.mod <- data.table(iter, fgrp = levels(dx$fgrp),
predict(clmFit, newdata = data.frame(fgrp = levels(dx$fgrp)))$fit)
dprob.mod <- melt(dprob.mod, id.vars = c("iter", "fgrp"),
variable.name = "frating", value.name = "N")
setkey(dprob.mod, fgrp, frating)
setnames(dprob.mod, "N", "p.fit")
dprob <- dprob.mod[dprob.obs]
dprob[, frating := factor(frating,
levels=c("none", "a little", "a lot"))]
dprob[]
}
def <- defData(varname = "grp", formula = ".3;.5;.2",
dist = "categorical")
defc <- defCondition(condition = "fgrp == 1",
formula = "0.7;0.2;0.1", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2",
formula = "0.1;0.6;0.3", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3",
formula = "0.05;0.25;0.70", dist = "categorical")
res.nonp <- rbindlist(mclapply(1:1000,
function(iter) dat.nonprop(iter,180)))
sum.nonp <- res.nonp[, .(mfit = mean(p.fit), sfit = sd(p.fit),
mobs = mean(p.obs), sobs = sd(p.obs)),
keyby = .(fgrp, frating)]
sum.nonp[, `:=`(lsd = mfit - sfit, usd = mfit + sfit)]
ggplot(data = sum.nonp, aes(x = frating, y = mobs)) +
geom_line(aes(group = fgrp), color = "grey60") +
geom_errorbar(aes(ymin = lsd, ymax = usd, color = fgrp),
width = 0) +
geom_point(aes(y = mfit, color = fgrp)) +
theme(panel.grid = element_blank(),
legend.position = "none",
axis.title.x = element_blank()) +
facet_grid(fgrp ~ .) +
scale_y_continuous(limits = c(0, 0.85), name = "probability") +
scale_color_manual(values = c("#B62A3D", "#EDCB64", "#B5966D"))</code></pre>
</div>
Clustered randomized trials and the design effect
https://www.rdatagen.net/post/what-exactly-is-the-design-effect/
Tue, 18 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-exactly-is-the-design-effect/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I am always saying that simulation can help illuminate interesting statistical concepts or ideas. The <em>design effect</em> that underlies much of clustered analysis is could benefit from a little exploration through simulation. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.</p>
<p>When individual outcomes in a group are correlated, we learn less about the group from adding a new individual than we might think. Take an extreme example where every individual in a group is perfectly correlated with all the others: we will learn nothing new about the group by adding someone new. In fact, we might as well just look at a single member, since she is identical to all the others. The design effect is a value that in a sense quantifies how much information we lose (or, surprisingly, possibly gain) by this interdependence.</p>
<p>Let’s just jump right into it.</p>
<div id="the-context" class="section level3">
<h3>The context</h3>
<p>Imagine a scenario where an underlying population of interest is structurally defined by a group of clusters. The classic case is students in schools or classrooms. I don’t really do any school-based education (I learned from debating my teacher-wife that is a dangerous area to tread), but this example seems so clear. (The ideas in this post were, in part, motivated by my involvement with the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>, which focuses at the opposite end of life, seeking to improve care and quality of life for people living with advanced dementia and their caregivers through research and pragmatic clinical trials.) We might be interested in measuring the effect of some intervention (it may or may not take place in school) on an educational attainment outcome of high school-aged kids in a city (I am assuming a continuous outcome here just because it is so much easier to visualize). It does not seem crazy to think that the outcomes of kids from the same school might be correlated, either because the school itself does such a good (or poor) job of teaching or similar types of kids tend to go to the same school.</p>
</div>
<div id="the-unit-of-randomization" class="section level3">
<h3>The unit of randomization</h3>
<p>We have at least three ways to design our study. We could just recruit kids out and about in city and randomize them each individually to intervention or control. In the second approach, we decide that it is easier to randomize the schools to intervention or control - and recruit kids from each of the schools. This means that <em>all</em> kids from one school will be in the same intervention arm. And for the third option, we can go half way: we go to each school and recruit kids, randomizing half of the kids in each school to control, and the other half to the intervention. This last option assumes that we could ensure that the kids in the school exposed to the intervention would not influence their unexposed friends.</p>
<p>In all three cases the underlying assumptions are the same - there is a school effect on the outcome, an individual effect, and an intervention effect. But it turns out that the variability of the intervention effect depends entirely on how we randomize. And since variability of the outcome affects sample size, each approach has implications for sample size. (I’ll point you to a book by <a href="https://books.google.com/books/about/Design_and_Analysis_of_Cluster_Randomiza.html?id=QJZrQgAACAAJ&source=kp_cover">Donner & Klar</a>, which gives a comprehensive and comprehensible overview of cluster randomized trials.)</p>
</div>
<div id="simulation-of-each-design" class="section level3">
<h3>Simulation of each design</h3>
<p>Just to be clear about these different randomization designs, I’ll simulate 1500 students using each. I’ve set a seed in case you’d like to recreate the results shown here (and indicate the libraries I am using).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(clusterPower)
library(parallel)
library(lmerTest)
RNGkind("L'Ecuyer-CMRG") # enables seed for parallel process
set.seed(987)</code></pre>
<div id="randomization-by-student" class="section level4">
<h4>Randomization by student</h4>
<p>I’ve written a function for each of the three designs to generate the data, because later I am going to need to generate multiple iterations of each design. In the first case, randomization is applied to the full group of students:</p>
<pre class="r"><code>independentData <- function(N, d1) {
di <- genData(N)
di <- trtAssign(di, grpName = "rx")
di <- addColumns(d1, di)
di[]
}</code></pre>
<p>The outcome is a function of intervention status and a combined effect of the student’s school and the student herself. We cannot disentangle the variance components, because we do not know the identity of the school:</p>
<pre class="r"><code>defI1 <- defDataAdd(varname = "y", formula = "0.8 * rx",
variance = 10, dist = "normal")
dx <- independentData(N = 30 * 50, defI1)</code></pre>
<p>The observed effect size and variance should be close to the specified parameters of 0.8 and 10, respectively:</p>
<pre class="r"><code>dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.597</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.2</code></pre>
<p>Here is a plot of the individual observations that highlights the group differences and individual variation:</p>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-6-1.png" width="576" /></p>
</div>
<div id="randomization-by-site" class="section level4">
<h4>Randomization by site</h4>
<p>Next, the intervention status is assigned to each of the <span class="math inline">\(k\)</span> schools/clusters before generating <span class="math inline">\(m\)</span> students per cluster. In this case, the outcome (defined by <code>defI2</code>) is a function of the cluster effect, individual effect, and the intervention status. Note here, the variance components are disentangled, but together they sum to 10, suggesting that total variance should be the same as the first scenario:</p>
<pre class="r"><code>clusteredData <- function(k, m, d1, d2) {
dc <- genData(k, d1)
dc <- trtAssign(dc, grpName = "rx")
di <- genCluster(dc, "site", m, level1ID = "id")
di <- addColumns(d2, di)
di[]
}</code></pre>
<pre class="r"><code>defC <- defData(varname = "ceff", formula = 0,
variance = 0.5, id = "site", dist = "normal")
defI2 <- defDataAdd(varname = "y", formula = "ceff + 0.8 * rx",
variance = 9.5, dist = "normal")</code></pre>
<pre class="r"><code>dx <- clusteredData(k = 30, m = 50, defC, defI2)</code></pre>
<p>The effect size and variation across all observations should be be quite similar to the previous design, though now the data has a structure that is determined by the clusters:</p>
<pre class="r"><code>dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.203</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.5</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="randomization-within-site" class="section level4">
<h4>Randomization within site</h4>
<p>In the last design, the treatment assignment is made <em>after</em> both the clusters and individuals have been generated. Cluster randomization within site is specified using the <code>strata</code> argument:</p>
<pre class="r"><code>withinData <- function(k, m, d1, d2) {
dc <- genData(k, d1)
di <- genCluster(dc, "site", m, "id")
di <- trtAssign(di, strata="site", grpName = "rx")
di <- addColumns(d2, di)
di[]
}</code></pre>
<pre class="r"><code>dx <- withinData(30, 50, defC, defI2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.813</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.1</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
</div>
</div>
<div id="the-design-effect" class="section level3">
<h3>The design effect</h3>
<p>There’s a really nice paper by <a href="https://link.springer.com/article/10.1186/1471-2288-9-39">Vierron & Giraudeau</a> that describes many of the issues I am only touching on here. In particular, they define the design effect and then relate this definition to formulas that are frequently used simplify the estimation of the design effect.</p>
<p>Consider the statistics <span class="math inline">\(\sigma^2_{\Delta_{bc}}\)</span> and <span class="math inline">\(\sigma^2_{\Delta_{i}}\)</span>, which are the variance of the effect sizes under the cluster randomization and the individual randomization designs, respectively:</p>
<p><span class="math display">\[\sigma^2_{\Delta_{bc}} = Var(\bar{Y}_1^{bc} - \bar{Y}_0^{bc})\]</span></p>
<p>and</p>
<p><span class="math display">\[\sigma^2_{\Delta_{i}} =Var(\bar{Y}_1^{i} - \bar{Y}_0^{i})\]</span></p>
<p>These variances are never observed, since they are based on a very large (really, an infinite) number of repeated experiments. However, the theoretical variances can be derived (as they are in the paper), and can be simulated (as they will be here). The design effect <span class="math inline">\(\delta_{bc}\)</span> is defined as</p>
<p><span class="math display">\[\delta_{bc} = \frac{\sigma^2_{\Delta_{bc}}}{\sigma^2_{\Delta_{i}}}\]</span></p>
<p>This ratio represents the required adjustment in sample size required to make the two designs equivalent in the sense that they provide the same amount of information. This will hopefully become clear with the simulations below.</p>
<p>I have decided to use <span class="math inline">\(k = 50\)</span> simulations to ensure enough clusters to estimate the proper variance. I need to know how many individuals per cluster are required for 80% power in the cluster randomized design, given the effect size and variance assumptions I’ve been using here. I’ll use the <code>clusterPower</code> package (which unfortunately defines the number of clusters in each as <span class="math inline">\(m\)</span>, so don’t let that confuse you). Based on this, we should have 18 students per school, for a total sample of 900 students:</p>
<pre class="r"><code>crtpwr.2mean(m = 50/2, d = 0.8, icc = 0.05, varw = 9.5)</code></pre>
<pre><code>## n
## 17.9</code></pre>
<p>Now, I am ready to generate effect sizes for each of 2000 iterations of the experiment assuming randomization by cluster. With this collection of effect sizes in hand, I will be able to estimate their variance:</p>
<pre class="r"><code>genDifFromClust <- function(k, m, d1, d2) {
dx <- clusteredData(k, m, d1, d2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resC <- unlist(mclapply(1:niters,
function(x) genDifFromClust(k= 50, m=18, defC, defI2)))</code></pre>
<p>Here is an estimate of <span class="math inline">\(\sigma^2_{\Delta_{bc}}\)</span> based on the repeated experiments:</p>
<pre class="r"><code>(s2.D_bc <- var(resC))</code></pre>
<pre><code>## [1] 0.0818</code></pre>
<p>And here is the estimate of <span class="math inline">\(\sigma^2_{\Delta_{i}}\)</span> (the variance of the effect sizes based on individual-level randomization experiments):</p>
<pre class="r"><code>genDifFromInd <- function(N, d1) {
dx <- independentData(N, d1)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resI <- unlist(mclapply(1:niters,
function(x) genDifFromInd(N = 50*18, defI1)))
(s2.D_i <- var(resI))</code></pre>
<pre><code>## [1] 0.0432</code></pre>
<p>So, now we can use these variance estimates to derive the estimate of the design effect <span class="math inline">\(\delta_{bc}\)</span>, which, based on the earlier definition, is:</p>
<pre class="r"><code>(d_bc <- s2.D_bc / s2.D_i)</code></pre>
<pre><code>## [1] 1.89</code></pre>
<p>The Vierron & Giraudeau paper derives a simple formula for the design effect assuming equal cluster sizes and an ICC <span class="math inline">\(\rho\)</span>. This (or some close variation, when cluster sizes are not equal) is quite commonly used:</p>
<p><span class="math display">\[\delta_{bc} = 1 + (m-1)*\rho\]</span></p>
<p>As the ICC increases, the design effect increases. Based on the parameters for <span class="math inline">\(m\)</span> and <span class="math inline">\(\rho\)</span> we have been using in these simulations (note that <span class="math inline">\(\rho = 0.5/(0.5+9.5) = 0.05\)</span>), the standard formula gives us this estimate of <span class="math inline">\(\delta_{bc.formula}\)</span> that is quite close to our experimental value:</p>
<pre class="r"><code>( d_bc_form <- 1 + (18-1) * (0.05) )</code></pre>
<pre><code>## [1] 1.85</code></pre>
<p> </p>
</div>
<div id="but-what-is-the-design-effect" class="section level3">
<h3>But what is the design effect?</h3>
<p>OK, finally, we can now see what the design effect actually represents. As before, we will generate repeated data sets; this time, we will estimate the treatment effect using an appropriate model. (In the case of the cluster randomization, this is a linear mixed effects model, and in the case of individual randomization, this is linear regression model.) For each iteration, I am saving the p-value for the treatment effect parameter in the model. We expect close to 80% of the p-values to be lower than 0.05 (this is 80% power given a true treatment effect of 0.8).</p>
<p>First, here is the cluster randomized experiment and the estimate of power:</p>
<pre class="r"><code>genEstFromClust <- function(k, m, d1, d2) {
dx <- clusteredData(k, m, d1, d2)
summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}
resCest <- unlist(mclapply(1:niters,
function(x) genEstFromClust(k=50, m = 18, defC, defI2)))
mean(resCest < 0.05) # power</code></pre>
<pre><code>## [1] 0.778</code></pre>
<p>In just over 80% of the cases, we would have rejected the null.</p>
<p>And here is the estimated power under the individual randomization experiment, but with a twist. Since the design effect is 1.85, the cluster randomized experiment needs a relative sample size 1.85 times higher than an equivalent (individual-level) RCT to provide the same information, or to have equivalent power. So, in our simulations, we will use a reduced sample size for the individual RCT. Since we used 900 individuals in the CRT, we need only <span class="math inline">\(900/1.85 = 487\)</span> individuals in the RCT:</p>
<pre class="r"><code>( N.adj <- ceiling( 50 * 18 / d_bc_form ) )</code></pre>
<pre><code>## [1] 487</code></pre>
<pre class="r"><code>genEstFromInd <- function(N, d1) {
dx <- independentData(N, d1)
summary(lm(y ~ rx, data = dx))$coef["rx", 4]
}
resIest <- unlist(mclapply(1:niters,
function(x) genEstFromInd(N = N.adj, defI1)))</code></pre>
<p>The power for this second experiment is also quite close to 80%:</p>
<pre class="r"><code>mean(resIest < 0.05) # power</code></pre>
<pre><code>## [1] 0.794</code></pre>
</div>
<div id="within-cluster-randomization" class="section level3">
<h3>Within cluster randomization</h3>
<p>It is interesting to see what happens when we randomize within the cluster. I think there may be some confusion here, because I have seen folks incorrectly apply the standard formula for <span class="math inline">\(\delta_{bc}\)</span>, rather than this formula for <span class="math inline">\(\delta_{wc}\)</span> that is derived (again, under the assumption of equal cluster sizes) in the Vierron & Giraudeau paper as</p>
<p><span class="math display">\[ \delta_{wc} = 1- \rho\]</span></p>
<p>This implies that the sample size requirement actually declines as intra-cluster correlation increases! In this case, since <span class="math inline">\(\rho = 0.05\)</span>, the total sample size for the within-cluster randomization needs to be only 95% of the sample size for the individual RCT.</p>
<p>As before, let’s see if the simulated data confirms this design effect based on the definition</p>
<p><span class="math display">\[ \delta_{wc} = \frac{\sigma^2_{\Delta_{wc}}}{\sigma^2_{\Delta_{i}}}\]</span></p>
<pre class="r"><code>genDifFromWithin <- function(k, m, d1, d2) {
dx <- withinData(k, m, d1, d2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resW <- unlist(mclapply(1:niters,
function(x) genDifFromWithin(k = 50, m = 18, defC, defI2)))
(s2.D_wc <- var(resW))</code></pre>
<pre><code>## [1] 0.0409</code></pre>
<p>The estimated design effect is quite close to the expected design effect of 0.95:</p>
<pre class="r"><code>(d_wc <- s2.D_wc / s2.D_i)</code></pre>
<pre><code>## [1] 0.947</code></pre>
<p>And to finish things off, if we estimate an adjusted cluster size based on the design effects (first reducing the cluster size <span class="math inline">\(m=18\)</span> for the cluster randomized trial by <span class="math inline">\(\delta_{bc.formula}\)</span> to derive the appropriate sample size for the RCT, and then adjusting by <span class="math inline">\(\delta_{wc} = 0.95\)</span>) to get the appropriate cluster size for the within cluster randomization, which is about 9 students. This study will only have 450 students, fewer than the RCT:</p>
<pre class="r"><code>(m.adj <- round( (18 / d_bc_form) * 0.95, 0))</code></pre>
<pre><code>## [1] 9</code></pre>
<pre class="r"><code>genEstFromWithin <- function(k, m, d1, d2) {
dx <- withinData(k, m, d1, d2)
summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}
resWest <- unlist(mclapply(1:niters,
function(x) genEstFromWithin(k = 50, m = ceiling(m.adj), defC, defI2)))
mean(resWest < 0.05)</code></pre>
<pre><code>## [1] 0.779</code></pre>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Donner, Allan, and Neil Klar. “Design and analysis of cluster randomization trials in health research.” New York (2010).</p>
<p>Vierron, Emilie, and Bruno Giraudeau. “Design effect in multicenter studies: gain or loss of power?.” BMC medical research methodology 9, no. 1 (2009): 39.</p>
<p>Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes
https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/
Tue, 04 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/
<p>I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model. (If I have missed something obvious with respect to modeling options, please let me know.)</p>
<p>Two initiatives I am involved with, the HAS-QOL study and <a href="https://impactcollaboratory.org/">the IMPACT Collaboratory</a>, are focused on improving quality of care for people living with Alzheimer’s disease and other dementias. Both are examples where the stepped-wedge study design can be an important tool to evaluate interventions in a real-world context. In an earlier post, I <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">introduced</a> a particular variation of the stepped-wedge design which includes an open cohort. I provided simulations of the data generating process we are assuming for the analysis presented here. Elsewhere (<a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">here</a> and <a href="https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/">here</a>), I described Bayesian models that can be used to analyze data with more complicated variance patterns; all of those examples were based on continuous outcomes.</p>
<p>Here, I am extending and combining these ideas. This post walks through the data generation process and describes a Bayesian model that addresses the challenges posed by the open cohort stepped-wedge study design.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>The process I use to simulate the data and then estimate to effects is based on a relatively straightforward logistic regression model with two random effects. To simplify things a bit, I intentionally make the assumption that there are no general time trends that affect that outcomes (though it would not be difficult to add in). In the logistic model, the log-odds (or logit) of a binary outcome is a linear function of predictors and random effects:</p>
<p><span class="math display">\[ \text{logit}(P(y_{ict}=1) = \beta_0 + \beta_1 I_{ct} + b_{ct} + b_i,\]</span>
where <span class="math inline">\(\text{logit}(P(y_{ict}=1))\)</span> is the log-odds for individual <span class="math inline">\(i\)</span> in cluster (or site) <span class="math inline">\(c\)</span> during time period <span class="math inline">\(t\)</span>, and <span class="math inline">\(I_{ct}\)</span> is a treatment indicator for cluster <span class="math inline">\(c\)</span> during period <span class="math inline">\(t\)</span>.</p>
<p>There are two random effects in this model. The first is a cluster-specific period random effect, <span class="math inline">\(b_{ct}\)</span> . For each cluster, there will actually be a vector of cluster effects <span class="math inline">\(\mathbf{b_c} = (b_{c0}, b_{c1},...,b_{c,T-1})\)</span>, where <span class="math inline">\(\mathbf{b_c}\sim MVN(\mathbf{0}, \sigma_{b_c}^2\mathbf{R})\)</span>, and <span class="math inline">\(\mathbf{R}\)</span> is</p>
<p><span class="math display">\[
\mathbf{R} =
\left(
\begin{matrix}
1 & \rho & \rho^2 & \cdots & \rho^{T-1} \\
\rho & 1 & \rho & \cdots & \rho^{T-2} \\
\rho^2 & \rho & 1 & \cdots & \rho^{T-3} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\rho^{T-1} & \rho^{T-2} & \rho^{T-3} & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<p>The second random effect is the individual or patient-level random intercept <span class="math inline">\(b_i\)</span>, where <span class="math inline">\(b_i \sim N(0,\sigma_{b_i}^2)\)</span>. We could assume a more structured relationship for individual patients over time (such as a decaying correlation), but in this application, patients will not have sufficient measurements to properly estimate this.</p>
<p>In the model <span class="math inline">\(\beta_0\)</span> has the interpretation of the log-odds for the outcome when the the cluster is still in the control state and the cluster-period and individual effects are both 0. <span class="math inline">\(\beta_1\)</span> is the average treatment effect conditional on the random effects, and is reported as a log odds ratio.</p>
</div>
<div id="simulating-the-study-data" class="section level3">
<h3>Simulating the study data</h3>
<p>I am going to generate a single data set based on this model. If you want more explanation of the code, this <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">earlier post</a> provides the details. The only real difference here is that I am generating an outcome that is a function of cluster-period effects, individual effects, and treatment status.</p>
<div id="site-level-data" class="section level4">
<h4>Site level data</h4>
<p>There will be 24 sites followed for 12 periods (<span class="math inline">\(t=0\)</span> through <span class="math inline">\(t=11\)</span>), and the stepped-wedge design includes 6 waves of 4 sites in each wave. The first wave will start at <span class="math inline">\(t=4\)</span>, and a new wave will be added each period, so that the last wave starts at <span class="math inline">\(t=9\)</span>.</p>
<pre class="r"><code>library(simstudy)
dsite <- genData(24, id = "site")
dper <- addPeriods(dsite, nPeriods = 12, idvars = "site",
perName = "period")
dsw <- trtStepWedge(dper, "site", nWaves = 6, lenWaves = 1,
startPer = 4, perName = "period",
grpName = "Ict")</code></pre>
<p> </p>
</div>
<div id="correlated-site-level-effects" class="section level4">
<h4>Correlated site-level effects</h4>
<p>The average site-level effect is 0, the standard deviation of site averages is <span class="math inline">\(\sigma_{ct} = 0.3\)</span>, and the correlation coefficient that will determine between-period within site correlation is <span class="math inline">\(\rho = 0.5\)</span>. The correlation structure is “AR-1”, which means the between-period correlation decays over time (see definition of <span class="math inline">\(\mathbf{R}\)</span> above.)</p>
<pre class="r"><code>siteDef <- defData(varname = "eff.mu", formula = 0,
dist = "nonrandom", id = "site")
siteDef <- defData(siteDef, varname = "eff.s2", formula = 0.3^2,
dist = "nonrandom")
dsw <- addColumns(siteDef, dsw)
dsw <- addCorGen(dsw, nvars = 12, idvar = "site", rho = 0.5,
corstr = "ar1", dist = "normal",
param1 = "eff.mu", param2 = "eff.s2",
cnames = "eff.st")
dsw <- dsw[, .(site, period, startTrt, Ict, eff.st)]</code></pre>
<p> </p>
</div>
<div id="patient-level-data" class="section level4">
<h4>Patient level data</h4>
<p>We are generating 20 patients per period for each site, so there will be a total of 5760 individuals (<span class="math inline">\(20\times24\times12\)</span>). The individual level effect standard deviation <span class="math inline">\(\sigma_{b_i} = 0.3\)</span>. Each of the patients will be followed until they die, which is a function of their health status over time, defined by the Markov process and its transition matrix defined below. (This was described in more detail in an <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">earlier post</a>.</p>
<pre class="r"><code>dpat <- genCluster(dper, cLevelVar = "timeID",
numIndsVar = 20, level1ID = "id")
patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
patDef <- defDataAdd(patDef, varname = "eff.p",
formula = 0, variance = 0.3^2)
dpat <- addColumns(patDef, dpat)
P <-t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.1, 0.3, 0.5, 0.1,
0.0, 0.1, 0.6, 0.3,
0.0, 0.0, 0.0, 1.0),
nrow = 4))
dpat <- addMarkov(dpat, transMat = P,
chainLen = 12, id = "id",
pername = "seq", start0lab = "S0",
trimvalue = 4)
dpat[, period := period + seq - 1]
dpat <- dpat[period < 12]</code></pre>
<p> </p>
</div>
<div id="individual-outcomes" class="section level4">
<h4>Individual outcomes</h4>
<p>In this last step, the binary outcome <span class="math inline">\(y_{ict}\)</span> is generated based on treatment status and random effects. In this case, the treatment lowers the probability of <span class="math inline">\(Y=1\)</span>.</p>
<pre class="r"><code>dx <- merge(dpat, dsw, by = c("site","period"))
setkey(dx, id, period)
outDef <- defDataAdd(varname = "y",
formula = "-0.5 - 0.8*Ict + eff.st + eff.p",
dist = "binary", link = "logit")
dx <- addColumns(outDef, dx)
dx <- dx[, .(site, period, id, Ict, y)]</code></pre>
<p>Here are the site-level averages over time. The light blue indicates periods in which a site is still in the control condition, and the dark blue shows the transition to the intervention condition. The lines, which are grouped by wave starting period, show the proportion of <span class="math inline">\(Y=1\)</span> for each period. You should be able to see the slight drop following entry into treatment.</p>
<p><img src="https://www.rdatagen.net/img/post-opensw/siteplot.png" height="400" /></p>
</div>
</div>
<div id="estimating-the-treatment-effect-and-variance-components" class="section level3">
<h3>Estimating the treatment effect and variance components</h3>
<p>Because none of the maximum likelihood methods implemented in <code>R</code> or <code>SAS</code> could estimate this specific variance structure using a mixed effects logistic regression model, I am fitting a Bayesian model using <a href="http://mc-stan.org">RStan and Stan</a>, which requires a set of model definitions.</p>
<p>This model specification is actually quite similar to the model I estimated <a href="https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/">earlier</a>, except of course the outcome distribution is logistic rather than continuous. Another major change is the use of a <a href="https://mc-stan.org/docs/2_21/stan-users-guide/reparameterization-section.html">“non-centered” parameterization</a>, which actually reduced estimation times from hours to minutes (more precisely, about 12 hours to about 30 minutes). This reparameterization requires a Cholesky decomposition of the variance-covariance matrix <span class="math inline">\(\Sigma\)</span>. One additional limitation is that proper convergence of the MCMC chains seems to require a limited prior on <span class="math inline">\(\rho\)</span>, so that <span class="math inline">\(\rho \sim U(0,1)\)</span> rather than <span class="math inline">\(\rho \sim U(-1,1)\)</span>.</p>
<p>This particular code needs to be saved externally, and I have created a file named <code>binary sw - ar ind effect - non-central.stan</code>. This file is subsequently referenced in the call to <code>RStan</code>.</p>
<pre class="stan"><code>data {
int<lower=1> I; // number of unique individuals
int<lower=1> N; // number of records
int<lower=1> K; // number of predictors
int<lower=1> J; // number of sites
int<lower=0> T; // number of periods
int<lower=1,upper=I> ii[N]; // id for individual
int<lower=1,upper=J> jj[N]; // group for individual
int<lower=1,upper=T> tt[N]; // period of indidvidual
matrix[N, K] x; // matrix of predictors
int<lower=0,upper=1> y[N]; // vector of binary outcomes
}
parameters {
vector[K] beta; // model fixed effects
real<lower=0> sigma_S; // site variance (sd)
real<lower=0,upper=1> rho; // correlation
real<lower=0> sigma_I; // individual level varianc (sd)
// non-centered paramerization
vector[T] z_ran_S[J]; // site level random effects (by period)
vector[I] z_ran_I; // individual level random effects
}
transformed parameters {
cov_matrix[T] Sigma;
matrix[T, T] L; // for non-central parameterization
vector[I] ran_I; // individual level random effects
vector[T] ran_S[J]; // site level random effects (by period)
vector[N] yloghat;
// Random effects with exchangeable correlation
real sigma_S2 = sigma_S^2;
for (j in 1:T)
for (k in 1:T)
Sigma[j,k] = sigma_S2 * pow(rho,abs(j-k));
// for non-centered parameterization
L = cholesky_decompose(Sigma);
for(j in 1:J)
ran_S[j] = L * z_ran_S[j];
ran_I = sigma_I * z_ran_I;
// defining mean on log-odds scale
for (i in 1:N)
yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
}
model {
sigma_I ~ exponential(0.25);
sigma_S ~ exponential(0.25);
rho ~ uniform(0, 1);
for(j in 1:J) {
z_ran_S[j] ~ std_normal();
}
z_ran_I ~ std_normal();
y ~ bernoulli_logit(yloghat);
}</code></pre>
</div>
<div id="set-up-the-data-and-call-stan-from-r" class="section level3">
<h3>Set up the data and call stan from R</h3>
<p>Just for completeness, I am providing the code that shows the interface between <code>R</code> and <code>Stan</code> using <code>RStan</code>. The data needs to be sent to Stan as a list of data elements, which here is called <code>testdat</code>. For the estimation of the posterior probabilities, I am specifying 4 chains of 4000 iterations each, which includes 1000 warm-up iterations. I specified “adapt_delta = 0.90” to reduce the step-size a bit (default is 0.80); this slows things down a bit, but improves stability.</p>
<p>As I mentioned earlier, with this data set (and rather large number of effects to estimate), the running time is between 30 and 45 minutes. One of the downsides of this particular Bayesian approach is that it wouldn’t really be practical to do any kind of sample size estimate.</p>
<pre class="r"><code>x <- as.matrix(dx[ ,.(1, Ict)])
I <- dx[, length(unique(id))]
N <- nrow(x)
K <- ncol(x)
J <- dx[, length(unique(site))]
T <- dx[, length(unique(period))]
ii <- dx[, id]
jj <- dx[, site]
tt <- dx[, period] + 1
y <- dx[, y]
testdat <- list(I=I, N=N, K=K, J=J, T=T, ii=ii, jj=jj, tt=tt, x=x, y=y)
library(rstan)
options(mc.cores = parallel::detectCores())
rt <- stanc("binary sw - ar ind effect - non-central.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
fit.ar1 <- sampling(sm, data=testdat,
iter = 4000, warmup = 1000,
control=list(adapt_delta=0.90,
max_treedepth = 15),
chains = 4)</code></pre>
</div>
<div id="diagnostics" class="section level3">
<h3>Diagnostics</h3>
<p>After running the MCMC process to generate the probability distributions, the trace plots show that the mixing is quite adequate for the chains.</p>
<pre class="r"><code>plot(fit.ar1, plotfun = "trace", pars = pars,
inc_warmup = FALSE, ncol = 1)</code></pre>
<p><img src="https://www.rdatagen.net/img/post-opensw/diagplot.png" height="600" /></p>
</div>
<div id="extracting-results" class="section level3">
<h3>Extracting results</h3>
<p>If we take a look at the posterior probability distributions, we can see that they contain the original values used to generate the data - so at least in this case, the model seems to model the original data generation process quite well.</p>
<pre class="r"><code>pars <- c("beta", "sigma_S","sigma_I","rho")
summary(fit.ar1, pars = pars, probs = c(0.025, 0.975))$summary</code></pre>
<pre><code>## mean se_mean sd 2.5% 97.5% n_eff Rhat
## beta[1] -0.519 0.000687 0.0459 -0.609 -0.428 4470 1
## beta[2] -0.751 0.000844 0.0573 -0.864 -0.638 4618 1
## sigma_S 0.307 0.000394 0.0256 0.260 0.362 4223 1
## sigma_I 0.254 0.001548 0.0476 0.148 0.337 945 1
## rho 0.544 0.001594 0.0812 0.376 0.698 2599 1</code></pre>
<p><img src="https://www.rdatagen.net/img/post-opensw/postplot.png" height="500" /></p>
<p>One thing that is not working so well is my attempt to compare different models. For example, I might want to fit another model that does not assume between-period correlations decay and compare it to the current model. Previously, I used the <code>bridgesampling</code> package for the comparisons, but it does not seem to be able to accommodate these models. I will continue to explore the options more model comparison and will report back if I find something promising.</p>
<p>
<p><small><font color="darkkhaki"></p>
<p>This study is supported by the National Institutes of Health National Institute on Aging under award numbers R61AG061904 and U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</p>
</font></small>
</p>
</div>
A brief account (via simulation) of the ROC (and its AUC)
https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/
Tue, 21 Jan 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/
<p>The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve.</p>
<div id="a-prediction-problem" class="section level2">
<h2>A prediction problem</h2>
<p>The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease. The population is comprised of two groups of individuals: those with the condition and those without. What we want is some sort of diagnostic tool (such as a blood test or diagnostic scan) that will identify which group a particular patient belongs to. The question is how well does that tool or measure help us distinguish between the two groups? The ROC (and AUC) is designed to help answer that question.</p>
<div id="true-and-false-predictions" class="section level3">
<h3>True and false predictions</h3>
<p>While we might not know group membership for an individual, we assume that they do indeed belong to one of the two groups. When we base a prediction of group membership based on a test, we may or may not be right. There are four scenarios. It is possible that our prediction is (1) a true positive (the patient has the condition and that is what we predict), (2) a false positive (the patient does not have the condition, but we predict they do have it), (3) a false negative (the patient has the condition but we believe they are healthy), or (4) a true negative (the patient is healthy and that is our prediction.) A “good” test is one that maximizes true positive predictions while minimizing false positive predictions.</p>
<p>We can actually only assess the quality of the test if we know the true group membership of the individuals. So, our plan is to take measurements on this known sample, make predictions based on the test, and see how our predictions match up to reality. The ROC is one way to characterize how well our test matches up to reality.</p>
</div>
<div id="binary-decision-informed-by-a-continuous-measure" class="section level3">
<h3>Binary decision informed by a continuous measure</h3>
<p>While we make a binary decision about group membership - either we think they have the condition or they do not - the underlying measure that is used to make that determination may be continuous, like a score. For example, a hypothetical test might return a score between -10 and 10. We can pick a threshold anywhere along the continuum that will form the basis of our prediction. For example, we might say that any score > 0 indicates the condition is present, otherwise it is not. This simple test will be useful as a tool to discriminate between the disease and non-disease groups if that threshold indeed distinguishes the groups.</p>
<p>This is probably best demonstrated with a simple simulation. The sample we will generate has 100 individuals, around 40% who have the condition in question. The average score for the non-disease group is set at -5, and the average score for the disease group is 5. Both have variance 3.5:</p>
<pre class="r"><code>library(simstudy)
# define data
defx <- defData(varname = "condition", formula = .4, dist = "binary")
defx <- defData(defx, "x", formula = "-5 + 10*condition",
variance = 3.5, dist = "normal")
# generate data
set.seed(1873)
dx <- genData(100, defx)
head(dx)</code></pre>
<pre><code>## id condition x
## 1: 1 0 -5.83
## 2: 2 1 4.66
## 3: 3 1 4.23
## 4: 4 0 -3.87
## 5: 5 1 1.78
## 6: 6 0 -4.87</code></pre>
<p>Looking at the plot below, a threshold of zero appears to do an excellent job of distinguishing the groups. All of those with the condition (depicted in red) are above the threshold, whereas all of those without the condition (depicted in green) fall below the threshold:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-3-1.png" width="192" /></p>
</div>
<div id="the-world-is-not-always-so-neat-and-clean" class="section level3">
<h3>The world is not always so neat and clean</h3>
<p>Of course, we don’t usually have a measure or test that separates the groups so cleanly. Let’s say the average of the disease group is 2.5 and the non-disease group is -3. The threshold of zero still works pretty well, but it is not perfect. Some with the disease fall below the threshold (false negatives), and some without the disease lie above the threshold (false positives). In fact, only 87% of those with the disease are correctly identified (true positives), while 13% of those without the condition are incorrectly identified has having the disease (false positives).</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="-3 + 5.5*condition",
newvariance = 6)
dx <- genData(100, defx)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-5-1.png" width="192" /></p>
</div>
</div>
<div id="generating-the-roc" class="section level2">
<h2>Generating the ROC</h2>
<p>Zero isn’t the only possible threshold we could use for the diagnosis test. We can lower the threshold to below zero to ensure that we have 100% true positives, but we will have to sacrifice by increasing the proportion of false positives. Likewise, we could reduce the proportion of false positives by increasing the threshold above zero, but would reduce the proportion of true positives in the process.</p>
<p>There are, in fact, an infinite number of possible thresholds. Here is a sequence of plots of the same data with a number of different thresholds ranging from 8 to -8. The percent of true positives is shown on the top and the percent of false positives is shown on the bottom:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-6-1.png" width="576" /></p>
<p>The ROC is really just a summarized version of this sequence of plots. The X-axis is the proportion of false positives at a particular threshold, and the Y-axis is the proportion of true positives. As we lower the threshold, we move from left to right. So, in the plot below, each point represents one of the sections above:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-7-1.png" width="432" /></p>
<p>The ROC above is built from only 9 thresholds. If we consider all possible thresholds (continuous between -10 and 10), this is the the more complete curve:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-8-1.png" width="432" /></p>
<div id="area-under-the-roc" class="section level3">
<h3>Area under the ROC</h3>
<p>The AUC is, well, the area under the ROC. The maximum AUC will be 1 when there is complete separation (there is an example of this below), and the minimum is 0.5 (depicted by the diagonal line) when there is no separation by the test measure (again, an example will follow). We can estimate this area by integrating an approximate function defined by the data between 0 and 1.</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.957 with absolute error < 0.00011</code></pre>
<p>There is actually a meaningful interpretation of the AUC, that is described in a classic 1982 paper by <a href="https://pubs.rsna.org/doi/abs/10.1148/radiology.143.1.7063747">Hanley & McNeil</a> (if you want a deeper understanding of the issues, this paper is not a bad place to start - there is, of course, a huge literature on the topic of ROCs). The AUC is actually equivalent to the probability that the test measure of a random draw from the diseased group will be greater than the test measure of a random draw from the healthy group. So, an AUC = 0.90 indicates that 90% of the time we draw a test measure from the disease group and non-disease group, the measure from the disease group will be greater.</p>
<p>Here is a simple function that returns a value of <code>TRUE</code> if the random draw from the disease group is greater:</p>
<pre class="r"><code>randcomp <- function(ds) {
ds[condition == 1, sample(x, 1)] > ds[condition == 0, sample(x, 1)]
}</code></pre>
<p>And here is the proportion of 1000 draws where the measure from the disease group draws is greater (this is expected to be close to the AUC, which was estimated above to be 0.957):</p>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 0.958</code></pre>
<p>Of course, <code>R</code> has several packages that provide ROCs and calculate AUCs. I’m using package <a href="http://www.biomedcentral.com/1471-2105/12/77/">pROC</a> here just to show you that my AUC estimate is not totally crazy:</p>
<pre class="r"><code>library(pROC)
roc_obj <- roc(response = dx$condition, predictor = dx$x)
auc(roc_obj)</code></pre>
<pre><code>## Area under the curve: 0.958</code></pre>
</div>
</div>
<div id="alternative-scenarios" class="section level2">
<h2>Alternative scenarios</h2>
<p>As I indicated above, the AUC can generally range from 0.5 to 1.0. There is no hard and fast rule about what is a “good” AUC - it will depend on the application. Certainly, anything below 0.7 or maybe even 0.8 is pretty weak. I am going to conclude by generating data at the two extremes.</p>
<div id="minimal-separation" class="section level3">
<h3>Minimal separation</h3>
<p>When the test measure for each group is equally distributed, there is unlikely to be any threshold for which the proportion of true positives exceeds the proportion of false positives. If this is the case, we should probably look for another test measure - or be prepared to make a lot of mistakes in the non-disease group.</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="0+0*condition",
newvariance = 8)
dx <- genData(100, defx)</code></pre>
<p>As we move the threshold lower, both the proportion of true positives and false positives steadily increase:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-14-1.png" width="576" /></p>
<p>As a result, the ROC hangs fairly close to the diagonal lower bound.</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-15-1.png" width="432" /></p>
<p>We would expect the AUC to be fairly close to 0.5, which it is:</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.623 with absolute error < 4.5e-05</code></pre>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 0.613</code></pre>
</div>
<div id="complete-separation" class="section level3">
<h3>Complete separation</h3>
<p>At the other extreme, the mean of the disease group is high enough so that there is no overlap between the two groups. In this case, the curve follows along Y-axis before going across the X-axis. We can achieve 100% true positives and no false positives if threshold is set at some point that is below the minimum of the disease group, and above the maximum of the non-disease group. Zero will be the ideal cut-off point for this example.</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="-4+8*condition",
newvariance = 3.5)
dx <- genData(100, defx)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-19-1.png" width="576" /></p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-20-1.png" width="432" /></p>
<p>As expected the AUC is equal to 1:</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.996 with absolute error < 9.2e-05</code></pre>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 1</code></pre>
</div>
</div>
<div id="logistic-regression-and-the-roc" class="section level2">
<h2>Logistic regression and the ROC</h2>
<p>Just a quick note to conclude. The ROC is often used in conjunction with classification problems based on logistic regression modeling. In this case, we may not have a single underlying test measure, but rather we may have multiple predictors or measures. In this case, group assignment decision needs to be based on a summary of these multiple measures; one logical candidate is the individual’s predicted probability estimated by model.</p>
<p>If the specified logistic regression model provides good separation between the two groups, the predicted probabilities will be quite different for each group (higher AUC). However, if the model is not a strong classifier, the predicted probabilities for the two groups will be much closer together (lower AUC).</p>
<p> </p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Hanley, J.A. and McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), pp.29-36.</p>
Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77.
</font></small>
</p>
</div>
Repeated measures can improve estimation when we only care about a single endpoint
https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/
Tue, 10 Dec 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/
<p>I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.</p>
<p>There <em>is</em> one issue that should be fairly generalizable to other studies. It is likely that individual measures will be collected repeatedly over time but the primary outcome of interest will be the measure collected during the last follow-up period. I wanted to explore what, if anything, can be gained by analyzing all of the available data rather than focusing only the final end point.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>In this simulation scenario, there will be 200 subjects randomized at the individual level to one of two treatment arms, intervention (<span class="math inline">\(rx = 1\)</span>) and control (<span class="math inline">\(rx = 0\)</span>). Each person will be followed for 5 months, with a binary outcome measure collected at the end of each month. In the data, period 0 is the first month, and period 4 is the final month.</p>
<pre class="r"><code>library(simstudy)
set.seed(281726)
dx <- genData(200)
dx <- trtAssign(dx, grpName = "rx")
dx <- addPeriods(dx, nPeriods = 5)</code></pre>
<p>Here are the data for a single individual:</p>
<pre class="r"><code>dx[id == 142]</code></pre>
<pre><code>## id period rx timeID
## 1: 142 0 1 706
## 2: 142 1 1 707
## 3: 142 2 1 708
## 4: 142 3 1 709
## 5: 142 4 1 710</code></pre>
<p>The probabilities of the five binary outcomes for each individual are a function of time and intervention status.</p>
<pre class="r"><code>defP <- defDataAdd(varname = "p",
formula = "-2 + 0.2*period + 0.5*rx",
dist = "nonrandom", link = "logit")
dx <- addColumns(defP, dx)</code></pre>
<p>The outcomes for a particular individual are correlated, with outcomes in two adjacent periods are more highly correlated than outcomes collected further apart. (I use an auto-regressive correlation structure to generate these data.)</p>
<pre class="r"><code>dx <- addCorGen(dtOld = dx, idvar = "id", nvars = 5, rho = 0.6,
corstr = "ar1", dist = "binary", param1 = "p",
method = "ep", formSpec = "-2 + 0.2*period + 0.5*rx",
cnames = "y")
dx[id == 142]</code></pre>
<pre><code>## id period rx timeID p y
## 1: 142 0 1 706 0.18 0
## 2: 142 1 1 707 0.21 0
## 3: 142 2 1 708 0.25 1
## 4: 142 3 1 709 0.29 0
## 5: 142 4 1 710 0.33 0</code></pre>
<p>In the real world, there will be loss to follow up - not everyone will be observed until the end. In the first case, I will be assuming the data are missing completely at random (MCAR), where missingness is independent of all observed and unobserved variables. (I have <a href="https://www.rdatagen.net/post/musings-on-missing-data/">mused on missingess</a> before.)</p>
<pre class="r"><code>MCAR <- defMiss(varname = "y", formula = "-2.6",
logit.link = TRUE, monotonic = TRUE
)
dm <- genMiss(dx, MCAR, "id", repeated = TRUE, periodvar = "period")
dObs <- genObs(dx, dm, idvars = "id")
dObs[id == 142]</code></pre>
<pre><code>## id period rx timeID p y
## 1: 142 0 1 706 0.18 0
## 2: 142 1 1 707 0.21 0
## 3: 142 2 1 708 0.25 1
## 4: 142 3 1 709 0.29 NA
## 5: 142 4 1 710 0.33 NA</code></pre>
<p>In this data set only about 70% of the total sample is observed - though by chance there is different dropout for each of the treatment arms:</p>
<pre class="r"><code>dObs[period == 4, .(prop.missing = mean(is.na(y))), keyby = rx]</code></pre>
<pre><code>## rx prop.missing
## 1: 0 0.28
## 2: 1 0.38</code></pre>
</div>
<div id="estimating-the-intervention-effect" class="section level3">
<h3>Estimating the intervention effect</h3>
<p>If we are really only interested in the probability of a successful outcome in the final period, we could go ahead and estimate the treatment effect using a simple logistic regression using individuals who were available at the end of the study. The true value is 0.5 (on the logistic scale), and the estimate here is close to 1.0 with a standard error just under 0.4:</p>
<pre class="r"><code>fit.l <- glm(y ~ rx, data = dObs[period == 4], family = binomial)
coef(summary(fit.l))</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.25 0.28 -4.4 9.9e-06
## rx 0.99 0.38 2.6 9.3e-03</code></pre>
<p>But, can we do better? Fitting a longitudinal model might provide a more stable and possibly less biased estimate, particularly if the specified model is the correct one. In this case, I suspect it will be an improvement, since the data was generated using a process that is amenable to a GEE (generalized estimating equation) model.</p>
<pre class="r"><code>library(geepack)
fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dObs, corstr = "ar1")
coef(summary(fit.m))</code></pre>
<pre><code>## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.33 0.259 81 0.00000
## period 0.30 0.072 17 0.00003
## rx 0.83 0.263 10 0.00152</code></pre>
<p>And finally, it is reasonable to expect that a model that is based on a data set without any missing values would provide the most efficient estimate. And that does seem to be case if we look at the standard error of the effect estimate.</p>
<pre class="r"><code>fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dx, corstr = "ar1")
coef(summary(fit.f))</code></pre>
<pre><code>## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.15 0.227 89.2 0.0e+00
## period 0.30 0.062 23.1 1.5e-06
## rx 0.54 0.233 5.4 2.1e-02</code></pre>
<p>Of course, we can’t really learn much of anything from a single simulated data set. Below is a plot of the mean estimate under each modeling scenario (along with the blue line that represents <span class="math inline">\(\pm 2\)</span> <em>sd</em>) based on 2500 simulated data sets with missingness completely at random. (The code for these replications is included in the addendum.)</p>
<p>It is readily apparent that under an assumption of MCAR, all estimation models yield unbiased estimates (the true effect size is 0.5), though using the last period only is inherently more variable (given that there are fewer observations to work with).</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-11-1.png" width="384" /></p>
</div>
<div id="missing-at-random" class="section level3">
<h3>Missing at random</h3>
<p>When the data are MAR (missing at random), using the last period only no longer provides an unbiased estimate of the effect size. In this case, the probability of missingness is a function of time, intervention status, and the outcome from the prior period, all of which are observed. This is how I’ve defined the MAR process:</p>
<pre class="r"><code>MAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
logit.link = TRUE, monotonic = TRUE
)</code></pre>
<p>The mean plots based on 2500 iterations reveal the bias of the last period only. It is interesting to see that the GEE model is <em>not</em> biased, because we have captured all of the relevant covariates in the model. (It is well known that a likelihood method can yield unbiased estimates in the case of MAR, and while GEE is not technically a likelihood, it is a <em>quasi</em>-likelihood method.)</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-13-1.png" width="384" /></p>
</div>
<div id="missing-not-at-random" class="section level3">
<h3>Missing not at random</h3>
<p>When missingness depends on unobserved data, such as the outcome itself, then GEE estimates are also biased. For the last set of simulations, I defined missingness of <span class="math inline">\(y\)</span> in any particular time period to be a function of itself. Specifically, if the outcome was successful and the subject was in the intervention, the subject would be more likely to be observed:</p>
<pre class="r"><code>NMAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*y",
logit.link = TRUE, monotonic = TRUE
)</code></pre>
<p>Under the assumption of missingness not at random (NMAR), both estimation approaches based on the observed data set with missing values yields an biased estimate, though using all of the data appears to reduce the bias somewhat:</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-15-1.png" width="384" /></p>
</div>
<div id="addendum-generating-replications" class="section level3">
<h3>Addendum: generating replications</h3>
<pre class="r"><code>iter <- function(n, np, defM) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addPeriods(dx, nPeriods = np)
defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + .5*rx",
dist = "nonrandom", link = "logit")
dx <- addColumns(defP, dx)
dx <- addCorGen(dtOld = dx, idvar = "id", nvars = np, rho = .6,
corstr = "ar1", dist = "binary", param1 = "p",
method = "ep", formSpec = "-2 + 0.2*period + .5*rx",
cnames = "y")
dm <- genMiss(dx, defM, "id", repeated = TRUE, periodvar = "period")
dObs <- genObs(dx, dm, idvars = "id")
fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dx, corstr = "ar1")
fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dObs, corstr = "ar1")
fit.l <- glm(y ~ rx, data = dObs[period == (np - 1)], family = binomial)
return(data.table(full = coef(fit.f)["rx"],
miss = coef(fit.m)["rx"],
last = coef(fit.l)["rx"])
)
}
## defM
MCAR <- defMiss(varname = "y", formula = "-2.6",
logit.link = TRUE, monotonic = TRUE
)
MAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
logit.link = TRUE, monotonic = TRUE
)
NMAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*y",
logit.link = TRUE, monotonic = TRUE
)
##
library(parallel)
niter <- 2500
resMCAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MCAR)))
resMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MAR)))
resNMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, NMAR)))</code></pre>
</div>
Adding a "mixture" distribution to the simstudy package
https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/
Tue, 26 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/
<p>I am contemplating adding a new distribution option to the package <code>simstudy</code> that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new <em>mixture</em> option is to step through a few examples and see it in action.</p>
<div id="specifying-the-mixture-distribution" class="section level3">
<h3>Specifying the “mixture” distribution</h3>
<p>As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities. For example, if we have two variables, <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>, we have a mixture if, for any particular observation, we take <span class="math inline">\(x_1\)</span> with probability <span class="math inline">\(p_1\)</span> and <span class="math inline">\(x_2\)</span> with probability <span class="math inline">\(p_2\)</span>, where <span class="math inline">\(\sum_i{p_i} = 1\)</span>, <span class="math inline">\(i \in (1, 2)\)</span>. So, if we have already defined <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span> using the <code>defData</code> function, we can create a third variable <span class="math inline">\(x_{mix}\)</span> with this definition:</p>
<pre class="r"><code>def <- defData(def, varname = "xMix",
formula = "x1 | 0.4 + x2 | 0.6",
dist = "mixture")</code></pre>
<p>In this example, we will draw <span class="math inline">\(x_1\)</span> with probability 0.4 and <span class="math inline">\(x_2\)</span> with probability 0.6. We are, however, not limited to mixing only two variables; to make that clear, I’ll start off with an example that shows a mixture of three normally distributed variables.</p>
</div>
<div id="mixture-of-3-normal-distributions" class="section level3">
<h3>Mixture of 3 normal distributions</h3>
<p>In this case, we have <span class="math inline">\(x_1 \sim N(1,1)\)</span>, <span class="math inline">\(x_2 \sim N(5,4)\)</span>, and <span class="math inline">\(x_3 \sim N(9,1)\)</span>. The mixture will draw from <span class="math inline">\(x_1\)</span> 30% of the time, from <span class="math inline">\(x_2\)</span> 40%, and from <span class="math inline">\(x_3\)</span> 30%:</p>
<pre class="r"><code>def <- defData(varname = "x1", formula = 1, variance = 1)
def <- defData(def, varname = "x2", formula = 5, variance = 4)
def <- defData(def, varname = "x3", formula = 9, variance = 1)
def <- defData(def, varname = "xMix",
formula = "x1 | .3 + x2 | .4 + x3 | .3",
dist = "mixture")</code></pre>
<p>The data generation now proceeds as usual in <code>simstudy</code>:</p>
<pre class="r"><code>set.seed(2716)
dx <- genData(1000, def)
dx</code></pre>
<pre><code>## id x1 x2 x3 xMix
## 1: 1 1.640 4.12 7.13 4.125
## 2: 2 -0.633 6.89 9.07 -0.633
## 3: 3 1.152 2.95 8.71 1.152
## 4: 4 1.519 5.53 8.82 5.530
## 5: 5 0.206 5.55 9.31 5.547
## ---
## 996: 996 2.658 1.87 8.09 1.870
## 997: 997 2.604 4.44 9.09 2.604
## 998: 998 0.457 5.56 10.87 10.867
## 999: 999 -0.400 4.29 9.03 -0.400
## 1000: 1000 2.838 4.78 9.17 9.174</code></pre>
<p>Here are two plots. The top shows the densities for the original distributions separately, and the bottom plot shows the mixture distribution (which is the distribution of <code>xMix</code>):</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-5-1.png" width="576" /></p>
<p>And it is easy to show that the mixture proportions are indeed based on the probabilities that were defined:</p>
<pre class="r"><code>dx[, .(p1=mean(xMix == x1), p2=mean(xMix == x2), p3=mean(xMix == x3))]</code></pre>
<pre><code>## p1 p2 p3
## 1: 0.298 0.405 0.297</code></pre>
</div>
<div id="zero-inflated" class="section level3">
<h3>Zero-inflated</h3>
<p>One classic mixture model is the <em>zero-inflated Poisson</em> model. We can easily generate data from this model using a mixture distribution. In this case, the outcome is <span class="math inline">\(0\)</span> with probability <span class="math inline">\(p\)</span> and is a draw from a Poisson distribution with mean (and variance) <span class="math inline">\(\lambda\)</span> with probability <span class="math inline">\(1-p\)</span>. As a result, there will be an over-representation of 0’s in the observed data set.
In this example <span class="math inline">\(p\)</span> = 0.2 and <span class="math inline">\(\lambda = 2\)</span>:</p>
<pre class="r"><code>def <- defData(varname = "x0", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "xPois", formula = 2, dist = "poisson")
def <- defData(def, varname = "xMix", formula = "x0 | .2 + xPois | .8",
dist = "mixture")
set.seed(2716)
dx <- genData(1000, def)</code></pre>
<p>The figure below shows a histogram of the Poisson distributed <span class="math inline">\(x_{pois}\)</span> on top and a histogram of the mixture on the bottom. It is readily apparent that the mixture distribution has “too many” zeros relative to the Poisson distribution:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-8-1.png" width="480" /></p>
<p>I am fitting model below (using the <code>pscl</code> package) to see if it is possible to recover the assumptions I used in the data generation process. With 1000 observations, of course, it is easy:</p>
<pre class="r"><code>library(pscl)
zfit <- zeroinfl(xMix ~ 1 | 1, data = dx)
summary(zfit)</code></pre>
<pre><code>##
## Call:
## zeroinfl(formula = xMix ~ 1 | 1, data = dx)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.035 -1.035 -0.370 0.296 4.291
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6959 0.0306 22.8 <2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.239 0.107 -11.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 9
## Log-likelihood: -1.66e+03 on 2 Df</code></pre>
<p>The estimated value of <span class="math inline">\(lambda\)</span> from the model is the exponentiated value of the coefficient from the Poisson model: <span class="math inline">\(e^{0.6959}\)</span>. The estimate is quite close to the true value <span class="math inline">\(\lambda = 2\)</span>:</p>
<pre class="r"><code>exp(coef(zfit)[1])</code></pre>
<pre><code>## count_(Intercept)
## 2.01</code></pre>
<p>And the estimated probability of drawing a zero (i.e. <span class="math inline">\(\hat{p}\)</span>) is based on a simple transformation of the coefficient of the binomial model (<span class="math inline">\(-1.239\)</span>), which is on the logit scale. Again, the estimate is quite close to the true value <span class="math inline">\(p = 0.2\)</span>:</p>
<pre class="r"><code>1/(1 + exp(-coef(zfit)[2]))</code></pre>
<pre><code>## zero_(Intercept)
## 0.225</code></pre>
</div>
<div id="outlier-in-linear-regression" class="section level3">
<h3>Outlier in linear regression</h3>
<p>In this final example, I use the mixture option to generate outliers in the context of a regression model. This is done first by generating outcomes <span class="math inline">\(y\)</span> as a function of a predictor <span class="math inline">\(x\)</span>. Next, alternative outcomes <span class="math inline">\(y_{outlier}\)</span> are generated independent of <span class="math inline">\(x\)</span>. The observed outcomes <span class="math inline">\(y_{obs}\)</span> are a mixture of the outliers <span class="math inline">\(y_{outlier}\)</span> and the predicted <span class="math inline">\(y\)</span>’s. In this simulation, 2.5% of the observations will be drawn from the outliers:</p>
<pre class="r"><code>def <- defData(varname = "x", formula = 0, variance = 9,
dist = "normal")
def <- defData(def, varname = "y", formula = "3+2*x", variance = 7,
dist = "normal")
def <- defData(def, varname = "yOutlier", formula = 12, variance = 6,
dist = "normal")
def <- defData(def, varname = "yObs",
formula = "y | .975 + yOutlier | .025",
dist = "mixture")
set.seed(2716)
dx <- genData(100, def)</code></pre>
<p>This scatter plot shows the relationship between <span class="math inline">\(y_{obs}\)</span> and <span class="math inline">\(x\)</span>; the red dots represent the observations drawn from the outlier distribution:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p>Once again, it is illustrative to fit a few models to estimate the linear relationships between the <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span>. The model that includes the true value of <span class="math inline">\(y\)</span> (as opposed to the outliers) unsurprisingly recovers the true relationship. The model that includes the observed outcomes (the mixture distribution) underestimates the relationship. And a robust regression model (using the <code>rlm</code> function <code>MASS</code> package) provides a less biased estimate:</p>
<pre class="r"><code>lm1 <- lm( y ~ x, data = dx)
lm2 <- lm( yObs ~ x, data = dx)
library(MASS)
rr <- rlm(yObs ~ x , data = dx)</code></pre>
<pre class="r"><code>library(stargazer)
stargazer(lm1, lm2, rr, type = "text",
omit.stat = "all", omit.table.layout = "-asn",
report = "vcs")</code></pre>
<pre><code>##
## ================================
## Dependent variable:
## -----------------------
## y yObs
## OLS OLS robust
## linear
## (1) (2) (3)
## x 2.210 2.030 2.150
## (0.093) (0.136) (0.111)
##
## Constant 2.780 3.310 2.950
## (0.285) (0.417) (0.341)
##
## ================================</code></pre>
<p>The scatter plot below includes the fitted lines from the estimated models: the blue line is the true regression model, the red line is the biased estimate based on the data that includes outliers, and the black line is the robust regression line that is much closer to the truth:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>The mixture option is still experimental, though it is available on <a href="https://github.com/kgoldfeld/simstudy">github</a>. One enhancement I hope to make is to allow the mixture probability to be a function of covariates. The next release on CRAN will certainly include some form of this new distribution option.</p>
<p> </p>
</div>
What can we really expect to learn from a pilot study?
https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/
Tue, 12 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/
<p>I am involved with a very interesting project - the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a> - where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.</p>
<p>There is a rich literature on this topic. For example, these papers by <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j..2002.384.doc.x"><em>Lancaster et al</em></a> and <a href="https://www.sciencedirect.com/science/article/pii/S002239561000292X"><em>Leon et al</em></a> provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the <em>feasibility</em> of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?</p>
<p>Another thing the authors agree on is that the pilot study is <em>not</em> generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)</p>
<p>My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.</p>
<div id="a-small-two-arm-pilot-study" class="section level3">
<h3>A small, two-arm pilot study</h3>
<p>In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect <span class="math inline">\(\Delta = 50\)</span>. The outcome is a continuous measure with a within-arm standard deviation <span class="math inline">\(\sigma = 100\)</span>. In some fields of research, the effect size would be standardized as <span class="math inline">\(d = \Delta / \sigma\)</span>. (This is also known as <a href="https://rpsychologist.com/d3/cohend/">Cohen’s <span class="math inline">\(d\)</span></a>.) So, in this case the true standardized effect size <span class="math inline">\(d=0.5\)</span>.</p>
<p>If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate <span class="math inline">\(\alpha = 0.05\)</span>. Using the <code>pwr.t.test</code> function in the <code>pwr</code> library, we specify the treatment effect (as <span class="math inline">\(d\)</span>), significance level <span class="math inline">\(\alpha\)</span>, and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):</p>
<pre class="r"><code>library(pwr)
pwr.t.test(n = NULL, d = 50/100, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 64
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>If we do not have an estimate of <span class="math inline">\(d\)</span> or even of the individual components <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\sigma\)</span>, we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator <span class="math inline">\(rx\)</span> and an outcome <span class="math inline">\(y\)</span>:</p>
<pre class="r"><code>library(simstudy)
defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2)
ss <- 30
set.seed(22821)
dd <- genData(n = ss*2)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(defd, dd)
head(dd)</code></pre>
<pre><code>## id rx y
## 1: 1 0 -150
## 2: 2 1 48
## 3: 3 0 -230
## 4: 4 1 116
## 5: 5 1 91
## 6: 6 1 105</code></pre>
<p>Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform <span class="math inline">\(d\)</span>? We have a couple of options. In the first case, we can estimate both <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\sigma\)</span> from the data and use those results directly in power calculations:</p>
<pre class="r"><code>lmfit <- lm(y ~ rx, data = dd)
Delta <- coef(lmfit)["rx"]
Delta</code></pre>
<pre><code>## rx
## 78</code></pre>
<pre class="r"><code>sd.rx <- dd[rx==1, sd(y)]
sd.ctl <- dd[rx==0, sd(y)]
pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 )
pool.sd</code></pre>
<pre><code>## [1] 94</code></pre>
<p>The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated <span class="math inline">\(\hat{d}\)</span> is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = Delta/pool.sd, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 24
## d = 0.83
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = 50/pool.sd, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 56
## d = 0.53
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
</div>
<div id="speak-truth-to-power" class="section level3">
<h3>Speak truth to power</h3>
<p>Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true <span class="math inline">\(d\)</span>)?</p>
<p>In the first case, where we actually used the true <span class="math inline">\(d\)</span> to get the sample size estimate, we just recover the 80% power estimate. No surprise there:</p>
<pre class="r"><code>pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.8</code></pre>
<p>In the second case, where we used <span class="math inline">\(\hat{d} = \hat{\Delta} / \hat{\sigma}\)</span> to get the sample size <span class="math inline">\(n=24\)</span>, the true power of the larger RCT would be 40%:</p>
<pre class="r"><code>pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.4</code></pre>
<p>And if we had used <span class="math inline">\(\hat{d} = 50 / \hat{\sigma}\)</span> to get the sample size estimate <span class="math inline">\(n=56\)</span>, the true power would have been 75%:</p>
<pre class="r"><code>pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.75</code></pre>
</div>
<div id="conservative-estimate-of-standard-deviation" class="section level3">
<h3>Conservative estimate of standard deviation</h3>
<p>While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780141709">1995 paper</a> by R.H. Browne presents the idea that we <em>can</em> use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study.</p>
<p>The confidence interval for the standard deviation is not routinely provided in R. Another <a href="https://www.sciencedirect.com/science/article/pii/S0378375810005070?via%3Dihub">paper</a> analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the <span class="math inline">\(\gamma*100\%\)</span> one-sided confidence interval for <span class="math inline">\(\sigma^2\)</span>:</p>
<p><span class="math display">\[
\left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right)
\]</span></p>
<p>where <span class="math inline">\(\chi^2_{N-2;\gamma}\)</span> is determined by <span class="math inline">\(P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma\)</span>. So, if <span class="math inline">\(\gamma = 0.95\)</span> then we can get a one-sided 95% confidence interval for the standard deviation using that formulation:</p>
<pre class="r"><code>gamma <- 0.95
qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE)
ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi )
ucl</code></pre>
<pre><code>## [1] 111</code></pre>
<p>The point estimate <span class="math inline">\(\hat{\sigma}\)</span> is 94, and the one-sided 95% confidence interval is <span class="math inline">\((0, 111)\)</span>. (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.)</p>
<p>If we use <span class="math inline">\(\hat{\sigma}_{ucl} = 111\)</span> to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate <span class="math inline">\(\hat{\sigma} = 94\)</span> (where the sample size requirement was 56):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = 50/ucl, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 78
## d = 0.45
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>Ultimately, using <span class="math inline">\(\gamma = 0.95\)</span> might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of <span class="math inline">\(\gamma\)</span>’s, from 0.5 to 0.9, which I also do in the next section.</p>
</div>
<div id="simulation-of-different-approaches" class="section level3">
<h3>Simulation of different approaches</h3>
<p>At this point, we need to generate multiple iterations to see how the various approaches perform over <em>repeated</em> pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above.</p>
<p>As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate <span class="math inline">\(\Delta\)</span> (in the figures below, this would be the columns labeled <em>‘truth’</em>) with using the effect size point estimate from the pilot (labeled <em>pilot</em>). I also compare using a point estimate of <span class="math inline">\(\sigma\)</span> from the pilot (where <span class="math inline">\(\gamma=0\)</span>), with using the upper limit of a one-sided confidence interval defined by <span class="math inline">\(\gamma\)</span>. In these simulations I compare three levels of <span class="math inline">\(\gamma\)</span>: <span class="math inline">\(\gamma \in (0.5, 0.7, 0.9)\)</span>.</p>
<p>In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error <span class="math inline">\(\sigma = 100\)</span> so that true <span class="math inline">\(d\)</span> is 0.30 or 0.75.</p>
<p>The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%.</p>
<p><img src="https://www.rdatagen.net/img/post-pilot/pilot30.png" style="width:90.0%" /></p>
<p>Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for <span class="math inline">\(\hat{\sigma}\)</span>. In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using <span class="math inline">\(\gamma\)</span> higher than 0.6 or 0.7 might provide an adequate cushion.</p>
<p>Second, it appears <em>using the effect size estimate from the pilot as the basis for an RCT power analysis is risky</em>. The box plots labeled as <em>pilot</em> exhibit much more variation than the <em>‘true’</em> box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be.</p>
<p>The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal.</p>
<p><img src="https://www.rdatagen.net/img/post-pilot/pilot60.png" style="width:90.0%" /></p>
<p>A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that <em>can</em> help improve the design of the RCT.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312.</p>
<p>Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629.</p>
<p>Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940.</p>
<p>Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672.</p>
<p> </p>
<p>Support:</p>
This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.
</font></small>
</p>
<p> </p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>Below is the code I used to run the simulations and generate the plots</p>
<pre class="r"><code>getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) {
estring <- paste0("rx * ", esize)
defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2)
N <- ssize * 2
dd <- genData(n = N)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(defd, dd)
lmfit <- lm(y~rx, data = dd)
sd.rx <- dd[rx==1, sd(y)]
sd.ctl <- dd[rx==0, sd(y)]
pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 )
qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE)
ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi )
p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0)
p.eff <- esize * (use.est == FALSE) +
coef(lmfit)["rx"] * (use.est == TRUE)
if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd
nstar <- round(pwr.t.test(n = NULL, d = p.eff/p.sd, sig.level = 0.05,
power = 0.80, type = "two.sample")$n,0)
power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05,
type = "two.sample")
return(data.table(ssize, esize, gamma, use.est,
estsd = estsd, ucl = ucl, nstar, power = power$power,
est = coef(lmfit)["rx"],
lcl.est = confint(lmfit)["rx",1] ,
ucl.est = confint(lmfit)["rx",2])
)
}</code></pre>
<pre class="r"><code>dres <- data.table()
for (i in c(30, 60)) {
for (j in c(30, 75)) {
for (k in c(0, .5, .7)) {
for (l in c(FALSE, TRUE)) {
dd <- rbindlist(lapply(1:5000,
function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l))
)
dres <- rbind(dres, dd)
}}}}</code></pre>
<pre class="r"><code>above80 <- dres[, .(x80 = mean(power >= 0.80)),
keyby = .(ssize, esize, gamma, use.est)]
above80[, l80 := scales::percent(x80, accuracy = 1)]
g_labeller <- function(value) {
paste("\U03B3", "=", value) # unicode for gamma
}
e_labeller <- function(value) {
paste("\U0394", "=", value) # unicdoe for Delta
}
ggplot(data = dres[ssize == 30],
aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) +
geom_hline(yintercept = 0.8, color = "white") +
geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey92"),
axis.ticks = element_blank(),
plot.title = element_text(size = 9, face = "bold")) +
facet_grid(esize ~ gamma,
labeller = labeller(gamma = g_labeller, esize = e_labeller)) +
scale_x_discrete(
name = "\n source of effect size used for power calculation") +
scale_y_continuous(limits = c(0,1), breaks = c(0, .8),
name = "distribution of power estimates \n") +
ggtitle("Distribution of power estimates (n = 30 per treatment arm)") +
geom_text(data = above80[ssize == 30],
aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95,
size = 2.5)</code></pre>
</div>
Any one interested in a function to quickly generate data with many predictors?
https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/
Tue, 29 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/
<p>A couple of months ago, I was contacted about the possibility of creating a simple function in <code>simstudy</code> to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.</p>
<p>I’m presenting a new function here as a work-in-progress. I am putting it out there in case other folks have opinions about what might be most useful; feel free to let me know if you do. If not, I am likely to include something very similar to this in the next iteration of <code>simstudy</code>, which will be version <code>0.1.16</code>.</p>
<div id="function-genmultpred" class="section level3">
<h3>Function genMultPred</h3>
<p>In its latest iteration, the new function has three interesting arguments. The first two are <code>predNorm</code> and <code>predBin</code>, which are each vectors of length 2. The first value indicates the number of predictors to generate with either a standard normal distribution or a binary distribution, respectively. The second value in each vector represents the number of variables that will actually be predictive of the outcome. (Obviously, the second value cannot be greater than the first value.)</p>
<p>The third interesting argument is <code>corStrength</code>, which is a non-negative number indicating the overall strength of the correlation between the predictors. When corStrength is set to 0 (which is the default), the variables are generated assuming independence. When corStrength is non-zero, a random correlation matrix is generated using package <code>clusterGeneration</code> [Weiliang Qiu and Harry Joe. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation).] The corStrength value is passed on to the argument <code>ratioLambda</code> in the function <code>genPositiveDefMat</code>. As the value of corStrength increases, higher levels of correlation are induced in the random correlation matrix for the predictors.</p>
<p>Currently, the outcome can only have one of three distributions: <em>normal</em>, <em>binomial</em>, or <em>Poisson</em>.</p>
<p>One possible enhancement would be to allow the distributions of the predictors to have more flexibility. However, I’m not sure the added complexity would be worth it. Again, you could always take the more standard <code>simstudy</code> approach of function <code>genData</code> if you wanted more flexibility.</p>
<p>Here’s the function, in case you want to take a look under the hood:</p>
<pre class="r"><code>genMultPred <- function(n, predNorm, predBin,
dist = "normal", sdy = 1, corStrength = 0) {
normNames <- paste0("n", 1:predNorm[1])
binNames <- paste0("b", 1:predBin[1])
## Create the definition tables to be used by genData
defn <- data.table(varname = normNames,
formula = 0,
variance = 1,
dist = "normal",
link = "identity")
defb <- data.table(varname = binNames,
formula = 0.5,
variance = NA,
dist = "binary",
link = "identity")
defx <- rbind(defn, defb)
attr(defx, which = "id") <- "id"
## Create the coefficient values - all normally distributed
ncoefs <- rnorm(predNorm[1], 0, 1)
setzero <- sample(1:predNorm[1], (predNorm[1] - predNorm[2]),
replace = FALSE)
ncoefs[setzero] <- 0
bcoefs <- rnorm(predBin[1], 0, 1)
setzero <- sample(1:predBin[1], (predBin[1] - predBin[2]),
replace = FALSE)
bcoefs[setzero] <- 0
coefs <- c(ncoefs, bcoefs)
names(coefs) <- c(normNames, binNames)
## Generate the predictors
if (corStrength <= 0) { # predictors are independent
dx <- genData(n, defx)
} else {
rLambda <- max(1, corStrength)
covx <- cov2cor(genPositiveDefMat(nrow(defx),
lambdaLow = 1, ratioLambda = rLambda)$Sigma)
dx <- genCorFlex(n, defx, corMatrix = covx)
}
## Generate the means (given the predictors)
mu <- as.matrix(dx[,-"id"]) %*% coefs
dx[, mu := mu]
## Generate the outcomes based on the means
if (dist == "normal") {
dx[, y := rnorm(n, mu, sdy)]
} else if (dist == "binary") {
dx[, y := rbinom(n, 1, 1/(1 + exp(-mu)))] # link = logit
} else if (dist == "poisson") {
dx[, y := rpois(n, exp(mu))] # link = log
}
dx[, mu := NULL]
return(list(data = dx[], coefs = coefs))
}</code></pre>
</div>
<div id="a-brief-example" class="section level2">
<h2>A brief example</h2>
<p>Here is an example with 7 normally distributed covariates and 4 binary covariates. Only 3 of the continuous covariates and 2 of the binary covariates will actually be predictive.</p>
<pre class="r"><code>library(simstudy)
library(clusterGeneration)
set.seed(732521)
dx <- genMultPred(250, c(7, 3), c(4, 2))</code></pre>
<p>The function returns a list of two objects. The first is a data.table containing the generated predictors and outcome:</p>
<pre class="r"><code>round(dx$data, 2)</code></pre>
<pre><code>## id n1 n2 n3 n4 n5 n6 n7 b1 b2 b3 b4 y
## 1: 1 0.15 0.12 -0.07 -1.38 -0.05 0.58 0.57 1 1 0 1 -1.07
## 2: 2 1.42 -0.64 0.08 0.83 2.01 1.18 0.23 1 1 0 0 4.42
## 3: 3 -0.71 0.77 0.94 1.59 -0.53 -0.05 0.26 0 0 0 0 0.09
## 4: 4 0.35 -0.80 0.90 -0.79 -1.72 -0.16 0.09 0 0 1 1 -0.58
## 5: 5 -0.22 -0.72 0.62 1.40 0.17 2.21 -0.45 0 1 0 1 -2.18
## ---
## 246: 246 -1.04 1.62 0.40 1.46 0.80 -0.77 -1.27 0 0 0 0 -1.19
## 247: 247 -0.85 1.56 1.39 -1.25 -0.82 -0.63 0.13 0 1 0 0 -0.70
## 248: 248 0.72 -0.83 -0.04 -1.38 0.61 -0.71 -0.06 1 0 1 1 0.74
## 249: 249 -0.15 1.62 -1.01 -0.79 -0.53 0.44 -0.46 1 1 1 1 0.95
## 250: 250 -0.59 0.34 -0.31 0.18 -0.86 -0.90 0.22 1 0 1 0 -1.90</code></pre>
<p>The second object is the set of coefficients that determine the average response conditional on the predictors:</p>
<pre class="r"><code>round(dx$coefs, 2)</code></pre>
<pre><code>## n1 n2 n3 n4 n5 n6 n7 b1 b2 b3 b4
## 2.48 0.62 0.28 0.00 0.00 0.00 0.00 0.00 0.00 0.53 -1.21</code></pre>
<p>Finally, we can “recover” the original coefficients with linear regression:</p>
<pre class="r"><code>lmfit <- lm(y ~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + b1 + b2 + b3 + b4,
data = dx$data)</code></pre>
<p>Here’s a plot showing the 95% confidence intervals of the estimates along with the true values. The yellow lines are covariates where there is truly no association.</p>
<p><img src="https://www.rdatagen.net/post/2019-10-29-any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors.en_files/figure-html/unnamed-chunk-7-1.png" width="576" /></p>
<p> </p>
<div id="addendum-correlation-among-predictors" class="section level3">
<h3>Addendum: correlation among predictors</h3>
<p>Here is a pair of examples using the <code>corStrength</code> argument. In the first case, the observed correlations are close to 0, whereas in the second case, the correlations range from -0.50 to 0.25. The impact of <code>corStrength</code> will vary depending on the number of potential predictors.</p>
<pre class="r"><code>set.seed(291212)
# Case 1
dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 0)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)</code></pre>
<pre><code>## n1 n2 n3 n4 b1 b2
## n1 1.00 -0.02 0.02 0.03 -0.01 -0.01
## n2 -0.02 1.00 -0.01 0.03 -0.03 0.00
## n3 0.02 -0.01 1.00 0.00 -0.04 -0.01
## n4 0.03 0.03 0.00 1.00 0.06 -0.01
## b1 -0.01 -0.03 -0.04 0.06 1.00 -0.01
## b2 -0.01 0.00 -0.01 -0.01 -0.01 1.00</code></pre>
<pre class="r"><code># Case 2
dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 50)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)</code></pre>
<pre><code>## n1 n2 n3 n4 b1 b2
## n1 1.00 0.09 0.08 -0.32 0.25 0.04
## n2 0.09 1.00 -0.29 -0.47 -0.05 -0.02
## n3 0.08 -0.29 1.00 -0.46 -0.01 -0.01
## n4 -0.32 -0.47 -0.46 1.00 -0.20 -0.05
## b1 0.25 -0.05 -0.01 -0.20 1.00 -0.04
## b2 0.04 -0.02 -0.01 -0.05 -0.04 1.00</code></pre>
</div>
</div>
Selection bias, death, and dying
https://www.rdatagen.net/post/selection-bias-death-and-dying/
Tue, 15 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/selection-bias-death-and-dying/
<p>I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death.</p>
<p>A key challenge for researchers is figuring out how to analyze events that occur just before death. For example, it is not unusual to consider hospitalization in the week or month before death as a poor outcome. For example, here is a <a href="https://www.liebertpub.com/doi/full/10.1089/jpm.2015.0229">paper</a> in the <em>Journal of Palliative Care Medicine</em> that describes an association of homecare nursing and reduced hospitalizations in the week before death. While there is no denying the strength of the association, it is less clear how much of that association is causal.</p>
<p>In particular, there is the possibility of <em>selection bias</em> that may be result when considering only patients who have died. In this post, I want to describe the concept of selection bias and simulate data that mimics the process of end-stage disease in order to explore how these issues might play out when we are actually evaluating the causal effect of an exposure or randomized intervention.</p>
<div id="selection-bias" class="section level3">
<h3>Selection bias</h3>
<p><em>Selection bias</em> is used to refer to different concepts by different researchers (see this article by <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4043938/"><em>Haneuse</em></a> or this one by <a href="https://journals.lww.com/epidem/Fulltext/2004/09000/Causation_of_Bias__The_Episcope.20.aspx"><em>Hernán et al</em></a> for really nice discussions of these issues). The terminology doesn’t matter as much as understanding the underlying data generating processes that distinguish the different ideas.</p>
<p>The key issue is to understand <em>what</em> is being selected. In one case, the exposure or intervention is the focus. And in the second case, it is how the patients or subjects are selected into the study more generally that induces the bias. The first selection process is typically referred to by epidemiologists as <em>confounding bias</em> (though it is also called <em>treatment-selection bias</em>), while the second is actually <em>selection bias</em>.</p>
<p>When I’ve written about these issues before (for example, see <a href="https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/">here</a>), I’ve described how DAGs can be useful to illuminate the potential biases. Below, I have drawn a diagram to represent a simple case of selection bias. Say we are interested in measuring the causal relationship between <em>income</em> and <em>blood pressure</em> in some population in which the two are actually not causally related. If people with higher income are more likely to visit a doctor, and if people with higher blood pressure are also more likely to visit a doctor, the underlying causal relationship might be well represented by the DAG on the left in the figure below.</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/selectionbias.png" style="width:65.0%" /></p>
<p>Let’s say we recruit participants for our study right outside of a medical facility. Choosing this location (as opposed to, say, a shopping mall where the causal model on the left would not be relevant), we are inducing a relationship between <em>income</em> and <em>blood pressure</em>. This can be seen in the DAG on the right, where we have effectively “controlled” for medical facility access in our selection process. The induced statistical relationship can be described in this way: if someone is at the medical center and they have relatively low income, they are more likely to have relatively high blood pressure. Conversely, if someone is there and they have relatively low blood pressure, they are more likely to have relatively high income. Based on this logic, we would expect to see a negative relationship between <em>income</em> and <em>blood pressure</em> in our study sample drawn from patients visiting a medical facility.</p>
<p>To explore by simulation, we can generate a large population of individuals with uncorrelated income and blood pressure. Selection will be a function of both:</p>
<pre class="r"><code>n = 5000
set.seed(748347)
income <- rnorm(n);
bp <- rnorm(n)
logitSelect <- -2 + 2*income + 2*bp
pSelect <- 1/(1+exp(-logitSelect))
select <- rbinom(n, 1, pSelect)
dPop <- data.table(income, bp, select)
dSample <- dPop[select == 1]</code></pre>
<p>The plot on the left below is the overall population of 5000; there is no obvious relationship between the <em>income</em> and <em>blood pressure</em>. The group that was recruited at the medical facility and enrolled in the study (a subset of the original population) is shown in purple in the plot on the right. In this subset, there does indeed appear to be a relationship between the two characteristics. An estimate of the association, which we know is zero, based on the sample would be biased; that bias is due to how we selected participants into the study.</p>
<p><img src="https://www.rdatagen.net/post/2019-10-15-selection-bias-death-and-dying.en_files/figure-html/unnamed-chunk-2-1.png" width="960" /></p>
</div>
<div id="hospitalization-before-death" class="section level3">
<h3>Hospitalization before death</h3>
<p>In the next simulation, let’s consider a somewhat more complex process, though with the same underlying structure and similar bias as the simpler case above. The next DAG (below) shows three different time periods. In this case there is an indicator of homecare by a nurse <span class="math inline">\(N_1\)</span>, <span class="math inline">\(N_2\)</span>, and <span class="math inline">\(N_3\)</span>. (In this particular example, an individual has home nursing care in all three periods or they don’t have any home nursing care in any period. This is not a requirement.) In each period, each patient has an underlying time-dependent health status, ranging from <span class="math inline">\(1\)</span> (healthiest) to <span class="math inline">\(4\)</span> (sickest). In this simulated study, the underlying health status <span class="math inline">\(U_1\)</span>, <span class="math inline">\(U_2\)</span>, and <span class="math inline">\(U_3\)</span> are considered latent (i.e. unmeasured). The progression of health status is governed by a Markov process that is independent of any kind of treatment. (See <a href="https://www.rdatagen.net/post/simstudy-1-14-update/">here</a> and <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">here</a> for a more detailed description of how this is done using <code>simstudy</code>.)</p>
<p>The probability of hospitalization is a solely a function of the underlying health status, and nothing else. (I could make hospitalization a function of palliative care as well, but this just simplifies matters. In both cases the estimates will be biased - you can try for yourself.)</p>
<p>Death is a function of underlying health status and palliative care. While it does not seem to be the case in practice, I am assuming that less aggressive care results in shorter survival times. And the sicker the patient is in a particular period, the greater risk of dying in that period. (There should be lines between death in various periods and all subsequent measures, but I have eliminated them for clarity sake.)</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/repeated.png" style="width:80.0%" /></p>
<p>The code to generate the data starts with the definitions: first, I define an initial health state <span class="math inline">\(S_0\)</span> that can range from 1 to 3 and the transition matrix <span class="math inline">\(P\)</span> for the Markov process. Next, I define the hospitalization and death outcomes:</p>
<pre class="r"><code>bDef <- defData(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.0, 0.4, 0.4, 0.2,
0.0, 0.0, 0.6, 0.4,
0.0, 0.0, 0.0, 1.0),
nrow = 4))
pDef <- defDataAdd(varname = "hospital", formula = "-2 + u",
dist = "binary", link = "logit")
pDef <- defDataAdd(pDef, varname = "death",
formula = "-2 + u + homenurse * 1.5",
dist = "binary", link = "logit")</code></pre>
<p>The data generation process randomizes individuals to nursing home care (or care as usual) in the first period, and creates all of the health status measures and outcomes. The last step removes any data for an individual that was generated after they died. (The function <code>trimData</code> is new and only available in <code>simstudy 0.1.15</code>, which is available on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a> - as of 10/21/2019)</p>
<pre class="r"><code>set.seed(272872)
dd <- genData(10000, bDef)
dd <- trtAssign(dd, grpName = "homenurse")
dp <- addMarkov(dd, transMat = P,
chainLen = 4, id = "id",
pername = "seq", start0lab = "S0",
varname = "u")
dp <- addColumns(pDef, dp)
dp <- trimData(dp, seqvar = "seq", eventvar = "death")</code></pre>
<p> </p>
<div id="a-short-follow-up-period" class="section level4">
<h4>A short follow-up period</h4>
<p>If we have a relatively short follow up period in our randomized trial of supportive care at home (nursecare), only a portion of the sample will die; as result, we can only compare the hospitalization before death for a subset of the sample. By selecting on death, we will induce a relationship between the intervention and the outcome where none truly exists. Inspecting the DAGs below, it is apparent that this is a classic case of selection bias. Since we cannot control for the unmeasured health status <span class="math inline">\(U\)</span>, hospitalization and death are associated. And, since treatment and death <em>are</em> causally related, by selecting on death we are in the same situation as we were in the first example.</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/singleprepost.png" style="width:100.0%" /></p>
<pre class="r"><code>d1 <- dp[seq == 1]</code></pre>
<p>If we consider only those who died in the first period, we will be including 61% of the sample:</p>
<pre class="r"><code>d1[, mean(death)]</code></pre>
<pre><code>## [1] 0.6109</code></pre>
<p>To get a sense of the bias, I am considering three models. The first model estimates the effect of the intervention on hospitalization for only those who died in the first period; we expect that this will have a negative bias. In the second model, we use the same subset of patients who died, but adjust for underlying health status; the hospitalization coefficient should be close to zero. Finally, we estimate a model for everyone in period 1, regardless of whether they died. again, we expect the effect size to be close to 0.</p>
<pre class="r"><code>fit1 <- glm(hospital ~ homenurse, data=d1[death==1],
family = "binomial")
fit2 <- glm(hospital ~ homenurse + u, data=d1[death==1],
family = "binomial")
fit3 <- glm(hospital ~ homenurse, data=d1,
family = "binomial")
library(stargazer)
stargazer(fit1, fit2, fit3, type = "text",
column.labels = c("died", "died - adjusted", "all"),
omit.stat = "all", omit.table.layout = "-asn")</code></pre>
<pre><code>##
## =============================================
## Dependent variable:
## -----------------------------------
## hospital
## died died - adjusted all
## (1) (2) (3)
## homenurse -0.222*** -0.061 -0.049
## (0.053) (0.057) (0.040)
##
## u 1.017***
## (0.039)
##
## Constant 0.108** -2.005*** -0.149***
## (0.042) (0.092) (0.028)
##
## =============================================</code></pre>
<p>While these are estimates from a single data set (I should really do more extensive experiment based on many different data sets), the estimates do seem to support our expectations. Indeed, if we cannot measure the underlying health status, the estimate of the intervention effect on hospitalization prior to death is biased; we would conclude that supportive care reduces the probability of hospitalization before death when we know (based on the data generation process used here) that it does not.</p>
</div>
<div id="extended-follow-up" class="section level4">
<h4>Extended follow-up</h4>
<p>We might think that if we could follow <em>everyone</em> up until death (and hence not select on death), the bias would be eliminated. However, this not the case. The treatment effect is essentially an average of the effect over all time periods, and we know that for each time period, the effect estimate is biased due to selection. And averaging across biased estimates yields a biased estimate.</p>
<p>This issue is closely related to a general issue for causal survival analysis. It has been pointed out that it is not possible to estimate a causal treatment effect using hazard rates, as we do when we use Cox proportional hazard models. This is true even if treatment has been randomized and the two treatment arms are initially balanced with respect to underlying health status. The challenge is that after the first set of deaths, the treatment groups will no longer be balanced with respect to health status; some people survived because of the intervention, others because they were generally healthier. At each point in the survival analysis, the model for risk of death is conditioning (i.e. selecting on) those who did not die. So, there is built in selection bias in the modelling. If you are interested in reading more about these issues, I recommend taking a look at these papers by <a href="https://journals.lww.com/epidem/fulltext/2010/01000/The_Hazards_of_Hazard_Ratios.4.aspx"><em>Hernán</em></a> and <a href="https://link.springer.com/article/10.1007/s10985-015-9335-y"><em>Aalen et al.</em></a>.</p>
<p>Now, back to the simulation. In this case, we analyze everyone who has died within 4 periods, which is about 97% of the initial sample, virtually everyone.</p>
<pre class="r"><code>dDied <- dp[death == 1]
nrow(dDied)/nrow(d1)</code></pre>
<pre><code>## [1] 0.9658</code></pre>
<p>The effect estimate based on this data set is only unbiased when we are able to control for underlying health status. Otherwise, extending follow-up does not help remove any bias.</p>
<pre class="r"><code>fit4 <- glm(hospital ~ homenurse, data=dDied, family = "binomial")
fit5 <- glm(hospital ~ homenurse + u, data=dDied, family = "binomial")
stargazer(fit4, fit5, type = "text",
omit.stat = "all", omit.table.layout = "-asn")</code></pre>
<pre><code>##
## ==============================
## Dependent variable:
## --------------------
## hospital
## (1) (2)
## homenurse -0.383*** -0.048
## (0.041) (0.045)
##
## u 1.020***
## (0.028)
##
## Constant 0.296*** -2.028***
## (0.030) (0.070)
##
## ==============================</code></pre>
<p>In the future, I hope to explore alternative ways to analyze these types of questions. In the case of survival analysis, models that do not condition on death have been proposed to get at causal estimates. This may not be possible when the outcome of interest (health care before death) is defined by conditioning on death. We may actually need to frame the question slightly differently to be able to get an unbiased estimate.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Seow, H., Sutradhar, R., McGrail, K., Fassbender, K., Pataky, R., Lawson, B., Sussman, J., Burge, F. and Barbera, L., 2016. End-of-life cancer care: temporal association between homecare nursing and hospitalizations. <em>Journal of palliative medicine</em>, 19(3), pp.263-270.</p>
<p>Haneuse, S., 2016. Distinguishing selection bias and confounding bias in comparative effectiveness research. <em>Medical care</em>, 54(4), p.e23.</p>
<p>Hernán, M.A., Hernández-Díaz, S. and Robins, J.M., 2004. A structural approach to selection bias. <em>Epidemiology</em>, 15(5), pp.615-625.</p>
<p>Hernán, M.A., 2010. The hazards of hazard ratios. <em>Epidemiology</em>, 21(1), p.13.</p>
<p>Aalen, O.O., Cook, R.J. and Røysland, K., 2015. Does Cox analysis of a randomized survival study yield a causal treatment effect?. <em>Lifetime data analysis</em>, 21(4), pp.579-593.</p>
<p>Support:</p>
This research is supported by the National Institutes of Health National Institute on Aging R33AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.
</font></small>
</p>
</div>
</div>
There's always at least two ways to do the same thing: an example generating 3-level hierarchical data using simstudy
https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/
Thu, 03 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/
<p>“I am working on a simulation study that requires me to generate data for individuals within clusters, but each individual will have repeated measures (say baseline and two follow-ups). I’m new to simstudy and have been going through the examples in R this afternoon, but I wondered if this was possible in the package, and if so whether you could offer any tips to get me started with how I would do this?”</p>
<p>This question popped up in my in-box a couple of days ago. And since I always like an excuse to do a little coding early in the morning to get my brain going, I decided to create a little example, though in this case, there were at least two ways to go about it. I sent back both options, and am putting them up here, since I know this kind of data generation problem comes up frequently. In fact, the post I recently wrote on <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">open cohorts in stepped-wedge designs</a> had to deal with this same issue, though in a slightly more elaborate way.</p>
<div id="three-level-hierarchical-data" class="section level3">
<h3>Three-level hierarchical data</h3>
<p>In this example, we want individuals clustered within groups, and measurements clustered within individual, as depicted by this figure:</p>
<p><img src="https://www.rdatagen.net/img/post-twoways/cluster.png" /></p>
<p>The hierarchical scheme represented implies that outcomes for individuals within groups are correlated, and that measurements over time for a particular individual are correlated. The structure of these two levels of correlation can take on a variety of forms. In the examples that follow, I am going to assume that the correlation between the individuals in a group is constant, as are the individual measurements over time. We could easily make the assumption that measurements closer in time will be more highly correlated than measurements further apart in time (such as auto-regressive correlation with 1 period of lag), but since we have only three measurements, it is not totally unreasonable to assume constant correlation.</p>
</div>
<div id="generating-data-explicitly-with-random-effects" class="section level3">
<h3>Generating data explicitly with random effects</h3>
<p>Enough with the preliminaries - let’s get to the data generation. In the first approach, both levels of correlation will be induced with group- and individual-level random effects using the following underlying model:</p>
<p><span class="math display">\[Y_{ijt} = \beta_t + \gamma_j + \alpha_i + \epsilon_{ijt},\]</span></p>
<p>where <span class="math inline">\(Y_{ijt}\)</span> is the outcome for person <span class="math inline">\(i\)</span> in group <span class="math inline">\(j\)</span> during time period <span class="math inline">\(t\)</span>. <span class="math inline">\(\beta_t\)</span> is the mean outcome during period <span class="math inline">\(t\)</span>, <span class="math inline">\(t \in \{ 0,3, 6 \}\)</span>. <span class="math inline">\(\gamma_j\)</span> is the group-specific effect, and <span class="math inline">\(\gamma_j \sim N(0,\sigma^2_\gamma)\)</span>. <span class="math inline">\(\alpha_i\)</span> is the individual-specific effect, and <span class="math inline">\(\alpha_i \sim N(0,\sigma^2_\alpha)\)</span>. Finally, <span class="math inline">\(\epsilon_{ijt}\)</span> is the noise for each particular measurement, where <span class="math inline">\(\epsilon_{ijt} \sim N(0,\sigma^2_\epsilon)\)</span>.</p>
<p>The group, individual, and outcome definitions are the first order of business. In this example, <span class="math inline">\(\sigma^2_\gamma = 2\)</span>, <span class="math inline">\(\sigma^2_\alpha = 1.3\)</span>, and <span class="math inline">\(\sigma^2_\epsilon = 1.1\)</span>. In addition, the average outcomes at baseline, 3 months and 6 months, are 3, 4, and 6, respectively:</p>
<pre class="r"><code>library(simstudy)
### Group defintion
defg <- defData(varname = "gamma", formula=0, variance = 2, id = "cid")
### Individal definition
defi <- defDataAdd(varname = "alpha", formula = 0, variance = 1.3)
### Outcome definition
defC <- defCondition(condition = "period == 0",
formula = "3 + gamma + alpha",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1",
formula = "4 + gamma + alpha",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2",
formula = "6 + gamma + alpha",
dist = "nonrandom")
defy <- defDataAdd(varname = "y", formula = "mu", variance = 1.1)</code></pre>
<p>To generate the data, first we create the group level records, then the individual level records, and finally the repeated measurements for each individual:</p>
<pre class="r"><code>set.seed(3483)
dgrp1 <- genData(100, defg)
dind1 <- genCluster(dgrp1, "cid", numIndsVar = 20, level1ID = "id")
dind1 <- addColumns(defi, dind1)
dper1 <- addPeriods(dind1, nPeriods = 3, idvars = "id")
dper1 <- addCondition(defC, dper1, newvar = "mu")
dper1 <- addColumns(defy, dper1)</code></pre>
<p>Here is a plot of the outcome data by period, with the grey lines representing individuals, and the red lines representing the group averages:</p>
<p><img src="https://www.rdatagen.net/post/2019-10-03-in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Here is a calculation of the observed covariance matrix. The total variance for each outcome should be close to <span class="math inline">\(\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon = 4.4\)</span>, and the observed covariance should be close to <span class="math inline">\(\sigma^2_\gamma + \sigma^2_\alpha = 3.3\)</span></p>
<pre class="r"><code>dcor1 <- dcast(dper1, id + cid ~ period, value.var = "y")
setnames(dcor1, c("id", "cid", "y0", "y1", "y2"))
dcor1[, cov(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 4.5 3.2 3.4
## y1 3.2 4.2 3.2
## y2 3.4 3.2 4.6</code></pre>
<p>The correlation <span class="math inline">\(\rho\)</span> show be close to</p>
<p><span class="math display">\[ \rho = \frac{\sigma^2_\gamma + \sigma^2_\alpha}{\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon} = \frac{3.3}{4.4} = 0.75\]</span></p>
<p>(For a more elaborate derivation of correlation coefficients, see this <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">post</a> on stepped-wedge designs.)</p>
<pre class="r"><code>dcor1[, cor(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 1.00 0.73 0.75
## y1 0.73 1.00 0.73
## y2 0.75 0.73 1.00</code></pre>
</div>
<div id="directly-generating-correlated-data" class="section level3">
<h3>Directly generating correlated data</h3>
<p>In this second approach, the group-level correlation is once again generated using a group effect. However, the individual-level effect is replaced by noise that is explicitly correlated across time. The model here is</p>
<p><span class="math display">\[Y_{ijt} = \beta_t + \gamma_j + \phi_{ijt},\]</span></p>
<p>where the noise <span class="math inline">\(\mathbf{\phi}_{ij}\)</span> is a vector of noise components <span class="math inline">\(\{\phi_{ij0},\phi_{ij3},\phi_{ij6}\} \sim N(\mathbf{0}, \Sigma)\)</span>, and</p>
<p><span class="math display">\[\Sigma =
\left [
\begin{matrix}
\sigma^2_\phi & \rho \sigma^2_\phi & \rho \sigma^2_\phi \\
\rho \sigma^2_\phi & \sigma^2_\phi & \rho \sigma^2_\phi \\
\rho \sigma^2_\phi & \rho \sigma^2_\phi & \sigma^2_\phi
\end{matrix}
\right ]
\]</span></p>
<p>In this case <span class="math inline">\(\sigma^2_\gamma\)</span> is still 2, and <span class="math inline">\(\sigma^2_\phi = 2.4\)</span> to ensure that total variation is 4.4. We set <span class="math inline">\(\rho = 0.54167\)</span> so that the <span class="math inline">\(\rho \sigma^2_\phi = 1.3\)</span>, ensuring that the overall covariance of the observed outcome <span class="math inline">\(y\)</span> across periods is <span class="math inline">\(3.3\)</span> as in the first method.</p>
<pre class="r"><code>defg <- defData(varname = "gamma",
formula = 0, variance = 2, id = "cid")
defg <- defData(defg, varname = "mu",
formula = 0, dist = "nonrandom")
defg <- defData(defg, varname = "phi",
formula = 2.4, dist = "nonrandom")
defC <- defCondition(condition = "period == 0",
formula = "3 + gamma + e",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1",
formula = "4 + gamma + e",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2",
formula = "6 + gamma + e",
dist = "nonrandom")</code></pre>
<p>In the data generation process, the function <code>addCorGen</code> is used to create the correlated noise across time:</p>
<pre class="r"><code>set.seed(3483)
dgrp2 <- genData(100, defg)
dind2 <- genCluster(dgrp2, "cid", numIndsVar = 20, level1ID = "id")
dper2 <- addPeriods(dind2, nPeriods = 3, idvars = "id")
dper2 <- addCorGen(dper2, "id", nvars = 3, param1 = "mu", param2 = "phi",
rho = .54167, dist = "normal", corstr = "cs", cnames = "e")
dper2 <- addCondition(defC, dper2, newvar = "y")</code></pre>
<p>I won’t do a second plot, because it would look identical to the one above. But I am calculating the covariance and correlation matrices for the outcome to illustrate for you that the two slightly different approaches do indeed generate similarly distributed data.</p>
<pre class="r"><code>dcor2 <- dcast(dper2, id + cid ~ period, value.var = "y")
setnames(dcor2, c("id", "cid", "y0", "y1", "y2"))
dcor2[, cov(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 4.4 3.4 3.3
## y1 3.4 4.4 3.4
## y2 3.3 3.4 4.5</code></pre>
<pre class="r"><code>dcor2[, cor(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 1.00 0.76 0.75
## y1 0.76 1.00 0.76
## y2 0.75 0.76 1.00</code></pre>
<p>In the example here, I wouldn’t say either approach is better. For some, the purely random effects approach may be more intuitive, and for others the correlated noise might be. However, if we want a more complex correlation pattern, like the AR-1 pattern I mentioned earlier, one approach may in fact be a little more straightforward to implement.</p>
<p>And no, I don’t respond so thoroughly to every question I get; sometimes it is better for you to struggle a bit to figure something out.</p>
</div>
Simulating an open cohort stepped-wedge trial
https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/
Tue, 17 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/
<p>In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with <a href="https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/">stepped-wedge designs</a>, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time. This means that for some patients, the data collection period straddles the transition from control to intervention.</p>
<p>Whenever I contemplate a simulation, I first think about the general structure of the data generating process before even thinking about outcome model. In the case of a more standard two-arm randomized trial, that structure is quite simple and doesn’t require much, if any, thought. In this case, however, the overlaying of a longitudinal patient outcome process on top of a stepped-wedge design presents a little bit of a challenge.</p>
<p>Adding to the challenge is that, in addition to being a function of site- and individual-specific characteristics/effects, the primary outcome will likely be a function of time-varying factors. In particular here, certain patient-level health-related factors that might contribute to the decision to prescribe anti-psychotic medications, and the time-varying intervention status, which is determined by the stepped-wedge randomization scheme. So, the simulation needs to accommodate the generation of both types of time-varying variables.</p>
<p>I’ve developed a bare-boned simulation of sites and patients to provide a structure that I can add to at some point in the future. While this is probably a pretty rare study design (though as stepped-wedge designs become more popular, it may be less rare than I am imagining), I thought the code could provide yet another example of how to approach a potentially vexing simulation in a relatively simple way.</p>
<div id="data-definition" class="section level3">
<h3>Data definition</h3>
<p>The focus here is on the structure of the data, so I am not generating any outcome data. However, in addition to generating the treatment assignment, I am creating the time-varying health status, which will affect the outcome process when I get to that.</p>
<p>In this simulation, there will be 5 sites, each followed for 25 weeks (starting with week 0). Each week, a site will have approximately 20 new patients, so we should expect to generate around <span class="math inline">\(5 \times 25 \times 20 = 2500\)</span> total patients.</p>
<p>For each patient, we will be generating a series of health status, which ranges from 1 to 4, with 1 being healthiest, and 4 being death. I will use a <a href="https://www.rdatagen.net/post/simstudy-1-14-update/">Markov chain</a> to generate this series. Two arguments required to simulate the Markov process are the starting state (which is created in <code>S0</code>) and the transition matrix <code>P</code>, which determines the probabilities of moving from one state to another.</p>
<pre class="r"><code>NPER <- 25
perDef <- defDataAdd(varname = "npatient", formula = 20,
dist = "poisson")
patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.1, 0.3, 0.4, 0.2,
0.0, 0.1, 0.5, 0.4,
0.0, 0.0, 0.0, 1.0),
nrow = 4))</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generation process starts with the sites and then proceeds to the patient level data. To begin, the five sites are generated (for now without any site-specific variables, but that could easily be modified in the future). Next, records for each site for each of the 25 periods (from week 0 to week 24) are generated; these site level records include the number patients to be generated for each site, each week:</p>
<pre class="r"><code>set.seed(3837263)
dsite <- genData(5, id = "site")
dper <- addPeriods(dsite, nPeriods = NPER, idvars = "site",
timeid = "site.time", perName = "period")
dper <- addColumns(perDef, dper)
dper</code></pre>
<pre><code>## site period site.time npatient
## 1: 1 0 1 17
## 2: 1 1 2 20
## 3: 1 2 3 25
## 4: 1 3 4 18
## 5: 1 4 5 23
## ---
## 121: 5 20 121 17
## 122: 5 21 122 15
## 123: 5 22 123 16
## 124: 5 23 124 19
## 125: 5 24 125 20</code></pre>
<p>Now, we assign each of the five sites to its own intervention “wave”. The first site starts at the beginning of the the study, week 0. The second starts 4 weeks later at week 4, and so on, until the fifth and last site starts the intervention at week 16. (Obviously, a more realistic simulation would include many more sites, but all of this can easily be scaled up.) The intervention indicator is <span class="math inline">\(I_{ct}\)</span>, and is set to 1 when cluster <span class="math inline">\(c\)</span> during week <span class="math inline">\(t\)</span> is in the intervention, and is 0 otherwise.</p>
<pre class="r"><code>dsw <- trtStepWedge(dper, "site", nWaves = 5, lenWaves = 4,
startPer = 0, perName = "period",
grpName = "Ict")
dsw <- dsw[, .(site, period, startTrt, Ict)]</code></pre>
<p>Here are the intervention assignments for the first two sites during the first 8 weeks.</p>
<pre class="r"><code>dsw[site %in% c(1,2) & period < 8]</code></pre>
<pre><code>## site period startTrt Ict
## 1: 1 0 0 1
## 2: 1 1 0 1
## 3: 1 2 0 1
## 4: 1 3 0 1
## 5: 1 4 0 1
## 6: 1 5 0 1
## 7: 1 6 0 1
## 8: 1 7 0 1
## 9: 2 0 4 0
## 10: 2 1 4 0
## 11: 2 2 4 0
## 12: 2 3 4 0
## 13: 2 4 4 1
## 14: 2 5 4 1
## 15: 2 6 4 1
## 16: 2 7 4 1</code></pre>
<p>To generate the patients, we start by generating the 2500 or so individual records. The single baseline factor that we include this time around is the starting health status <code>S0</code>.</p>
<pre class="r"><code>dpat <- genCluster(dper, cLevelVar = "site.time",
numIndsVar = "npatient", level1ID = "id")
dpat <- addColumns(patDef, dpat)
dpat</code></pre>
<pre><code>## site period site.time npatient id S0
## 1: 1 0 1 17 1 2
## 2: 1 0 1 17 2 1
## 3: 1 0 1 17 3 2
## 4: 1 0 1 17 4 2
## 5: 1 0 1 17 5 1
## ---
## 2524: 5 24 125 20 2524 3
## 2525: 5 24 125 20 2525 2
## 2526: 5 24 125 20 2526 1
## 2527: 5 24 125 20 2527 1
## 2528: 5 24 125 20 2528 1</code></pre>
<p>Here is a visualization of the patients (it turns out there are 2528 of them) by site and starting point, with each point representing a patient. The color represents the intervention status: light blue is control (pre-intervention) and dark blue is intervention. Even though a patient may start in the pre-intervention period, they may actually receive services in the intervention period, as we will see further on down.</p>
<p><img src="https://www.rdatagen.net/post/2019-09-17-simulating-an-open-cohort-stepped-wedge-trial.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>The patient health status series are generated using a Markov chain process. This particular transition matrix has an “absorbing” state, as indicated by the probability 1 in the last row of the matrix. Once a patient enters state 4, they will not transition to any other state. (In this case, state 4 is death.)</p>
<pre class="r"><code>dpat <- addMarkov(dpat, transMat = P,
chainLen = NPER, id = "id",
pername = "seq", start0lab = "S0")
dpat</code></pre>
<pre><code>## site period site.time npatient id S0 seq state
## 1: 1 0 1 17 1 2 1 2
## 2: 1 0 1 17 1 2 2 3
## 3: 1 0 1 17 1 2 3 3
## 4: 1 0 1 17 1 2 4 3
## 5: 1 0 1 17 1 2 5 4
## ---
## 63196: 5 24 125 20 2528 1 21 4
## 63197: 5 24 125 20 2528 1 22 4
## 63198: 5 24 125 20 2528 1 23 4
## 63199: 5 24 125 20 2528 1 24 4
## 63200: 5 24 125 20 2528 1 25 4</code></pre>
<p>Now, we aren’t interested in the periods following the one where death occurs. So, we want to trim the data.table <code>dpat</code> to include only those periods leading up to state 4 and the first period in which state 4 is entered. We do this first by identifying the first time a state of 4 is encountered for each individual (and if an individual never reaches state 4, then all the individual’s records are retained, and the variable <code>.last</code> is set to the maximum number of periods <code>NPER</code>, in this case 25).</p>
<pre class="r"><code>dlast <- dpat[, .SD[state == 4][1,], by = id][, .(id, .last = seq)]
dlast[is.na(.last), .last := NPER]
dlast</code></pre>
<pre><code>## id .last
## 1: 1 5
## 2: 2 13
## 3: 3 2
## 4: 4 6
## 5: 5 3
## ---
## 2524: 2524 7
## 2525: 2525 5
## 2526: 2526 19
## 2527: 2527 20
## 2528: 2528 8</code></pre>
<p>Next, we use the <code>dlast</code> data.table to “trim” <code>dpat</code>. We further trim the data set so that we do not have patient-level observations that extend beyond the overall follow-up period:</p>
<pre class="r"><code>dpat <- dlast[dpat][seq <= .last][ , .last := NULL][]
dpat[, period := period + seq - 1]
dpat <- dpat[period < NPER]
dpat</code></pre>
<pre><code>## id site period site.time npatient S0 seq state
## 1: 1 1 0 1 17 2 1 2
## 2: 1 1 1 1 17 2 2 3
## 3: 1 1 2 1 17 2 3 3
## 4: 1 1 3 1 17 2 4 3
## 5: 1 1 4 1 17 2 5 4
## ---
## 12608: 2524 5 24 125 20 3 1 3
## 12609: 2525 5 24 125 20 2 1 2
## 12610: 2526 5 24 125 20 1 1 1
## 12611: 2527 5 24 125 20 1 1 1
## 12612: 2528 5 24 125 20 1 1 1</code></pre>
<p>And finally, we merge the patient data with the stepped-wedge treatment assignment data to create the final data set. The individual outcomes for each week could now be generated, because would we have all the baseline and time-varying information in a single data set.</p>
<pre class="r"><code>dpat <- merge(dpat, dsw, by = c("site","period"))
setkey(dpat, id, period)
dpat <- delColumns(dpat, c("site.time", "seq", "npatient"))
dpat</code></pre>
<pre><code>## site period id S0 state startTrt Ict
## 1: 1 0 1 2 2 0 1
## 2: 1 1 1 2 3 0 1
## 3: 1 2 1 2 3 0 1
## 4: 1 3 1 2 3 0 1
## 5: 1 4 1 2 4 0 1
## ---
## 12608: 5 24 2524 3 3 16 1
## 12609: 5 24 2525 2 2 16 1
## 12610: 5 24 2526 1 1 16 1
## 12611: 5 24 2527 1 1 16 1
## 12612: 5 24 2528 1 1 16 1</code></pre>
<p>Here is what the individual trajectories of health state look like. In the plot, each column represents a different site, and each row represents a different starting week. For example the fifth row represents patients who appear for the first time in week 4. Sites 1 and 2 are already in the intervention in week 4, so none of these patients will transition. However, patients in sites 3 through 5 enter in the pre-intervention stage in week 4, and transition into the intervention at different points, depending on the site.</p>
<p><img src="https://www.rdatagen.net/post/2019-09-17-simulating-an-open-cohort-stepped-wedge-trial.en_files/figure-html/unnamed-chunk-11-1.png" width="1056" /></p>
<p>The basic structure is in place, so we are ready to extend this simulation to include more covariates, random effects, and outcomes. And once we’ve done that, we can explore analytic approaches.</p>
<p>
<small><font color="darkkhaki">This study is supported by the National Institutes of Health National Institute on Aging R61AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</font></small>
</p>
</div>
Analyzing a binary outcome arising out of within-cluster, pair-matched randomization
https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/
Tue, 03 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/
<p>A key motivating factor for the <code>simstudy</code> package and much of this blog is that simulation can be super helpful in understanding how best to approach an unusual, or least unfamiliar, analytic problem. About six months ago, I <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">described</a> the DREAM Initiative (Diabetes Research, Education, and Action for Minorities), a study that used a slightly innovative randomization scheme to ensure that two comparison groups were evenly balanced across important covariates. At the time, we hadn’t finalized the analytic plan. But, now that we have started actually randomizing and recruiting (yes, in that order, oddly enough), it is important that we do that, with the help of a little simulation.</p>
<div id="the-study-design" class="section level3">
<h3>The study design</h3>
<p>The <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">original post</a> has the details about the design and matching algorithm (and code). The randomization is taking place at 20 primary care clinics, and patients within these clinics are matched based on important characteristics before randomization occurs. There is little or no risk that patients in the control arm will be “contaminated” or affected by the intervention that is taking place, which will minimize the effects of clustering. However, we may not want to ignore the clustering altogether.</p>
</div>
<div id="possible-analytic-solutions" class="section level3">
<h3>Possible analytic solutions</h3>
<p>Given that the primary outcome is binary, one reasonable procedure to assess whether or not the intervention is effective is McNemar’s test, which is typically used for paired dichotomous data. However, this approach has two limitations. First, McNemar’s test does not take into account the clustered nature of the data. Second, the test is just that, a test; it does not provide an estimate of effect size (and the associated confidence interval).</p>
<p>So, in addition to McNemar’s test, I considered four additional analytic approaches to assess the effect of the intervention: (1) Durkalski’s extension of McNemar’s test to account for clustering, (2) conditional logistic regression, which takes into account stratification and matching, (3) standard logistic regression with specific adjustment for the three matching variables, and (4) mixed effects logistic regression with matching covariate adjustment and a clinic-level random intercept. (In the mixed effects model, I assume the treatment effect does not vary by site, since I have also assumed that the intervention is delivered in a consistent manner across the sites. These may or may not be reasonable assumptions.)</p>
<p>While I was interested to see how the two tests (McNemar and the extension) performed, my primary goal was to see if any of the regression models was superior. In order to do this, I wanted to compare the methods in a scenario without any intervention effect, and in another scenario where there <em>was</em> an effect. I was interested in comparing bias, error rates, and variance estimates.</p>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generation process parallels the earlier <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">post</a>. The treatment assignment is made in the context of the matching process, which I am not showing this time around. Note that in this initial example, the outcome <code>y</code> depends on the intervention <code>rx</code> (i.e. there <em>is</em> an intervention effect).</p>
<pre class="r"><code>library(simstudy)
### defining the data
defc <- defData(varname = "ceffect", formula = 0, variance = 0.4,
dist = "normal", id = "cid")
defi <- defDataAdd(varname = "male", formula = .4, dist = "binary")
defi <- defDataAdd(defi, varname = "age", formula = 0, variance = 40)
defi <- defDataAdd(defi, varname = "bmi", formula = 0, variance = 5)
defr <- defDataAdd(varname = "y",
formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0.45*rx + ceffect",
dist = "binary", link = "logit")
### generating the data
set.seed(547317)
dc <- genData(20, defc)
di <- genCluster(dc, "cid", 60, "id")
di <- addColumns(defi, di)
### matching and randomization within cluster (cid)
library(parallel)
library(Matching)
RNGkind("L'Ecuyer-CMRG") # to set seed for parallel process
### See addendum for dmatch code
dd <- rbindlist(mclapply(1:nrow(dc),
function(x) dmatch(di[cid == x]),
mc.set.seed = TRUE
)
)
### generate outcome
dd <- addColumns(defr, dd)
setkey(dd, pair)
dd</code></pre>
<pre><code>## cid ceffect id male age bmi rx pair y
## 1: 1 1.168 11 1 4.35 0.6886 0 1.01 1
## 2: 1 1.168 53 1 3.85 0.2215 1 1.01 1
## 3: 1 1.168 51 0 6.01 -0.9321 0 1.02 0
## 4: 1 1.168 58 0 7.02 0.1407 1 1.02 1
## 5: 1 1.168 57 0 9.25 -1.3253 0 1.03 1
## ---
## 798: 9 -0.413 504 1 -8.72 -0.0767 1 9.17 0
## 799: 9 -0.413 525 0 1.66 3.5507 0 9.18 0
## 800: 9 -0.413 491 0 4.31 2.6968 1 9.18 0
## 801: 9 -0.413 499 0 7.36 0.6064 0 9.19 0
## 802: 9 -0.413 531 0 8.05 0.8068 1 9.19 0</code></pre>
<p>Based on the outcomes of each individual, each pair can be assigned to a particular category that describes the outcomes. Either both fail, both succeed, or one fails and the other succeeds. These category counts can be represented in a <span class="math inline">\(2 \times 2\)</span> contingency table. The counts are the number of pairs in each of the four possible pairwise outcomes. For example, there were 173 pairs where the outcome was determined to be unsuccessful for both intervention and control arms.</p>
<pre class="r"><code>dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(0,1),
labels = c("no success", "success"))]
dpair[, rx := factor(`1`, levels = c(0, 1),
labels = c("no success", "success"))]
dpair[, table(control,rx)]</code></pre>
<pre><code>## rx
## control no success success
## no success 173 102
## success 69 57</code></pre>
<p>Here is a figure that depicts the <span class="math inline">\(2 \times 2\)</span> matrix, providing a visualization of how the treatment and control group outcomes compare. (The code is in the addendum in case anyone wants to see the lengths I took to make this simple graphic.)</p>
<p><img src="https://www.rdatagen.net/post/2019-09-03-analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization.en_files/figure-html/unnamed-chunk-4-1.png" width="576" /></p>
<div id="mcnemars-test" class="section level4">
<h4>McNemar’s test</h4>
<p>McNemar’s test requires the data to be in table format, and the test really only takes into consideration the cells which represent disagreement between treatment arms. In terms of the matrix above, this would be the lower left and upper right quadrants.</p>
<pre class="r"><code>ddc <- dcast(dd, pair ~ rx, value.var = "y")
dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]
mcnemar.test(dmat)</code></pre>
<pre><code>##
## McNemar's Chi-squared test with continuity correction
##
## data: dmat
## McNemar's chi-squared = 6, df = 1, p-value = 0.01</code></pre>
<p>Based on the p-value = 0.01, we would reject the null hypothesis that the intervention has no effect.</p>
</div>
<div id="durkalski-extension-of-mcnemars-test" class="section level4">
<h4>Durkalski extension of McNemar’s test</h4>
<p>Durkalski’s test also requires the data to be in tabular form, though there essentially needs to be a table for each cluster. The <code>clust.bin.pair</code> function needs us to separate the table into vectors <code>a</code>, <code>b</code>, <code>c</code>, and <code>d</code>, where each element in each of the vectors is a count for a specific cluster. Vector <code>a</code> is collection of counts for the upper left hand quadrants, <code>b</code> is for the upper right hand quadrants, etc. We have 20 clusters, so each of the four vectors has length 20. Much of the work done in the code below is just getting the data in the right form for the function.</p>
<pre class="r"><code>library(clust.bin.pair)
ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
ddc[, ypair := 2*`0` + 1*`1`]
dvec <- ddc[, .N, keyby=.(cid, ypair)]
allpossible <- data.table(expand.grid(1:20, 0:3))
setnames(allpossible, c("cid","ypair"))
setkey(dvec, cid, ypair)
setkey(allpossible, cid, ypair)
dvec <- dvec[allpossible]
dvec[is.na(N), N := 0]
a <- dvec[ypair == 0, N]
b <- dvec[ypair == 1, N]
c <- dvec[ypair == 2, N]
d <- dvec[ypair == 3, N]
clust.bin.pair(a, b, c, d, method = "durkalski")</code></pre>
<pre><code>##
## Durkalski's Chi-square test
##
## data: a, b, c, d
## chi-square = 5, df = 1, p-value = 0.03</code></pre>
<p>Again, the p-value, though larger, leads us to reject the null.</p>
</div>
<div id="conditional-logistic-regression" class="section level4">
<h4>Conditional logistic regression</h4>
<p>Conditional logistic regression is conditional on the pair. Since the pair is similar with respect to the matching variables, no further adjustment (beyond specifying the strata) is necessary.</p>
<pre class="r"><code>library(survival)
summary(clogit(y ~ rx + strata(pair), data = dd))$coef["rx",]</code></pre>
<pre><code>## coef exp(coef) se(coef) z Pr(>|z|)
## 0.3909 1.4783 0.1559 2.5076 0.0122</code></pre>
<p> </p>
</div>
<div id="logistic-regression-with-matching-covariates-adjustment" class="section level4">
<h4>Logistic regression with matching covariates adjustment</h4>
<p>Using logistic regression should in theory provide a reasonable estimate of the treatment effect, though given that there is clustering, I wouldn’t expect the standard error estimates to be correct. Although we are not specifically modeling the matching, by including covariates used in the matching, we are effectively estimating a model that is conditional on the pair.</p>
<pre class="r"><code>summary(glm(y~rx + age + male + bmi, data = dd,
family = "binomial"))$coef["rx",]</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## 0.3679 0.1515 2.4285 0.0152</code></pre>
<p> </p>
</div>
<div id="generalized-mixed-effects-model-with-matching-covariates-adjustment" class="section level4">
<h4>Generalized mixed effects model with matching covariates adjustment</h4>
<p>The mixed effects model merely improves on the logistic regression model by ensuring that any clustering effects are reflected in the estimates.</p>
<pre class="r"><code>library(lme4)
summary(glmer(y ~ rx + age + male + bmi + (1|cid), data= dd,
family = "binomial"))$coef["rx",]</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## 0.4030 0.1586 2.5409 0.0111</code></pre>
<p> </p>
</div>
</div>
<div id="comparing-the-analytic-approaches" class="section level3">
<h3>Comparing the analytic approaches</h3>
<p>To compare the methods, I generated 1000 data sets under each scenario. As I mentioned, I wanted to conduct the comparison under two scenarios. The first when there is no intervention effect, and the second with an effect (I will use the effect size used to generate the first data set.</p>
<p>I’ll start with no intervention effect. In this case, the outcome definition sets the true parameter of <code>rx</code> to 0.</p>
<pre class="r"><code>defr <- defDataAdd(varname = "y",
formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0*rx + ceffect",
dist = "binary", link = "logit")</code></pre>
<p>Using the updated definition, I generate 1000 datasets, and for each one, I apply the five analytic approaches. The results from each iteration are stored in a large list. (The code for the iterative process is shown in the addendum below.) As an example, here are the contents from the 711th iteration:</p>
<pre class="r"><code>res[[711]]</code></pre>
<pre><code>## $clr
## coef exp(coef) se(coef) z Pr(>|z|)
## rx -0.0263 0.974 0.162 -0.162 0.871
##
## $glm
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6583 0.1247 -5.279 1.30e-07
## rx -0.0309 0.1565 -0.198 8.43e-01
## age -0.0670 0.0149 -4.495 6.96e-06
## male -0.5131 0.1647 -3.115 1.84e-03
## bmi 0.1308 0.0411 3.184 1.45e-03
##
## $glmer
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7373 0.1888 -3.91 9.42e-05
## rx -0.0340 0.1617 -0.21 8.33e-01
## age -0.0721 0.0156 -4.61 4.05e-06
## male -0.4896 0.1710 -2.86 4.20e-03
## bmi 0.1366 0.0432 3.16 1.58e-03
##
## $mcnemar
##
## McNemar's Chi-squared test with continuity correction
##
## data: dmat
## McNemar's chi-squared = 0.007, df = 1, p-value = 0.9
##
##
## $durk
##
## Durkalski's Chi-square test
##
## data: a, b, c, d
## chi-square = 0.1, df = 1, p-value = 0.7</code></pre>
</div>
<div id="summary-statistics" class="section level3">
<h3>Summary statistics</h3>
<p>To compare the five methods, I am first looking at the proportion of iterations where the p-value is less then 0.05, in which case we would reject the the null hypothesis. (In the case where the null is true, the proportion is the Type 1 error rate; when there is truly an effect, then the proportion is the power.) I am less interested in the hypothesis test than the bias and standard errors, but the first two methods only provide a p-value, so that is all we can assess them on.</p>
<p>Next, I calculate the bias, which is the average effect estimate minus the true effect. And finally, I evaluate the standard errors by looking at the estimated standard error as well as the observed standard error (which is the standard deviation of the point estimates).</p>
<pre class="r"><code>pval <- data.frame(
mcnm = mean(sapply(res, function(x) x$mcnemar$p.value <= 0.05)),
durk = mean(sapply(res, function(x) x$durk$p.value <= 0.05)),
clr =mean(sapply(res, function(x) x$clr["rx","Pr(>|z|)"] <= 0.05)),
glm = mean(sapply(res, function(x) x$glm["rx","Pr(>|z|)"] <= 0.05)),
glmer = mean(sapply(res, function(x) x$glmer["rx","Pr(>|z|)"] <= 0.05))
)
bias <- data.frame(
clr = mean(sapply(res, function(x) x$clr["rx", "coef"])),
glm = mean(sapply(res, function(x) x$glm["rx", "Estimate"])),
glmer = mean(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)
se <- data.frame(
clr = mean(sapply(res, function(x) x$clr["rx", "se(coef)"])),
glm = mean(sapply(res, function(x) x$glm["rx", "Std. Error"])),
glmer = mean(sapply(res, function(x) x$glmer["rx", "Std. Error"]))
)
obs.se <- data.frame(
clr = sd(sapply(res, function(x) x$clr["rx", "coef"])),
glm = sd(sapply(res, function(x) x$glm["rx", "Estimate"])),
glmer = sd(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)
sumstat <- round(plyr::rbind.fill(pval, bias, se, obs.se), 3)
rownames(sumstat) <- c("prop.rejected", "bias", "se.est", "se.obs")
sumstat</code></pre>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.035 0.048 0.043 0.038 0.044
## bias NA NA 0.006 0.005 0.005
## se.est NA NA 0.167 0.161 0.167
## se.obs NA NA 0.164 0.153 0.164</code></pre>
<p>In this first case, where the true underlying effect size is 0, the Type 1 error rate should be 0.05. The Durkalski test, the conditional logistical regression, and the mixed effects model are below that level but closer than the other two methods. All three models provide unbiased point estimates, but the standard logistic regression (glm) underestimates the standard errors. The results from the conditional logistic regression and the mixed effects model are quite close across the board.</p>
<p>Here are the summary statistics for a data set with an intervention effect of 0.45. The results are consistent with the “no effect” simulations, except that the standard linear regression model exhibits some bias. In reality, this is not necessarily bias, but a different estimand. The model that ignores clustering is a marginal model (with respect to the site), whereas the conditional logistic regression and mixed effects models are conditional on the site. (I’ve described this phenomenon <a href="https://www.rdatagen.net/post/marginal-v-conditional/">here</a> and <a href="https://www.rdatagen.net/post/mixed-effect-models-vs-gee/">here</a>.) We are interested in the conditional effect here, so that argues for the conditional models.</p>
<p>The conditional logistic regression and the mixed effects model yielded similar estimates, though the mixed effects model had slightly higher power, which is the reason I opted to use this approach at the end of the day.</p>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.766 0.731 0.784 0.766 0.796
## bias NA NA 0.000 -0.033 -0.001
## se.est NA NA 0.164 0.156 0.162
## se.obs NA NA 0.165 0.152 0.162</code></pre>
<p>In this last case, the true underlying data generating process still includes an intervention effect but <em>no clustering</em>. In this scenario, all of the analytic yield similar estimates. However, since there is no guarantee that clustering is not a factor, the mixed effects model will still be the preferred approach.</p>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.802 0.774 0.825 0.828 0.830
## bias NA NA -0.003 -0.002 -0.001
## se.est NA NA 0.159 0.158 0.158
## se.obs NA NA 0.151 0.150 0.150</code></pre>
<p>
<small><font color="darkkhaki">The DREAM Initiative is supported by the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases R01DK11048. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</font></small>
</p>
<p> </p>
</div>
<div id="addendum-multiple-datasets-and-model-estimates" class="section level3">
<h3>Addendum: multiple datasets and model estimates</h3>
<pre class="r"><code>gen <- function(nclust, m) {
dc <- genData(nclust, defc)
di <- genCluster(dc, "cid", m, "id")
di <- addColumns(defi, di)
dr <- rbindlist(mclapply(1:nrow(dc), function(x) dmatch(di[cid == x])))
dr <- addColumns(defr, dr)
dr[]
}
iterate <- function(ncluster, m) {
dd <- gen(ncluster, m)
clrfit <- summary(clogit(y ~ rx + strata(pair), data = dd))$coef
glmfit <- summary(glm(y~rx + age + male + bmi, data = dd,
family = binomial))$coef
mefit <- summary(glmer(y~rx + age + male + bmi + (1|cid), data= dd,
family = binomial))$coef
## McNemar
ddc <- dcast(dd, pair ~ rx, value.var = "y")
dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]
mc <- mcnemar.test(dmat)
# Clustered McNemar
ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
ddc[, ypair := 2*`0` + 1*`1`]
dvec <- ddc[, .N, keyby=.(cid, ypair)]
allpossible <- data.table(expand.grid(1:20, 0:3))
setnames(allpossible, c("cid","ypair"))
setkey(dvec, cid, ypair)
setkey(allpossible, cid, ypair)
dvec <- dvec[allpossible]
dvec[is.na(N), N := 0]
a <- dvec[ypair == 0, N]
b <- dvec[ypair == 1, N]
c <- dvec[ypair == 2, N]
d <- dvec[ypair == 3, N]
durk <- clust.bin.pair(a, b, c, d, method = "durkalski")
list(clr = clrfit, glm = glmfit, glmer = mefit,
mcnemar = mc, durk = durk)
}
res <- mclapply(1:1000, function(x) iterate(20, 60))</code></pre>
<p> </p>
<div id="code-to-generate-figure" class="section level4">
<h4>Code to generate figure</h4>
<pre class="r"><code>library(ggmosaic)
dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(1,0),
labels = c("success", "no success"))]
dpair[, rx := factor(`1`, levels = c(0, 1),
labels = c("no success", "success"))]
p <- ggplot(data = dpair) +
geom_mosaic(aes(x = product(control, rx)))
pdata <- data.table(ggplot_build(p)$data[[1]])
pdata[, mcnemar := factor(c("diff","same","same", "diff"))]
textloc <- pdata[c(1,4), .(x=(xmin + xmax)/2, y=(ymin + ymax)/2)]
ggplot(data = pdata) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax,
fill = mcnemar)) +
geom_label(data = pdata,
aes(x = (xmin+xmax)/2, y = (ymin+ymax)/2, label=.wt),
size = 3.2) +
scale_x_continuous(position = "top",
breaks = textloc$x,
labels = c("no success", "success"),
name = "intervention",
expand = c(0,0)) +
scale_y_continuous(breaks = textloc$y,
labels = c("success", "no success"),
name = "control",
expand = c(0,0)) +
scale_fill_manual(values = c("#6b5dd5", "grey80")) +
theme(panel.grid = element_blank(),
legend.position = "none",
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 0, hjust = 0.5),
axis.text.y = element_text(angle = 90, hjust = 0.5)
)</code></pre>
<p> </p>
</div>
<div id="original-matching-algorithm" class="section level4">
<h4>Original matching algorithm</h4>
<pre class="r"><code>dmatch <- function(dsamp) {
dsamp[, rx := 0]
dused <- NULL
drand <- NULL
dcntl <- NULL
while (nrow(dsamp) > 1) {
selectRow <- sample(1:nrow(dsamp), 1)
dsamp[selectRow, rx := 1]
myTr <- dsamp[, rx]
myX <- as.matrix(dsamp[, .(male, age, bmi)])
match.dt <- Match(Tr = myTr, X = myX,
caliper = c(0, 0.50, .50), ties = FALSE)
if (length(match.dt) == 1) { # no match
dused <- rbind(dused, dsamp[selectRow])
dsamp <- dsamp[-selectRow, ]
} else { # match
trt <- match.dt$index.treated
ctl <- match.dt$index.control
drand <- rbind(drand, dsamp[trt])
dcntl <- rbind(dcntl, dsamp[ctl])
dsamp <- dsamp[-c(trt, ctl)]
}
}
dcntl[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
drand[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
rbind(dcntl, drand)
}</code></pre>
</div>
</div>
simstudy updated to version 0.1.14: implementing Markov chains
https://www.rdatagen.net/post/simstudy-1-14-update/
Tue, 20 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-1-14-update/
<p>I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends only on the previous health status (in the immediate past). While there are packages to do this sort of thing (see for example the <a href="https://cran.r-project.org/web/packages/markovchain/index.html">markovchain</a> package), I hadn’t yet stumbled upon them while I was tackling my problem. So, I wrote my own functions, which I’ve now incorporated into the latest version of <code>simstudy</code> that is now available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>. As a way of announcing the new release, here is a brief overview of Markov chains and the new functions. (See <a href="https://cran.r-project.org/web/packages/simstudy/news/news.html">here</a> for a more complete list of changes.)</p>
<div id="markov-processes" class="section level3">
<h3>Markov processes</h3>
<p>The key “parameter” of a stochastic Markov process is the transition matrix, which defines the probability of moving from one state to another (or remaining in the same state). Each row of the matrix is indexed by the current state, while the columns are indexed by the target state. The values of the matrix represent the probabilities of transitioning from the current state to the target state. The sum of the probabilities across each row must equal one.</p>
<p>In the transition matrix below, there are three states <span class="math inline">\((1, 2, 3)\)</span>. The probability of moving from state 1 to state 3 is represented by <span class="math inline">\(p_{13}\)</span>. Likewise the probability of moving from state 3 to state 2 is <span class="math inline">\(p_{32}\)</span>. And <span class="math inline">\(\sum_{j=1}^3 p_{ij} = 1\)</span> for all <span class="math inline">\(i \in (1,2,3)\)</span>.</p>
<p><span class="math display">\[
\left(
\begin{matrix}
p_{11} & p_{12} & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33}
\end{matrix}
\right )
\]</span></p>
<p>Here’s a possible <span class="math inline">\(3 \times 3\)</span> transition matrix:</p>
<p><span class="math display">\[
\left(
\begin{matrix}
0.5 & 0.4 & 0.1 \\
0.2 & 0.5 & 0.3 \\
0.0 & 0.0 & 1.0
\end{matrix}
\right )
\]</span></p>
<p>In this case, the probability of moving from state 1 to state 2 is <span class="math inline">\(40\%\)</span>, whereas there is no possibility that you can move from 3 to 1 or 2. (State 3 is considered to be an “absorbing” state since it is not possible to leave; if we are talking about health status, state 3 could be death.)</p>
</div>
<div id="function-genmarkov" class="section level3">
<h3>function genMarkov</h3>
<p>The new function <code>genMarkov</code> generates a random sequence for the specified number of individuals. (The sister function <code>addMarkov</code> is quite similar, though it allows users to add a Markov chain to an existing data set.) In addition to defining the transition matrix, you need to indicate the length of the chain to be generated for each simulated unit or person. The data can be returned either in long or wide form, depending on how you’d ultimately like to use the data. In the first case, I am generating wide format data for sequences of length of 6 for 12 individuals:</p>
<pre class="r"><code>library(simstudy)
set.seed(3928398)
tmatrix <- matrix(c(0.5, 0.4, 0.1,
0.2, 0.5, 0.3,
0.0, 0.0, 1.0), 3, 3, byrow = T)
dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = TRUE)
dd</code></pre>
<pre><code>## id S1 S2 S3 S4 S5 S6
## 1: 1 1 2 2 1 2 2
## 2: 2 1 1 2 2 2 3
## 3: 3 1 1 2 3 3 3
## 4: 4 1 2 2 1 1 2
## 5: 5 1 1 2 2 2 3
## 6: 6 1 1 1 1 1 1
## 7: 7 1 1 1 1 2 2
## 8: 8 1 1 1 1 1 1
## 9: 9 1 1 2 3 3 3
## 10: 10 1 1 2 3 3 3
## 11: 11 1 2 2 2 2 1
## 12: 12 1 2 1 1 2 1</code></pre>
<p>In the long format, the output is multiple records per id. This could be useful if you are going to be estimating longitudinal models, or as in this case, creating longitudinal plots:</p>
<pre class="r"><code>set.seed(3928398)
dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = FALSE)</code></pre>
<p>Here are the resulting data (for the first two individuals):</p>
<pre class="r"><code>dd[id %in% c(1,2)]</code></pre>
<pre><code>## id period state
## 1: 1 1 1
## 2: 1 2 2
## 3: 1 3 2
## 4: 1 4 1
## 5: 1 5 2
## 6: 1 6 2
## 7: 2 1 1
## 8: 2 2 1
## 9: 2 3 2
## 10: 2 4 2
## 11: 2 5 2
## 12: 2 6 3</code></pre>
<p>And here’s a plot for each individual, showing their health status progressions over time:</p>
<p><img src="https://www.rdatagen.net/post/2019-08-20-simstudy-0-1-14-update.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>I do plan on sharing the details of the simulation that inspired the creation of these new functions, though I am still working out a few things. In the meantime, as always, if anyone has any suggestions or questions about simstudy, definitely let me know.</p>
</div>
Bayes models for estimation in stepped-wedge trials with non-trivial ICC patterns
https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/
Tue, 06 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/
<p>Continuing a series of posts discussing the structure of intra-cluster correlations (ICC’s) in the context of a stepped-wedge trial, this latest edition is primarily interested in fitting Bayesian hierarchical models for more complex cases (though I do talk a bit more about the linear mixed effects models). The first two posts in the series focused on generating data to simulate various scenarios; the <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">third post</a> considered linear mixed effects and Bayesian hierarchical models to estimate ICC’s under the simplest scenario of constant between-period ICC’s. Throughout this post, I use code drawn from the previous one; I am not repeating much of it here for brevity’s sake. So, if this is all new, it is probably worth <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">glancing at</a> before continuing on.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generating model this time around is only subtly different from before, but that difference is quite important. Rather than a single cluster-specific effect <span class="math inline">\(b_c\)</span>, there is now a vector of cluster effects <span class="math inline">\(\mathbf{b_c} = \left( b_{c1}, b_{c2}, \ldots, b_{cT} \right)\)</span>, where <span class="math inline">\(b_c \sim MVN(\mathbf{0}, \sigma^2 \mathbf{R})\)</span> (see <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">this earlier post</a> for a description of the correlation matrix <span class="math inline">\(\mathbf{R}\)</span>.)</p>
<p><span class="math display">\[
Y_{ict} = \mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{ict}
\]</span></p>
<p>By altering the correlation structure of <span class="math inline">\(\mathbf{b_c}\)</span> (that is <span class="math inline">\(\mathbf{R}\)</span>), we can the change the structure of the ICC’s. (The data generation was the focus of the first two posts of this series, <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a> and <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">here</a>. The data generating function <code>genDD</code> includes an argument where you can specify the two correlation structures, <em>exchangeable</em> and <em>auto-regressive</em>:</p>
<pre class="r"><code>library(simstudy)
defc <- defData(varname = "mu", formula = 0,
dist = "nonrandom", id = "cluster")
defc <- defData(defc, "s2", formula = 0.15, dist = "nonrandom")
defc <- defData(defc, "m", formula = 15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + cteffect",
variance = 2, dist = "normal")</code></pre>
<pre class="r"><code>genDD <- function(defc, defa, nclust, nperiods,
waves, len, start, rho, corstr) {
dc <- genData(nclust, defc)
dp <- addPeriods(dc, nperiods, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = waves, lenWaves = len,
startPer = start)
dp <- addCorGen(dtOld = dp, nvars = nperiods, idvar = "cluster",
rho = rho, corstr = corstr, dist = "normal",
param1 = "mu", param2 = "s2", cnames = "cteffect")
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = "m",
level1ID = "id")
dd <- addColumns(defa, dd)
dd[]
}</code></pre>
</div>
<div id="constant-between-period-iccs" class="section level3">
<h3>Constant between-period ICC’s</h3>
<p>In this first scenario, the assumption is that the within-period ICC’s are larger than the between-period ICC’s and the between-period ICC’s are constant. This can be generated with random effects that have a correlation matrix with compound symmetry (or is exchangeable). In this case, we will have 60 clusters and 7 time periods:</p>
<pre class="r"><code>set.seed(4119)
dcs <- genDD(defc, defa, 60, 7, 4, 1, 2, 0.6, "cs")
# correlation of "unobserved" random effects
round(cor(dcast(dcs[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.60 0.49 0.60 0.60 0.51
## 1 0.60 1.00 0.68 0.64 0.62 0.64
## 2 0.49 0.68 1.00 0.58 0.54 0.62
## 3 0.60 0.64 0.58 1.00 0.63 0.66
## 4 0.60 0.62 0.54 0.63 1.00 0.63
## 5 0.51 0.64 0.62 0.66 0.63 1.00</code></pre>
<p><br></p>
<div id="linear-mixed-effects-model" class="section level4">
<h4>Linear mixed-effects model</h4>
<p>It is possible to use <code>lmer</code> to correctly estimate the variance components and other parameters that underlie the data generating process used in this case. The cluster-level period-specific effects are specified in the model as “cluster/period”, which indicates that the period effects are <em>nested</em> within the cluster.</p>
<pre class="r"><code>library(lme4)
lmerfit <- lmer(Y ~ period + rx + (1 | cluster/period) , data = dcs)
as.data.table(VarCorr(lmerfit))</code></pre>
<pre><code>## grp var1 var2 vcov sdcor
## 1: period:cluster (Intercept) <NA> 0.05827349 0.2413990
## 2: cluster (Intercept) <NA> 0.07816476 0.2795796
## 3: Residual <NA> <NA> 2.02075355 1.4215321</code></pre>
<p>Reading from the <code>vcov</code> column in the <code>lmer</code> output above, we can extract the <em>period:cluster</em> variance (<span class="math inline">\(\sigma_w^2\)</span>), the <em>cluster</em> variance (<span class="math inline">\(\sigma^2_v\)</span>), and the <em>residual</em> (individual level) variance (<span class="math inline">\(\sigma^2_e\)</span>). Using these three variance components, we can estimate the correlation of the cluster level effects (<span class="math inline">\(\rho\)</span>), the within-period ICC (<span class="math inline">\(ICC_{tt}\)</span>), and the between-period ICC (<span class="math inline">\(ICC_{tt^\prime}\)</span>). (See the <a href="#addendum">addendum</a> below for a more detailed description of the derivations.)</p>
</div>
<div id="correlation-rho-of-cluster-specific-effects-over-time" class="section level4">
<h4>Correlation (<span class="math inline">\(\rho\)</span>) of cluster-specific effects over time</h4>
<p>In this post, don’t confuse <span class="math inline">\(\rho\)</span> with the ICC. <span class="math inline">\(\rho\)</span> is the correlation between the cluster-level period-specific random effects. Here I am just showing that it is function of the decomposed variance estimates provided in the <code>lmer</code> output:</p>
<p><span class="math display">\[
\rho = \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w}
\]</span></p>
<pre class="r"><code>vs <- as.data.table(VarCorr(lmerfit))$vcov
vs[2]/sum(vs[1:2]) </code></pre>
<pre><code>## [1] 0.5728948</code></pre>
<p><br></p>
</div>
<div id="within-period-icc" class="section level4">
<h4>Within-period ICC</h4>
<p>The within-period ICC is the ratio of total cluster variance relative to total variance:</p>
<p><span class="math display">\[ICC_{tt} = \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>sum(vs[1:2])/sum(vs)</code></pre>
<pre><code>## [1] 0.06324808</code></pre>
<p><br></p>
</div>
<div id="between-period-icc" class="section level4">
<h4>Between-period ICC</h4>
<p>The between-period <span class="math inline">\(ICC_{tt^\prime}\)</span> is really just the within-period <span class="math inline">\(ICC_{tt}\)</span> adjusted by <span class="math inline">\(\rho\)</span> (see the <a href="#addendum">addendum</a>):</p>
<p><span class="math display">\[ICC_{tt^\prime} = \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>vs[2]/sum(vs) </code></pre>
<pre><code>## [1] 0.0362345</code></pre>
<p><br></p>
</div>
<div id="bayesian-model" class="section level4">
<h4>Bayesian model</h4>
<p>Now, I’ll fit a Bayesian hierarchical model, as I did <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">earlier</a> with the simplest constant ICC data generation process. The specification of the model in <code>stan</code> in this instance is slightly more involved as the number of parameters has increased. In the simpler case, I only had to estimate a scalar parameter for <span class="math inline">\(\sigma_b\)</span> and a single ICC parameter. In this model definition (<code>nested_cor_cs.stan</code>) <span class="math inline">\(\mathbf{b_c}\)</span> is a vector so there is a need to specify the variance-covariance matrix <span class="math inline">\(\sigma^2 \mathbf{R}\)</span>, which has dimensions <span class="math inline">\(T \times T\)</span> (defined in the <code>transformed parameters</code> block). There are <span class="math inline">\(T\)</span> random effects for each cluster, rather than one. And finally, instead of one ICC value, there are two - the within- and between-period ICC’s (defined in the <code>generated quantities</code> block).</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of unique individuals
int<lower=1> K; // number of predictors
int<lower=1> J; // number of clusters
int<lower=0> T; // number of periods
int<lower=1,upper=J> jj[N]; // group for individual
int<lower=1> tt[N]; // period for individual
matrix[N, K] x; // matrix of predctors
vector[N] y; // matrix of outcomes
}
parameters {
vector[K] beta; // model fixed effects
real<lower=0> sigmalev1; // cluster variance (sd)
real<lower=-1,upper=1> rho; // correlation
real<lower=0> sigma; // individual level varianc (sd)
matrix[J, T] ran; // site level random effects (by period)
}
transformed parameters{
cov_matrix[T] Sigma;
vector[N] yhat;
vector[T] mu0;
for (t in 1:T)
mu0[t] = 0;
// Random effects with exchangeable correlation
for (j in 1:(T-1))
for (k in (j+1):T) {
Sigma[j,k] = pow(sigmalev1,2) * rho;
Sigma[k,j] = Sigma[j, k];
}
for (i in 1:T)
Sigma[i,i] = pow(sigmalev1,2);
for (i in 1:N)
yhat[i] = x[i]*beta + ran[jj[i], tt[i]];
}
model {
sigma ~ uniform(0, 10);
sigmalev1 ~ uniform(0, 10);
rho ~ uniform(-1, 1);
for (j in 1:J)
ran[j] ~ multi_normal(mu0, Sigma);
y ~ normal(yhat, sigma);
}
generated quantities {
real sigma2;
real sigmalev12;
real iccInPeriod;
real iccBetPeriod;
sigma2 = pow(sigma, 2);
sigmalev12 = pow(sigmalev1, 2);
iccInPeriod = sigmalev12/(sigmalev12 + sigma2);
iccBetPeriod = iccInPeriod * rho;
}</code></pre>
<p>Model estimation requires creating the data set (in the form of an <code>R list</code>), compiling the <code>stan</code> model, and then sampling from the posterior to generate distributions of all parameters and generated quantities. I should include conduct a diagnostic review (e.g. to assess convergence), but you’ll have to trust me that everything looked reasonable.</p>
<pre class="r"><code>library(rstan)
options(mc.cores = parallel::detectCores())
x <- as.matrix(dcs[ ,.(1, period, rx)])
K <- ncol(x)
N <- dcs[, length(unique(id))]
J <- dcs[, length(unique(cluster))]
T <- dcs[, length(unique(period))]
jj <- dcs[, cluster]
tt <- dcs[, period] + 1
y <- dcs[, Y]
testdat <- list(N=N, K=K, J=J, T=T, jj=jj, tt=tt, x=x, y=y)
rt <- stanc("nested_cor_cs.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
fit.cs <- sampling(sm, data=testdat, seed = 32748,
iter = 5000, warmup = 1000,
control = list(max_treedepth = 15))</code></pre>
<p>Here is a summary of results for <span class="math inline">\(\rho\)</span>, <span class="math inline">\(ICC_{tt}\)</span>, and <span class="math inline">\(ICC_{tt^\prime}\)</span>. I’ve included a comparison of the means of the posterior distributions with the <code>lmer</code> estimates, followed by a more complete (visual) description of the posterior distributions of the Bayesian estimates:</p>
<pre class="r"><code>mb <- sapply(
rstan::extract(fit.cs, pars=c("rho", "iccInPeriod", "iccBetPeriod")),
function(x) mean(x)
)
cbind(bayesian=round(mb,3),
lmer = round(c(vs[2]/sum(vs[1:2]),
sum(vs[1:2])/sum(vs),
vs[2]/sum(vs)),3)
)</code></pre>
<pre><code>## bayesian lmer
## rho 0.576 0.573
## iccInPeriod 0.065 0.063
## iccBetPeriod 0.037 0.036</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-12-1.png" width="480" /></p>
</div>
</div>
<div id="decaying-between-period-icc-over-time" class="section level3">
<h3>Decaying between-period ICC over time</h3>
<p>Now we enter somewhat uncharted territory, since there is no obvious way in <code>R</code> using the <code>lme4</code> or <code>nlme</code> packages to decompose the variance estimates when the random effects have correlation that decays over time. This is where we might have to rely on a Bayesian approach to do this. (I understand that <code>SAS</code> can accommodate this, but I can’t bring myself to go there.)</p>
<p>We start where we pretty much always do - generating the data. Everything is the same, except that the cluster-random effects are correlated over time; we specify a correlation structure of <em>ar1</em> (auto-regressive).</p>
<pre class="r"><code>set.seed(4119)
dar1 <- genDD(defc, defa, 60, 7, 4, 1, 2, 0.6, "ar1")
# correlation of "unobserved" random effects
round(cor(dcast(dar1[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.60 0.22 0.20 0.18 0.06
## 1 0.60 1.00 0.64 0.45 0.30 0.23
## 2 0.22 0.64 1.00 0.61 0.32 0.30
## 3 0.20 0.45 0.61 1.00 0.61 0.49
## 4 0.18 0.30 0.32 0.61 1.00 0.69
## 5 0.06 0.23 0.30 0.49 0.69 1.00</code></pre>
<p>The model file is similar to <code>nested_cor_cs.stan</code>, except that the specifications of the variance-covariance matrix and ICC’s are now a function of <span class="math inline">\(\rho^{|t^\prime - t|}\)</span>:</p>
<pre class="stan"><code>transformed parameters{
⋮
for (j in 1:T)
for (k in 1:T)
Sigma[j,k] = pow(sigmalev1,2) * pow(rho,abs(j-k));
⋮
}
generated quantities {
⋮
for (j in 1:T)
for (k in 1:T)
icc[j, k] = sigmalev12/(sigmalev12 + sigma2) * pow(rho,abs(j-k));
⋮
}</code></pre>
<p>The stan compilation and sampling code is not shown here - they are the same before. The posterior distribution of <span class="math inline">\(\rho\)</span> is similar to what we saw previously.</p>
<pre class="r"><code>print(fit.ar1, pars=c("rho"))</code></pre>
<pre><code>## Inference for Stan model: nested_cor_ar1.
## 4 chains, each with iter=5000; warmup=1000; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=16000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## rho 0.58 0 0.08 0.41 0.53 0.58 0.64 0.73 2302 1
##
## Samples were drawn using NUTS(diag_e) at Fri Jun 28 14:13:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).</code></pre>
<p>Now, however, we have to consider a full range of ICC estimates. Here is a plot of the posterior distribution of all ICC’s with the means of each posterior directly below. The diagonal represents the within-period (constant) ICCs, and the off-diagonals are the between-period ICC’s.</p>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-16-1.png" width="576" /></p>
</div>
<div id="an-alternative-bayesian-model-unstructured-correlation" class="section level3">
<h3>An alternative Bayesian model: unstructured correlation</h3>
<p>Now, there is no particular reason to expect that the particular decay model (with an AR1 structure) would be the best model. We could try to fit an even more general model, one with minimal structure. For example if we put no restrictions on the correlation matrix <span class="math inline">\(\mathbf{R}\)</span>, but assumed a constant variance of <span class="math inline">\(\sigma_b^2\)</span>, we might achieve a better model fit. (We could go even further and relax the assumption that the variance across time changes as well, but I’ll leave that to you if you want to try it.)</p>
<p>In this case, we need to define <span class="math inline">\(\mathbf{R}\)</span> and specify a prior distribution (I use the Lewandowski, Kurowicka, and Joe - LKJ prior, as suggested by <code>Stan</code> documentation) and define the ICC’s in terms of <span class="math inline">\(\mathbf{R}\)</span>. Here are the relevant snippets of the <code>stan</code> model (everything else is the same as before):</p>
<pre class="stan"><code>parameters {
⋮
corr_matrix[T] R; // correlation matrix
⋮
}
transformed parameters{
⋮
Sigma = pow(sigmalev1,2) * R;
⋮
}
model {
⋮
R ~ lkj_corr(1); // LKJ prior on the correlation matrix
⋮
}
generated quantities {
⋮
for (j in 1:T)
for (k in 1:T)
icc[j, k] = sigmalev12/(sigmalev12 + sigma2) * R[j, k];
⋮
}</code></pre>
<p>Here are the means of the ICC posterior distributions alongside the means from the previous <em>auto-regressive</em> model.</p>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-18-1.png" width="864" /></p>
<p>Looking at the unstructured model estimates on the right, it does appear that a decay model might be reasonable. (No surprise there, because in reality, it <em>is</em> totally reasonable; that’s how we generated the data.) We can use package <code>bridgesampling</code> which estimates marginal log likelihoods (across the prior distributions of the parameters). The marginal likelihoods are used in calculating the Bayes Factor, which is the basis for comparing two competing models. Here, the log-likelihood is reported. If the unstructured model is indeed an improvement (and it could very well be, because it has more parameters), the we would expect the marginal log-likelihood for the second model to be greater (less negative) than the log-likelihood for the auto-regressive model. If fact, the opposite true, suggesting the auto-regressive model is the preferred one (out of these two):</p>
<pre class="r"><code>library(bridgesampling)
bridge_sampler(fit.ar1, silent = TRUE)</code></pre>
<pre class="r"><code>## Bridge sampling estimate of the log marginal likelihood: -5132.277
## Estimate obtained in 6 iteration(s) via method "normal"</code></pre>
<pre class="r"><code>bridge_sampler(fit.ar1.nc, silent = TRUE)</code></pre>
<pre class="r"><code>## Bridge sampling estimate of the log marginal likelihood: -5137.081
## Estimate obtained in 269 iteration(s) via method "normal".</code></pre>
<p> </p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum---interpreting-lmer-variance-estimates" class="section level2">
<h2>Addendum - interpreting lmer variance estimates</h2>
<p>In order to show how the <code>lmer</code> variance estimates relate to the theoretical variances and correlations in the case of a constant between-period ICC, here is a simulation based on 1000 clusters. The key parameters are <span class="math inline">\(\sigma^2_b = 0.15\)</span>, <span class="math inline">\(\sigma^2_e = 2\)</span>, and <span class="math inline">\(\rho = 0.6\)</span>. And based on these values, the theoretical ICC’s are: <span class="math inline">\(ICC_{within} = 0.15/2.15 = 0.698\)</span>, and <span class="math inline">\(ICC_{bewteen} = 0.698 * 0.6 = 0.042\)</span>.</p>
<pre class="r"><code>set.seed(4119)
dcs <- genDD(defc, defa, 1000, 7, 4, 1, 2, 0.6, "cs")</code></pre>
<p>The underlying correlation matrix of the cluster-level effects is what we would expect:</p>
<pre class="r"><code>round(cor(dcast(dcs[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.59 0.59 0.61 0.61 0.61
## 1 0.59 1.00 0.61 0.60 0.61 0.64
## 2 0.59 0.61 1.00 0.59 0.61 0.61
## 3 0.61 0.60 0.59 1.00 0.59 0.62
## 4 0.61 0.61 0.61 0.59 1.00 0.60
## 5 0.61 0.64 0.61 0.62 0.60 1.00</code></pre>
<p>Here are the variance estimates from the mixed-effects model:</p>
<pre class="r"><code>lmerfit <- lmer(Y ~ period + rx + (1 | cluster/period) , data = dcs)
as.data.table(VarCorr(lmerfit))</code></pre>
<pre><code>## grp var1 var2 vcov sdcor
## 1: period:cluster (Intercept) <NA> 0.05779349 0.2404028
## 2: cluster (Intercept) <NA> 0.09143749 0.3023863
## 3: Residual <NA> <NA> 1.98894356 1.4102991</code></pre>
<p>The way <code>lmer</code> implements the nested random effects , the cluster period-specific effect <span class="math inline">\(b_{ct}\)</span> is decomposed into <span class="math inline">\(v_c\)</span>, a cluster level effect, and <span class="math inline">\(w_{ct}\)</span>, a cluster time-specific effect:</p>
<p><span class="math display">\[
b_{ct} = v_c + w_{ct}
\]</span></p>
<p>Since both <span class="math inline">\(v_c\)</span> and <span class="math inline">\(w_{ct}\)</span> are normally distributed (<span class="math inline">\(v_c \sim N(0,\sigma_v^2)\)</span> and <span class="math inline">\(w_{ct} \sim N(0,\sigma_w^2)\)</span>), <span class="math inline">\(var(b_{ct}) = \sigma^2_b = \sigma^2_v + \sigma^2_w\)</span>.</p>
<p>Here is the observed estimate of <span class="math inline">\(\sigma^2_v + \sigma^2_w\)</span>:</p>
<pre class="r"><code>vs <- as.data.table(VarCorr(lmerfit))$vcov
sum(vs[1:2])</code></pre>
<pre><code>## [1] 0.149231</code></pre>
<p>An estimate of <span class="math inline">\(\rho\)</span> can be extracted from the <code>lmer</code> model variance estimates:</p>
<p><span class="math display">\[
\begin{aligned}
\rho &= cov(b_{ct}, b_{ct^\prime}) \\
&= cov(v_{c} + w_{ct}, v_{c} + w_{ct^\prime}) \\
&= var(v_c) + cov(w_{ct}) \\
&= \sigma^2_v
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
var(b_{ct}) &= var(v_{c}) + var(w_{ct}) \\
&= \sigma^2_v + \sigma^2_w
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
cor(b_{ct}, b_{ct^\prime}) &= \frac{cov(b_{ct}, b_{ct^\prime})}{\sqrt{var(b_{ct}) var(b_{ct^\prime})} } \\
\rho &= \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w}
\end{aligned}
\]</span></p>
<pre class="r"><code>vs[2]/sum(vs[1:2])</code></pre>
<pre><code>## [1] 0.6127246</code></pre>
<p>And here are the estimates of within and between-period ICC’s:</p>
<p><span class="math display">\[ICC_{tt} = \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e} =\frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>sum(vs[1:2])/sum(vs)</code></pre>
<pre><code>## [1] 0.06979364</code></pre>
<p><span class="math display">\[
\begin{aligned}
ICC_{tt^\prime} &= \left( \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e}\right) \rho \\
\\
&= \left( \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\right) \rho \\
\\
&=\left( \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e} \right) \left( \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w} \right) \\
\\
&= \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w+\sigma^2_e}
\end{aligned}
\]</span></p>
<pre class="r"><code>vs[2]/sum(vs)</code></pre>
<pre><code>## [1] 0.04276428</code></pre>
</div>
Estimating treatment effects (and ICCs) for stepped-wedge designs
https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/
Tue, 16 Jul 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/
<p>In the last two posts, I introduced the notion of time-varying intra-cluster correlations in the context of stepped-wedge study designs. (See <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a> and <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">here</a>). Though I generated lots of data for those posts, I didn’t fit any models to see if I could recover the estimates and any underlying assumptions. That’s what I am doing now.</p>
<p>My focus here is on the simplest case, where the ICC’s are constant over time and between time. Typically, I would just use a mixed-effects model to estimate the treatment effect and account for variability across clusters, which is easily done in <code>R</code> using the <code>lme4</code> package; if the outcome is continuous the function <code>lmer</code> is appropriate. I thought, however, it would also be interesting to use the <code>rstan</code> package to fit a Bayesian hierarchical model.</p>
<p>While it is always fun to explore new methods, I have a better justification for trying this approach: as far as I can tell, <code>lme4</code> (or <code>nlme</code> for that matter) cannot handle the cases with more complex patterns of between-period intra-cluster correlation that I focused on last time. A Bayesian hierarchical model should be up to the challenge. I thought that it would be best to start with a simple case before proceeding to the situation where I have no clear option in <code>R</code>. I’ll do that next time.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I know I am repeating myself a little bit, but it is important to be clear about the data generation process that I am talking about here.</p>
<p><span class="math display">\[Y_{ic} = \mu + \beta_1X_{c} + b_c + e_{ic},\]</span></p>
<p>where <span class="math inline">\(Y_{ic}\)</span> is a continuous outcome for subject <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(c\)</span>, and <span class="math inline">\(X_c\)</span> is a treatment indicator for cluster <span class="math inline">\(c\)</span> (either 0 or 1). The underlying structural parameters are <span class="math inline">\(\mu\)</span>, the grand mean, and <span class="math inline">\(\beta_1\)</span>, the treatment effect. The unobserved random effects are, <span class="math inline">\(b_c \sim N(0, \sigma^2_b)\)</span>, the normally distributed group level effect, and <span class="math inline">\(e_{ic} \sim N(0, \sigma^2_e)\)</span>, the normally distributed individual-level effect.</p>
<pre class="r"><code>library(simstudy)
defc <- defData( varname = "ceffect", formula = 0.0, variance = 0.15,
dist = "normal", id = "cluster")
defc <- defData(defc, varname = "m", formula = 15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + ceffect",
variance = 2, dist = "normal")
genDD <- function(defc, defa, nclust, nperiods, waves, len, start) {
dc <- genData(nclust, defc)
dp <- addPeriods(dc, nperiods, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = waves, lenWaves = len,
startPer = start)
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = "m",
level1ID = "id")
dd <- addColumns(defa, dd)
dd[]
}
set.seed(2822)
dx <- genDD(defc, defa, 60, 7, 4, 1, 2)
dx</code></pre>
<pre><code>## cluster period ceffect m timeID startTrt rx id Y
## 1: 1 0 -0.05348668 15 1 2 0 1 -0.1369149
## 2: 1 0 -0.05348668 15 1 2 0 2 -1.0030891
## 3: 1 0 -0.05348668 15 1 2 0 3 3.1169339
## 4: 1 0 -0.05348668 15 1 2 0 4 -0.8109585
## 5: 1 0 -0.05348668 15 1 2 0 5 0.2285751
## ---
## 6296: 60 6 0.10844859 15 420 5 1 6296 0.4171770
## 6297: 60 6 0.10844859 15 420 5 1 6297 1.5127632
## 6298: 60 6 0.10844859 15 420 5 1 6298 0.5194967
## 6299: 60 6 0.10844859 15 420 5 1 6299 -0.3120285
## 6300: 60 6 0.10844859 15 420 5 1 6300 2.0493244</code></pre>
</div>
<div id="using-lmer-to-estimate-treatment-effect-and-iccs" class="section level3">
<h3>Using lmer to estimate treatment effect and ICC’s</h3>
<p>As I <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">derived earlier</a>, the within- and between-period ICC’s under this data generating process are:</p>
<p><span class="math display">\[ICC = \frac{\sigma^2_b}{\sigma^2_b + \sigma^2_e}\]</span></p>
<p>Using a linear mixed-effects regression model we can estimate the fixed effects (the time trend and the treatment effect) as well as the random effects (cluster- and individual-level variation, <span class="math inline">\(\sigma^2_b\)</span> and <span class="math inline">\(\sigma^2_e\)</span>). The constant ICC can be estimated directly from the variance estimates.</p>
<pre class="r"><code>library(lme4)
library(sjPlot)
lmerfit <- lmer(Y ~ period + rx + (1 | cluster) , data = dx)
tab_model(lmerfit, show.icc = FALSE, show.dev = FALSE,
show.p = FALSE, show.r2 = FALSE,
title = "Linear mixed-effects model")</code></pre>
<table style="border-collapse:collapse; border:none;">
<caption style="font-weight: bold; text-align:left;">
Linear mixed-effects model
</caption>
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; ">
</th>
<th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">
Y
</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">
Predictors
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
Estimates
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
CI
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
(Intercept)
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.09
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
-0.03 – 0.21
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
period
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.08
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.05 – 0.11
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
rx
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
1.03
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.90 – 1.17
</td>
</tr>
<tr>
<td colspan="3" style="font-weight:bold; text-align:left; padding-top:.8em;">
Random Effects
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
σ<sup>2</sup>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
2.07
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
τ<sub>00</sub> <sub>cluster</sub>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
0.15
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
N <sub>cluster</sub>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
60
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">
Observations
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">
6300
</td>
</tr>
</table>
<p>Not surprisingly, this model recovers the parameters used in the data generation process. Here is the ICC estimate based on this sample:</p>
<pre class="r"><code>(vars <- as.data.frame(VarCorr(lmerfit))$vcov)</code></pre>
<pre><code>## [1] 0.1540414 2.0691434</code></pre>
<pre class="r"><code>(iccest <- round(vars[1]/(sum(vars)), 3))</code></pre>
<pre><code>## [1] 0.069</code></pre>
</div>
<div id="bayesian-hierarchical-model" class="section level3">
<h3>Bayesian hierarchical model</h3>
<p>To estimate the same model using Bayesian methods, I’m turning to <code>rstan</code>. If Bayesian methods are completely foreign to you or you haven’t used <code>rstan</code> before, there are obviously incredible resources out on the internet and in bookstores. (See <a href="https://mc-stan.org/users/interfaces/rstan">here</a>, for example.) While I have done some Bayesian modeling in the past and have read some excellent books on the topic (including <a href="https://xcelab.net/rm/statistical-rethinking/"><em>Statistical Rethinking</em></a> and <a href="https://sites.google.com/site/doingbayesiandataanalysis/what-s-new-in-2nd-ed"><em>Doing Bayesian Data Analysis</em></a>, though I have not read <a href="http://www.stat.columbia.edu/~gelman/book/"><em>Bayesian Data Analysis</em></a> and I know I should.)</p>
<p>To put things simplistically, the goal of this method is to generate a posterior distribution <span class="math inline">\(P(\theta | observed \ data)\)</span>, where <span class="math inline">\(\theta\)</span> is a vector of model parameters of interest. The <em>Bayes theorem</em> provides the underlying machinery for all of this to happen:</p>
<p><span class="math display">\[P(\theta | observed \ data) = \frac{P(observed \ data | \theta)}{P(observed \ data)} P(\theta)\]</span>
<span class="math inline">\(P(observed \ data | \theta)\)</span> is the data <em>likelihood</em> and <span class="math inline">\(P(\theta)\)</span> is the prior distribution. Both need to be specified in order to generate the desired posterior distribution. The general (again, highly simplistic) idea is that draws of <span class="math inline">\(\theta\)</span> are repeatedly made from the prior distribution, and each time the likelihood is estimated which updates the probability of <span class="math inline">\(\theta\)</span>. At the completion of the iterations, we are left with a posterior distribution of <span class="math inline">\(\theta\)</span> (conditional on the observed data).</p>
<p>This is my first time working with <code>Stan</code>, so it is a bit of an experiment. While things have worked out quite well in this case, I may be doing things in an unconventional (i.e. not quite correct) way, so treat this as more conceptual than tutorial - though it’ll certainly get you started.</p>
</div>
<div id="defining-the-model" class="section level3">
<h3>Defining the model</h3>
<p>In Stan, the model is specified in a separate <code>stan</code> program that is written using the Stan probabilistic programming language. The code can be saved as an external file and referenced when you want to sample data from the posterior distribution. In this case, I’ve save the following code in a file named <code>nested.stan</code>.</p>
<p>This <code>stan</code> file includes at least 3 “blocks”: <em>data</em>, <em>parameters</em>, and <em>model</em>. The data block defines the data that will be provided by the user, which includes the outcome and predictor data, as well as other information required for model estimation. The data are passed from <code>R</code> using a <code>list</code>.</p>
<p>The parameters of the model are defined explicitly in the parameter block; in this case, we have regression parameters, random effects, and variance parameters. The transformed parameter block provides the opportunity to create parameters that depend on data and pre-defined parameters. They have no prior distributions <em>per se</em>, but can be used to simplify model block statements, or perhaps make the model estimation more efficient.</p>
<p>Since this is a Bayesian model, each of the parameters will have a prior distribution that can be specified in the model block; if there is no explicit specification of a prior for a parameter, Stan will use a default (non- or minimally-informative) prior distribution. The outcome model is also defined here.</p>
<p>There is also the interesting possibility of defining derived values in a block called <em>generated quantities</em>. These quantities will be functions of previously defined parameters and data. In this case, we might be interested in estimating the ICC along with an uncertainty interval; since the ICC is a function of cluster- and individual-level variation, we can derive and ICC estimate for each of the iterations. At the end of the sequence of iterations, we will have a posterior distribution of the ICC.</p>
<p>Here is the <code>nested.stan</code> file used for this analysis:</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of individuals
int<lower=1> K; // number of predictors
int<lower=1> J; // number of clusters
int<lower=1,upper=J> jj[N]; // group for individual
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
vector[K] beta; // intercept, time trend, rx effect
real<lower=0> sigmalev1; // cluster level standard deviation
real<lower=0> sigma; // individual level sd
vector[J] ran; // cluster level effects
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = x[i]*beta + ran[jj[i]];
}
model {
ran ~ normal(0, sigmalev1);
y ~ normal(yhat, sigma);
}
generated quantities {
real<lower=0> sigma2;
real<lower=0> sigmalev12;
real<lower=0> icc;
sigma2 = pow(sigma, 2);
sigmalev12 = pow(sigmalev1, 2);
icc = sigmalev12/(sigmalev12 + sigma2);
}
</code></pre>
</div>
<div id="estimating-the-model" class="section level3">
<h3>Estimating the model</h3>
<p>Once the definition has been created, the next steps are to create the data set (as an R <code>list</code>) and call the functions to run the MCMC algorithm. The first function (<code>stanc</code>) converts the <code>.stan</code> file into <code>C++</code> code. The function <code>stan_model</code> converts the <code>C++</code> code into a stanmodel object. And the function <code>sampling</code> draws samples from the stanmodel object created in the second step.</p>
<pre class="r"><code>library(rstan)
options(mc.cores = parallel::detectCores())
x <- as.matrix(dx[ ,.(1, period, rx)])
K <- ncol(x)
N <- dx[, length(unique(id))]
J <- dx[, length(unique(cluster))]
jj <- dx[, cluster]
y <- dx[, Y]
testdat <- list(N, K, J, jj, x, y)
rt <- stanc("Working/stan_icc/nested.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
fit <- sampling(sm, data=testdat, seed = 3327, iter = 5000, warmup = 1000)</code></pre>
</div>
<div id="looking-at-the-diagnostics" class="section level3">
<h3>Looking at the diagnostics</h3>
<p>Once the posterior distribution has been generated, it is important to investigate to see how well-behaved the algorithm performed. One way to do this is look at a series of <em>trace</em> plots that provide insight into how stable the algorithm was as it moved around the parameter space. In this example, I used 5000 draws but threw out the first 1000. Typically, the early draws show much more variability, so it is usual to ignore the “burn-in” phase when analyzing the posterior distribution.</p>
<p>The process didn’t actually generate 5000 draws, but rather 20,000. The process was simultaneously run four separate times. The idea is if things are behaving well, the parallel processes (called chains) should mix quite well - it should be difficult to distinguish between the chains. In the plot below each chain is represented by a different color.</p>
<p>I think it is prudent to ensure that all parameters behaved reasonably, but here I am providing trace plots to the variance estimates, the effect size estimate, and the ICC.</p>
<pre class="r"><code>library(ggthemes)
pname <- c("sigma2", "sigmalev12", "beta[3]", "icc")
muc <- rstan::extract(fit, pars=pname, permuted=FALSE, inc_warmup=FALSE)
mdf <- data.table(melt(muc))
mdf[parameters == "beta[3]", parameters := "beta[3] (rx effect)"]
ggplot(mdf,aes(x=iterations, y=value, color=chains)) +
geom_line() +
facet_wrap(~parameters, scales = "free_y") +
theme(legend.position = "none",
panel.grid = element_blank()) +
scale_color_ptol()</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-07-16-estimating-treatment-effects-and-iccs-for-stepped-wedge-designs.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="evaluating-the-posterior-distribution" class="section level3">
<h3>Evaluating the posterior distribution</h3>
<p>Since these trace plots look fairly stable, it is reasonable to look at the posterior distribution. A summary of the distribution reports the means and percentiles for the parameters of interest. I am reprinting the results from <code>lmer</code> so you can see that the Bayesian estimates are pretty much identical to the mixed-effect model:</p>
<pre class="r"><code>print(fit, pars=c("beta", "sigma2", "sigmalev12", "icc"))</code></pre>
<pre><code>## Inference for Stan model: nested.
## 4 chains, each with iter=5000; warmup=1000; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=16000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] 0.09 0 0.06 -0.03 0.05 0.09 0.13 0.21 3106 1
## beta[2] 0.08 0 0.02 0.05 0.07 0.08 0.09 0.11 9548 1
## beta[3] 1.03 0 0.07 0.90 0.99 1.03 1.08 1.16 9556 1
## sigma2 2.07 0 0.04 2.00 2.05 2.07 2.10 2.14 24941 1
## sigmalev12 0.16 0 0.03 0.11 0.14 0.16 0.18 0.24 13530 1
## icc 0.07 0 0.01 0.05 0.06 0.07 0.08 0.11 13604 1
##
## Samples were drawn using NUTS(diag_e) at Wed Jun 26 16:18:31 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).</code></pre>
<table style="border-collapse:collapse; border:none;">
<caption style="font-weight: bold; text-align:left;">
Linear mixed-effects model
</caption>
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; ">
</th>
<th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">
Y
</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">
Predictors
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
Estimates
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
CI
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
(Intercept)
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.09
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
-0.03 – 0.21
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
period
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.08
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.05 – 0.11
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
rx
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
1.03
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.90 – 1.17
</td>
</tr>
<tr>
<td colspan="3" style="font-weight:bold; text-align:left; padding-top:.8em;">
Random Effects
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
σ<sup>2</sup>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
2.07
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
τ<sub>00</sub> <sub>cluster</sub>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
0.15
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
N <sub>cluster</sub>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">
60
</td>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">
Observations
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">
6300
</td>
</tr>
</table>
<p>The ability to produce a density plot that shows the posterior distribution of the ICC is a pretty compelling reason to use Bayesian methods. The density plot provides an quick way to assess uncertainty of estimates for parameters that might not even be directly included in a linear mixed-effects model:</p>
<pre class="r"><code>plot_dens <- function(fit, pars, p = c(0.05, 0.95),
fill = "grey80", xlab = NULL) {
qs <- quantile(extract(fit, pars = pars)[[1]], probs = p)
x.dens <- density(extract(fit, pars = pars)[[1]])
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p <- stan_dens(fit, pars = c(pars), fill = fill, alpha = .1) +
geom_area(data = subset(df.dens, x >= qs[1] & x <= qs[2]),
aes(x=x,y=y), fill = fill, alpha = .4)
if (is.null(xlab)) return(p)
else return(p + xlab(xlab))
}
plot_dens(fit, "icc", fill = "#a1be97")</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-07-16-estimating-treatment-effects-and-iccs-for-stepped-wedge-designs.en_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Next time, I will expand the <code>stan</code> model to generate parameter estimates for cases where the within-period and between-period ICC’s are not necessarily constant. I will also explore how we compare models in the context of Bayesian models, because we won’t always know the underlying data generating process!</p>
</div>
More on those stepped-wedge design assumptions: varying intra-cluster correlations over time
https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/
Tue, 09 Jul 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/
<p>In my last <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">post</a>, I wrote about <em>within-</em> and <em>between-period</em> intra-cluster correlations in the context of stepped-wedge cluster randomized study designs. These are quite important to understand when figuring out sample size requirements (and models for analysis, which I’ll be writing about soon.) Here, I’m extending the constant ICC assumption I presented last time around by introducing some complexity into the correlation structure. Much of the code I am using can be found in last week’s post, so if anything seems a little unclear, hop over <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a>.</p>
<div id="different-within--and-between-period-iccs" class="section level3">
<h3>Different within- and between-period ICC’s</h3>
<p>In a scenario with constant within- and between-period ICC’s, the correlated data can be induced using a single cluster-level effect like <span class="math inline">\(b_c\)</span> in this model:</p>
<p><span class="math display">\[
Y_{ict} = \mu + \beta_0t + \beta_1X_{ct} + b_{c} + e_{ict}
\]</span></p>
<p>More complexity can be added if, instead of a single cluster level effect, we have a vector of correlated cluster/time specific effects <span class="math inline">\(\mathbf{b_c}\)</span>. These cluster-specific random effects <span class="math inline">\((b_{c1}, b_{c2}, \ldots, b_{cT})\)</span> replace <span class="math inline">\(b_c\)</span>, and the slightly modified data generating model is</p>
<p><span class="math display">\[
Y_{ict} = \mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{ict}
\]</span></p>
<p>The vector <span class="math inline">\(\mathbf{b_c}\)</span> has a multivariate normal distribution <span class="math inline">\(N_T(0, \sigma^2_b \mathbf{R})\)</span>. This model assumes a common covariance structure across all clusters, <span class="math inline">\(\sigma^2_b \mathbf{R}\)</span>, where the general version of <span class="math inline">\(\mathbf{R}\)</span> is</p>
<p><span class="math display">\[
\mathbf{R} =
\left(
\begin{matrix}
1 & r_{12} & r_{13} & \cdots & r_{1T} \\
r_{21} & 1 & r_{23} & \cdots & r_{2T} \\
r_{31} & r_{32} & 1 & \cdots & r_{3T} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
r_{T1} & r_{T2} & r_{T3} & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<div id="within-period-cluster-correlation" class="section level4">
<h4>Within-period cluster correlation</h4>
<p>The covariance of any two individuals <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> in the same cluster <span class="math inline">\(c\)</span> and same period <span class="math inline">\(t\)</span> is</p>
<p><span class="math display">\[
\begin{aligned}
cov(Y_{ict}, Y_{jct}) &= cor(\mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{ict}, \ \mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{jct}) \\
\\
&= cov(b_{ct}, b_{ct}) + cov(e_{ict}, e_{jct}) \\
\\
&=var(b_{ct}) + 0 \\
\\
&= \sigma^2_b r_{tt} \\
\\
&= \sigma^2_b \qquad \qquad \qquad \text{since } r_{tt} = 1, \ \forall t \in \ ( 1, \ldots, T)
\end{aligned}
\]</span></p>
<p>And I showed in the previous post that <span class="math inline">\(var(Y_{ict}) = var(Y_{jct}) = \sigma^2_b + \sigma^2_e\)</span>, so the within-period intra-cluster correlation is what we saw last time:</p>
<p><span class="math display">\[ICC_{tt} = \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e}\]</span></p>
</div>
<div id="between-period-cluster-correlation" class="section level4">
<h4>Between-period cluster correlation</h4>
<p>The covariance of any two individuals in the same cluster but two <em>different</em> time periods <span class="math inline">\(t\)</span> and <span class="math inline">\(t^{\prime}\)</span> is:</p>
<p><span class="math display">\[
\begin{aligned}
cov(Y_{ict}, Y_{jct^{\prime}}) &= cor(\mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{ict}, \ \mu + \beta_0t + \beta_1X_{ct^{\prime}} + b_{ct^{\prime}} + e_{jct^{\prime}}) \\
\\
&= cov(b_{ct}, b_{ct^{\prime}}) + cov(e_{ict}, e_{jct^{\prime}}) \\
\\
&= \sigma^2_br_{tt^{\prime}}
\end{aligned}
\]</span></p>
<p>Based on this, the between-period intra-cluster correlation is</p>
<p><span class="math display">\[ ICC_{tt^\prime} =\frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e} r_{tt^{\prime}}\]</span></p>
</div>
<div id="adding-structure-to-matrix-mathbfr" class="section level4">
<h4>Adding structure to matrix <span class="math inline">\(\mathbf{R}\)</span></h4>
<p>This paper by <a href="https://journals.sagepub.com/doi/full/10.1177/0962280217734981"><em>Kasza et al</em></a>, which describes various stepped-wedge models, suggests a structured variation of <span class="math inline">\(\mathbf{R}\)</span> that is a function of two parameters, <span class="math inline">\(r_0\)</span> and <span class="math inline">\(r\)</span>:</p>
<p><span class="math display">\[
\mathbf{R} = \mathbf{R}(r_0, r) =
\left(
\begin{matrix}
1 & r_0r & r_0r^2 & \cdots & r_0r^{T-1} \\
r_0r & 1 & r_0 r & \cdots & r_0 r^{T-2} \\
r_0r^2 & r_0 r & 1 & \cdots & r_0 r^{T-3} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
r_0r^{T-1} & r_0r^{T-2} & r_0 r^{T-3} & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<p>How we specify <span class="math inline">\(r_0\)</span> and <span class="math inline">\(r\)</span> reflects different assumptions about the between-period intra-cluster correlations. I describe two particular cases below.</p>
</div>
</div>
<div id="constant-correlation-over-time" class="section level3">
<h3>Constant correlation over time</h3>
<p>In this first case, the correlation between individuals in the same cluster but different time periods is less than the correlation between individuals in the same cluster and same time period. In other words, <span class="math inline">\(ICC_{tt} \ne ICC_{tt^\prime}\)</span>. However the between-period correlation is constant, or <span class="math inline">\(ICC_{tt^\prime}\)</span> are constant for all <span class="math inline">\(t\)</span> and <span class="math inline">\(t^\prime\)</span>. We have these correlations when <span class="math inline">\(r_0 = \rho\)</span> and <span class="math inline">\(r = 1\)</span>, giving</p>
<p><span class="math display">\[
\mathbf{R} = \mathbf{R}(\rho, 1) =
\left(
\begin{matrix}
1 & \rho & \rho & \cdots & \rho \\
\rho & 1 & \rho & \cdots & \rho \\
\rho & \rho & 1 & \cdots & \rho \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\rho & \rho & \rho & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<p>To simulate under this scenario, I am setting <span class="math inline">\(\sigma_b^2 = 0.15\)</span>, <span class="math inline">\(\sigma_e^2 = 2.0\)</span>, and <span class="math inline">\(\rho = 0.6\)</span>. We would expect the following ICC’s:</p>
<p><span class="math display">\[
\begin{aligned}
ICC_{tt} &= \frac{0.15}{0.15+2.00} = 0.0698 \\
\\
ICC_{tt^\prime} &= \frac{0.15}{0.15+2.00}\times0.6 = 0.0419
\end{aligned}
\]</span></p>
<p>Here is the code to define and generate the data:</p>
<pre class="r"><code>defc <- defData(varname = "mu", formula = 0,
dist = "nonrandom", id = "cluster")
defc <- defData(defc, "s2", formula = 0.15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + cteffect",
variance = 2, dist = "normal")
dc <- genData(100, defc)
dp <- addPeriods(dc, 7, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = 4, lenWaves = 1, startPer = 2)
dp <- addCorGen(dtOld = dp, nvars = 7, idvar = "cluster",
rho = 0.6, corstr = "cs", dist = "normal",
param1 = "mu", param2 = "s2", cnames = "cteffect")
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = 100,
level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p>As I did in my previous post, I’ve generated 200 data sets, estimated the <em>within-</em> and <em>between-period</em> ICC’s for each data set, and computed the average for each. The plot below shows the expected values in gray and the estimated values in purple and green.</p>
<p><img src="https://www.rdatagen.net/img/post-iccvary/p2.png" width="800" /></p>
</div>
<div id="declining-correlation-over-time" class="section level3">
<h3>Declining correlation over time</h3>
<p>In this second case, we make an assumption that the correlation between individuals in the same cluster degrades over time. Here, the correlation between two individuals in adjacent time periods is stronger than the correlation between individuals in periods further apart. That is <span class="math inline">\(ICC_{tt^\prime} > ICC_{tt^{\prime\prime}}\)</span> if <span class="math inline">\(|t^\prime - t| < |t^{\prime\prime} - t|\)</span>. This structure can be created by setting <span class="math inline">\(r_0 = 1\)</span> and <span class="math inline">\(r=\rho\)</span>, giving us an auto-regressive correlation matrix <span class="math inline">\(R\)</span>:</p>
<p><span class="math display">\[
\mathbf{R} = \mathbf{R}(1, \rho) =
\left(
\begin{matrix}
1 & \rho & \rho^2 & \cdots & \rho^{T-1} \\
\rho & 1 & \rho & \cdots & \rho^{T-2} \\
\rho^2 & \rho & 1 & \cdots & \rho^{T-3} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\rho^{T-1} & \rho^{T-2} & \rho^{T-3} & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<p>I’ve generated data using the same variance assumptions as above. The only difference in this case is that the <code>corstr</code> argument in the call to <code>addCorGen</code> is “ar1” rather than “cs” (which was used above). Here are a few of the expected correlations:</p>
<p><span class="math display">\[
\begin{aligned}
ICC_{t,t} &= \frac{0.15}{0.15+2.00} = 0.0698 \\
\\
ICC_{t,t+1} &= \frac{0.15}{0.15+2.00}\times 0.6^{1} = 0.0419 \\
\\
ICC_{t,t+2} &= \frac{0.15}{0.15+2.00}\times 0.6^{2} = 0.0251 \\
\\
\vdots
\\
ICC_{t, t+6} &= \frac{0.15}{0.15+2.00}\times 0.6^{6} = 0.0032
\end{aligned}
\]</span></p>
<p>And here is the code:</p>
<pre class="r"><code>defc <- defData(varname = "mu", formula = 0,
dist = "nonrandom", id = "cluster")
defc <- defData(defc, "s2", formula = 0.15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + cteffect",
variance = 2, dist = "normal")
dc <- genData(100, defc)
dp <- addPeriods(dc, 7, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = 4, lenWaves = 1, startPer = 2)
dp <- addCorGen(dtOld = dp, nvars = 7, idvar = "cluster",
rho = 0.6, corstr = "ar1", dist = "normal",
param1 = "mu", param2 = "s2", cnames = "cteffect")
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = 10,
level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p>And here are the observed average estimates (based on 200 datasets) alongside the expected values:</p>
<p><img src="https://www.rdatagen.net/img/post-iccvary/p3.png" width="800" /></p>
</div>
<div id="random-slope" class="section level3">
<h3>Random slope</h3>
<p>In this last case, I am exploring what the ICC’s look like in the context of random effects model that includes a cluster-specific intercept <span class="math inline">\(b_c\)</span> and a cluster-specific slope <span class="math inline">\(s_c\)</span>:</p>
<p><span class="math display">\[
Y_{ict} = \mu + \beta_0 t + \beta_1 X_{ct} + b_c + s_c t + e_{ict}
\]</span></p>
<p>Both <span class="math inline">\(b_c\)</span> and <span class="math inline">\(s_c\)</span> are normally distributed with mean 0, and variances <span class="math inline">\(\sigma_b^2\)</span> and <span class="math inline">\(\sigma_s^2\)</span>, respectively. (In this example <span class="math inline">\(\sigma_b^2\)</span> and <span class="math inline">\(\sigma_s^2\)</span> are uncorrelated, but that may not necessarily be the case.)</p>
<p>Because of the random slopes, the variance of the <span class="math inline">\(Y\)</span>’s increase over time:</p>
<p><span class="math display">\[
var(Y_{ict}) = \sigma^2_b + t^2 \sigma^2_s + \sigma^2_e
\]</span></p>
<p>The same is true for the within- and between-period covariances:</p>
<p><span class="math display">\[
\begin{aligned}
cov(Y_{ict}, Y_{jct}) &= \sigma^2_b + t^2 \sigma^2_s \\
\\
cov(Y_{ict}, Y_{jct^\prime}) &= \sigma^2_b + tt^\prime \sigma^2_s \\
\end{aligned}
\]</span></p>
<p>The ICC’s that follow from these various variances and covariances are:</p>
<p><span class="math display">\[
\begin{aligned}
ITT_{tt} &= \frac{\sigma^2_b + t^2 \sigma^2_s}{\sigma^2_b + t^2 \sigma^2_s + \sigma^2_e}\\
\\
ITT_{tt^\prime} & = \frac{\sigma^2_b + tt^\prime \sigma^2_s}{\left[(\sigma^2_b + t^2 \sigma^2_s + \sigma^2_e)(\sigma^2_b + {t^\prime}^2 \sigma^2_s + \sigma^2_e)\right]^\frac{1}{2}}
\end{aligned}
\]</span></p>
<p>In this example, <span class="math inline">\(\sigma^2_s = 0.01\)</span> (and the other variances remain as before), so</p>
<p><span class="math display">\[ ITT_{33} = \frac{0.15 + 3^2 \times 0.01}{0.15 + 3^2 \times 0.01 + 2} =0.1071\]</span>
and</p>
<p><span class="math display">\[ ITT_{36} = \frac{0.15 + 3 \times 6 \times 0.01}{\left[(0.15 + 3^2 \times 0.01 + 2)(0.15 + 6^2 \times 0.01 + 2)\right ]^\frac{1}{2}} =0.1392\]</span></p>
<p>Here’s the data generation:</p>
<pre class="r"><code>defc <- defData(varname = "ceffect", formula = 0, variance = 0.15,
dist = "normal", id = "cluster")
defc <- defData(defc, "cteffect", formula = 0, variance = 0.01,
dist = "normal")
defa <- defDataAdd(varname = "Y",
formula = "0 + ceffect + 0.10 * period + cteffect * period + 1 * rx",
variance = 2, dist = "normal")
dc <- genData(100, defc)
dp <- addPeriods(dc, 7, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = 4, lenWaves = 1, startPer = 2)
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = 10,
level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p>And here is the comparison between observed and expected ICC’s. The estimates are quite variable, so there appears to be slight bias. However, if I generated more than 200 data sets, the mean would likely converge closer to the expected values.</p>
<p><img src="https://www.rdatagen.net/img/post-iccvary/p4.png" width="800" /></p>
<p>In the next post (or two), I plan on providing some examples of fitting models to the data I’ve generated here. In some cases, fairly standard linear mixed effects models in <code>R</code> may be adequate, but in others, we may need to look elsewhere.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Kasza, J., K. Hemming, R. Hooper, J. N. S. Matthews, and A. B. Forbes. “Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials.” <em>Statistical methods in medical research</em> (2017): 0962280217734981.</p>
</font></small>
</p>
</div>
Planning a stepped-wedge trial? Make sure you know what you're assuming about intra-cluster correlations ...
https://www.rdatagen.net/post/intra-cluster-correlations-over-time/
Tue, 25 Jun 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/intra-cluster-correlations-over-time/
<p>A few weeks ago, I was at the annual meeting of the <a href="https://rethinkingclinicaltrials.org/">NIH Collaboratory</a>, which is an innovative collection of collaboratory cores, demonstration projects, and NIH Institutes and Centers that is developing new models for implementing and supporting large-scale health services research. A study I am involved with - <em>Primary Palliative Care for Emergency Medicine</em> - is one of the demonstration projects in this collaboratory.</p>
<p>The second day of this meeting included four panels devoted to the design and analysis of embedded pragmatic clinical trials, and focused on the challenges of conducting rigorous research in the real-world context of a health delivery system. The keynote address that started off the day was presented by David Murray of NIH, who talked about the challenges and limitations of cluster randomized trials. (I’ve written before on issues related to clustered randomized trials, including <a href="https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/">here</a>.)</p>
<p>In particular, Dr. Murray talked a great deal about stepped-wedge designs, which have become a quite popular tool in health services research. (I described stepped-wedge designs <a href="https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/">here</a>.) A big takeaway from the talk was that we must be cognizant of the underlying assumptions of the models used to estimate treatment effects; being unaware can lead to biased estimates of treatment effects, or more likely, biased estimates of uncertainty.</p>
<div id="intra-cluster-correlations" class="section level3">
<h3>Intra-cluster correlations</h3>
<p>If outcomes of subjects in a study are correlated in any way (e.g. they received care from the same health care provider), we do not learn as much information from each individual study participant as we would in the case where there is no correlation. In a parallel designed cluster randomized trial (where half of the clusters receive an intervention and the other half do not), we expect that the outcomes will be correlated <em>within</em> each cluster, though not <em>across</em> clusters. (This is not true if the clusters are themselves clustered, in which case we would have a 2-level clustered study.) This intra-cluster correlation (ICC) increases sample size requirements and reduces precision/power.</p>
<p>A common way to model correlation explicitly in a cluster randomized trial is to conceive of a random effects model like this:</p>
<p><span class="math display">\[(1) \qquad \qquad Y_{ic} = \mu + \beta_1X_{c} + b_c + e_{ic},\]</span></p>
<p>where <span class="math inline">\(Y_{ic}\)</span> is a continuous outcome for subject <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(c\)</span>, and <span class="math inline">\(X_c\)</span> is a treatment indicator for cluster <span class="math inline">\(c\)</span> (either 0 or 1). The underlying structural parameters are <span class="math inline">\(\mu\)</span>, the grand mean, and <span class="math inline">\(\beta_1\)</span>, the treatment effect. The unobserved random effects are, <span class="math inline">\(b_c \sim N(0, \sigma^2_b)\)</span>, the normally distributed group level effect, and <span class="math inline">\(e_{ic} \sim N(0, \sigma^2_e)\)</span>, the normally distributed individual-level effect. (This is often referred to as the “error” term, but that doesn’t adequately describe what is really unmeasured individual variation.)</p>
<p>The correlation between any two subjects <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> in the <em>same</em> cluster <span class="math inline">\(c\)</span> is:</p>
<p><span class="math display">\[ cor(Y_{ic}, Y_{jc}) = \frac{cov(Y_{ic}, Y_{jc})} {\sqrt {var(Y_{ic})var(Y_{jc})}} \]</span></p>
<p><span class="math inline">\(cov(Y_{ic}, Y_{jc})\)</span> can be written in terms of the parameters in the underlying data generating process:</p>
<p><span class="math display">\[
\begin{aligned}
cov(Y_{ic}, Y_{jc}) &= cov(\mu + \beta_1X_c + b_c + e_{ic}, \mu + \beta_1X_c + b_c + e_{jc}) \\
&=cov(b_c, b_c) + cov(e_{ic},e_{jc} ) \\
&=\sigma^2_b + 0 \\
&=\sigma^2_b
\end{aligned}
\]</span></p>
<p>The terms simplify since the cluster level effects are independent of the individual level effects (and all the fixed effects in the model) and the individual level effects are independent of each other. The within-period intra-cluster co-variance depends only on the between cluster variation.</p>
<p>The total variance of the outcomes <span class="math inline">\(Y_{ic}\)</span> is:</p>
<p><span class="math display">\[
\begin{aligned}
var(Y_{ic}) &= var(\mu + \beta_1X_c + b_c + e_{ic}) \\
&= var(b_c) + var(e_{ic}) \\
&= \sigma^2_b + \sigma^2_e
\end{aligned}
\]</span></p>
<p>Substituting all of this into the original equation gives us the intra-cluster correlation for any two subjects in the cluster:</p>
<p><span class="math display">\[
\begin{aligned}
cor(Y_{ic}, Y_{jc}) &= \frac{cov(Y_{ic}, Y_{jc})} {\sqrt {var(Y_{ic})var(Y_{jc})}} \\
\\
ICC &= \frac{\sigma^2_b}{\sigma^2_b + \sigma^2_e}
\end{aligned}
\]</span></p>
<p>So, the correlation between any two subjects in a cluster increases as the variation <em>between</em> clusters increases.</p>
</div>
<div id="cluster-randomization-when-time-matters" class="section level3">
<h3>Cluster randomization when time matters</h3>
<p>Moving beyond the parallel design to the stepped-wedge design, time starts to play a very important role. It is important to ensure that we do not confound treatment and time effects; we have to be careful that we do not attribute the general changes over time to the intervention. This is accomplished by introducing a time trend into the model. (Actually, it seems more common to include a time-specific effect so that each time period has its own effect. However, for simulation purposes, I will will assume a linear trend.)</p>
<p>In the stepped-wedge design, we are essentially estimating within-cluster treatment effects by comparing the cluster with itself pre- and post-intervention. To estimate sample size and precision (or power), it is no longer sufficient to consider a single ICC, because there are now multiple ICC’s - the within-period ICC and the between-period ICC’s. The within-period ICC is what we defined in the parallel design (since we effectively treated all observations as occurring in the same period.) Now we also need to consider the expected correlation of two individuals in the <em>same</em> cluster in <em>different</em> time periods.</p>
<p>If we do not properly account for within-period ICC and the between-period ICC’s in either the planning or analysis stages, we run the risk of generating biased estimates.</p>
<p>My primary aim is to describe possible data generating processes for the stepped wedge design and what implications they have for both the within-period and between-period ICC’s. I will generate data to confirm that observed ICC’s match up well with the theoretical expectations. This week I will consider the simplest model, one that is frequently used but whose assumptions may not be realistic in many applications. In a follow-up post, I will consider more flexible data generating processes.</p>
</div>
<div id="constant-iccs-over-time" class="section level3">
<h3>Constant ICC’s over time</h3>
<p>Here is probably the simplest model that can be conceived for a process underlying the stepped-wedge design:</p>
<p><span class="math display">\[
(2) \qquad \qquad Y_{ict} = \mu + \beta_0t + \beta_1X_{ct} + b_c + e_{ict}
\]</span></p>
<p>As before, the unobserved random effects are <span class="math inline">\(b_c \sim N(0, \sigma^2_b)\)</span> and <span class="math inline">\(e_{ict} \sim N(0, \sigma^2_e)\)</span>. The key differences between this model compared to the parallel design is the time trend and time-dependent treatment indicator. The time trend accounts for the fact that the outcome may change over time regardless of the intervention. And since the cluster will be in both the control and intervention states we need to have an time-dependent intervention indicator. (This model is a slight variation on the <em>Hussey and Hughes</em> model, which includes a time-specific effect <span class="math inline">\(\beta_t\)</span> rather than a linear time trend. This paper by <a href="https://journals.sagepub.com/doi/full/10.1177/0962280217734981"><em>Kasza et al</em></a> describes this stepped-wedge model, and several others, in much greater detail.)</p>
<p>The <em>within-period</em> ICC from this is model is:</p>
<p><span class="math display">\[
\begin{aligned}
cor(Y_{ict}, Y_{jct}) &= cor(\mu + \beta_0t + \beta_1X_{ct} + b_c + e_{ict}, \ \mu + \beta_0t + \beta_1X_{ct} + b_c + e_{jct}) \\
\\
ICC_{tt}&= \frac{\sigma^2_b}{\sigma^2_b + \sigma^2_e}
\end{aligned}
\]</span></p>
<p>I have omitted the intermediary steps, but the logic is the same as in the parallel design case. The within-period ICC under this model is also the same as the ICC in the parallel design.</p>
<p>More importantly, in this case the <em>between-period</em> ICC turns out to be the same as the <em>within-period</em> ICC. For the <em>between-period</em> ICC, we are estimating the expected correlation between any two subjects <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> in cluster <span class="math inline">\(c\)</span>, one in time period <span class="math inline">\(t\)</span> and the other in time period <span class="math inline">\(t^\prime\)</span> <span class="math inline">\((t \ne t^\prime)\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
cor(Y_{ict}, Y_{jct^\prime}) &= cor(\mu + \beta_0t + \beta_1X_{ct} + b_c + e_{ict}, \ \mu + \beta_0t^\prime + \beta_1X_{ct^\prime} + b_c + e_{jct^\prime}) \\
\\
ICC_{tt^\prime}&= \frac{\sigma^2_b}{\sigma^2_b + \sigma^2_e}
\end{aligned}
\]</span></p>
<p>Under this seemingly reasonable (and popular) model, we are making a big assumption that the within-period ICC and between-period ICC’s are equal and constant throughout the study. This may or may not be reasonable - but it is important to acknowledge the assumption and to make sure we justify that choice.</p>
</div>
<div id="generating-data-to-simulate-a-stepped-wedge-design" class="section level3">
<h3>Generating data to simulate a stepped-wedge design</h3>
<p>I’ve generated data from a stepped-wedge design <a href="https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/">before</a> on this blog, but will repeat the details here. For the data definitions, we define the variance of the cluster-specific effects, the cluster sizes, and the outcome model.</p>
<pre class="r"><code>defc <- defData(varname = "ceffect", formula = 0, variance = 0.15,
dist = "normal", id = "cluster")
defc <- defData(defc, "m", formula = 10, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + ceffect",
variance = 2, dist = "normal")</code></pre>
<p>The data generation follows this sequence: cluster data, temporal data, stepped-wedge treatment assignment, and individual (within cluster) data:</p>
<pre class="r"><code>dc <- genData(100, defc)
dp <- addPeriods(dc, 7, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = 4, lenWaves = 1, startPer = 2)
dd <- genCluster(dp, cLevelVar = "timeID", "m", "id")
dd <- addColumns(defa, dd)
dd</code></pre>
<pre><code>## cluster period ceffect m timeID startTrt rx id Y
## 1: 1 0 -0.073 10 1 2 0 1 -2.12
## 2: 1 0 -0.073 10 1 2 0 2 -1.79
## 3: 1 0 -0.073 10 1 2 0 3 1.53
## 4: 1 0 -0.073 10 1 2 0 4 -1.44
## 5: 1 0 -0.073 10 1 2 0 5 2.25
## ---
## 6996: 100 6 0.414 10 700 5 1 6996 1.28
## 6997: 100 6 0.414 10 700 5 1 6997 0.30
## 6998: 100 6 0.414 10 700 5 1 6998 0.94
## 6999: 100 6 0.414 10 700 5 1 6999 1.43
## 7000: 100 6 0.414 10 700 5 1 7000 0.58</code></pre>
<p>It is always useful (and important) to visualize the data (regardless of whether they are simulated or real). This is the summarized cluster-level data. The clusters are grouped together in waves defined by starting point. In this case, there are 25 clusters per wave. The light blue represents pre-intervention periods, and the dark blue represents intervention periods.</p>
<p><img src="https://www.rdatagen.net/post/2019-06-25-intra-cluster-correlations-over-time.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="estimating-the-between-period-within-cluster-correlation" class="section level3">
<h3>Estimating the between-period within-cluster correlation</h3>
<p>I want to estimate the observed between-period within cluster correlation without imposing any pre-conceived structure. In particular, I want to see if the data generated by the process defined in equation (2) above does indeed lead to constant within- and between-period ICC’s. In a future post, I will estimate the ICC using a model, but for now, I’d prefer to estimate the ICC’s directly from the data.</p>
<p>A 1982 paper by <a href="https://academic.oup.com/aje/article/116/4/722/52694"><em>Bernard Rosner</em></a> provides a non-parametric estimate of the <em>between-period</em> ICC. He gives this set of equations to find the correlation coefficient <span class="math inline">\(\rho_{tt^\prime}\)</span> for two time periods <span class="math inline">\(t\)</span> and <span class="math inline">\(t^\prime\)</span>. In the equations, <span class="math inline">\(m_{ct}\)</span> represents the cluster size for cluster <span class="math inline">\(c\)</span> in time period <span class="math inline">\(t\)</span>, and <span class="math inline">\(K\)</span> represents the number of clusters:</p>
<p><span class="math display">\[
\rho_{tt^\prime} = \frac{\sum_{c=1}^K \sum_{i=1}^{m_{ct}} \sum_{j=1}^{m_{ct^\prime}} (Y_{ict}-\mu_t)(Y_{jct^\prime}-\mu_{t^\prime})} {\left[ \left ( \sum_{c=1}^K m_{ct^\prime} \sum_{i=1}^{m_{ct}} (Y_{ict}-\mu_t)^2 \right ) \left ( \sum_{c=1}^K m_{ct} \sum_{j=1}^{m_{ct^\prime}} (Y_{jct^\prime}-\mu_{t^\prime})^2 \right )\right] ^ \frac {1}{2}}
\]</span></p>
<p><span class="math display">\[
\mu_t = \frac{\sum_{c=1}^K m_{ct} m_{ct^\prime} \mu_{ct}}{\sum_{c=1}^K m_{ct} m_{ct^\prime}} \ \ , \ \ \mu_{t^\prime} = \frac{\sum_{c=1}^K m_{ct} m_{ct^\prime} \mu_{ct^\prime}}{\sum_{c=1}^K m_{ct} m_{ct^\prime}}
\]</span></p>
<p><span class="math display">\[
\mu_{ct} = \frac{\sum_{i=1}^{m_{ct}} Y_{ict}}{m_{ct}} \ \ , \ \ \mu_{ct^\prime} = \frac{\sum_{j=1}^{m_{ct^\prime}} Y_{jct^\prime}}{m_{ct^\prime}}
\]</span></p>
<p>I’ve implemented the algorithm in <code>R</code>, and the code is included in the addendum. One issue that came up is that as the intervention is phased in over time, the treatment effect is present for each at different times. The algorithm breaks down as a result. However, the between-period ICC can be calculated for each wave, and then we can average across the four waves.</p>
<p>The <em>within-period</em> ICC is estimated using a linear mixed effects model applied to each period separately, so that we estimate period-specific within-period ICC’s. The expected (constant) ICC is <span class="math inline">\(0.07 = \left(\frac{0.15}{0.15 + 2}\right)\)</span>.</p>
<p>The function <code>iccs</code> (shown below in the addendum) returns both the estimated <em>within-</em> and <em>between-cluster</em> ICC’s for a single data set. Here is the within-period ICC for the first period (actually period 0) and the between-period ICC’s using period 0:</p>
<pre class="r"><code>set.seed(47463)
iccs(dd, byWave = T)[,c(22, 0:6)]</code></pre>
<pre><code>## wp0 bp01 bp02 bp03 bp04 bp05 bp06
## 1: 0.041 0.068 0.073 0.08 0.067 0.054 0.053</code></pre>
<p>ICC estimates are quite variable and we can’t tell anything about the distribution from any single data set. Generating multiple replications lets us see if the estimates are close, on average, to our assumption of constant ICC’s. Here is a function to generate a single data set:</p>
<pre class="r"><code>genDD <- function(defc, defa, nclust, nperiods, waves, len, start) {
dc <- genData(nclust, defc)
dp <- addPeriods(dc, nperiods, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = waves,
lenWaves = len, startPer = start)
dd <- genCluster(dp, cLevelVar = "timeID", "m", "id")
dd <- addColumns(defa, dd)
return(dd[])
}</code></pre>
<p>And here is a function to estimate 200 sets of ICC’s for 200 data sets:</p>
<pre class="r"><code>icc <- mclapply(1:200,
function(x) iccs(genDD(defc, defa, 100, 7, 4, 1, 2), byWave = T),
mc.cores = 4
)
observed <- sapply(rbindlist(icc), function(x) mean(x))</code></pre>
<p>Averages of all the <em>within-</em> and <em>between-period</em> ICC’s were in fact quite close to the “true” value of 0.07 based on a relatively small number of replications. The plot shows the observed averages along side the expected value (shown in gray) for each of the periods generated in the data. There is little variation across both the <em>within-</em> and <em>between-period</em> ICC’s.</p>
<p><img src="https://www.rdatagen.net/img/post-iccvary/p1.png" width="800" /></p>
<p>I’ll give you a little time to absorb this. Next time, I will consider alternative data generating processes where the the ICC’s are not necessarily constant.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Kasza, J., K. Hemming, R. Hooper, J. N. S. Matthews, and A. B. Forbes. “Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials.” <em>Statistical methods in medical research</em> (2017): 0962280217734981.</p>
<p>Rosner, Bernard. “On the estimation and testing of inter-class correlations: the general case of multiple replicates for each variable.” <em>American journal of epidemiology</em> 116, no. 4 (1982): 722-730.</p>
</font></small>
</p>
<p> </p>
</div>
<div id="addendum-r-code-for-simulations" class="section level3">
<h3>Addendum: R code for simulations</h3>
<pre class="r"><code>library(lme4)
library(parallel)
Covar <- function(dx, clust, period1, period2, x_0, x_1) {
v0 <- dx[ctemp == clust & period == period1, Y - x_0]
v1 <- dx[ctemp == clust & period == period2, Y - x_1]
sum(v0 %*% t(v1))
}
calcBP <- function(dx, period1, period2) {
# dx <- copy(d2)
# create cluster numbers starting from 1
tt <- dx[, .N, keyby = cluster]
nclust <- nrow(tt)
dx[, ctemp := rep(1:nclust, times = tt$N)]
dx <- dx[period %in% c(period1, period2)]
## Grand means
dg <- dx[, .(m=.N, mu = mean(Y)), keyby = .(ctemp, period)]
dg <- dcast(dg, formula = ctemp ~ period, value.var = c("m","mu"))
setnames(dg, c("ctemp", "m_0", "m_1", "mu_0", "mu_1"))
x_0 <- dg[, sum(m_0 * m_1 * mu_0)/sum(m_0 * m_1)]
x_1 <- dg[, sum(m_0 * m_1 * mu_1)/sum(m_0 * m_1)]
## Variance (denominator)
dss_0 <- dx[period == period1, .(ss_0 = sum((Y - x_0)^2)),
keyby = ctemp]
dss_0[, m_1 := dg[, m_1]]
v_0 <- dss_0[, sum(m_1 * ss_0)]
dss_1 <- dx[period == period2, .(ss_1 = sum((Y - x_1)^2)),
keyby = ctemp]
dss_1[, m_0 := dg[, m_0]]
v_1 <- dss_1[, sum(m_0 * ss_1)]
## Covariance
v0v1 <- sapply(1:nclust,
function(x) Covar(dx, x, period1, period2, x_0, x_1))
bp.icc <- sum(v0v1)/sqrt(v_0 * v_1)
bp.icc
}
btwnPerICC <- function(dd, period1, period2, byWave = FALSE) {
if (byWave) {
waves <- dd[, unique(startTrt)]
bpICCs <- sapply(waves, function(x)
calcBP(dd[startTrt==x], period1, period2))
return(mean(bpICCs))
} else {
calcBP(dd, period1, period2)
}
}
withinPerICC <- function(dx) {
lmerfit <- lmer(Y~rx + (1|cluster), data = dx)
vars <- as.data.table(VarCorr(lmerfit))[, vcov]
vars[1]/sum(vars)
}
genPairs <- function(n) {
x <- combn(x = c(1:n-1), 2)
lapply(seq_len(ncol(x)), function(i) x[,i])
}
iccs <- function(dd, byWave = FALSE) {
nperiods <- dd[, length(unique(period))]
bperiods <- genPairs(nperiods)
names <-
unlist(lapply(bperiods, function(x) paste0("bp", x[1], x[2])))
bp.icc <- sapply(bperiods,
function(x) btwnPerICC(dd, x[1], x[2], byWave))
system(paste("echo ."))
bdd.per <- lapply(1:nperiods - 1, function(x) dd[period == x])
wp.icc <- lapply(bdd.per,
function(x) withinPerICC(x))
wp.icc <- unlist(wp.icc)
nameswp <- sapply(1:nperiods - 1, function(x) paste0("wp", x))
do <- data.table(t(c(bp.icc, wp.icc)))
setnames(do, c(names, nameswp))
return(do[])
}</code></pre>
</div>
Don't get too excited - it might just be regression to the mean
https://www.rdatagen.net/post/regression-to-the-mean/
Tue, 11 Jun 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/regression-to-the-mean/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>It is always exciting to find an interesting pattern in the data that seems to point to some important difference or relationship. A while ago, one of my colleagues shared a figure with me that looked something like this:</p>
<p><img src="https://www.rdatagen.net/post/2019-06-11-regression-to-the-mean.en_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>It looks like something is going on. On average low scorers in the first period increased a bit in the second period, and high scorers decreased a bit. Something <strong>is</strong> going on, but nothing specific to the data in question; it is just probability working its magic.</p>
<p>What my colleague had shown me is a classic example of <em>regression to the mean</em>. In the hope of clarifying the issue, I created a little simulation for her to show I could recreate this scenario with arbitrary data. And now I share it with you.</p>
<div id="what-is-regression-to-the-mean" class="section level3">
<h3>What <em>is</em> regression to the mean?</h3>
<p>A simple picture may clarify what underlies regression to the mean. An individual’s measured responses over time are a function of various factors. In this first scenario, the responses are driven entirely by short term factors:</p>
<p><img src="https://www.rdatagen.net/img/post-regression-to-mean/shortcauses.png" width="500" /></p>
<p>Responses in the two different time periods depend only on proximal causes. These could include an individual’s mood (which changes over time) or maybe something unrelated to the individual that would induce measurement error. (If the short term factor is not measured, this what is typically considered random noise or maybe “error”; I prefer to refer to this quantity as something like unexplained variation or individual level effects.) When these are the only factors influencing the responses, we would expect the responses in each period to be uncorrelated.</p>
<p>Regression to the mean manifests itself when we focus on sub-groups at extreme ends of the distribution. Here, we consider a sub-group of individuals with high levels of response in the first period. Since factors that led to these high values will not necessarily be present in the second period, we would expect the distribution of values for the sub-group in the <strong>second</strong> period to look like the distribution in the <em>full sample</em> (including high, moderate, and low responders) from the <strong>first</strong> period. Alternatively, if we think about the second period alone, we would expect the high value sub-group (from the first period) to look just like the rest of the sample. Either way we look at it, the sub-group mean in the second period will necessarily be lower than the mean of that same sub-group in the first period.</p>
<p>A simulation might clarify this. <span class="math inline">\(p_1\)</span> and <span class="math inline">\(p_2\)</span> are the short term factors influencing the period one outcome <span class="math inline">\(x_1\)</span> and period two outcome <span class="math inline">\(x_2\)</span>, respectively. The indicator <span class="math inline">\(h_1 = 1\)</span> if the period one response falls in the top <span class="math inline">\(20\%\)</span> of responses:</p>
<pre class="r"><code>d <- defData(varname = "p1", formula = 0, variance = 1, dist = "normal")
d <- defData(d, varname = "p2", formula = 0, variance = 1, dist = "normal")
d <- defData(d, varname = "x1", formula = "0 + p1", dist = "nonrandom")
d <- defData(d, varname = "x2", formula = "0 + p2", dist = "nonrandom")
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")</code></pre>
<pre class="r"><code>set.seed(2371)
dd <- genData(1000, d)</code></pre>
<p>The average (and sd) for the full sample in period one and period two are pretty much the same:</p>
<pre class="r"><code>dd[, .(mu.x1 = mean(x1), sd.x1 = sd(x1),
mu.x2 = mean(x2), sd.x2 = sd(x2))]</code></pre>
<pre><code>## mu.x1 sd.x1 mu.x2 sd.x2
## 1: 0.02 1 -0.07 1</code></pre>
<p>The mean of the sub-group of the sample who scored in the top 20% in period one is obviously higher than the full sample period one average since this is how we defined the sub-group. However, the period two distribution for this sub-group looks like the <em>overall</em> sample in period two. Again, this is due to the fact that the distribution of <span class="math inline">\(p_2\)</span> is the <em>same</em> for the period one high scoring sub-group and everyone else:</p>
<pre class="r"><code>cbind(dd[h1 == TRUE, .(muh.x1 = mean(x1), sdh.x1 = sd(x1),
muh.x2 = mean(x2), sdh.x2 = sd(x2))],
dd[, .(mu.x2 = mean(x2), sd.x2 = sd(x2))])</code></pre>
<pre><code>## muh.x1 sdh.x1 muh.x2 sdh.x2 mu.x2 sd.x2
## 1: 1 0.5 -0.08 1 -0.07 1</code></pre>
</div>
<div id="a-more-realistic-scenario" class="section level3">
<h3>A more realistic scenario</h3>
<p>It is unlikely that the repeated measures <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span> will be uncorrelated, and more plausible that they share some common factor or factors; someone who tends to score high in the first may tend to score high in the second. For example, an individual’s underlying health status could influence outcomes over both measurement periods. Here is the updated DAG:</p>
<p><img src="https://www.rdatagen.net/img/post-regression-to-mean/causes.png" width="500" /></p>
<p>Regression to the mean is really a phenomenon driven by the relative strength of the longer term underlying factors and shorter term proximal factors. If the underlying factors dominate the more proximal ones, then the we would expect to see less regression to the mean. (In the extreme case where there no proximal factors, only longer term, underlying ones, there will be no regression to the mean.)</p>
<p>Back to the simulation. (This time <span class="math inline">\(p_1\)</span> and <span class="math inline">\(p_2\)</span> are reflected in the variance of the two responses, so they do not appear explicitly in the data definitions.)</p>
<pre class="r"><code>library(parallel)
d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "x2", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")
set.seed(2371)
dd <- genData(1000, d)</code></pre>
<p>When we look at the means of the period one high scoring sub-group in periods one and two, it appears that there is at least <em>some</em> regression to the mean, but it is not absolute, because the underlying factors <span class="math inline">\(U\)</span> have a fairly strong effect on the responses in both periods:</p>
<pre><code>## muh.x1 sdh.x1 muh.x2 sdh.x2 mu.x2 sd.x2
## 1: 2 0.6 1 1 -0.02 1</code></pre>
</div>
<div id="regression-to-the-mean-under-different-scenarios" class="section level3">
<h3>Regression to the mean under different scenarios</h3>
<p>To conclude, I want to illustrate how the relative strength of <span class="math inline">\(U\)</span>, <span class="math inline">\(p_1\)</span>, and <span class="math inline">\(p_2\)</span> affect the regression to the mean. (The code to generate the plot immediately follows.) Under each simulation scenario I generated 1000 data sets of 200 individuals each, and averaged across the 1000 replications to show the mean <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span> measurements <em>for the high scorers only in period one</em>. In all cases, period one scores are to the right and the arrow points to the period two scores. The longer the arrow, the more extensive the regression to the mean.</p>
<p><img src="https://www.rdatagen.net/post/2019-06-11-regression-to-the-mean.en_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
<p>As the effect of <span class="math inline">\(U\)</span> grows (moving down from box to box in the plot), regression to the mean decreases. And within each box, as we decrease the strength of the proximal <span class="math inline">\(p\)</span> factors (by decreasing the variance of the <span class="math inline">\(p_1\)</span> and <span class="math inline">\(p_2\)</span>), regression to the mean also decreases.</p>
<p id="addendum">
</p>
</div>
<div id="addendum-code-to-generate-replications-and-plot" class="section level3">
<h3>Addendum: code to generate replications and plot</h3>
<pre class="r"><code>rtomean <- function(n, d) {
dd <- genData(n, d)
data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}
repl <- function(xvar, nrep, ucoef, d) {
d <- updateDef(d, "x1", newvariance = xvar)
d <- updateDef(d, "x2", newvariance = xvar)
dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d)))
mudif <- unlist(lapply(dif, mean))
data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
}
dres <- list()
i <- 0
for (ucoef in c(0, 1, 2, 3)) {
i <- i + 1
uform <- genFormula( c(0, ucoef), "U")
d <- updateDef(d, "x1", newformula = uform)
d <- updateDef(d, "x2", newformula = uform)
dr <- mclapply(seq(1, 4, by = 1), function(x) repl(x, 1000, ucoef, d))
dres[[i]] <- rbindlist(dr)
}
dres <- rbindlist(dres)
ggplot(data = dres, aes(x = x1, xend = x2, y = xvar, yend = xvar)) +
geom_point(aes(x=x1, y = xvar), color = "#824D99", size = 1) +
geom_segment(arrow = arrow(length = unit(.175, "cm")),
color = "#824D99") +
scale_y_continuous(limits = c(0.5, 4.5), breaks = 1:4,
name = "variance of measurements") +
scale_x_continuous(limits = c(-0.1, 3), name = "mean") +
facet_grid(ucoef ~ .) +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank())</code></pre>
</div>
simstudy update - stepped-wedge design treatment assignment
https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/
Tue, 28 May 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/
<p><code>simstudy</code> has just been updated (version 0.1.13 on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a>), and includes one interesting addition (and a couple of bug fixes). I am working on a post (or two) about intra-cluster correlations (ICCs) and stepped-wedge study designs (which I’ve written about <a href="https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/">before</a>), and I was getting tired of going through the convoluted process of generating data from a time-dependent treatment assignment process. So, I wrote a new function, <code>trtStepWedge</code>, that should simplify things.</p>
<p>I will take the opportunity of this brief announcement to provide a quick example.</p>
<div id="data-definition" class="section level3">
<h3>Data definition</h3>
<p>Stepped-wedge designs are a special class of cluster randomized trial where each cluster is observed in both treatment arms (as opposed to the classic parallel design where only some of the clusters receive the treatment). This is a special case of a cross-over design, where the cross-over is only in one direction: control (or pre-intervention) to intervention.</p>
<p>In this example, the data generating process looks like this:</p>
<p><span class="math display">\[Y_{ict} = \beta_0 + b_c + \beta_1 * t + \beta_2*X_{ct} + e_{ict}\]</span></p>
<p>where <span class="math inline">\(Y_{ict}\)</span> is the outcome for individual <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(c\)</span> in time period <span class="math inline">\(t\)</span>, <span class="math inline">\(b_c\)</span> is a cluster-specific effect, <span class="math inline">\(X_{ct}\)</span> is the intervention indicator that has a value 1 during periods where the cluster is under the intervention, and <span class="math inline">\(e_{ict}\)</span> is the individual-level effect. Both <span class="math inline">\(b_c\)</span> and <span class="math inline">\(e_{ict}\)</span> are normally distributed with mean 0 and variances <span class="math inline">\(\sigma^2_{b}\)</span> and <span class="math inline">\(\sigma^2_{e}\)</span>, respectively. <span class="math inline">\(\beta_1\)</span> is the time trend, and <span class="math inline">\(\beta_2\)</span> is the intervention effect.</p>
<p>We need to define the cluster-level variables (i.e. the cluster effect and the cluster size) as well as the individual specific outcome. In this case each cluster will have 15 individuals per period, and <span class="math inline">\(\sigma^2_b = 0.20\)</span>. In addition, <span class="math inline">\(\sigma^2_e = 1.75\)</span>.</p>
<pre class="r"><code>library(simstudy)
library(ggplot2)
defc <- defData(varname = "ceffect", formula = 0, variance = 0.20,
dist = "normal", id = "cluster")
defc <- defData(defc, "m", formula = 15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + ceffect + 0.1*period + trt*1.5",
variance = 1.75, dist = "normal")</code></pre>
<p>In this case, there will be 30 clusters and 24 time periods. With 15 individuals per cluster per period, there will be 360 observations for each cluster, and 10,800 in total. (There is no reason the cluster sizes need to be deterministic, but I just did that to simplify things a bit.)</p>
<p>Cluster-level intervention assignment is done after generating the cluster-level and time-period data. The call to <code>trtStepWedge</code> includes 3 key arguments that specify the number of waves, the length of each wave, and the period during which the first clusters begin the intervention.</p>
<p><code>nWaves</code> indicates how many clusters share the same starting period for the intervention. In this case, we have 5 waves, with 6 clusters each. <code>startPer</code> is the first period of the first wave. The earliest starting period is 0, the first period. Here, the first wave starts the intervention during period 4. <code>lenWaves</code> indicates the length between starting points for each wave. Here, a length of 4 means that the starting points will be 4, 8, 12, 16, and 20.</p>
<p>Once the treatment assignments are made, the individual records are created and the outcome data are generated in the last step.</p>
<pre class="r"><code>set.seed(608477)
dc <- genData(30, defc)
dp <- addPeriods(dc, 24, "cluster", timevarName = "t")
dp <- trtStepWedge(dp, "cluster", nWaves = 5, lenWaves = 4,
startPer = 4, grpName = "trt")
dd <- genCluster(dp, cLevelVar = "timeID", "m", "id")
dd <- addColumns(defa, dd)
dd</code></pre>
<pre><code>## cluster period ceffect m timeID startTrt trt id Y
## 1: 1 0 0.628 15 1 4 0 1 1.52
## 2: 1 0 0.628 15 1 4 0 2 0.99
## 3: 1 0 0.628 15 1 4 0 3 -0.12
## 4: 1 0 0.628 15 1 4 0 4 2.09
## 5: 1 0 0.628 15 1 4 0 5 -2.34
## ---
## 10796: 30 23 -0.098 15 720 20 1 10796 1.92
## 10797: 30 23 -0.098 15 720 20 1 10797 5.92
## 10798: 30 23 -0.098 15 720 20 1 10798 4.12
## 10799: 30 23 -0.098 15 720 20 1 10799 4.57
## 10800: 30 23 -0.098 15 720 20 1 10800 3.66</code></pre>
<p>It is easiest to understand the stepped-wedge design by looking at it. Here, we average the outcomes by each cluster for each period and plot the results.</p>
<pre class="r"><code>dSum <- dd[, .(Y = mean(Y)), keyby = .(cluster, period, trt, startTrt)]
ggplot(data = dSum,
aes(x = period, y = Y, group = interaction(cluster, trt))) +
geom_line(aes(color = factor(trt))) +
facet_grid(factor(startTrt, labels = c(1 : 5)) ~ .) +
scale_x_continuous(breaks = seq(0, 23, by = 4), name = "week") +
scale_color_manual(values = c("#b8cce4", "#4e81ba")) +
theme(panel.grid = element_blank(),
legend.position = "none") </code></pre>
<p><img src="https://www.rdatagen.net/post/2019-05-28-simstudy-update-stepped-wedge-treatment-assignment.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Key elements of the data generation process are readily appreciated by looking at the graph: (1) the cluster-specific effects, reflected in the variable starting points at period 0, (2) the general upward time trend, and (3), the stepped-wedge intervention scheme.</p>
<p>Since <code>trtStepWedge</code> is a new function, it is still a work in progress. Feel free to get in touch to give me feedback on any enhancements that folks might find useful.</p>
</div>
Generating and modeling over-dispersed binomial data
https://www.rdatagen.net/post/overdispersed-binomial-data/
Tue, 14 May 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/overdispersed-binomial-data/
<p>A couple of weeks ago, I was inspired by a study to <a href="https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/">write</a> about a classic design issue that arises in cluster randomized trials: should we focus on the number of clusters or the size of those clusters? This trial, which is concerned with preventing opioid use disorder for at-risk patients in primary care clinics, has also motivated this second post, which concerns another important issue - over-dispersion.</p>
<div id="a-count-outcome" class="section level3">
<h3>A count outcome</h3>
<p>In this study, one of the primary outcomes is the number of days of opioid use over a six-month follow-up period (to be recorded monthly by patient-report and aggregated for the six-month measure). While one might get away with assuming that the outcome is continuous, it really is not; it is a <em>count</em> outcome, and the possible range is 0 to 180. There are two related questions here - what model will be used to analyze the data once the study is complete? And, how should we generate simulated data to estimate the power of the study?</p>
<p>In this particular study, the randomization is at the physician level so that all patients in a particular physician practice will be in control or treatment. (For the purposes of simplification here, I am going to assume there is no treatment effect, so that all variation in the outcome is due to physicians and patients only.) One possibility is to assume the outcome <span class="math inline">\(Y_{ij}\)</span> for patient <span class="math inline">\(i\)</span> in group <span class="math inline">\(j\)</span> has a binomial distribution with 180 different “experiments” - every day we ask did the patient use opioids? - so that we say <span class="math inline">\(Y_{ij} \sim Bin(180, \ p_{ij})\)</span>.</p>
</div>
<div id="the-probability-parameter" class="section level3">
<h3>The probability parameter</h3>
<p>The key parameter here is <span class="math inline">\(p_{ij}\)</span>, the probability that patient <span class="math inline">\(i\)</span> (in group <span class="math inline">\(j\)</span>) uses opioids on any given day. Given the binomial distribution, the number of days of opioid use we expect to observe for patient <span class="math inline">\(i\)</span> is <span class="math inline">\(180p_{ij}\)</span>. There are at least three ways to think about how to model this probability (though there are certainly more):</p>
<ul>
<li><span class="math inline">\(p_{ij} = p\)</span>: everyone shares the same probability The collection of all patients will represent a sample from <span class="math inline">\(Bin(180, p)\)</span>.</li>
<li><span class="math inline">\(p_{ij} = p_j\)</span>: the probability of the outcome is determined by the cluster or group alone. The data within the cluster will have a binomial distribution, but the collective data set will <em>not</em> have a strict binomial distribution and will be over-dispersed.</li>
<li><span class="math inline">\(p_{ij}\)</span> is unique for each individual. Once again the collective data are over-dispersed, potentially even more so.</li>
</ul>
</div>
<div id="modeling-the-outcome" class="section level3">
<h3>Modeling the outcome</h3>
<p>The correct model depends, of course, on the situation at hand. What data generation process fits what we expect to be the case? Hopefully, there are existing data to inform the likely model. If not, it may by most prudent to be conservative, which usually means assuming more variation (unique <span class="math inline">\(p_{ij}\)</span>) rather than less (<span class="math inline">\(p_{ij} = p\)</span>).</p>
<p>In the first case, the probability (and counts) can be estimated using a generalized linear model (GLM) with a binomial distribution. In the second, one solution (that I will show here) is a generalized linear mixed effects model (GLMM) with a binomial distribution and a group level random effect. In the third case, a GLMM with a negative a <em>negative binomial</em> distribution would be more likely to properly estimate the variation. (I have described other ways to think about these kind of data <a href="https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/">here</a> and <a href="https://www.rdatagen.net/post/binary-beta-beta-binomial/">here</a>.)</p>
</div>
<div id="case-1-binomial-distribution" class="section level3">
<h3>Case 1: binomial distribution</h3>
<p>Even though there is no clustering effect in this first scenario, let’s assume there are clusters. Each individual will have a probability of 0.4 of using opioids on any given day (log odds = -0.405):</p>
<pre class="r"><code>def <- defData(varname = "m", formula = 100, dist = "nonrandom", id = "cid")
defa <- defDataAdd(varname = "x", formula = -.405, variance = 180,
dist = "binomial", link = "logit")</code></pre>
<p>Generate the data:</p>
<pre class="r"><code>set.seed(5113373)
dc <- genData(200, def)
dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p>Here is a plot of 20 of the 100 groups:</p>
<pre class="r"><code>dplot <- dd[cid %in% c(1:20)]
davg <- dplot[, .(avgx = mean(x)), keyby = cid]
ggplot(data=dplot, aes(y = x, x = factor(cid))) +
geom_jitter(size = .5, color = "grey50", width = 0.2) +
geom_point(data = davg, aes(y = avgx, x = factor(cid)),
shape = 21, fill = "firebrick3", size = 2) +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) +
xlab("Group") +
scale_y_continuous(limits = c(0, 185), breaks = c(0, 60, 120, 180))</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-05-14-overdispersed-binomial-data.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Looking at the plot, we can see that a mixed effects model is probably not relevant.</p>
</div>
<div id="case-2-over-dispersion-from-clustering" class="section level3">
<h3>Case 2: over-dispersion from clustering</h3>
<pre class="r"><code>def <- defData(varname = "ceffect", formula = 0, variance = 0.08,
dist = "normal", id = "cid")
def <- defData(def, varname = "m", formula = "100", dist = "nonrandom")
defa <- defDataAdd(varname = "x", formula = "-0.405 + ceffect",
variance = 100, dist = "binomial", link = "logit")
dc <- genData(200, def)
dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-05-14-overdispersed-binomial-data.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>This plot suggests that variation <em>within</em> the groups is pretty consistent, though there is variation <em>across</em> the groups. This suggests that a binomial GLMM with a group level random effect would be appropriate.</p>
</div>
<div id="case-3-added-over-dispersion-due-to-individual-differences" class="section level3">
<h3>Case 3: added over-dispersion due to individual differences</h3>
<pre class="r"><code>defa <- defDataAdd(varname = "ieffect", formula = 0,
variance = .25, dist = "normal")
defa <- defDataAdd(defa, varname = "x",
formula = "-0.405 + ceffect + ieffect",
variance = 180, dist = "binomial", link = "logit")
dd <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m", level1ID = "id")
dd <- addColumns(defa, dd)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-05-14-overdispersed-binomial-data.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>In this last case, it is not obvious what model to use. Since there is variability within and between groups, it is probably safe to use a negative binomial model, which is most conservative.</p>
</div>
<div id="estimating-the-parameters-under-a-negative-binomial-assumption" class="section level3">
<h3>Estimating the parameters under a negative binomial assumption</h3>
<p>We can fit the data we just generated (with a 2-level mixed effects model) using a <em>single-level</em> mixed effects model with the assumption of a negative binomial distribution to estimate the parameters we can use for one last simulated data set. Here is the model fit:</p>
<pre class="r"><code>nbfit <- glmer.nb(x ~ 1 + (1|cid), data = dd,
control = glmerControl(optimizer="bobyqa"))
broom::tidy(nbfit)</code></pre>
<pre><code>## # A tibble: 2 x 6
## term estimate std.error statistic p.value group
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 4.29 0.0123 347. 0 fixed
## 2 sd_(Intercept).cid 0.172 NA NA NA cid</code></pre>
<p>And to generate the negative binomial data using <code>simstudy</code>, we need a dispersion parameter, which can be extracted from the estimated model:</p>
<pre class="r"><code>(theta <- 1/getME(nbfit, "glmer.nb.theta"))</code></pre>
<pre><code>## [1] 0.079</code></pre>
<pre class="r"><code>revar <- lme4::getME(nbfit, name = "theta")^2
revar</code></pre>
<pre><code>## cid.(Intercept)
## 0.03</code></pre>
<p>Generating the data from the estimated model allows us to see how well the negative binomial model fit the dispersed binomial data that we generated. A plot of the two data sets should look pretty similar, at least with respect to the distribution of the cluster means and within-cluster individual counts.</p>
<pre class="r"><code>def <- defData(varname = "ceffect", formula = 0, variance = revar,
dist = "normal", id = "cid")
def <- defData(def, varname = "m", formula = "100", dist = "nonrandom")
defa <- defDataAdd(varname = "x", formula = "4.28 + ceffect",
variance = theta, dist = "negBinomial", link = "log")
dc <- genData(200, def)
ddnb <- genCluster(dc, cLevelVar = "cid", numIndsVar = "m",
level1ID = "id")
ddnb <- addColumns(defa, ddnb)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-05-14-overdispersed-binomial-data.en_files/figure-html/unnamed-chunk-12-1.png" width="960" /></p>
<p>The two data sets do look like they came from the same distribution. The one limitation of the negative binomial distribution is that the sample space is not limited to numbers between 0 and 180; in fact, the sample space is all non-negative integers. For at least two clusters shown, there are some individuals with counts that exceed 180 days, which of course is impossible. Because of this, it might be safer to use the over-dispersed binomial data as the generating process for a power calculation, but it would be totally fine to use the negative binomial model as the analysis model (in both the power calculation and the actual data analysis).</p>
</div>
<div id="estimating-power" class="section level3">
<h3>Estimating power</h3>
<p>One could verify that power is indeed reduced as we move from <em>Case 1</em> to <em>Case 3</em>. (I’ll leave that as an exercise for you - I think I’ve provided many examples in the past on how one might go about doing this. If, after struggling for a while, you aren’t successful, feel free to get in touch with me.)</p>
</div>
What matters more in a cluster randomized trial: number or size?
https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/
Tue, 30 Apr 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/
<p>I am involved with a trial of an intervention designed to prevent full-blown opioid use disorder for patients who may have an incipient opioid use problem. Given the nature of the intervention, it was clear the only feasible way to conduct this particular study is to randomize at the physician rather than the patient level.</p>
<p>There was a concern that the number of patients eligible for the study might be limited, so that each physician might only have a handful of patients able to participate, if that many. A question arose as to whether we can make up for this limitation by increasing the number of physicians who participate? That is, what is the trade-off between number of clusters and cluster size?</p>
<p>This is a classic issue that confronts any cluster randomized trial - made more challenging by the potentially very small cluster sizes. A primary concern of the investigators is having sufficient power to estimate an intervention effect - how would this trade-off impact that? And as a statistician, I have concerns about bias and variance, which could have important implications depending on what you are interested in measuring.</p>
<div id="clustering-in-a-nutshell" class="section level2">
<h2>Clustering in a nutshell</h2>
<p>This is an immense topic - I won’t attempt to point you to the best resources, because there are so many out there. For me, there are two salient features of cluster randomized trials that present key challenges.</p>
<p>First, individuals in a cluster are not providing as much information as we might imagine. If we take an extreme example of a case where the outcome of everyone in a cluster is identical, we learn absolutely nothing by taking an additional subject from that cluster; in fact, all we need is one subject per cluster, because all the variation is across clusters, not within. Of course, that is overly dramatic, but the same principal is in play even when the outcomes of subjects in a cluster are moderately correlated. The impact of this phenomenon depends on the within cluster correlation relative to the between cluster correlation. The relationship of these two correlations is traditionally characterized by the intra-class coefficient (ICC), which is the ratio of the between-cluster variation to total variation.</p>
<p>Second, if there is high variability across clusters, that gets propagated to the variance of the estimate of the treatment effect. From study to study (which is what we are conceiving of in a frequentist frame of mind), we are not just sampling individuals from the clusters, but we are changing the sample of clusters that we are selecting from! So much variation going on. Of course, if all clusters are exactly the same (i.e. no variation between clusters), then it doesn’t really matter what clusters we are choosing from each time around, and we have no added variability as a result of sampling from different clusters. But, as we relax this assumption of no between-cluster variability, we add over-all variability to the process, which gets translated to our parameter estimates.</p>
<p>The cluster size/cluster number trade-off is driven largely by these two issues.</p>
</div>
<div id="simulation" class="section level2">
<h2>Simulation</h2>
<p>I am generating data from a cluster randomized trial that has the following underlying data generating process:</p>
<p><span class="math display">\[ Y_{ij} = 0.35 * R_j + c_j + \epsilon_{ij}\ ,\]</span>
where <span class="math inline">\(Y_{ij}\)</span> is the outcome for patient <span class="math inline">\(i\)</span> who is being treated by physician <span class="math inline">\(j\)</span>. <span class="math inline">\(R_j\)</span> represents the treatment indicator for physician <span class="math inline">\(j\)</span> (0 for control, 1 for treatment). <span class="math inline">\(c_j\)</span> is the physician-level random effect that is normally distributed <span class="math inline">\(N(0, \sigma^2_c)\)</span>. <span class="math inline">\(\epsilon_{ij}\)</span> is the individual-level effect, and <span class="math inline">\(\epsilon_{ij} \sim N(0, \sigma^2_\epsilon)\)</span>. The expected value of <span class="math inline">\(Y_{ij}\)</span> for patients treated by physicians in the control group is <span class="math inline">\(0\)</span>. And for the patients treated by physicians in the intervention <span class="math inline">\(E(Y_{ij}) = 0.35\)</span>.</p>
<div id="defining-the-simulation" class="section level3">
<h3>Defining the simulation</h3>
<p>The entire premise of this post is that we have a target number of study subjects (which in the real world example was set at 480), and the question is should we spread those subjects across a smaller or larger number of clusters? In all the simulations that follow, then, we have fixed the total number of subjects at 480. That means if we have 240 clusters, there will be only 2 in each one; and if we have 10 clusters, there will be 48 patients per cluster.</p>
<p>In the first example shown here, we are assuming an ICC = 0.10 and 60 clusters of 8 subjects each:</p>
<pre class="r"><code>library(simstudy)
Var <- iccRE(0.10, varWithin = 0.90, dist = "normal")
defC <- defData(varname = "ceffect", formula = 0, variance = Var,
dist = "normal", id = "cid")
defC <- defData(defC, "nperc", formula = "8",
dist = "nonrandom" )
defI <- defDataAdd(varname = "y", formula = "ceffect + 0.35 * rx",
variance = 0.90)</code></pre>
</div>
<div id="generating-a-single-data-set-and-estimating-parameters" class="section level3">
<h3>Generating a single data set and estimating parameters</h3>
<p>Based on the data definitions, I can now generate a single data set:</p>
<pre class="r"><code>set.seed(711216)
dc <- genData(60, defC)
dc <- trtAssign(dc, 2, grpName = "rx")
dd <- genCluster(dc, "cid", numIndsVar = "nperc", level1ID = "id" )
dd <- addColumns(defI, dd)
dd</code></pre>
<pre><code>## cid rx ceffect nperc id y
## 1: 1 0 0.71732 8 1 0.42
## 2: 1 0 0.71732 8 2 0.90
## 3: 1 0 0.71732 8 3 -1.24
## 4: 1 0 0.71732 8 4 2.37
## 5: 1 0 0.71732 8 5 0.71
## ---
## 476: 60 1 -0.00034 8 476 -1.12
## 477: 60 1 -0.00034 8 477 0.88
## 478: 60 1 -0.00034 8 478 0.47
## 479: 60 1 -0.00034 8 479 0.28
## 480: 60 1 -0.00034 8 480 -0.54</code></pre>
<p>We use a linear mixed effect model to estimate the treatment effect and variation across clusters:</p>
<pre class="r"><code>library(lmerTest)
lmerfit <- lmer(y~rx + (1 | cid), data = dd)</code></pre>
<p>Here are the estimates of the random and fixed effects:</p>
<pre class="r"><code>as.data.table(VarCorr(lmerfit))</code></pre>
<pre><code>## grp var1 var2 vcov sdcor
## 1: cid (Intercept) <NA> 0.14 0.38
## 2: Residual <NA> <NA> 0.78 0.88</code></pre>
<pre class="r"><code>coef(summary(lmerfit))</code></pre>
<pre><code>## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.008 0.089 58 0.09 0.929
## rx 0.322 0.126 58 2.54 0.014</code></pre>
<p>And here is the estimated ICC, which happens to be close to the “true” ICC of 0.10 (which is definitely not a sure thing given the relatively small sample size):</p>
<pre class="r"><code>library(sjstats)
icc(lmerfit)</code></pre>
<pre><code>##
## Intraclass Correlation Coefficient for Linear mixed model
##
## Family : gaussian (identity)
## Formula: y ~ rx + (1 | cid)
##
## ICC (cid): 0.1540</code></pre>
</div>
</div>
<div id="a-deeper-look-at-the-variation-of-estimates" class="section level2">
<h2>A deeper look at the variation of estimates</h2>
<p>In these simulations, we are primarily interested in investigating the effect of different numbers of clusters and different cluster sizes on power, variation, bias (and mean square error, which is a combined measure of variance and bias). This means replicating many data sets and studying the distribution of the estimates.</p>
<p>To do this, it is helpful to create a functions that generates the data:</p>
<pre class="r"><code>reps <- function(nclust) {
dc <- genData(nclust, defC)
dc <- trtAssign(dc, 2, grpName = "rx")
dd <- genCluster(dc, "cid", numIndsVar = "nperc", level1ID = "id" )
dd <- addColumns(defI, dd)
lmerTest::lmer(y ~ rx + (1 | cid), data = dd)
}</code></pre>
<p>And here is a function to check if p-values from model estimates are less than 0.05, which will come in handy later when estimating power:</p>
<pre class="r"><code>pval <- function(x) {
coef(summary(x))["rx", "Pr(>|t|)"] < 0.05
}</code></pre>
<p>Now we can generate 1000 data sets and fit a linear fixed effects model to each one, and store the results in an R <em>list</em>:</p>
<pre class="r"><code>library(parallel)
res <- mclapply(1:1000, function(x) reps(60))</code></pre>
<p>Extracting information from all 1000 model fits provides an estimate of power:</p>
<pre class="r"><code>mean(sapply(res, function(x) pval(x)))</code></pre>
<pre><code>## [1] 0.82</code></pre>
<p>And here are estimates of bias, variance, and root mean square error of the treatment effect estimates. We can see in this case, the estimated treatment effect is not particularly biased:</p>
<pre class="r"><code>RX <- sapply(res, function(x) getME(x, "fixef")["rx"])
c(true = 0.35, avg = mean(RX), var = var(RX),
bias = mean(RX - 0.35), rmse = sqrt(mean((RX - 0.35)^2)))</code></pre>
<pre><code>## true avg var bias rmse
## 0.35000 0.35061 0.01489 0.00061 0.12197</code></pre>
<p>And if we are interested in seeing how well we measure the between cluster variation, we can evaluate that as well. The true variance (used to generate the data), was 0.10, and the average of the estimates was 0.099, quite close:</p>
<pre class="r"><code>RE <- sapply(res, function(x) as.numeric(VarCorr(x)))
c(true = Var, avg = mean(RE), var = var(RE),
bias = mean(RE - Var), rmse = sqrt(mean((RE - Var)^2)))</code></pre>
<pre><code>## true avg var bias rmse
## 0.10000 0.10011 0.00160 0.00011 0.03996</code></pre>
</div>
<div id="replications-under-different-scenarios" class="section level2">
<h2>Replications under different scenarios</h2>
<p>Now we are ready to put all of this together for one final experiment to investigate the effects of the ICC and cluster number/size on power, variance, and bias. I generated 2000 data sets for each combination of assumptions about cluster sizes (ranging from 10 to 240) and ICC’s (ranging from 0 to 0.15). For each combination, I estimated the variance and bias for the treatment effect parameter estimates and the between-cluster variance. (I include the code in case any one needs to do something similar.)</p>
<pre class="r"><code>ps <- list()
pn <- 0
nclust <- c(10, 20, 30, 40, 48, 60, 80, 96, 120, 160, 240)
iccs <- c(0, 0.02, 0.05 , 0.10, 0.15)
for (s in seq_along(nclust)) {
for (i in seq_along(iccs)) {
newvar <- iccRE(iccs[i], varWithin = .90, dist = "normal")
newperc <- 480/nclust[s]
defC <- updateDef(defC, "ceffect", newvariance = newvar)
defC <- updateDef(defC, "nperc", newformula = newperc)
res <- mclapply(1:2000, function(x) reps(nclust[s]))
RX <- sapply(res, function(x) getME(x, "fixef")["rx"])
RE <- sapply(res, function(x) as.numeric(VarCorr(x)))
power <- mean(sapply(res, function(x) pval(x)))
pn <- pn + 1
ps[[pn]] <- data.table(nclust = nclust[s],
newperc,
icc=iccs[i],
newvar,
power,
biasRX = mean(RX - 0.35),
varRX = var(RX),
rmseRX = sqrt(mean((RX - 0.35)^2)),
avgRE = mean(RE),
biasRE = mean(RE - newvar),
varRE = var(RE),
rmseRE = sqrt(mean((RE - newvar)^2))
)
}
}
ps <- data.table::rbindlist(ps)</code></pre>
<p>First, we can take a look at the power. Clearly, for lower ICC’s, there is little marginal gain after a threshold between 60 and 80 clusters; with the higher ICC’s, a study might benefit with respect to power from adding more clusters (and reducing cluster size):</p>
<pre class="r"><code>library(ggthemes) # for Paul Tol's Color Schemes
library(scales)
ggplot(data = ps, aes(x = nclust, y = power, group = icc)) +
geom_smooth(aes(color = factor(icc)), se = FALSE) +
theme(panel.grid.minor = element_blank()) +
scale_color_ptol(name = "ICC", labels = number(iccs, accuracy = .01)) +
scale_x_continuous(name = "number of clusters", breaks = nclust)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<p>Not surprisingly, the same picture emerges (only in reverse) when looking at the variance of the estimate for treatment effect. Variance declines quite dramatically as we increase the number of clusters (again, reducing cluster size) up to about 60 or so, and little gain in precision beyond that:</p>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>If we are interested in measuring the variation across clusters (which was <span class="math inline">\(\sigma^2_c\)</span> in the model), then a very different picture emerges. First, the plot of RMSE (which is <span class="math inline">\(E[(\hat{\theta} - \theta)^2]^{\frac{1}{2}}\)</span>, where <span class="math inline">\(\theta = \sigma^2_c\)</span>), indicates that after some point, actually increasing the number of clusters after a certain point may be a bad idea.</p>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
<p>The trends of RMSE are mirrored by the variance of <span class="math inline">\(\hat{\sigma^2_c}\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-18-1.png" width="672" /></p>
<p>I show the bias of the variance estimate, because it highlights the point that it is very difficult to get an unbiased estimate of <span class="math inline">\(\sigma^2_c\)</span> when the ICC is low, particularly with a large number of clusters with small cluster sizes. This may not be so surprising, because with small cluster sizes it may be more difficult to estimate the within-cluster variance, an important piece of the total variation.</p>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
</div>
<div id="almost-an-addendum" class="section level2">
<h2>Almost an addendum</h2>
<p>I’ve focused entirely on the direct trade-off between the number of clusters and cluster size, because that was the question raised by the study that motivated this post. However, we may have a fixed number of clusters, and we might want to know if it makes sense to recruit more subjects from each cluster. To get a picture of this, I re-ran the simulations with 60 clusters, by evaluated power and variance of the treatment effect estimator at cluster sizes ranging from 5 to 60.</p>
<p>Under the assumptions used here, it also looks like there is a point after which little can be gained by adding subjects to each cluster (at least in terms of both power and precision of the estimate of the treatment effect):</p>
<p><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-21-1.png" width="672" /><img src="https://www.rdatagen.net/post/2019-04-30-what-matters-more-in-a-cluster-randomized-trial-number-or-size.en_files/figure-html/unnamed-chunk-21-2.png" width="672" /></p>
</div>
Even with randomization, mediation analysis can still be confounded
https://www.rdatagen.net/post/even-with-randomization-mediation-analysis-can-still-be-confounded/
Tue, 16 Apr 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/even-with-randomization-mediation-analysis-can-still-be-confounded/
<p>Randomization is super useful because it usually eliminates the risk that confounding will lead to a biased estimate of a treatment effect. However, this only goes so far. If you are conducting a meditation analysis in the hopes of understanding the underlying causal mechanism of a treatment, it is important to remember that the mediator has <em>not</em> been randomized, only the treatment. This means that the estimated mediation effect <em>is</em> still at risk of being confounded.</p>
<p>I never fail to mention this when a researcher tells me they are interested in doing a mediation analysis (and it seems like more and more folks are interested in including this analysis as part of their studies). So, when my son brought up the fact that the lead investigator on his experimental psychology project wanted to include a mediation analysis, I, of course, had to pipe up. “You have to be careful, you know.”</p>
<p>But, he wasn’t buying it, wondering why randomization didn’t take care of the confounding; surely, the potential confounders would be balanced across treatment groups. Maybe I’d had a little too much wine, as I considered he might have a point. But no - I’d quickly come to my senses - it doesn’t matter that the confounder is balanced across treatment groups (which it very well could be), it would still be unbalanced across the different levels of the mediator, which is what really matters if we are estimating the effect of the mediator.</p>
<p>I proposed to do a simulation of this phenomenon. My son was not impressed, but I went ahead and did it anyways, and I am saving it here in case he wants to take a look. Incidentally, this is effectively a brief follow-up to an <a href="https://www.rdatagen.net/post/causal-mediation/">earlier post</a> on mediation. So, if the way in which I am generating the data seems a bit opaque, you might want to take a <a href="https://www.rdatagen.net/post/causal-mediation/">look</a> at what I did earlier.</p>
<div id="the-data-generating-process" class="section level2">
<h2>The data generating process</h2>
<p>Here is a DAG that succinctly describes how I will generate the data. You can see clearly that <span class="math inline">\(U_2\)</span> is a confounder of the relationship between the mediator <span class="math inline">\(M\)</span> and the outcome <span class="math inline">\(Y\)</span>. (It should be noted that if we were only interested in is the causal effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(Y\)</span>, <span class="math inline">\(U_2\)</span> is <em>not</em> a confounder, so we wouldn’t need to control for <span class="math inline">\(U_2\)</span>.)</p>
<p><img src="https://www.rdatagen.net/img/post-confoundmed/DAGmediation.png" /></p>
<p>As I did in the earlier simulation of mediation, I am simulating the potential outcomes so that we can see the “truth” that we are trying to measure.</p>
<pre class="r"><code>defU <- defData(varname = "U2", formula = 0,
variance = 1.5, dist = "normal")
defI <- defDataAdd(varname = "M0", formula = "-2 + U2",
dist = "binary", link = "logit")
defI <- defDataAdd(defI, varname = "M1", formula = "-1 + U2",
dist = "binary", link = "logit")
defA <- defReadAdd("DataConfoundMediation/mediation def.csv")</code></pre>
<table class="table table-condensed">
<thead>
<tr>
<th style="text-align:right;">
varname
</th>
<th style="text-align:right;">
formula
</th>
<th style="text-align:right;">
variance
</th>
<th style="text-align:right;">
dist
</th>
<th style="text-align:right;">
link
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">normal </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y0M0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">2 + M0*2 + U2 + e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y0M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">2 + M1*2 + U2 + e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">normal </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y1M0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">8 + M0*5 + U2 + e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y1M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">8 + M1*5 + U2 + e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">M </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">(A==0) * M0 + (A==1) * M1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">(A==0) * Y0M0 + (A==1) * Y1M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
</tbody>
</table>
<div id="getting-the-true-causal-effects" class="section level3">
<h3>Getting the “true”" causal effects</h3>
<p>With the definitions set, we can generate a very, very large data set (not infinite, but pretty close) to get at the “true” causal effects that we will try to recover using smaller (finite) data sets. I am calculating the causal mediated effects (for the treated and controls) and the causal direct effects (also for the treated and controls).</p>
<pre class="r"><code>set.seed(184049)
du <- genData(1000000, defU)
dtrue <- addCorFlex(du, defI, rho = 0.6, corstr = "cs")
dtrue <- trtAssign(dtrue, grpName = "A")
dtrue <- addColumns(defA, dtrue)
truth <- round(dtrue[, .(CMEc = mean(Y0M1 - Y0M0), CMEt= mean(Y1M1 - Y1M0),
CDEc = mean(Y1M0 - Y0M0), CDEt= mean(Y1M1 - Y0M1))], 2)
truth</code></pre>
<pre><code>## CMEc CMEt CDEc CDEt
## 1: 0.29 0.72 6.51 6.95</code></pre>
<p>And here we can see that although <span class="math inline">\(U_2\)</span> is balanced across treatment groups <span class="math inline">\(A\)</span>, <span class="math inline">\(U_2\)</span> is still associated with the mediator <span class="math inline">\(M\)</span>:</p>
<pre class="r"><code>dtrue[, mean(U2), keyby = A]</code></pre>
<pre><code>## A V1
## 1: 0 -0.00220
## 2: 1 -0.00326</code></pre>
<pre class="r"><code>dtrue[, mean(U2), keyby = M]</code></pre>
<pre><code>## M V1
## 1: 0 -0.287
## 2: 1 0.884</code></pre>
<p>Also - since <span class="math inline">\(U_2\)</span> is a confounder, we would expect it to be associated with the outcome <span class="math inline">\(Y\)</span>, which it is:</p>
<pre class="r"><code>dtrue[, cor(U2, Y)]</code></pre>
<pre><code>## [1] 0.42</code></pre>
</div>
<div id="recovering-the-estimate-from-a-small-data-set" class="section level3">
<h3>Recovering the estimate from a small data set</h3>
<p>We generate a smaller data set using the same process:</p>
<pre class="r"><code>du <- genData(1000, defU)
dd <- addCorFlex(du, defI, rho = 0.6, corstr = "cs")
dd <- trtAssign(dd, grpName = "A")
dd <- addColumns(defA,dd)</code></pre>
<p>We can estimate the causal effects using the <code>mediation</code> package, by specifying a “mediation” model and an “outcome model”. I am going to compare two approaches, one that controls for <span class="math inline">\(U_2\)</span> in both models, and a second that ignores the confounder in both.</p>
<pre class="r"><code>library(mediation)
### models that control for confounder
med.fitc <- glm(M ~ A + U2, data = dd, family = binomial("logit"))
out.fitc <- lm(Y ~ M*A + U2, data = dd)
med.outc <- mediate(med.fitc, out.fitc, treat = "A", mediator = "M",
robustSE = TRUE, sims = 500)
### models that ignore confounder
med.fitx <- glm(M ~ A, data = dd, family = binomial("logit"))
out.fitx <- lm(Y ~ M*A, data = dd)
med.outx <- mediate(med.fitx, out.fitx, treat = "A", mediator = "M",
robustSE = TRUE, sims = 500)</code></pre>
<p>It appears that the approach that adjusts for <span class="math inline">\(U_2\)</span> (middle row) provides a set of estimates closer to the truth (top row) than the approach that ignores <span class="math inline">\(U_2\)</span> (bottom row):</p>
<pre class="r"><code>dres <- rbind(
truth,
data.table(CMEc = med.outc$d0, CMEt = med.outc$d1,
CDEc = med.outc$z0, CDEt = med.outc$z1) ,
data.table(CMEc = med.outx$d0, CMEt = med.outx$d1,
CDEc = med.outx$z0, CDEt = med.outx$z1)
)
round(dres,2)</code></pre>
<pre><code>## CMEc CMEt CDEc CDEt
## 1: 0.29 0.72 6.51 6.95
## 2: 0.32 0.84 6.51 7.03
## 3: 0.53 1.07 6.32 6.85</code></pre>
<p>Of course, it is not prudent to draw conclusions from a single simulation. So, I generated 1000 data sets and recorded all the results. A visual summary of the results shows that the approach that ignores <span class="math inline">\(U_2\)</span> is biased with respect to the four causal effects, whereas including <span class="math inline">\(U_2\)</span> in the analysis yields unbiased estimates. In the plot, the averages of the estimates are the black points, the segments represent <span class="math inline">\(\pm \ 2 \ sd\)</span>, and the blue vertical lines represent the truth:</p>
<p><img src="https://www.rdatagen.net/img/post-confoundmed/estMediation.png" /></p>
<p>Almost as an addendum, using the almost infinitely large “true” data set, we can see that the total treatment effect of <span class="math inline">\(A\)</span> can be estimated from observed data <em>ignoring</em> <span class="math inline">\(U_2\)</span>, because as we saw earlier, <span class="math inline">\(U_2\)</span> is indeed balanced across both levels of <span class="math inline">\(A\)</span> due to randomization:</p>
<pre class="r"><code>c( est = coef(lm(Y ~ A, data = dtrue))["A"],
truth = round(dtrue[, .(TotalEff = mean(Y1M1 - Y0M0))], 2))</code></pre>
<pre><code>## $est.A
## [1] 7.24
##
## $truth.TotalEff
## [1] 7.24</code></pre>
</div>
</div>
Musings on missing data
https://www.rdatagen.net/post/musings-on-missing-data/
Tue, 02 Apr 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/musings-on-missing-data/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I’ve been meaning to share an analysis I recently did to estimate the strength of the relationship between a young child’s ability to recognize emotions in others (e.g. teachers and fellow students) and her longer term academic success. The study itself is quite interesting (hopefully it will be published sometime soon), but I really wanted to write about it here as it involved the challenging problem of missing data in the context of heterogeneous effects (different across sub-groups) and clustering (by schools).</p>
<p>As I started to develop simulations to highlight key issues, I found myself getting bogged down in the data generation process. Once I realized I needed to be systematic about thinking how to generate various types of missingness, I thought maybe DAGs would help to clarify some of the issues (I’ve written a bit about DAGS <a href="https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/">before</a> and provided some links to some good references). I figured that I probably wasn’t the first to think of this, and a quick search confirmed that there is indeed a pretty rich literature on the topic. I first found this <a href="http://jakewestfall.org/blog/index.php/2017/08/22/using-causal-graphs-to-understand-missingness-and-how-to-deal-with-it/">blog post</a> by Jake Westfall, which, in addition to describing many of the key issues that I want to address here, provides some excellent references, including this paper by <a href="https://journals.sagepub.com/doi/pdf/10.1177/0962280210394469"><em>Daniel et al</em></a> and this one by <a href="http://papers.nips.cc/paper/4899-graphical-models-for-inference-with-missing-data.pdf"><em>Mohan et al</em></a>.</p>
<p>I think the value I can add here is to provide some basic code to get the data generation processes going, in case you want to explore missing data methods for yourself.</p>
<div id="thinking-systematically-about-missingness" class="section level2">
<h2>Thinking systematically about missingness</h2>
<p>In the world of missing data, it has proved to be immensely useful to classify different types of missing data. That is, there could various explanations of how the missingness came to be in a particular data set. This is important, because as in any other modeling problem, having an idea about the data generation process (in this case the missingness generation process) informs how you should proceed to get the “best” estimate possible using the data at hand.</p>
<p>Missingness can be recorded as a binary characteristic of a particular data point for a particular individual; the data point is missing or it is not. It seems to be the convention that the missingness indicator is <span class="math inline">\(R_{p}\)</span> (where <span class="math inline">\(p\)</span> is the variable), and <span class="math inline">\(R_{p} = 1\)</span> if the data point <span class="math inline">\(p\)</span> is missing and is <span class="math inline">\(0\)</span> otherwise.</p>
<p>We say data are <em>missing completely at random</em> (MCAR) when <span class="math inline">\(P(R)\)</span> is independent of all data, observed and missing. For example, if missingness depends on the flip of a coin, the data would be MCAR. Data are <em>missing at random</em> when <span class="math inline">\(P(R \ | \ D_{obs})\)</span> is independent of <span class="math inline">\(D_{mis}\)</span>, the missing data. In this case, if older people tend to have more missing data, and we’ve recorded age, then the data are MAR. And finally, data are <em>missing not at random</em> (MNAR) when <span class="math inline">\(P(R \ | \ D_{obs}) = f(D_{mis})\)</span>, or missingness is related to the unobserved data even after conditioning on observed data. If missingness is related to the health of a person at follow-up and the outcome measurement reflects the health of a person, then the data are MNAR.</p>
</div>
<div id="the-missingness-taxonomy-in-3-dags" class="section level2">
<h2>The missingness taxonomy in 3 DAGs</h2>
<p>The <a href="http://papers.nips.cc/paper/4899-graphical-models-for-inference-with-missing-data.pdf"><em>Mohan et al</em></a> paper suggests including the missing indicator <span class="math inline">\(R_p\)</span> directly in the DAG to clarify the nature of dependence between the variables and the missingness. If we have missingness in the outcome <span class="math inline">\(Y\)</span> (so that for at least one individual <span class="math inline">\(R_y = 1\)</span>), there is an induced observed variable <span class="math inline">\(Y^*\)</span> that equals <span class="math inline">\(Y\)</span> if <span class="math inline">\(R_y = 0\)</span>, and is missing if <span class="math inline">\(R_y = 1\)</span>. <span class="math inline">\(Y\)</span> represents the complete outcome data, which don’t observe if there is any missingness. The question is, can we estimate the joint distribution <span class="math inline">\(P(A, Y)\)</span> (or really any characteristic of the distribution, such as the mean of <span class="math inline">\(Y\)</span> at different levels of <span class="math inline">\(A\)</span>, which would give us a measure of causal effect) using the observed data <span class="math inline">\((A, R_y, Y^*)\)</span>? (For much of what follows, I am drawing directly from the <em>Mohan et al</em> paper.)</p>
<div id="mcar" class="section level3">
<h3>MCAR</h3>
<p><img src="https://www.rdatagen.net/img/post-missing/MCAR.png" /></p>
<p>First, consider when the missingness is MCAR, as depicted above. From the DAG, <span class="math inline">\(A \cup Y \perp \! \! \! \perp R_y\)</span>, since <span class="math inline">\(Y^*\)</span> is a “collider”. It follows that <span class="math inline">\(P(A, Y) = P(A, Y \ | \ R_y)\)</span>, or more specifically <span class="math inline">\(P(A, Y) = P(A, Y \ | \ R_y=0)\)</span>. And when <span class="math inline">\(R_y = 0\)</span>, by definition <span class="math inline">\(Y = Y^*\)</span>. So we end up with <span class="math inline">\(P(A, Y) = P(A, Y^* \ | \ R_y = 0)\)</span>. Using observed data only, we can “recover” the underlying relationship between <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span>.</p>
<p>A simulation my help to see this. First, we use the <code>simstudy</code> functions to define both the data generation and missing data processes:</p>
<pre class="r"><code>def <- defData(varname = "a", formula = 0, variance = 1, dist = "normal")
def <- defData(def, "y", formula = "1*a", variance = 1, dist = "normal")
defM <- defMiss(varname = "y", formula = 0.2, logit.link = FALSE)</code></pre>
<p>The complete data are generated first, followed by the missing data matrix, and ending with the observed data set.</p>
<pre class="r"><code>set.seed(983987)
dcomp <- genData(1000, def)
dmiss <- genMiss(dcomp, defM, idvars = "id")
dobs <- genObs(dcomp, dmiss, "id")
head(dobs)</code></pre>
<pre><code>## id a y
## 1: 1 0.171 0.84
## 2: 2 -0.882 0.37
## 3: 3 0.362 NA
## 4: 4 1.951 1.62
## 5: 5 0.069 -0.18
## 6: 6 -2.423 -1.29</code></pre>
<p>In this replication, about 22% of the <span class="math inline">\(Y\)</span> values are missing:</p>
<pre class="r"><code>dmiss[, mean(y)]</code></pre>
<pre><code>## [1] 0.22</code></pre>
<p>If <span class="math inline">\(P(A, Y) = P(A, Y^* \ | \ R_y = 0)\)</span>, then we would expect that the mean of <span class="math inline">\(Y\)</span> in the complete data set will equal the mean of <span class="math inline">\(Y^*\)</span> in the observed data set. And indeed, they appear quite close:</p>
<pre class="r"><code>round(c(dcomp[, mean(y)], dobs[, mean(y, na.rm = TRUE)]), 2)</code></pre>
<pre><code>## [1] 0.03 0.02</code></pre>
<p>Going beyond the mean, we can characterize the joint distribution of <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span> using a linear model (which we know is true, since that is how we generated the data). Since the outcome data are missing completely at random, we would expect that the relationship between <span class="math inline">\(A\)</span> and <span class="math inline">\(Y^*\)</span> to be very close to the true relationship represented by the complete (and not fully observed) data.</p>
<pre class="r"><code>fit.comp <- lm(y ~ a, data = dcomp)
fit.obs <- lm(y ~ a, data = dobs)
broom::tidy(fit.comp)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.00453 0.0314 -0.144 8.85e- 1
## 2 a 0.964 0.0313 30.9 2.62e-147</code></pre>
<pre class="r"><code>broom::tidy(fit.obs)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0343 0.0353 -0.969 3.33e- 1
## 2 a 0.954 0.0348 27.4 4.49e-116</code></pre>
<p>And if we plot those lines over the actual data, they should be quite close, if not overlapping. In the plot below, the red points represent the true values of the missing data. We can see that missingness is scattered randomly across values of <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span> - this is what MCAR data looks like. The solid line represents the fitted regression line based on the full data set (assuming no data are missing) and the dotted line represents the fitted regression line using complete cases only.</p>
<pre class="r"><code>dplot <- cbind(dcomp, y.miss = dmiss$y)
ggplot(data = dplot, aes(x = a, y = y)) +
geom_point(aes(color = factor(y.miss)), size = 1) +
scale_color_manual(values = c("grey60", "#e67c7c")) +
geom_abline(intercept = coef(fit.comp)[1],
slope = coef(fit.comp)[2]) +
geom_abline(intercept = coef(fit.obs)[1],
slope = coef(fit.obs)[2], lty = 2) +
theme(legend.position = "none",
panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-04-02-musings-on-missing-data.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="mar" class="section level3">
<h3>MAR</h3>
<p><img src="https://www.rdatagen.net/img/post-missing/MAR.png" /></p>
<p>This DAG is showing a MAR pattern, where <span class="math inline">\(Y \perp \! \! \! \perp R_y \ | \ A\)</span>, again because <span class="math inline">\(Y^*\)</span> is a collider. This means that <span class="math inline">\(P(Y | A) = P(Y | A, R_y)\)</span>. If we decompose <span class="math inline">\(P(A, Y) = P(Y | A)P(A)\)</span>, you can see how that independence is useful. Substituting <span class="math inline">\(P(Y | A, R_y)\)</span> for <span class="math inline">\(P(Y | A)\)</span> , <span class="math inline">\(P(A, Y) = P(Y | A, R_y)P(A)\)</span>. Going further, <span class="math inline">\(P(A, Y) = P(Y | A, R_y=0)P(A)\)</span>, which is equal to <span class="math inline">\(P(Y^* | A, R_y=0)P(A)\)</span>. Everything in this last decomposition is observable - <span class="math inline">\(P(A)\)</span> from the full data set and <span class="math inline">\(P(Y^* | A, R_y=0)\)</span> from the records with observed <span class="math inline">\(Y\)</span>’s only.</p>
<p>This implies that, conceptually at least, we can estimate the conditional probability distribution of observed-only <span class="math inline">\(Y\)</span>’s for each level of <span class="math inline">\(A\)</span>, and then pool the distributions across the fully observed distribution of <span class="math inline">\(A\)</span>. That is, under an assumption of data MAR, we can recover the joint distribution of the full data using observed data only.</p>
<p>To simulate, we keep the data generation process the same as under MCAR; the only thing that changes is the missingness generation process. <span class="math inline">\(P(R_y)\)</span> now depends on <span class="math inline">\(A\)</span>:</p>
<pre class="r"><code>defM <- defMiss(varname = "y", formula = "-2 + 1.5*a", logit.link = TRUE)</code></pre>
<p>After generating the data as before, the proportion of missingness is unchanged (though the pattern of missingness certainly is):</p>
<pre class="r"><code>dmiss[, mean(y)]</code></pre>
<pre><code>## [1] 0.22</code></pre>
<p>We do not expect the marginal distribution of <span class="math inline">\(Y\)</span> and <span class="math inline">\(Y^*\)</span> to be the same (only the distributions conditional on <span class="math inline">\(A\)</span> are close), so the means should be different:</p>
<pre class="r"><code>round(c(dcomp[, mean(y)], dobs[, mean(y, na.rm = TRUE)]), 2)</code></pre>
<pre><code>## [1] 0.03 -0.22</code></pre>
<p>However, since the conditional distribution of <span class="math inline">\((Y|A)\)</span> is equivalent to <span class="math inline">\((Y^*|A, R_y = 0)\)</span>, we would expect estimates from a regression model of <span class="math inline">\(E[Y] = \beta_0 + \beta_1A)\)</span> would yield estimates very close to <span class="math inline">\(E[Y^*] = \beta_0^{*} + \beta_1^{*}A\)</span>. That is, we would expect <span class="math inline">\(\beta_1^{*} \approx \beta_1\)</span>.</p>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.00453 0.0314 -0.144 8.85e- 1
## 2 a 0.964 0.0313 30.9 2.62e-147</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00756 0.0369 0.205 8.37e- 1
## 2 a 0.980 0.0410 23.9 3.57e-95</code></pre>
<p>The overlapping lines in the plot confirm the close model estimates. In addition, you can see here that missingness is associated with higher values of <span class="math inline">\(A\)</span>.</p>
<p><img src="https://www.rdatagen.net/post/2019-04-02-musings-on-missing-data.en_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
</div>
<div id="mnar" class="section level3">
<h3>MNAR</h3>
<p><img src="https://www.rdatagen.net/img/post-missing/MNAR.png" /></p>
<p>In MNAR, there is no way to separate <span class="math inline">\(Y\)</span> from <span class="math inline">\(R_y\)</span>. Reading from the DAG, <span class="math inline">\(P(Y) \neq P(Y^* | R_y)\)</span>, and <span class="math inline">\(P(Y|A) \neq P(Y^* | A, R_y)\)</span>. There is no way to recover the joint probability of <span class="math inline">\(P(A,Y)\)</span> with observed data. <em>Mohan et al</em> do show that under some circumstances, it <em>is</em> possible to use observed data to recover the true distribution under MNAR (particularly when there is missingness related to the exposure measurement <span class="math inline">\(A\)</span>), but not in this particular case.</p>
<p><a href="https://journals.sagepub.com/doi/pdf/10.1177/0962280210394469"><em>Daniel et al</em></a> have a different approach to determine whether the causal relationship of <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span> is identifiable under the different mechanisms. They do not use a variable like <span class="math inline">\(Y^*\)</span>, but introduce external nodes <span class="math inline">\(U_a\)</span> and <span class="math inline">\(U_y\)</span> representing unmeasured variability related to both exposure and outcome (panel <em>a</em> of the diagram below).</p>
<p><img src="https://www.rdatagen.net/img/post-missing/MNAR%20Daniel.png" /></p>
<p>In the case of MNAR, when you use complete cases only, you are effectively controlling for <span class="math inline">\(R_y\)</span> (panel <em>b</em>). Since <span class="math inline">\(Y\)</span> is a collider (and <span class="math inline">\(U_y\)</span> is an ancestor of <span class="math inline">\(Y\)</span>), this has the effect of inducing an association between <span class="math inline">\(A\)</span> and <span class="math inline">\(U_y\)</span>, the common causes of <span class="math inline">\(Y\)</span>. By doing this, we have introduced unmeasured confounding that cannot be corrected, because <span class="math inline">\(U_y\)</span>, by definition, always represents the portion of unmeasured variation of <span class="math inline">\(Y\)</span>.</p>
<p>In the simulation, I explicitly generate <span class="math inline">\(U_y\)</span>, so we can see if we observe this association:</p>
<pre class="r"><code>def <- defData(varname = "a", formula = 0, variance = 1, dist = "normal")
def <- defData(def, "u.y", formula = 0, variance = 1, dist = "normal")
def <- defData(def, "y", formula = "1*a + u.y", dist = "nonrandom")</code></pre>
<p>This time around, we generate missingness of <span class="math inline">\(Y\)</span> as a function of <span class="math inline">\(Y\)</span> itself:</p>
<pre class="r"><code>defM <- defMiss(varname = "y", formula = "-3 + 2*y", logit.link = TRUE)</code></pre>
<p>And this results in just over 20% missingness:</p>
<pre class="r"><code>dmiss[, mean(y)]</code></pre>
<pre><code>## [1] 0.21</code></pre>
<p>Indeed, <span class="math inline">\(A\)</span> and <span class="math inline">\(U_y\)</span> are virtually uncorrelated in the full data set, but are negatively correlated in the cases where <span class="math inline">\(Y\)</span> is not missing, as theory would suggest:</p>
<pre class="r"><code>round(c(dcomp[, cor(a, u.y)], dobs[!is.na(y), cor(a, u.y)]), 2)</code></pre>
<pre><code>## [1] -0.04 -0.23</code></pre>
<p>The plot generated from these data shows diverging regression lines, the divergence a result of the induced unmeasured confounding.</p>
<p><img src="https://www.rdatagen.net/post/2019-04-02-musings-on-missing-data.en_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
<p>In this MNAR example, we see that the missingness is indeed associated with higher values of <span class="math inline">\(Y\)</span>, although the proportion of missingness remains at about 21%, consistent with the earlier simulations.</p>
</div>
</div>
<div id="there-may-be-more-down-the-road" class="section level2">
<h2>There may be more down the road</h2>
<p>I’ll close here, but in the near future, I hope to explore various (slightly more involved) scenarios under which complete case analysis is adequate, or where something like multiple imputation is more useful. Also, I would like to get back to the original motivation for writing about missingness, which was to describe how I went about analyzing the child emotional intelligence data. Both of these will be much easier now that we have the basic tools to think about how missing data can be generated in a systematic way.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Daniel, Rhian M., Michael G. Kenward, Simon N. Cousens, and Bianca L. De Stavola. “Using causal diagrams to guide analysis in missing data problems.” Statistical methods in medical research 21, no. 3 (2012): 243-256.</p>
<p>Mohan, Karthika, Judea Pearl, and Jin Tian. “Graphical models for inference with missing data.” In Advances in neural information processing systems, pp. 1277-1285. 2013.</p>
Westfall, Jake. “Using causal graphs to understand missingness and how to deal with it.” Cookie Scientist (blog). August 22, 2017. Accessed March 25, 2019. <a href="http://jakewestfall.org/blog/" class="uri">http://jakewestfall.org/blog/</a>.
</font></small>
</p>
</div>
A case where prospective matching may limit bias in a randomized trial
https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/
Tue, 12 Mar 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/
<p>Analysis is important, but study design is paramount. I am involved with the Diabetes Research, Education, and Action for Minorities (DREAM) Initiative, which is, among other things, estimating the effect of a group-based therapy program on weight loss for patients who have been identified as pre-diabetic (which means they have elevated HbA1c levels). The original plan was to randomize patients at a clinic to treatment or control, and then follow up with those assigned to the treatment group to see if they wanted to participate. The primary outcome is going to be measured using medical records, so those randomized to control (which basically means nothing special happens to them) will not need to interact with the researchers in any way.</p>
<p>The concern with this design is that only those patients randomized to the intervention arm of the study have an opportunity to make a choice about participating. In fact, in a pilot study, it was quite difficult to recruit some patients, because the group therapy sessions were frequently provided during working hours. So, even if the groups are balanced after randomization with respect to important (and unimportant characteristics) like age, gender, weight, baseline A1c levels, etc., the patients who actually receive the group therapy might look quite different from the patients who receive treatment as usual. The decision to actually participate in group therapy is not randomized, so it is possible (maybe even likely) that the group getting the therapy is older and more at risk for diabetes (which might make them more motivated to get involved) than those in the control group.</p>
<p>One solution is to analyze the outcomes for everyone randomized, regardless of whether or not they participate (as an <em>intent-to-treat</em> analysis). This estimate would answer the question about how effective the therapy would be in a setting where the intervention is made available; this intent-to-treat estimate does not say how effective the therapy is for the patients who actually choose to receive it. To answer this second question, some sort of <em>as-treated</em> analysis could be used. One analytic solution would be to use an instrumental variable approach. (I wrote about non-compliance in a series of posts starting <a href="https://www.rdatagen.net/post/cace-explored/">here</a>.)</p>
<p>However, we decided to address the issue of differential non-participation in the actual design of the study. In particular, we have modified the randomization process with the aim of eliminating any potential bias. The post-hoc IV analysis is essentially a post-hoc matched analysis (it estimates the treatment effect only for the compliers - those randomized to treatment who actually participate in treatment); we hope to construct the groups <em>prospectively</em> to arrive at the same estimate.</p>
<div id="the-matching-strategy" class="section level2">
<h2>The matching strategy</h2>
<p>The idea is quite simple. We will generate a list of patients based on a recent pre-diabetes diagnosis. From that list, we will draw a single individual and then find a match from the remaining individuals. The match will be based on factors that the researchers think might be related to the outcome, such as age, gender, and one or two other relevant baseline measures. (If the number of matching characteristics grows too large, matching may turn out to be difficult.) If no match is found, the first individual is removed from the study. If a match is found, the first individual is assigned to the therapy group, and the second to the control group. Now we repeat the process, drawing another individual from the list (which excludes the first pair and any patients who have been unmatched), and finding a match. The process is repeated until everyone on the list has been matched or placed on the unmatched list.</p>
<p>After the pairs have been created, the research study coordinators reach out to the individuals who have been randomized to the therapy group in an effort to recruit participants. If a patient declines, she and her matched pair are removed from the study (i.e. their outcomes will not be included in the final analysis). The researchers will work their way down the list until enough people have been found to participate.</p>
<p>We try to eliminate the bias due to differential dropout by removing the matched patient every time a patient randomized to therapy declines to participate. We are making a key assumption here: the matched patient of someone who agrees to participate would have also agreed to participate. We are also assuming that the matching criteria are sufficient to predict participation. While we will not completely remove bias, it may be the best we can do given the baseline information we have about the patients. It would be ideal if we could ask both members of the pair if they would be willing to participate, and remove them both if one declines. However, in this particular study, this is not feasible.</p>
</div>
<div id="the-matching-algorithm" class="section level2">
<h2>The matching algorithm</h2>
<p>I implemented this algorithm on a sample data set that includes gender, age, and BMI, the three characteristics we want to match. The data is read directly into an <code>R</code> data.table <code>dsamp</code>. I’ve printed the first six rows:</p>
<pre class="r"><code>dsamp <- fread("DataMatchBias/eligList.csv")
setkey(dsamp, ID)
dsamp[1:6]</code></pre>
<pre><code>## ID female age BMI
## 1: 1 1 24 27.14
## 2: 2 0 29 31.98
## 3: 3 0 47 25.28
## 4: 4 0 40 24.27
## 5: 5 1 29 30.61
## 6: 6 1 38 25.69</code></pre>
<p>The loop below selects a single record from dsamp and searches for a match. If a match is found, the selected record is added to <code>drand</code> (randomized to therapy) and the match is added to <code>dcntl</code>. If no match is found, the single record is added to <code>dused</code>, and nothing is added to <code>drand</code> or <code>dcntl</code>. Anytime a record is added to any of the three data tables, it is removed from <code>dsamp</code>. This process continues until <code>dsamp</code> has one or no records remaining.</p>
<p>The actual matching is done by a call to function <code>Match</code> from the <code>Matching</code> package. This function is typically used to match a group of exposed to unexposed (or treated to untreated) individuals, often using a propensity score. In this case, we are matching simultaneously on the three columns in <code>dsamp</code>. Ideally, we would want to have exact matches, but this is unrealistic for continuous measures. So, for age and BMI, we set the matching range to be 0.5 standard deviations. (We do match exactly on gender.)</p>
<pre class="r"><code>library(Matching)
set.seed(3532)
dsamp[, rx := 0]
dused <- NULL
drand <- NULL
dcntl <- NULL
while (nrow(dsamp) > 1) {
selectRow <- sample(1:nrow(dsamp), 1)
dsamp[selectRow, rx := 1]
myTr <- dsamp[, rx]
myX <- as.matrix(dsamp[, .(female, age, BMI)])
match.dt <- Match(Tr = myTr, X = myX,
caliper = c(0, 0.50, .50), ties = FALSE)
if (length(match.dt) == 1) { # no match
dused <- rbind(dused, dsamp[selectRow])
dsamp <- dsamp[-selectRow, ]
} else { # match
trt <- match.dt$index.treated
ctl <- match.dt$index.control
drand <- rbind(drand, dsamp[trt])
dcntl <- rbind(dcntl, dsamp[ctl])
dsamp <- dsamp[-c(trt, ctl)]
}
}</code></pre>
</div>
<div id="matching-results" class="section level2">
<h2>Matching results</h2>
<p>Here is a plot of all the pairs that were generated (connected by the blue segment), and includes the individuals without a match (red circles). We could get shorter line segments if we reduced the caliper values, but we would certainly increase the number of unmatched patients.</p>
<p><img src="https://www.rdatagen.net/post/2019-03-12-a-case-where-prospecitve-matching-may-limit-bias.en_files/figure-html/unnamed-chunk-3-1.png" width="960" /></p>
<p>The distributions of the matching variables (or least the means and standard deviations) appear quite close, as we can see by looking at the males and females separately.</p>
<div id="males" class="section level5">
<h5>Males</h5>
<pre><code>## rx N mu.age sd.age mu.bmi sd.bmi
## 1: 0 77 44.8 12.4 28.6 3.65
## 2: 1 77 44.6 12.4 28.6 3.71</code></pre>
</div>
<div id="females" class="section level5">
<h5>Females</h5>
<pre><code>## rx N mu.age sd.age mu.bmi sd.bmi
## 1: 0 94 47.8 11.1 29.7 4.63
## 2: 1 94 47.8 11.3 29.7 4.55</code></pre>
</div>
</div>
<div id="incorporating-the-design-into-the-analysis-plan" class="section level2">
<h2>Incorporating the design into the analysis plan</h2>
<p>The study - which is formally named <em>Integrated Community-Clinical Linkage Model to Promote Weight Loss among South Asians with Pre-Diabetes</em> - is still in its early stages, so no outcomes have been collected. But when it comes time to analyzing the results, the models used to estimate the effect of the intervention will have to take into consideration two important design factors: (1) the fact that the individuals in the treatment and control groups are not independent, because they were assigned to their respective groups in pairs, and (2) the fact that the individuals in the treatment groups will not be independent of each other, since the intervention is group-based, so this a partially cluster randomized trial. In a future post, I will explore this model in a bit more detail.</p>
<p>
<small><font color="darkkhaki">This study is supported by the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases R01DK11048. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</font></small>
</p>
</div>
A example in causal inference designed to frustrate: an estimate pretty much guaranteed to be biased
https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/
Tue, 26 Feb 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/
<p>I am putting together a brief lecture introducing causal inference for graduate students studying biostatistics. As part of this lecture, I thought it would be helpful to spend a little time describing directed acyclic graphs (DAGs), since they are an extremely helpful tool for communicating assumptions about the causal relationships underlying a researcher’s data.</p>
<p>The strength of DAGs is that they help us think how these underlying relationships in the data might lead to biases in causal effect estimation, and suggest ways to estimate causal effects that eliminate these biases. (For a real introduction to DAGs, you could take a look at this <a href="http://ftp.cs.ucla.edu/pub/stat_ser/r251.pdf">paper</a> by <em>Greenland</em>, <em>Pearl</em>, and <em>Robins</em> or better yet take a look at Part I of this <a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/2015/">book</a> on causal inference by <em>Hernán</em> and <em>Robins</em>.)</p>
<p>As part of this lecture, I plan on including a (frustrating) example that illustrates a scenario where it may in fact be impossible to get an unbiased estimate of the causal effect of interest based on the data that has been collected. I thought I would share this little example here.</p>
<div id="the-scenario" class="section level2">
<h2>The scenario</h2>
<p>In the graph below we are interested in the causal effect of <span class="math inline">\(A\)</span> on an outcome <span class="math inline">\(Y\)</span>. We have also measured a covariate <span class="math inline">\(L\)</span>, thinking it might be related to some unmeasured confounder (in this case <span class="math inline">\(U_2\)</span>). Furthermore, there is another unmeasured variable <span class="math inline">\(U_1\)</span> unrelated to <span class="math inline">\(A\)</span>, but related to the measure <span class="math inline">\(L\)</span> and outcome <span class="math inline">\(Y\)</span>. These relationships are captured in this DAG:</p>
<p><img src="https://www.rdatagen.net/img/post-dag/firstDAG.png" /></p>
<p>It may help to be a bit more concrete about what these variables might represent. Say we are conducting an epidemiological study focused on whether or not exercise between the age of 50 and 60 has an effect on hypertension after 60. (So, <span class="math inline">\(A\)</span> is exercise and <span class="math inline">\(Y\)</span> is a measure of hypertension.) We are concerned that there might be confounding by some latent (unmeasured) factor related to an individual’s conscientiousness about their health; those who are more conscientious may exercise more, but they will also do other things to improve their health. In this case, we are able to measure whether or not the individual has a healthy diet (<span class="math inline">\(L\)</span>), and we hope that will address the issue of confounding. (Note we are making the assumption that conscientiousness is related to hypertension only through exercise or diet, probably not very realistic.)</p>
<p>But, it also turns out that an individual’s diet is also partly determined by where the individual lives; that is, characteristics of the area may play a role. Unfortunately, the location of the individual (or characteristics of the location) was not measured (<span class="math inline">\(U_1\)</span>). These same characteristics also affect location-specific hypertension levels.</p>
<p>Inspecting the original DAG, we see that <span class="math inline">\(U_2\)</span> is indeed confounding the relationship between <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span>. There is a back-door path <span class="math inline">\(A \rightarrow U_2 \rightarrow L \rightarrow Y\)</span> that needs to be blocked. We cannot just ignore this path. If we generate data and estimate the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(Y\)</span>, we will see that the estimate is quite biased. First, we generate data based on the DAG, assuming <span class="math inline">\(L\)</span>, and <span class="math inline">\(A\)</span> are binary, and <span class="math inline">\(Y\)</span> is continuous (though this is by no means necessary):</p>
<pre class="r"><code>d <- defData(varname = "U1", formula = 0.5,
dist = "binary")
d <- defData(d, varname = "U2", formula = 0.4,
dist = "binary")
d <- defData(d, varname = "L", formula = "-1.6 + 1 * U1 + 1 * U2",
dist = "binary", link = "logit")
d <- defData(d, varname = "A", formula = "-1.5 + 1.2 * U2",
dist = "binary", link="logit")
d <- defData(d, varname = "Y", formula = "0 + 1 * U1 + 1 * L + 0.5 * A",
variance = .5, dist = "normal")
set.seed(20190226)
dd <- genData(2500, d)
dd</code></pre>
<pre><code>## id U1 U2 L A Y
## 1: 1 0 1 1 1 1.13
## 2: 2 0 0 1 0 1.31
## 3: 3 1 0 0 0 1.20
## 4: 4 0 1 1 0 1.04
## 5: 5 0 0 0 0 -0.67
## ---
## 2496: 2496 0 0 0 0 0.29
## 2497: 2497 0 0 0 0 -0.24
## 2498: 2498 1 0 1 0 1.32
## 2499: 2499 1 1 1 1 3.44
## 2500: 2500 0 0 0 0 -0.78</code></pre>
<p>And here is the unadjusted model. The effect of <span class="math inline">\(A\)</span> is overestimated (the true effect is 0.5):</p>
<pre class="r"><code>broom::tidy(lm(Y ~ A, data = dd))</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.826 0.0243 34.0 2.54e-208
## 2 A 0.570 0.0473 12.0 1.53e- 32</code></pre>
</div>
<div id="adjusting-for-a-potential-confounder-that-is-also-a-collider" class="section level2">
<h2>Adjusting for a potential confounder that is also a collider</h2>
<p>While we are not able to measure <span class="math inline">\(U_2\)</span>, we have observed <span class="math inline">\(L\)</span>. We might think we are OK. But, alas, we are not. If we control for diet (<span class="math inline">\(L\)</span>), we are controlling a “collider”, which will open up an association between <span class="math inline">\(U_1\)</span> and <span class="math inline">\(U_2\)</span>. (I wrote about this before <a href="https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/">here</a>.)</p>
<p><img src="https://www.rdatagen.net/img/post-dag/firstDAGcontrol1.png" /></p>
<p>The idea is that if I have a healthy diet but I am not particularly conscientious about my health, I probably live in an area encourages or provides access to better food. Therefore, conditioning on diet induces a (negative, in this case) correlation between location type and health conscientiousness. So, by controlling <span class="math inline">\(L\)</span> we’ve created a back-door path <span class="math inline">\(A \rightarrow U_2 \rightarrow U_1 \rightarrow Y\)</span>. Confounding remains, though it may be reduced considerably if the induced link between <span class="math inline">\(U_2\)</span> and <span class="math inline">\(U_1\)</span> is relatively weak.</p>
<pre class="r"><code>broom::tidy(lm(Y ~ L+ A, data = dd))</code></pre>
<pre><code>## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.402 0.0231 17.4 6.58e- 64
## 2 L 1.26 0.0356 35.6 2.10e-224
## 3 A 0.464 0.0386 12.0 2.46e- 32</code></pre>
</div>
<div id="more-systematic-exploration-of-bias-and-variance-of-estimates" class="section level2">
<h2>More systematic exploration of bias and variance of estimates</h2>
<p>If we repeatedly generate samples (this time of size 500), we get a much better picture of the consequences of using different models to estimate the causal effect. The function below generates the data (using the same definitions as before), and then estimating three different models: (1) no adjustment, (2) incorrect adjustment for <span class="math inline">\(L\)</span>, the confounder/collider, and (3) the correct adjustment of the unmeasured confounder <span class="math inline">\(U_2\)</span>, which should be unbiased. The function returns the three estimates of the causal effect of <span class="math inline">\(A\)</span>:</p>
<pre class="r"><code>repFunc <- function(n, def) {
dd <- genData(n, def)
c1 <- coef(lm(Y ~ A, data = dd))["A"]
c2 <- coef(lm(Y ~ L + A, data = dd))["A"]
c3 <- coef(lm(Y ~ U2 + A, data = dd))["A"]
return(data.table(c1, c2, c3))
}</code></pre>
<p>This following code generates 2500 replications of the “experiment” and stores the final results in data.table <code>rdd</code>:</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG") # to set seed for parallel process
reps <- parallel::mclapply(1:2500,
function(x) repFunc(500, d),
mc.set.seed = TRUE)
rdd <- rbindlist(reps)
rdd[, rep := .I]
rdd</code></pre>
<pre><code>## c1 c2 c3 rep
## 1: 0.46 0.45 0.40 1
## 2: 0.56 0.45 0.41 2
## 3: 0.59 0.46 0.50 3
## 4: 0.74 0.68 0.61 4
## 5: 0.45 0.43 0.41 5
## ---
## 2496: 0.42 0.42 0.37 2496
## 2497: 0.57 0.54 0.53 2497
## 2498: 0.56 0.49 0.51 2498
## 2499: 0.53 0.45 0.43 2499
## 2500: 0.73 0.63 0.69 2500</code></pre>
<pre class="r"><code>rdd[, .(mean(c1 - 0.5), mean(c2 - 0.5), mean(c3-0.5))]</code></pre>
<pre><code>## V1 V2 V3
## 1: 0.062 -0.015 -0.0016</code></pre>
<pre class="r"><code>rdd[, .(var(c1), var(c2), var(c3))]</code></pre>
<pre><code>## V1 V2 V3
## 1: 0.011 0.0074 0.012</code></pre>
<p>As expected, the first two models are biased, whereas the third is not. Under these parameter and distribution assumptions, the variance of the causal effect estimate is larger for the unbiased estimate than for the model that incorrectly adjusts for diet (<span class="math inline">\(L\)</span>). So, we seem to have a bias/variance trade-off. In other cases, where we have binary outcome <span class="math inline">\(Y\)</span> or continuous exposures, this trade-off may be more or less extreme.</p>
<p>Here, we end with a look at the estimates, with the dashed line indicated at the true causal effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(Y\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2019-02-26-dags-colliders-and-an-example-of-variance-bias-tradeoff.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
Using the uniform sum distribution to introduce probability
https://www.rdatagen.net/post/a-fun-example-to-explore-probability/
Tue, 05 Feb 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-fun-example-to-explore-probability/
<p>I’ve never taught an intro probability/statistics course. If I ever did, I would certainly want to bring the underlying wonder of the subject to life. I’ve always found it almost magical the way mathematical formulation can be mirrored by computer simulation, the way proof can be guided by observed data generation processes, and the way DGPs can confirm analytic solutions.</p>
<p>I would like to begin such a course with a somewhat unusual but accessible problem that would evoke these themes from the start. The concepts would not necessarily be immediately comprehensible, but rather would pique the interest of the students.</p>
<p>I recently picked up copy of John Allen Paulos’ fun sort-of-memoir <a href="https://www.goodreads.com/book/show/24940376-a-numerate-life"><em>A Numerate Life</em></a>, and he reminded me of an interesting problem that might provide a good starting point for my imaginary course. The problem is great, because it is easy to understand, but challenging enough to raise some interesting issues. In this post, I sketch out a set of simulations and mathematical derivations that would motivate the ideas of marginal and conditional probability distributions.</p>
<div id="the-problem" class="section level3">
<h3>The problem</h3>
<p>Say we make repeated draws of independent uniform variables between 0 and 1, and add them up as we go along. The question is, on average, how many draws do we need to make so that the cumulative sum is greater than 1? We definitely need to make at least 2 draws, but it is possible (though almost certainly not the case) that we won’t get to 1 with even 100 draws. It turns that if we did this experiment over and over, the average number of draws would approach <span class="math inline">\(exp(1) = 2.718\)</span>. Can we prove this and confirm by simulation?</p>
</div>
<div id="re-formulating-the-question-more-formally" class="section level3">
<h3>Re-formulating the question more formally</h3>
<p>If <span class="math inline">\(U_k\)</span> is one draw of a uniform random variable (where <span class="math inline">\(k \in (1, 2, ...)\)</span>), and <span class="math inline">\(N\)</span> represents the number of draws, the probability of <span class="math inline">\(N\)</span> taking on a specific value (say <span class="math inline">\(n\)</span>) can be characterized like this:</p>
<p><span class="math display">\[
\footnotesize{P(N = n) = P\left( \; \sum_{k=1}^{n-1} {U_k} < 1 \; \& \; \sum_{k=1}^{n} {U_k} > 1\right)}
\]</span></p>
<p>Now, we need to understand this a little better to figure out how to figure out how to characterize the distribution of <span class="math inline">\(N\)</span>, which is what we are really interested in.</p>
</div>
<div id="cdf" class="section level3">
<h3>CDF</h3>
<p>This is where I would start to describe the concept of a probability distribution, and start by looking at the cumulative distribution function (CDF) <span class="math inline">\(P \left( \; \sum_{k=1}^n {U_k} < 1 \right)\)</span> for a fixed <span class="math inline">\(n\)</span>. It turns out that the CDF for this distribution (which actually has at least two names - <em>Irwin-Hall distribution</em> or <em>uniform sum distribution</em>) can be estimated with the following equation. (I certainly wouldn’t be bold enough to attempt to derive this formula in an introductory class):</p>
<p><span class="math display">\[
\footnotesize {
P\left(\sum_{k=1}^{n} {U_k} < x \right) = \frac{1}{n!} \sum_{k=0}^{\lfloor x \rfloor} (-1)^k {n \choose k} (x - k)^n
}
\]</span></p>
<p>Our first simulation would confirm that this specification of the CDF is correct. Before doing the simulation, we create an <code>R</code> function to calculate the theoretical cumulative probability for the range of <span class="math inline">\(x\)</span>.</p>
<pre class="r"><code>psumunif <- function(x, n) {
k <- c(0:floor(x))
(1/factorial(n)) * sum( (-1)^k * choose(n, k) * (x - k)^n )
}
ddtheory <- data.table(x = seq(0, 3, by = .1))
ddtheory[, cump := psumunif(x, 3), keyby = x]
ddtheory[x %in% seq(0, 3, by = .5)]</code></pre>
<pre><code>## x cump
## 1: 0.0 0.000
## 2: 0.5 0.021
## 3: 1.0 0.167
## 4: 1.5 0.500
## 5: 2.0 0.833
## 6: 2.5 0.979
## 7: 3.0 1.000</code></pre>
<p>Now, we generate some actual data. In this case, we are assuming three draws for each experiment, and we are “conducting” 200 different experiments. We generate non-correlated uniform data for each experiment:</p>
<pre class="r"><code>library(simstudy)
set.seed(02012019)
dd <- genCorGen(n = 200, nvars = 3, params1 = 0, params2 = 1,
dist = "uniform", rho = 0.0, corstr = "cs", cnames = "u")
dd</code></pre>
<pre><code>## id period u
## 1: 1 0 0.98
## 2: 1 1 0.44
## 3: 1 2 0.58
## 4: 2 0 0.93
## 5: 2 1 0.80
## ---
## 596: 199 1 0.44
## 597: 199 2 0.77
## 598: 200 0 0.21
## 599: 200 1 0.45
## 600: 200 2 0.83</code></pre>
<p>For each experiment, we calculate the sum the of the three draws, so that we have a data set with 200 observations:</p>
<pre class="r"><code>dsum <- dd[, .(x = sum(u)), keyby = id]
dsum</code></pre>
<pre><code>## id x
## 1: 1 2.0
## 2: 2 1.8
## 3: 3 1.6
## 4: 4 1.4
## 5: 5 1.6
## ---
## 196: 196 1.7
## 197: 197 1.3
## 198: 198 1.4
## 199: 199 1.7
## 200: 200 1.5</code></pre>
<p>We can plot the theoretical CDF versus the observed empirical CDF:</p>
<pre class="r"><code>ggplot(dsum, aes(x)) +
stat_ecdf(geom = "step", color = "black") +
geom_line(data=ddtheory, aes(x=x, y = cump), color ="red") +
scale_x_continuous(limits = c(0, 3)) +
ylab("cumulative probability") +
theme(panel.grid.minor = element_blank(),
axis.ticks = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-02-05-a-fun-example-to-explore-probability_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>And here is another pair of curves using a set of experiments with only two draws for each experiment:</p>
<p><img src="https://www.rdatagen.net/post/2019-02-05-a-fun-example-to-explore-probability_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="more-specifically-exploring-psum-1" class="section level3">
<h3>More specifically, exploring P(sum < 1)</h3>
<p>The problem at hand specifically asks us to evaluate <span class="math inline">\(\footnotesize{P\left(\sum_{k=1}^{n} {U_k} < 1 \right)}\)</span>. So this will be our first algebraic manipulation to derive the probability in terms of <span class="math inline">\(n\)</span>, starting with the analytic solution for the <em>CDF</em> I introduced above without derivation:</p>
<p><span class="math display">\[
\footnotesize{
\begin{aligned}
P\left(\sum_{k=1}^{n} {U_k} < 1 \right) &= \frac{1}{n!} \sum_{k=0}^1 (-1)^k {n \choose k} (x - k)^n \\
\\
&= \frac{1}{n!} \left [ (-1)^0 {n \choose 0} (1 - 0)^n + (-1)^1 {n \choose 1} (1 - 1)^n \right] \\
\\
&= \frac{1}{n!} \left [ 1 + 0 \right] \\
\\
&= \frac{1}{n!}
\end{aligned}
}
\]</span></p>
<p>We can look back at the plots to confirm that this solution is matched by the theoretical and empirical CDFs. For <span class="math inline">\(n=3\)</span>, we expect <span class="math inline">\(\footnotesize{P\left(\sum_{k=1}^{3} {U_k} < 1 \right)} = \frac{1}{3!} = 0.167\)</span>. And for <span class="math inline">\(n=2\)</span>, the expected probability is <span class="math inline">\(\frac{1}{2}\)</span>.</p>
</div>
<div id="deriving-pn" class="section level3">
<h3>Deriving <span class="math inline">\(P(N)\)</span></h3>
<p>I think until this point, things would be generally pretty accessible to a group of students thinking about these things for the first time. This next step, deriving <span class="math inline">\(P(N)\)</span> might present more of a challenge, because we have to deal with joint probabilities as well as conditional probabilities. While I don’t do so here, I think in a classroom setting I would delve more into the simulated data to illustrate each type of probability. The joint probability is merely a probability of multiple events occurring simultaneously. And the the conditional probability is a probability of an event for a subset of the data (the subset defined by the group of observations where another - conditional - event actually happened). Once those concepts were explained a bit, I would need a little courage to walk through the derivation. However, I think it would be worth it, because moving through each step highlights an important concept.</p>
<p><span class="math inline">\(N\)</span> is a new random variable that takes on the value <span class="math inline">\(n\)</span> if <span class="math inline">\(\sum_{k=1}^{n-1} {U_k} < 1\)</span> <em>and</em> <span class="math inline">\(\sum_{k=1}^{n} {U_k} > 1\)</span>. That is, the <span class="math inline">\(n{\text{th}}\)</span> value in a sequence uniform random variables <span class="math inline">\(\left [ U(0,1) \right]\)</span> is the threshold where the cumulative sum exceeds 1. <span class="math inline">\(P(N=n)\)</span> can be derived as follows:</p>
<p><span class="math display">\[
\footnotesize {
\begin{aligned}
P(N = n) &= P\left( \; \sum_{k=1}^{n-1} {U_k} < 1 \; \& \; \sum_{k=1}^{n} {U_k} > 1\right) \\
\\
&= P\left( \; \sum_{k=1}^{n} {U_k} > 1 \; \middle | \;\sum_{k=1}^{n-1} {U_k} < 1 \right) P\left( \; \sum_{k=1}^{n-1} {U_k} < 1 \right) \\
\\
&= \frac{1}{(n-1)!}P\left( \; \sum_{k=1}^{n} {U_k} > 1 \; \middle | \;\sum_{k=1}^{n-1} {U_k} < 1 \right) \\
\\
&= \frac{1}{(n-1)!}\left [ 1 - P\left( \; \sum_{k=1}^{n} {U_k} < 1 \; \middle | \;\sum_{k=1}^{n-1} {U_k} < 1 \right) \right ] \\
\\
&= \frac{1}{(n-1)!}\left [ 1 - \frac{P\left( \; \sum_{k=1}^{n} {U_k} < 1 \; \& \;\sum_{k=1}^{n-1} {U_k} < 1 \right)}{P\left( \; \sum_{k=1}^{n-1} {U_k} < 1 \right)} \right ] \\
\\
&= \frac{1}{(n-1)!}\left [ 1 - \frac{P\left( \; \sum_{k=1}^{n} {U_k} < 1 \; \right)}{P\left( \; \sum_{k=1}^{n-1} {U_k} < 1 \right)} \right ] \\
\\
&= \frac{1}{(n-1)!}\left [ 1 - \frac{1/n!}{1/(n-1)!} \right ]
\end{aligned}
}
\]</span></p>
<p>Now, once we get to this point, it is just algebraic manipulation to get to the final formulation. It is a pet peeve of mine when papers say that it is quite easily shown that some formula can be simplified into another without showing it; sometimes, it is not so simple. In this case, however, I actually think it is. So, here is a final solution of the probability:</p>
<p><span class="math display">\[
P(N = n) = \frac{n-1}{n!}
\]</span></p>
</div>
<div id="simulating-the-distribution-of-n" class="section level3">
<h3>Simulating the distribution of <span class="math inline">\(N\)</span></h3>
<p>We are almost there. To simulate <span class="math inline">\(P(N=n)\)</span>, we generate 1000 iterations of 7 draws. For each iteration, we check to see which draw pushes the cumulative sum over 1, and this is the observed value of <span class="math inline">\(N\)</span> for each iteration. Even though <span class="math inline">\(N\)</span> can conceivably be quite large, we stop at 7, since the probability of observing <span class="math inline">\(n=8\)</span> is vanishingly small, <span class="math inline">\(P(N=8) < 0.0002\)</span>.</p>
<pre class="r"><code>dd <- genCorGen(n = 1000, nvars = 7, params1 = 0, params2 = 1,
dist = "uniform", rho = 0.0, corstr = "cs", cnames = "U")
dd[, csum := cumsum(U), keyby = id]
dd[, under := 1*(csum < 1)]
dc <- dcast(dd, id ~ (period + 1), value.var = c("under", "csum" ))
dc[, n := 2 + sum(under_2, under_3, under_4, under_5, under_6, under_7),
keyby = id]
dc[, .(id, n, csum_1, csum_2, csum_3, csum_4, csum_5, csum_6, csum_7)]</code></pre>
<pre><code>## id n csum_1 csum_2 csum_3 csum_4 csum_5 csum_6 csum_7
## 1: 1 2 0.39 1.13 1.41 1.94 2.4 3.0 3.4
## 2: 2 3 0.31 0.75 1.25 1.74 2.0 2.5 3.0
## 3: 3 2 0.67 1.61 1.98 2.06 2.1 2.6 2.8
## 4: 4 3 0.45 0.83 1.15 1.69 2.6 2.7 3.7
## 5: 5 3 0.27 0.81 1.81 2.46 2.9 3.5 3.9
## ---
## 996: 996 2 0.64 1.26 2.24 2.70 3.1 4.0 4.6
## 997: 997 4 0.06 0.80 0.81 1.05 1.2 1.6 1.8
## 998: 998 5 0.32 0.53 0.71 0.73 1.2 1.2 1.2
## 999: 999 2 0.91 1.02 1.49 1.75 2.4 2.4 2.5
## 1000: 1000 2 0.87 1.10 1.91 2.89 3.3 4.1 4.4</code></pre>
<p>And this is what the data look like. On the left is the cumulative sum of each iteration (color coded by the threshold value), and on the right is the probability for each level of <span class="math inline">\(n\)</span>.</p>
<p><img src="https://www.rdatagen.net/post/2019-02-05-a-fun-example-to-explore-probability_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>Here are the observed and expected probabilities:</p>
<pre class="r"><code>expProbN <- function(n) {
(n-1)/(factorial(n))
}
rbind(prop.table(dc[, table(n)]), # observed
expProbN(2:7) # expected
)</code></pre>
<pre><code>## 2 3 4 5 6 7
## [1,] 0.49 0.33 0.14 0.039 0.0100 0.0010
## [2,] 0.50 0.33 0.12 0.033 0.0069 0.0012</code></pre>
</div>
<div id="expected-value-of-n" class="section level3">
<h3>Expected value of N</h3>
<p>The final piece to the puzzle requires a brief introduction to expected value, which for a discrete outcome (which is what we are dealing with even though the underlying process is a sum of potentially infinite continuous outcomes) is <span class="math inline">\(\sum_{n=0}^\infty \; n\times P(n)\)</span>:</p>
<p><span class="math display">\[
\footnotesize{
\begin{aligned}
E[N] &= \sum_{n=1}^{\infty} \; n P(n) \\
\\
&= \sum_{n=1}^{\infty} \; n \left ( \frac{n-1}{n!} \right) \\
\\
&= \sum_{n=2}^{\infty} \; n \left ( \frac{n-1}{n!} \right) \\
\\
&= \sum_{n=2}^{\infty} \; \frac{1}{(n-2)!} \\
\\
&= \sum_{n=0}^{\infty} \; \frac{1}{n!} \\
\\
&= \sum_{n=0}^{\infty} \; \frac{1^n}{n!} \\
\\
E[N] &= exp(1) \; \; \text{(since} \sum_{n=0}^{\infty} \; \frac{a^n}{n!} = e^a \text{)}
\end{aligned}
}
\]</span></p>
<p>We are now in a position to see if our observed average is what is predicted by theory:</p>
<pre class="r"><code>c(observed = dc[, mean(n)], expected = exp(1))</code></pre>
<pre><code>## observed expected
## 2.8 2.7</code></pre>
<p>I am assuming that all the students in the class will think this is pretty cool when they see this final result. And that will provide motivation to really learn all of these concepts (and more) over the subsequent weeks of the course.</p>
<p>One final note: I have evaded uncertainty or variability in all of this, which is obviously a key omission, and something I would need to address if I really had an opportunity to do something like this in a class. However, simulation provides ample opportunity to introduce that as well, so I am sure I could figure out a way to weave that in. Or maybe that could be the second class, though I probably won’t do a follow-up post.</p>
</div>
Correlated longitudinal data with varying time intervals
https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/
Tue, 22 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/
<p>I was recently contacted to see if <code>simstudy</code> can create a data set of correlated outcomes that are measured over time, but at different intervals for each individual. The quick answer is there is no specific function to do this. However, if you are willing to assume an “exchangeable” correlation structure, where measurements far apart in time are just as correlated as measurements taken close together, then you could just generate individual-level random effects (intercepts and/or slopes) and pretty much call it a day. Unfortunately, the researcher had something more challenging in mind: he wanted to generate auto-regressive correlation, so that proximal measurements are more strongly correlated than distal measurements.</p>
<p>As is always the case with <code>R</code>, there are certainly multiple ways to do tackle this problem. I came up with this particular solution, which I thought I’d share. The idea is pretty simple: first, generate the time data with varying intervals, which <em>can</em> be done using <code>simstudy</code>; second, create an alternate data set of “latent” observations that include all time points, also doable with <code>simstudy</code>; last, merge the two in a way that gives you what you want.</p>
<div id="step-1-varying-time-intervals" class="section level3">
<h3>Step 1: varying time intervals</h3>
<p>The function <code>addPeriods</code> can create intervals of varying lengths. The function determines if the input data set includes the special fields <code>mInterval</code> and <code>vInterval</code>. If so, a <code>time</code> value is generated from a gamma distribution with mean <code>mInterval</code> and dispersion <code>vInterval</code>.</p>
<pre class="r"><code>maxTime <- 180 # limit follow-up time to 180 days
def1 <- defData(varname = "nCount", dist = "noZeroPoisson",
formula = 20)
def1 <- defData(def1, varname = "mInterval", dist = "nonrandom",
formula = 20)
def1 <- defData(def1, varname = "vInterval", dist = "nonrandom",
formula = 0.4)
set.seed(20190101)
dt <- genData(1000, def1)
dtPeriod <- addPeriods(dt)
dtPeriod <- dtPeriod[time <= maxTime]</code></pre>
<p>Here is a plot if time intervals for a small sample of the data set:</p>
<p><img src="https://www.rdatagen.net/post/2019-01-22-correlated-longitudinal-data-with-varying-time-intervals_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
<div id="step-2-generate-correlated-data" class="section level3">
<h3>Step 2: generate correlated data</h3>
<p>In this step, I am creating 181 records for each individual (from period = 0 to period = 180). In order to create correlated data, I need to specify the mean and variance for each observation; in this example, the mean is a quadratic function of <code>time</code> and the variance is fixed at 9. I generate the correlated data using the <code>addCorGen</code> function, and specify an <em>AR-1</em> correlation structure with <span class="math inline">\(\rho = 0.4\)</span>,</p>
<pre class="r"><code>def2 <- defDataAdd(varname = "mu", dist = "nonrandom",
formula = "2 + (1/500) * (time) * (180 - time)")
def2 <- defDataAdd(def2, varname = "var", dist = "nonrandom", formula = 9)
dtY <- genData(1000)
dtY <- addPeriods(dtY, nPeriod = (maxTime + 1) )
setnames(dtY, "period", "time")
dtY <- addColumns(def2, dtY)
dtY <- addCorGen(dtOld = dtY, idvar = "id", nvars = (maxTime + 1),
rho = .4, corstr = "ar1", dist = "normal",
param1 = "mu", param2 = "var", cnames = "Y")
dtY[, `:=`(timeID = NULL, var = NULL, mu = NULL)]</code></pre>
<p>Here is a plot of a sample of individuals that shows the values of <span class="math inline">\(Y\)</span> at every single time point (not just the time points generated in step 1). The <span class="math inline">\(Y\)</span>’s are correlated within individual.</p>
<p><img src="https://www.rdatagen.net/post/2019-01-22-correlated-longitudinal-data-with-varying-time-intervals_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
</div>
<div id="step-3" class="section level3">
<h3>Step 3</h3>
<p>Now we just do an inner-join, or perhaps it is a left join - hard to tell, because one data set is a subset of the other. In any case, the new data set includes all the rows from step 1 and the ones that match from step 2.</p>
<pre class="r"><code>setkey(dtY, id, time)
setkey(dtPeriod, id, time)
finalDT <- mergeData(dtY, dtPeriod, idvars = c("id", "time"))</code></pre>
<p>Here is a plot of the observed data for a sample of individuals:</p>
<p><img src="https://www.rdatagen.net/post/2019-01-22-correlated-longitudinal-data-with-varying-time-intervals_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>To verify that the data are indeed correlated with an <em>AR-1</em> structure, I first convert the complete (latent) data from step 2 from its <em>long</em> format to a <em>wide</em> format. The correlation is calculated from this <span class="math inline">\(1000 \times 181\)</span> matrix, where each row is an individual and each column is a value of <span class="math inline">\(Y\)</span> at a different time point. And since the correlation matrix, which has dimensions <span class="math inline">\(181 \times 181\)</span>, is too big to show, what you see is only the upper left hand corner of the matrix:</p>
<pre class="r"><code>round(cor(as.matrix(dcast(dtY, id ~ time,
value.var = "Y")[, -1]))[1:13, 1:13], 1)</code></pre>
<pre><code>## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 0 1.0 0.4 0.2 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0
## 1 0.4 1.0 0.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 2 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0
## 3 0.1 0.1 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.1
## 4 0.0 0.0 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.1 0.0 0.0 0.0
## 5 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.0 0.0
## 6 0.1 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.0 0.0 0.0 0.0
## 7 0.0 0.0 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.1 0.1 0.0
## 8 0.0 0.0 0.1 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.1 0.0 0.0
## 9 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.2 0.4 1.0 0.4 0.2 0.0
## 10 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.4 1.0 0.4 0.2
## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.2 0.4 1.0 0.4
## 12 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.4 1.0</code></pre>
</div>
Considering sensitivity to unmeasured confounding: part 2
https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/
Thu, 10 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/
<p>In <a href="https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/">part 1</a> of this 2-part series, I introduced the notion of <em>sensitivity to unmeasured confounding</em> in the context of an observational data analysis. I argued that an estimate of an association between an observed exposure <span class="math inline">\(D\)</span> and outcome <span class="math inline">\(Y\)</span> is sensitive to unmeasured confounding if we can conceive of a reasonable alternative data generating process (DGP) that includes some unmeasured confounder that will generate the same observed distribution the observed data. I further argued that reasonableness can be quantified or parameterized by the two correlation coefficients <span class="math inline">\(\rho_{UD}\)</span> and <span class="math inline">\(\rho_{UY}\)</span>, which measure the strength of the relationship of the unmeasured confounder <span class="math inline">\(U\)</span> with each of the observed measures. Alternative DGPs that are characterized by high correlation coefficients can be viewed as less realistic, and the observed data could be considered less sensitive to unmeasured confounding. On the other hand, DGPs characterized by lower correlation coefficients would be considered more sensitive.</p>
<p>I need to pause here for a moment to point out that something similar has been described much more thoroughly by a group at NYU’s <a href="https://steinhardt.nyu.edu/priism/">PRIISM</a> (see <a href="https://www.tandfonline.com/doi/abs/10.1080/19345747.2015.1078862">Carnegie, Harada & Hill</a> and <a href="https://onlinelibrary.wiley.com/doi/full/10.1002/sim.6973">Dorie et al</a>). In fact, this group of researchers has actually created an <code>R</code> package called <a href="https://cran.r-project.org/web/packages/treatSens/index.html">treatSens</a> to facilitate sensitivity analysis. I believe the discussion in these posts here is consistent with the PRIISM methodology, except <code>treatSens</code> is far more flexible (e.g. it can handle binary exposures) and provides more informative output than what I am describing. I am hoping that the examples and derivation of an equivalent DGP that I show here provide some additional insight into what sensitivity means.</p>
<p>I’ve been wrestling with these issues for a while, but the ideas for the derivation of an alternative DGP were actually motivated by this recent <a href="https://onlinelibrary.wiley.com/doi/full/10.1002/sim.7904">note</a> by <em>Fei Wan</em> on an unrelated topic. (Wan shows how a valid instrumental variable may appear to violate a key assumption even though it does not.) The key element of Wan’s argument for my purposes is how the coefficient estimates of an observed model relate to the coefficients of an alternative (possibly true) data generation process/model.</p>
<p>OK - now we are ready to walk through the derivation of alternative DGPs for an observed data set.</p>
<div id="two-dgps-same-data" class="section level3">
<h3>Two DGPs, same data</h3>
<p>Recall from Part 1 that we have an observed data model</p>
<p><span class="math display">\[ Y = k_0 + k_1D + \epsilon_Y\]</span>
where <span class="math inline">\(\epsilon_Y \sim N\left(0, \sigma^2_Y\right)\)</span>. We are wondering if there is another DGP that could have generated the data that we have actually observed:</p>
<p><span class="math display">\[
\begin{aligned}
D &= \alpha_0 + \alpha_1 U + \epsilon_D \\
Y &= \beta_0 + \beta_1 D + \beta_2 U + \epsilon_{Y^*},
\end{aligned}
\]</span></p>
<p>where <span class="math inline">\(U\)</span> is some unmeasured confounder, and <span class="math inline">\(\epsilon_D \sim N\left(0, \sigma^2_D\right)\)</span> and <span class="math inline">\(\epsilon_{Y^*} \sim N\left(0, \sigma^2_{Y^*}\right)\)</span>. Can we go even further and find an alternative DGP where <span class="math inline">\(D\)</span> has no direct effect on <span class="math inline">\(Y\)</span> at all?</p>
<p><span class="math display">\[
\begin{aligned}
D &= \alpha_0 + \alpha_1 U + \epsilon_D \\
Y &= \beta_0 + \beta_2 U + \epsilon_{Y^*},
\end{aligned}
\]</span></p>
</div>
<div id="alpha_1-and-sigma_epsilon_d2-derived-from-rho_ud" class="section level3">
<h3><span class="math inline">\(\alpha_1\)</span> (and <span class="math inline">\(\sigma_{\epsilon_D}^2\)</span>) derived from <span class="math inline">\(\rho_{UD}\)</span></h3>
<p>In a simple linear regression model with a single predictor, the coefficient <span class="math inline">\(\alpha_1\)</span> can be specified directly in terms <span class="math inline">\(\rho_{UD}\)</span>, the correlation between <span class="math inline">\(U\)</span> and <span class="math inline">\(D\)</span>:</p>
<p><span class="math display">\[ \alpha_1 = \rho_{UD} \frac{\sigma_D}{\sigma_U}\]</span>
We can estimate <span class="math inline">\(\sigma_D\)</span> from the observed data set, and we can reasonably assume that <span class="math inline">\(\sigma_U = 1\)</span> (since we could always normalize the original measurement of <span class="math inline">\(U\)</span>). Finally, we can specify a range of <span class="math inline">\(\rho_{UD}\)</span> (I am keeping everything positive here for simplicity), such that <span class="math inline">\(0 < \rho_{UD} < 0.90\)</span> (where I assume a correlation of <span class="math inline">\(0.90\)</span> is at or beyond the realm of reasonableness). By plugging these three parameters into the formula, we can generate a range of <span class="math inline">\(\alpha_1\)</span>’s.</p>
<p>Furthermore, we can derive an estimate of the variance for <span class="math inline">\(\epsilon_D\)</span> ( <span class="math inline">\(\sigma_{\epsilon_D}^2\)</span>) at each level of <span class="math inline">\(\rho_{UD}\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
Var(D) &= Var(\alpha_0 + \alpha_1 U + \epsilon_D) \\
\\
\sigma_D^2 &= \alpha_1^2 \sigma_U^2 + \sigma_{\epsilon_D}^2 \\
\\
\sigma_{\epsilon_D}^2 &= \sigma_D^2 - \alpha_1^2 \; \text{(since } \sigma_U^2=1)
\end{aligned}
\]</span></p>
<p>So, for each value of <span class="math inline">\(\rho_{UD}\)</span> that we generated, there is a corresponding pair <span class="math inline">\((\alpha_1, \; \sigma_{\epsilon_D}^2)\)</span>.</p>
</div>
<div id="determine-beta_2" class="section level3">
<h3>Determine <span class="math inline">\(\beta_2\)</span></h3>
<p>In the <a href="#addendum">addendum</a> I go through a bit of an elaborate derivation of <span class="math inline">\(\beta_2\)</span>, the coefficient of <span class="math inline">\(U\)</span> in the alternative outcome model. Here is the bottom line:</p>
<p><span class="math display">\[
\beta_2 = \frac{\alpha_1}{1-\frac{\sigma_{\epsilon_D}^2}{\sigma_D^2}}\left( k_1 - \beta_1\right)
\]</span></p>
<p>In the equation, we have <span class="math inline">\(\sigma^2_D\)</span> and <span class="math inline">\(k_1\)</span>, which are both estimated from the observed data and the pair of derived parameters <span class="math inline">\(\alpha_1\)</span> and <span class="math inline">\(\sigma_{\epsilon_D}^2\)</span> based on <span class="math inline">\(\rho_{UD}\)</span>. <span class="math inline">\(\beta_1\)</span>, the coefficient of <span class="math inline">\(D\)</span> in the outcome model is a free parameter, set externally. That is, we can choose to evaluate all <span class="math inline">\(\beta_2\)</span>’s the are generated when <span class="math inline">\(\beta_1 = 0\)</span>. More generally, we can set <span class="math inline">\(\beta_1 = pk_1\)</span>, where <span class="math inline">\(0 \le p \le 1\)</span>. (We could go negative if we want, though I won’t do that here.) If <span class="math inline">\(p=1\)</span> , <span class="math inline">\(\beta_1 = k_1\)</span> and <span class="math inline">\(\beta_2 = 0\)</span>; we end up with the original model with no confounding.</p>
<p>So, once we specify <span class="math inline">\(\rho_{UD}\)</span> and <span class="math inline">\(p\)</span>, we get the corresponding triplet <span class="math inline">\((\alpha_1, \; \sigma_{\epsilon_D}^2, \; \beta_2)\)</span>.</p>
</div>
<div id="determine-rho_uy" class="section level3">
<h3>Determine <span class="math inline">\(\rho_{UY}\)</span></h3>
<p>In this last step, we can identify the correlation of <span class="math inline">\(U\)</span> and <span class="math inline">\(Y\)</span>, <span class="math inline">\(\rho_{UY}\)</span>, that is associated with all the observed, specified, and derived parameters up until this point. We start by writing the alternative outcome model, and then replace <span class="math inline">\(D\)</span> with the alternative exposure model, and do some algebraic manipulation to end up with a re-parameterized alternative outcome model that has a single predictor:</p>
<p><span class="math display">\[
\begin{aligned}
Y &= \beta_0 + \beta_1 D + \beta_2 U + \epsilon_Y^* \\
&= \beta_0 + \beta_1 \left( \alpha_0 + \alpha_1 U + \epsilon_D \right) + \beta_2 U + \epsilon_Y^* \\
&=\beta_0 + \beta_1 \alpha_0 + \beta_1 \alpha_1 U + \beta_1 \epsilon_D + \beta_2 U +
\epsilon_Y^* \\
&=\beta_0^* + \left( \beta_1 \alpha_1 + \beta_2 \right)U + \epsilon_Y^+ \\
&=\beta_0^* + \beta_1^*U + \epsilon_Y^+ , \\
\end{aligned}
\]</span></p>
<p>where <span class="math inline">\(\beta_0^* = \beta_0 + \beta_1 \alpha_0\)</span>, <span class="math inline">\(\beta_1^* = \left( \beta_1 \alpha_1 + \beta_2 \right)\)</span>, and <span class="math inline">\(\epsilon_Y^+ = \beta_1 \epsilon_D + \epsilon_Y*\)</span>.</p>
<p>As before, the coefficient in a simple linear regression model with a single predictor is related to the correlation of the two variables as follows:</p>
<p><span class="math display">\[
\beta_1^* = \rho_{UY} \frac{\sigma_Y}{\sigma_U}
\]</span></p>
<p>Since <span class="math inline">\(\beta_1^* = \left( \beta_1 \alpha_1 + \beta_2 \right)\)</span>,</p>
<p><span class="math display">\[
\begin{aligned}
\beta_1 \alpha_1 + \beta_2 &= \rho_{UY} \frac{\sigma_Y}{\sigma_U} \\
\\
\rho_{UY} &= \frac{\sigma_U}{\sigma_Y} \left( \beta_1 \alpha_1 + \beta_2 \right) \\
\\
&= \frac{\left( \beta_1 \alpha_1 + \beta_2 \right)}{\sigma_Y}
\end{aligned}
\]</span></p>
</div>
<div id="determine-sigma2_y" class="section level3">
<h3>Determine <span class="math inline">\(\sigma^2_{Y*}\)</span></h3>
<p>In order to simulate data from the alternative DGPs, we need to derive the variation for the noise of the alternative model. That is, we need an estimate of <span class="math inline">\(\sigma^2_{Y*}\)</span>.</p>
<p><span class="math display">\[
\begin{aligned}
Var(Y) &= Var(\beta_0 + \beta_1 D + \beta_2 U + \epsilon_{Y^*}) \\
\\
&= \beta_1^2 Var(D) + \beta_2^2 Var(U) + 2\beta_1\beta_2Cov(D, U) + Var(\epsilon_{y*}) \\
\\
&= \beta_1^2 \sigma^2_D + \beta_2^2 + 2\beta_1\beta_2\rho_{UD}\sigma_D + \sigma^2_{Y*} \\
\end{aligned}
\]</span></p>
<p>So,</p>
<p><span class="math display">\[
\sigma^2_{Y*} = Var(Y) - (\beta_1^2 \sigma^2_D + \beta_2^2 + 2\beta_1\beta_2\rho_{UD}\sigma_D),
\]</span></p>
<p>where <span class="math inline">\(Var(Y)\)</span> is the variation of <span class="math inline">\(Y\)</span> from the observed data. Now we are ready to implement this in R.</p>
</div>
<div id="implementing-in-r" class="section level3">
<h3>Implementing in <code>R</code></h3>
<p>If we have an observed data set with observed <span class="math inline">\(D\)</span> and <span class="math inline">\(Y\)</span>, and some target <span class="math inline">\(\beta_1\)</span> determined by <span class="math inline">\(p\)</span>, we can calculate/generate all the quantities that we just derived.</p>
<p>Before getting to the function, I want to make a brief point about what we do if we have other <em>measured</em> confounders. We can essentially eliminate measured confounders by regressing the exposure <span class="math inline">\(D\)</span> on the confounders and conducting the entire sensitivity analysis with the residual exposure measurements derived from this initial regression model. I won’t be doing this here, but if anyone wants to see an example of this, let me know, and I can do a short post.</p>
<p>OK - here is the function, which essentially follows the path of the derivation above:</p>
<pre class="r"><code>altDGP <- function(dd, p) {
# Create values of rhoUD
dp <- data.table(p = p, rhoUD = seq(0.0, 0.9, length = 1000))
# Parameters estimated from data
dp[, `:=`(sdD = sd(dd$D), s2D = var(dd$D), sdY = sd(dd$Y))]
dp[, k1:= coef(lm(Y ~ D, data = dd))[2]]
# Generate b1 based on p
dp[, b1 := p * k1]
# Determine a1
dp[, a1 := rhoUD * sdD ]
# Determine s2ed
dp[, s2ed := s2D - (a1^2)]
# Determine b2
dp[, g:= s2ed/s2D]
dp <- dp[g != 1]
dp[, b2 := (a1 / (1 - g) ) * ( k1 - b1 )]
# Determine rhoUY
dp[, rhoUY := ( (b1 * a1) + b2 ) / sdY ]
# Eliminate impossible correlations
dp <- dp[rhoUY > 0 & rhoUY <= .9]
# Determine s2eyx
dp[, s2eyx := sdY^2 - (b1^2 * s2D + b2^2 + 2 * b1 * b2 * rhoUD * sdD)]
dp <- dp[s2eyx > 0]
# Determine standard deviations
dp[, sdeyx := sqrt(s2eyx)]
dp[, sdedx := sqrt(s2ed)]
# Finished
dp[]
}</code></pre>
</div>
<div id="assessing-sensitivity" class="section level3">
<h3>Assessing sensitivity</h3>
<p>If we generate the same data set we started out with last post, we can use the function to assess the sensitivity of this association.</p>
<pre class="r"><code>defO <- defData(varname = "D", formula = 0, variance = 1)
defO <- defData(defO, varname = "Y", formula = "1.5 * D", variance = 25)
set.seed(20181201)
dtO <- genData(1200, defO)</code></pre>
<p>In this first example, I am looking for the DGP with <span class="math inline">\(\beta_1 = 0\)</span>, which is implemented as <span class="math inline">\(p = 0\)</span> in the call to function <code>altDGP</code>. Each row of output represents an alternative set of parameters that will result in a DGP with <span class="math inline">\(\beta_1 = 0\)</span>.</p>
<pre class="r"><code>dp <- altDGP(dtO, p = 0)
dp[, .(rhoUD, rhoUY, k1, b1, a1, s2ed, b2, s2eyx)]</code></pre>
<pre><code>## rhoUD rhoUY k1 b1 a1 s2ed b2 s2eyx
## 1: 0.295 0.898 1.41 0 0.294 0.904 4.74 5.36
## 2: 0.296 0.896 1.41 0 0.295 0.903 4.72 5.50
## 3: 0.297 0.893 1.41 0 0.296 0.903 4.71 5.63
## 4: 0.298 0.890 1.41 0 0.297 0.902 4.69 5.76
## 5: 0.299 0.888 1.41 0 0.298 0.902 4.68 5.90
## ---
## 668: 0.896 0.296 1.41 0 0.892 0.195 1.56 25.35
## 669: 0.897 0.296 1.41 0 0.893 0.193 1.56 25.35
## 670: 0.898 0.296 1.41 0 0.894 0.191 1.56 25.36
## 671: 0.899 0.295 1.41 0 0.895 0.190 1.56 25.36
## 672: 0.900 0.295 1.41 0 0.896 0.188 1.55 25.37</code></pre>
<p>Now, I am creating a data set that will be based on four levels of <span class="math inline">\(\beta_1\)</span>. I do this by creating a vector <span class="math inline">\(p = \; <0.0, \; 0.2, \; 0.5, \; 0.8>\)</span>. The idea is to create a plot that shows the curve for each value of <span class="math inline">\(p\)</span>. The most extreme curve (in this case, the curve all the way to the right, since we are dealing with positive associations only) represents the scenario where <span class="math inline">\(p = 0\)</span> (i.e. <span class="math inline">\(\beta_1 = 0\)</span>). The curves moving to the left reflect increasing sensitivity as <span class="math inline">\(p\)</span> increases.</p>
<pre class="r"><code>dsenO <- rbindlist(lapply(c(0.0, 0.2, 0.5, 0.8),
function(x) altDGP(dtO, x)))</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-01-10-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii_files/figure-html/unnamed-chunk-6-1.png" width="720" /></p>
<p>I would say that in this first case the observed association is moderately sensitive to unmeasured confounding, as correlations as low as 0.5 would enough to erase the association.</p>
<p>In the next case, if the association remains unchanged but the variation of <span class="math inline">\(Y\)</span> is considerably reduced, the observed association is much less sensitive. However, it is still quite possible that the observed overestimation is at least partially overstated, as relatively low levels of correlation could reduce the estimated association.</p>
<pre class="r"><code>defA1 <- updateDef(defO, changevar = "Y", newvariance = 4)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-01-10-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii_files/figure-html/unnamed-chunk-8-1.png" width="720" /></p>
<p>In this last scenario, variance is the same as the initial scenario, but the association is considerably weaker. Here, we see that the estimate of the association is extremely sensitive to unmeasured confounding, as low levels of correlation are required to entirely erase the association.</p>
<pre class="r"><code>defA2 <- updateDef(defO, changevar = "Y", newformula = "0.25 * D")</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-01-10-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii_files/figure-html/unnamed-chunk-10-1.png" width="720" /></p>
</div>
<div id="treatsens-package" class="section level3">
<h3><code>treatSens</code> package</h3>
<p>I want to show output generated by the <code>treatSens</code> package I referenced earlier. <code>treatSens</code> requires a formula that includes an outcome vector <span class="math inline">\(Y\)</span>, an exposure vector <span class="math inline">\(Z\)</span>, and at least one vector of measured of confounders <span class="math inline">\(X\)</span>. In my examples, I have included no measured confounders, so I generate a vector of independent noise that is not related to the outcome.</p>
<pre class="r"><code>library(treatSens)
X <- rnorm(1200)
Y <- dtO$Y
Z <- dtO$D
testsens <- treatSens(Y ~ Z + X, nsim = 5)
sensPlot(testsens)</code></pre>
<p>Once <code>treatSens</code> has been executed, it is possible to generate a sensitivity plot, which looks substantively similar to the ones I have created. The package uses sensitivity parameters <span class="math inline">\(\zeta^Z\)</span> and <span class="math inline">\(\zeta^Y\)</span>, which represent the coefficients of <span class="math inline">\(U\)</span>, the unmeasured confounder. Since <code>treatSens</code> normalizes the data (in the default setting), these coefficients are actually equivalent to the correlations <span class="math inline">\(\rho_{UD}\)</span> and <span class="math inline">\(\rho_{UY}\)</span> that are the basis of my sensitivity analysis. A important difference in the output is that <code>treatSens</code> provides uncertainty bands, and extends into regions of negative correlation. (And of course, a more significant difference is that <code>treatSens</code> is flexible enough to handle binary exposures, whereas I have not yet extended my analytic approach in that direction, and I suspect it is no possible for me to do so due to non-collapsibility of logistic regression estimands - I hope to revisit this in the future.)</p>
<div id="observed-data-scenario-1-smally-sim-n1.50z-25" class="section level4">
<h4>Observed data scenario 1: <span class="math inline">\(\small{Y \sim N(1.50Z, \; 25)}\)</span></h4>
<p><img src="https://www.rdatagen.net/img/post-treatSens/Var25.png" width="550" /></p>
</div>
<div id="observed-data-scenario-2-smally-sim-n1.50z-4" class="section level4">
<h4>Observed data scenario 2: <span class="math inline">\(\small{Y \sim N(1.50Z, \; 4)}\)</span></h4>
<p><img src="https://www.rdatagen.net/img/post-treatSens/Var04.png" width="550" /></p>
</div>
<div id="observed-data-scenario-3-smally-sim-n0.25z-25" class="section level4">
<h4>Observed data scenario 3: <span class="math inline">\(\small{Y \sim N(0.25Z, \; 25)}\)</span></h4>
<p><img src="https://www.rdatagen.net/img/post-treatSens/V25025.png" width="550" /></p>
<p><a name="addendum"></a></p>
</div>
</div>
<div id="addendum-derivation-of-beta_2" class="section level2">
<h2>Addendum: Derivation of <span class="math inline">\(\beta_2\)</span></h2>
<p>In case you want more detail on how we derive <span class="math inline">\(\beta_2\)</span> from the observed data model and assumed correlation parameters, here it is. We start by specifying the simple observed outcome model:</p>
<p><span class="math display">\[ Y = k_0 + k_1D + \epsilon_Y\]</span></p>
<p>We can estimate the parameters <span class="math inline">\(k_0\)</span> and <span class="math inline">\(k_1\)</span> using this standard matrix solution:</p>
<p><span class="math display">\[ <k_0, \; k_1> \; = (W^TW)^{-1}W^TY,\]</span></p>
<p>where <span class="math inline">\(W\)</span> is the <span class="math inline">\(n \times 2\)</span> design matrix:</p>
<p><span class="math display">\[ W = [\mathbf{1}, D]_{n \times 2}.\]</span></p>
<p>We can replace <span class="math inline">\(Y\)</span> with the alternative outcome model:</p>
<p><span class="math display">\[
\begin{aligned}
<k_0, \; k_1> \; &= (W^TW)^{-1}W^T(\beta_0 + \beta_1 D + \beta_2 U + \epsilon_Y^*) \\
&= \;<\beta_0, 0> + <0, \beta_1> +\; (W^TW)^{-1}W^T(\beta_2U) + \mathbf{0} \\
&= \;<\beta_0, \beta_1> +\; (W^TW)^{-1}W^T(\beta_2U)
\end{aligned}
\]</span></p>
<p>Note that</p>
<p><span class="math display">\[
\begin{aligned}
(W^TW)^{-1}W^T(\beta_0) &= \; <\beta_0,\; 0> \; \; and\\
\\
(W^TW)^{-1}W^T(\beta_1D) &= \; <0,\; \beta_1>.
\end{aligned}
\]</span></p>
<p>Now, we need to figure out what <span class="math inline">\((W^TW)^{-1}W^T(\beta_2U)\)</span> is. First, we rearrange the alternate exposure model:
<span class="math display">\[
\begin{aligned}
D &= \alpha_0 + \alpha_1 U + \epsilon_D \\
\alpha_1 U &= D - \alpha_0 - \epsilon_D \\
U &= \frac{1}{\alpha_1} \left( D - \alpha_0 - \epsilon_D \right) \\
\beta_2 U &= \frac{\beta_2}{\alpha_1} \left( D - \alpha_0 - \epsilon_D \right)
\end{aligned}
\]</span></p>
<p>We can replace <span class="math inline">\(\beta_2 U\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
(W^TW)^{-1}W^T(\beta_2U) &= (W^TW)^{-1}W^T \left[ \frac{\beta_2}{\alpha_1} \left( D - \alpha_0 - \epsilon_D \right) \right] \\
&= <-\frac{\beta_2}{\alpha_1}\alpha_0, 0> + <0,\frac{\beta_2}{\alpha_1}>-\;\frac{\beta_2}{\alpha_1}(W^TW)^{-1}W^T \epsilon_D \\
&= <-\frac{\beta_2}{\alpha_1}\alpha_0, \frac{\beta_2}{\alpha_1}>-\;\frac{\beta_2}{\alpha_1}(W^TW)^{-1}W^T \epsilon_D \\
\end{aligned}
\]</span></p>
<p>And now we get back to <span class="math inline">\(<k_0,\; k_1>\)</span> :</p>
<p><span class="math display">\[
\begin{aligned}
<k_0,\; k_1> \; &= \;<\beta_0,\; \beta_1> +\; (W^TW)^{-1}W^T(\beta_2U) \\
&= \;<\beta_0-\frac{\beta_2}{\alpha_1}\alpha_0, \; \beta_1 + \frac{\beta_2}{\alpha_1}>-\;\frac{\beta_2}{\alpha_1}(W^TW)^{-1}W^T \epsilon_D \\
&= \;<\beta_0-\frac{\beta_2}{\alpha_1}\alpha_0, \; \beta_1 + \frac{\beta_2}{\alpha_1}>-\;\frac{\beta_2}{\alpha_1}<\gamma_0, \; \gamma_1>
\end{aligned}
\]</span></p>
<p>where <span class="math inline">\(\gamma_0\)</span> and <span class="math inline">\(\gamma_1\)</span> come from regressing <span class="math inline">\(\epsilon_D\)</span> on <span class="math inline">\(D\)</span>:</p>
<p><span class="math display">\[ \epsilon_D = \gamma_0 + \gamma_1 D\]</span>
so,</p>
<p><span class="math display">\[
\begin{aligned}
<k_0,\; k_1> \; &= \;<\beta_0-\frac{\beta_2}{\alpha_1}\alpha_0 - \frac{\beta_2}{\alpha_1}\gamma_0, \; \beta_1 + \frac{\beta_2}{\alpha_1} - \frac{\beta_2}{\alpha_1}\gamma_1 > \\
&= \;<\beta_0-\frac{\beta_2}{\alpha_1}\left(\alpha_0 + \gamma_0\right), \; \beta_1 + \frac{\beta_2}{\alpha_1}\left(1 - \gamma_1 \right) >
\end{aligned}
\]</span></p>
<p>Since we can center all the observed data, we can easily assume that <span class="math inline">\(k_0 = 0\)</span>. All we need to worry about is <span class="math inline">\(k_1\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
k_1 &= \beta_1 + \frac{\beta_2}{\alpha_1}\left(1 - \gamma_1 \right) \\
\frac{\beta_2}{\alpha_1}\left(1 - \gamma_1 \right) &= k_1 - \beta_1 \\
\beta_2 &= \frac{\alpha_1}{1-\gamma_1}\left( k_1 - \beta_1\right)
\end{aligned}
\]</span></p>
<p>We have generated <span class="math inline">\(\alpha_1\)</span> based on <span class="math inline">\(\rho_{UD}\)</span>, <span class="math inline">\(k_1\)</span> is a estimated from the data, and <span class="math inline">\(\beta_1\)</span> is fixed based on some <span class="math inline">\(p, \; 0 \le p \le 1\)</span> such that <span class="math inline">\(\beta_1 = pk_1\)</span>. All that remains is <span class="math inline">\(\gamma_1\)</span>:</p>
<p><span class="math display">\[
\gamma_1 = \rho_{\epsilon_D D} \frac{\sigma_{\epsilon_D}}{\sigma_D}
\]</span></p>
<p>Since <span class="math inline">\(D = \alpha_0 + \alpha_1 U + \epsilon_D\)</span> (and <span class="math inline">\(\epsilon_D \perp \! \! \! \perp U\)</span>)</p>
<p><span class="math display">\[
\begin{aligned}
\rho_{\epsilon_D D} &= \frac{Cov(\epsilon_D, D)}{\sigma_{\epsilon_D} \sigma_D} \\
\\
&=\frac{Cov(\epsilon_D, \;\alpha_0 + \alpha_1 U + \epsilon_D )}{\sigma_{\epsilon_D} \sigma_D} \\
\\
&= \frac{\sigma_{\epsilon_D}^2}{\sigma_{\epsilon_D} \sigma_D} \\
\\
&= \frac{\sigma_{\epsilon_D}}{\sigma_D}
\end{aligned}
\]</span></p>
<p>It follows that</p>
<p><span class="math display">\[
\begin{aligned}
\gamma_1 &= \rho_{\epsilon_D D} \frac{\sigma_{\epsilon_D}}{\sigma_D} \\
\\
&=\frac{\sigma_{\epsilon_D}}{\sigma_D} \times \frac{\sigma_{\epsilon_D}}{\sigma_D} \\
\\
&=\frac{\sigma_{\epsilon_D}^2}{\sigma_D^2}
\end{aligned}
\]</span></p>
<p>So, now, we have all the elements to generate <span class="math inline">\(\beta_2\)</span> for a range of <span class="math inline">\(\alpha_1\)</span>’s and <span class="math inline">\(\sigma_{\epsilon_D}^2\)</span>’s:</p>
<p><span class="math display">\[
\beta_2 = \frac{\alpha_1}{1-\frac{\sigma_{\epsilon_D}^2}{\sigma_D^2}}\left( k_1 - \beta_1\right)
\]</span></p>
</div>
Considering sensitivity to unmeasured confounding: part 1
https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/
Wed, 02 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/
<p>Principled causal inference methods can be used to compare the effects of different exposures or treatments we have observed in non-experimental settings. These methods, which include matching (with or without propensity scores), inverse probability weighting, and various g-methods, help us create comparable groups to simulate a randomized experiment. All of these approaches rely on a key assumption of <em>no unmeasured confounding</em>. The problem is, short of subject matter knowledge, there is no way to test this assumption empirically.</p>
<p>The general approach to this problem has been to posit a level of unmeasured confounding that would be necessary to alter the conclusions of a study. The classic example (which also is probably the first) comes from the debate on the effects of smoking on lung cancer. There were some folks who argued that there was a genetic factor that was leading people to smoke and was simultaneously the cause of cancer. The great statistician Jerome Cornfield (who, I just saw on <a href="https://en.wikipedia.org/wiki/Jerome_Cornfield">Wikipedia</a>, happens to have shared my birthday), showed that an unobserved confounder (like a particular genetic factor) would need to lead to a 9-fold increase in the odds of smoking to explain away the association between smoking and cancer. Since such a strong factor was not likely to exist, he argued, the observed association was most likely real. (For a detailed discussion on various approaches to these kinds of sensitivity analyses, look at this paper by <a href="https://link.springer.com/content/pdf/10.1007%2Fs11121-012-0339-5.pdf"><em>Liu, Kuramoto, and Stuart</em></a>.)</p>
<p>My goal here is to think a bit more about what it means for a measured association to be sensitive to unmeasured confounding. When I originally started thinking about this, I thought that an association will be sensitive to unmeasured confounding if the underlying data generation process (DGP) <em>actually includes</em> an unmeasured confounder. Sure, if this is the case - that there actually is unmeasured confounding - then it is more likely that a finding will be sensitive to unmeasured confounding. But, this isn’t really that interesting, because we can’t observe the underlying DGP. And it is not necessarily the case that data sensitive to unmeasured confounding was in fact generated through some process with an unmeasured confounder.</p>
<div id="is-there-room-for-an-alternative-data-generation-process" class="section level3">
<h3>Is there room for an alternative data generation process?</h3>
<p>When considering sensitivity, it may be more useful to talk about the plausibility of alternative models. In this context, sensitivity is inherently related to the (1) the strength of the association of the observed exposure and outcome, and (2) the uncertainty (i.e. variability) around that association. Put succinctly, a relatively weak association with a lot of variability will be much more sensitive to unmeasured confounding than a strong association with little uncertainty. If you think in visual terms, when thinking about sensitivity, you might ask “do the data provide room for an alternative model?”</p>
</div>
<div id="an-alternative-model" class="section level3">
<h3>An alternative model</h3>
<p>Let’s say we observe some exposure <span class="math inline">\(D\)</span> and we are interested in its causal relationship with an outcome <span class="math inline">\(Y\)</span>, which we also observe. I am assuming <span class="math inline">\(D\)</span> and <span class="math inline">\(Y\)</span> are both continuous and normally distributed, which makes all of this work, but also limits how far I can take this. (To be more general, we will ultimately need more powerful tools, such as the <code>R</code> package <code>treatSens</code>, but more on that later.) Also, let’s assume for simplicity’s sake that there are no <em>measured</em> confounders - though that is not a requirement here.</p>
<p>With this observed data, we can go ahead and fit a simple linear regression model:</p>
<p><span class="math display">\[ Y = k_0 + k_1D,\]</span>
where <span class="math inline">\(k_1\)</span> is the parameter of interest, the measured association of exposure <span class="math inline">\(D\)</span> with the outcome <span class="math inline">\(Y\)</span>. (Again for simplicity, I am assuming <span class="math inline">\(k_1 > 0\)</span>.)</p>
<p>The question is, is there a possible underlying data generating process where <span class="math inline">\(D\)</span> plays a minor role or none at all? For example, is there a possible DGP that looks like this:</p>
<p><span class="math display">\[
\begin{aligned}
D &= \alpha_0 + \alpha_1 U + \epsilon_d \\
Y &= \beta_0 + \beta_1 D + \beta_2 U + \epsilon_y,
\end{aligned}
\]</span></p>
<p>where <span class="math inline">\(\beta_1 << k_1\)</span>, or perhaps <span class="math inline">\(\beta_1 = 0\)</span>? That is, is there a process that generates the same observed distribution even though <span class="math inline">\(D\)</span> is not a cause of <span class="math inline">\(Y\)</span>? If so, how can we characterize that process, and is it plausible?</p>
</div>
<div id="simulation" class="section level3">
<h3>Simulation</h3>
<p>The observed DGP can be defined using <code>simstudy</code>. We can assume that the continuous exposure <span class="math inline">\(D\)</span> can always be normalized (by centering and dividing by the standard deviation). In this example, the coefficients <span class="math inline">\(k_0 = 0\)</span> and <span class="math inline">\(k_1 = 1.5\)</span>, so that a unit change in the normalized exposure leads, on average, to a positive change in <span class="math inline">\(Y\)</span> of 1.5 units:</p>
<pre class="r"><code>defO <- defData(varname = "D", formula = 0, variance = 1)
defO <- defData(defO, varname = "ey", formula = 0, variance = 25)
defO <- defData(defO, varname = "Y", formula = "1.5 * D + ey",
dist = "nonrandom")</code></pre>
<p>We can generate the data and take a look at it:</p>
<pre class="r"><code>set.seed(20181201)
dtO <- genData(1200, defO)</code></pre>
<p> </p>
<p><img src="https://www.rdatagen.net/post/2019-01-02-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding_files/figure-html/unnamed-chunk-4-1.png" width="336" /></p>
<p>Can we specify another DGP that removes <span class="math inline">\(D\)</span> from the process that defines <span class="math inline">\(Y\)</span>? The answer in this case is “yes.” Here is one such example where both <span class="math inline">\(D\)</span> and <span class="math inline">\(Y\)</span> are a function of some unmeasured confounder <span class="math inline">\(U\)</span>, but <span class="math inline">\(Y\)</span> is a function of <span class="math inline">\(U\)</span> alone. The variance and coefficient specifications for this DGP may seem a bit arbitrary (and maybe even lucky), but how I arrived at these quantities will be the focus of the second part of this post, coming soon. (My real goal here is to pique your interest.)</p>
<pre class="r"><code>defA1 <- defData(varname = "U", formula = 0, variance = 1)
defA1 <- defData(defA1, varname = "ed", formula = 0, variance = 0.727)
defA1 <- defData(defA1, varname = "D", formula = "0.513 * U + ed",
dist = "nonrandom")
defA1 <- defData(defA1, varname = "ey", formula = 0, variance = 20.412)
defA1 <- defData(defA1, varname = "Y", formula = "2.715 * U + ey",
dist = "nonrandom")</code></pre>
<p>After generating this second data set, we can see that they look pretty similar to each other:</p>
<pre class="r"><code>set.seed(20181201)
dtO <- genData(1200, defO)
dtA1 <- genData(1200, defA1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-01-02-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>If the data are indeed similar, the covariance matrices generated by each of the data sets should also be similar, and they do appear to be:</p>
<pre class="r"><code>dtO[, round(var(cbind(Y, D)), 1)]</code></pre>
<pre><code>## Y D
## Y 27.8 1.4
## D 1.4 1.0</code></pre>
<pre class="r"><code>dtA1[, round(var(cbind(Y, D)), 1)]</code></pre>
<pre><code>## Y D
## Y 26.8 1.3
## D 1.3 1.0</code></pre>
</div>
<div id="non-unique-data-generating-process" class="section level3">
<h3>Non-unique data generating process</h3>
<p>The DGP defined by <code>defA1</code> is not a unique alternative. There are actually an infinite number of alternatives - here are two more, what I am calling “Alternative 2” and “Alternative 3” to go along with the first.</p>
<pre class="r"><code>defA2 <- defData(varname = "U", formula = 0, variance = 1)
defA2 <- defData(defA2, varname = "ed", formula = 0, variance = 0.794)
defA2 <- defData(defA2, varname = "D", formula = "0.444 * U + ed",
dist = "nonrandom")
defA2 <- defData(defA2, varname = "ey", formula = 0, variance = 17.939)
defA2 <- defData(defA2, varname = "Y", formula = "3.138 * U + ey",
dist = "nonrandom")</code></pre>
<pre class="r"><code>defA3 <- defData(varname = "U", formula = 0, variance = 1)
defA3 <- defData(defA3, varname = "ed", formula = 0, variance = 0.435)
defA3 <- defData(defA3, varname = "D", formula = "0.745 * U + ed",
dist = "nonrandom")
defA3 <- defData(defA3, varname = "ey", formula = 0, variance = 24.292)
defA3 <- defData(defA3, varname = "Y", formula = "1.869 * U + ey",
dist = "nonrandom")</code></pre>
<p>Rather than looking at plots of the four data sets generated by these equivalent processes, I fit four linear regression models based on the observed <span class="math inline">\(D\)</span> and <span class="math inline">\(Y\)</span>. The parameter estimates and residual standard error estimates are quite close for all four:</p>
<table style="text-align:center">
<caption>
<strong>Comparison of different data generating processes</strong>
</caption>
<tr>
<td colspan="5" style="border-bottom: 1px solid black">
</td>
</tr>
<tr>
<td style="text-align:left">
</td>
<td>
Observed
</td>
<td>
Alt 1
</td>
<td>
Alt 2
</td>
<td>
Alt 3
</td>
</tr>
<tr>
<td style="text-align:left">
D
</td>
<td>
1.41<sup></sup>
</td>
<td>
1.41<sup></sup>
</td>
<td>
1.41<sup></sup>
</td>
<td>
1.37<sup></sup>
</td>
</tr>
<tr>
<td style="text-align:left">
</td>
<td>
(0.15)
</td>
<td>
(0.15)
</td>
<td>
(0.15)
</td>
<td>
(0.15)
</td>
</tr>
<tr>
<td style="text-align:left">
</td>
<td>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td style="text-align:left">
Constant
</td>
<td>
0.38<sup></sup>
</td>
<td>
-0.33<sup></sup>
</td>
<td>
-0.32<sup></sup>
</td>
<td>
-0.33<sup></sup>
</td>
</tr>
<tr>
<td style="text-align:left">
</td>
<td>
(0.15)
</td>
<td>
(0.14)
</td>
<td>
(0.14)
</td>
<td>
(0.15)
</td>
</tr>
<tr>
<td style="text-align:left">
</td>
<td>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td colspan="5" style="border-bottom: 1px solid black">
</td>
</tr>
<tr>
<td style="text-align:left">
Residual Std. Error (df = 1198)
</td>
<td>
5.08
</td>
<td>
4.99
</td>
<td>
4.98
</td>
<td>
5.02
</td>
</tr>
<tr>
<td colspan="5" style="border-bottom: 1px solid black">
</td>
</tr>
</table>
<p> </p>
</div>
<div id="characterizing-each-data-generation-process" class="section level3">
<h3>Characterizing each data generation process</h3>
<p>While each of the alternate DGPs lead to the same (or very similar) observed data distribution, the underlying relationships between <span class="math inline">\(U\)</span>, <span class="math inline">\(D\)</span>, and <span class="math inline">\(Y\)</span> are quite different. In particular, if we inspect the correlations, we can see that they are quite different for each of the three alternatives. In fact, as you will see next time, all we need to do is specify a range of correlations for <span class="math inline">\(U\)</span> and <span class="math inline">\(D\)</span> to derive a curve that defines all the alternatives for a particular value of <span class="math inline">\(\beta_1\)</span>.</p>
<pre class="r"><code>dtA1[, .(cor(U, D), cor(U, Y))]</code></pre>
<pre><code>## V1 V2
## 1: 0.511 0.496</code></pre>
<pre class="r"><code>dtA2[, .(cor(U, D), cor(U, Y))]</code></pre>
<pre><code>## V1 V2
## 1: 0.441 0.579</code></pre>
<pre class="r"><code>dtA3[, .(cor(U, D), cor(U, Y))]</code></pre>
<pre><code>## V1 V2
## 1: 0.748 0.331</code></pre>
</div>
<div id="less-sensitivity" class="section level3">
<h3>Less sensitivity</h3>
<p>So, what does it mean for an observed data set to be sensitive to unmeasured confounding? I would suggest that if an equivalent derived alternative DGP is based on “lower” correlations of <span class="math inline">\(U\)</span> and <span class="math inline">\(D\)</span> and/or <span class="math inline">\(U\)</span> and <span class="math inline">\(Y\)</span>, then the observed data are more sensitive. What “low” correlation is will probably depend on the subject matter. I would say that the data we have been looking at above is moderately sensitive to unmeasured confounding.</p>
<p>Here is an example of an observed data that might be considerably less sensitive:</p>
<pre class="r"><code>defS <- updateDef(defO, changevar = "ey", newvariance = 4)
defAS <- defData(varname = "U", formula = 0, variance = 1)
defAS <- defData(defAS, varname = "ed", formula = 0, variance = 0.414)
defAS <- defData(defAS, varname = "D", formula = "0.759 * U + ed",
dist = "nonrandom")
defAS <- defData(defAS, varname = "ey", formula = 0, variance = 2.613)
defAS <- defData(defAS, varname = "Y", formula = "1.907 * U + ey",
dist = "nonrandom")
set.seed(20181201)
dtS <- genData(1200, defS)
dtAS <- genData(1200, defAS)</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-01-02-what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<p>The plots look similar, as do the covariance matrix describing the observed data:</p>
<pre class="r"><code>dtS[, round(var(cbind(Y, D)), 1)]</code></pre>
<pre><code>## Y D
## Y 6.3 1.4
## D 1.4 1.0</code></pre>
<pre class="r"><code>dtAS[, round(var(cbind(Y, D)), 1)]</code></pre>
<pre><code>## Y D
## Y 6.0 1.4
## D 1.4 1.0</code></pre>
<p>In this case, the both the correlations in the alternative DGP are quite high, suggesting a higher bar is needed to remove the association between <span class="math inline">\(D\)</span> and <span class="math inline">\(Y\)</span> entirely:</p>
<pre class="r"><code>dtAS[, .(cor(U, D), cor(U, Y))]</code></pre>
<pre><code>## V1 V2
## 1: 0.762 0.754</code></pre>
<p>In the second part of this post I will show how I derived the alternative DGPs, and then use that derivation to create an <code>R</code> function to generate sensitivity curves that allow us to visualize sensitivity in terms of the correlation parameters <span class="math inline">\(\rho_{UD}\)</span> and <span class="math inline">\(\rho_{UY}\)</span>.</p>
</div>
Parallel processing to add a little zip to power simulations (and other replication studies)
https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/
Mon, 10 Dec 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/
<p>It’s always nice to be able to speed things up a bit. My <a href="https://www.rdatagen.net/post/first-blog-entry/">first blog post ever</a> described an approach using <code>Rcpp</code> to make huge improvements in a particularly intensive computational process. Here, I want to show how simple it is to speed things up by using the R package <code>parallel</code> and its function <code>mclapply</code>. I’ve been using this function more and more, so I want to explicitly demonstrate it in case any one is wondering.</p>
<p>I’m using a very simple power calculation as the motivating example here, but parallel processing can be useful in any problem where multiple replications are required. Monte Carlo simulation for experimentation and bootstrapping for variance estimation are other cases where computation times can grow long particularly fast.</p>
<div id="a-simple-two-sample-experiment" class="section level3">
<h3>A simple, two-sample experiment</h3>
<p>In this example, we are interested in estimating the probability of an experiment to show some sort of treatment effect given that there <em>actually is an effect</em>. In this example, I am comparing two group means with an unknown but true difference of 2.7; the standard deviation within each group is 5.0. Furthermore, we know we will be limited to a sample size of 100.</p>
<p>Here is the straightforward data generation process: (1) create 100 individual records, (2) assign 50 to treatment (<em>rx</em>) and 50 to control, and (3) generate an outcome <span class="math inline">\(y\)</span> for each individual, with <span class="math inline">\(\bar{y}_{rx=0} = 10.0\)</span> and <span class="math inline">\(\bar{y}_{rx=1} = 12.7\)</span>, both with standard deviation <span class="math inline">\(5\)</span>.</p>
<pre class="r"><code>set.seed(2827129)
defA <- defDataAdd(varname = "y", formula ="10 + rx*2.7", variance = 25)
DT <- genData(100)
DT <- trtAssign(DT, grpName = "rx")
DX <- addColumns(defA, DT)
ggplot(data = DX, aes(factor(rx), y)) +
geom_boxplot(fill = "red", alpha = .5) +
xlab("rx") +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-12-10-parallel-processing-to-add-a-little-zip-to-power-simulations_files/figure-html/unnamed-chunk-2-1.png" width="360" /></p>
<p>A simple linear regression model can be used to compare the group means for this particular data set. In this case, since <span class="math inline">\(p < 0.05\)</span>, we would conclude that the treatment effect is indeed different from <span class="math inline">\(0\)</span>. However, in other samples, this will not necessarily be the case.</p>
<pre class="r"><code>rndTidy(lm(y ~ rx, data = DX))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1: (Intercept) 9.8 0.72 13.7 0.00
## 2: rx 2.2 1.01 2.2 0.03</code></pre>
</div>
<div id="the-for-loop" class="section level3">
<h3>The <em>for</em> loop</h3>
<p>The single sample above yielded a <span class="math inline">\(p < 0.05\)</span>. The question is, would this be a rare occurrence based on a collection of related experiments. That is, if we repeated the experiment over and over again, what proportion of the time would <span class="math inline">\(p < 0.05\)</span>? To find this out, we can repeatedly draw from the same distributions and for each draw we can estimate the p-value. (In this simple power analysis, we would normally use an analytic solution (i.e., an equation), because that is obviously much faster; but, the analytic solution is not always so straightforward or even available.)</p>
<p>To facilitate this replication process, it is often easier to create a function that both generates the data and provides the estimate that is needed (in this case, the <em>p-value</em>). This is the purposed of function <code>genAndEst</code>:</p>
<pre class="r"><code>genAndEst <- function(def, dx) {
DX <- addColumns(def, dx)
coef(summary(lm(y ~ rx, data = DX)))["rx", "Pr(>|t|)"]
}</code></pre>
<p>Just to show that this function does indeed provide the same <em>p-value</em> as before, we can call based on the same seed.</p>
<pre class="r"><code>set.seed(2827129)
DT <- genData(100)
DT <- trtAssign(DT, grpName = "rx")
(pvalue <- genAndEst(defA, DT))</code></pre>
<pre><code>## [1] 0.029</code></pre>
<p>OK - now we are ready to estimate, using 2500 replications. Each time, we store the results in a vector called <code>pvals</code>. After the replications have been completed, we calculate the proportion of replications where the p-value was indeed below the <span class="math inline">\(5\%\)</span> threshold.</p>
<pre class="r"><code>forPower <- function(def, dx, reps) {
pvals <- vector("numeric", reps)
for (i in 1:reps) {
pvals[i] <- genAndEst(def, dx)
}
mean(pvals < 0.05)
}
forPower(defA, DT, reps = 2500)</code></pre>
<pre><code>## [1] 0.77</code></pre>
<p>The estimated power is 0.77. That is, given the underlying data generating process, we can expect to find a significant result <span class="math inline">\(77\%\)</span> of the times we conduct the experiment.</p>
<p>As an aside, here is the R function <code>power.t.test</code>, which uses the analytic (formulaic) approach:</p>
<pre class="r"><code>power.t.test(50, 2.7, 5)</code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 50
## delta = 2.7
## sd = 5
## sig.level = 0.05
## power = 0.76
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>Reading along here, you can’t tell how much time the <em>for</em> loop took on my MacBook Pro. It was not exactly zippy, maybe 5 seconds or so. (The result from <code>power.t.test</code> was instantaneous.)</p>
</div>
<div id="lapply" class="section level3">
<h3><em>lapply</em></h3>
<p>The R function <code>lapply</code> offers a second approach that might be simpler to code, but maybe less intuitive to understand. The whole replication process can be coded with a single call to <code>lapply</code>. This call also references the <code>genAndEst</code> function.</p>
<p>In this application of <code>lapply</code>, the argument <span class="math inline">\(X\)</span> is really a dummy argument, as the function call in argument <span class="math inline">\(FUN\)</span> essentially ignores the argument <span class="math inline">\(x\)</span>. <code>lapply</code> executes the function for each element of the vector <span class="math inline">\(X\)</span>; in this case, the function will be executed <span class="math inline">\(n=\text{length}(X)\)</span> times. That is, we get <span class="math inline">\(n\)</span> replications of the function <code>genAndEst</code>, just as we did with the <em>for</em> loop.</p>
<pre class="r"><code>lappPower <- function(def, dx, reps = 1000) {
plist <- lapply(X = 1:reps, FUN = function(x) genAndEst(def, dx))
mean(unlist(plist) < 0.05)
}
lappPower(defA, DT, 2500)</code></pre>
<pre><code>## [1] 0.75</code></pre>
<p>The power estimate is quite close to the initial <em>for</em> loop replication and the analytic solution. However, in this case, it did not appear to provide any time savings, taking about 5 seconds as well.</p>
</div>
<div id="mclapply" class="section level3">
<h3><em>mclapply</em></h3>
<p>The final approach here is the <code>mclapply</code> function - or multi-core lapply. The syntax is almost identical to <code>lapply</code>, but the speed is not. It seems like it took about 2 or 3 seconds to do 2500 replications.</p>
<pre class="r"><code>library(parallel)
mclPower <- function(def, dx, reps) {
plist <- mclapply(1:reps, function(x) genAndEst(def, dx), mc.cores = 4)
mean(unlist(plist) < 0.05)
}
mclPower(defA, DT, 2500)</code></pre>
<pre><code>## [1] 0.75</code></pre>
</div>
<div id="benchmarking-the-processing-times" class="section level3">
<h3>Benchmarking the processing times</h3>
<p>You’ve had to take my word about the relative processing times. Here, I use package <code>microbenchmark</code> to compare the three approaches (leaving out the analytic solution, because it is far, far superior in this case). This bench-marking process actually does 100 replications of each approach. And each replication involves 2500 <em>p-value estimates</em>. So, the benchmark takes quite a while on my laptop:</p>
<pre class="r"><code>library(microbenchmark)
m1500 <- microbenchmark(for_loop = forPower(defA, DT, 1500),
lapply = lappPower(defA, DT, 1500),
mclapply = mclPower(defA, DT, 1500),
times = 100L
)</code></pre>
<p>The results of the benchmark are plotted here, with each of the 100 benchmark calls shown for each method, as well as the average in red. My guesstimates of the processing times were not so far off, and it looks like the parallel processing on my laptop reduces the processing times by about <span class="math inline">\(50\%\)</span>. In my work more generally, I have found this to be typical, and when the computation requirements are more burdensome, this reduction can really be a substantial time saver.</p>
<p><img src="https://www.rdatagen.net/post/2018-12-10-parallel-processing-to-add-a-little-zip-to-power-simulations_files/figure-html/unnamed-chunk-11-1.png" width="288" /></p>
</div>
Horses for courses, or to each model its own (causal effect)
https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/
Wed, 28 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/
<p>In my previous <a href="https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/">post</a>, I described a (relatively) simple way to simulate observational data in order to compare different methods to estimate the causal effect of some exposure or treatment on an outcome. The underlying data generating process (DGP) included a possibly unmeasured confounder and an instrumental variable. (If you haven’t already, you should probably take a quick <a href="https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/">look</a>.)</p>
<p>A key point in considering causal effect estimation is that the average causal effect depends on the individuals included in the average. If we are talking about the causal effect for the population - that is, comparing the average outcome if <em>everyone</em> in the population received treatment against the average outcome if <em>no one</em> in the population received treatment - then we are interested in the average causal effect (ACE).</p>
<p>However, if we have an instrument, and we are talking about <em>only the compliers</em> (those who don’t get the treatment when <em>not</em> encouraged but do get it when they <em>are</em> encouraged) - then we will be measuring the complier average causal effect (CACE). The CACE is a comparison of the average outcome when <em>all compliers</em> receive the treatment with the average outcome when <em>none of the compliers</em> receive the treatment.</p>
<p>And the third causal effect I will consider here is the average causal effect for the treated (ACT). This population is defined by those who actually received the treatment (regardless of instrument or complier status). Just like the other causal effects, the ACT is a comparison of the average outcome when all those who were actually treated did get treatment (this is actually what we observe) with the average outcome if all those who were actually treated didn’t get the treatment (the counterfactual of the treated).</p>
<p>As we will see in short order, three different estimation methods using (almost) the same data set provide estimates for each of these three different causal estimands.</p>
<div id="the-data-generating-process" class="section level3">
<h3>The data generating process</h3>
<p>For the purposes of this illustration, I am generating data with heterogeneous causal effects that depend on an measured or unmeasured underlying health status <span class="math inline">\(U\)</span>. (I’m skipping over the details of the DGP that I laid out in <a href="https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/">part I</a>.) Higher values of <span class="math inline">\(U\)</span> indicate a sicker patient. Those patients are more likely to have stronger effects, and are more likely to seek treatment (independent of the instrument).</p>
<p>Here is a set of plots that show the causal effects by health status <span class="math inline">\(U\)</span> and various distributions of the causal effects:</p>
<p><img src="https://www.rdatagen.net/post/2018-11-28-different-models-estimate-different-causal-effects-part-ii_files/figure-html/unnamed-chunk-2-1.png" width="1056" /></p>
</div>
<div id="instrumental-variable" class="section level3">
<h3>Instrumental variable</h3>
<p>First up is IV estimation. The two-stage least squares regression method has been implemented in the R package <code>ivpack</code>. In case you didn’t check out the IV reference last time, here is an excellent <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201653/">tutorial</a> that describes IV methods in great, accessible detail. The model specification requires the intervention or exposure variable (in this case <span class="math inline">\(T\)</span>) and the instrument (<span class="math inline">\(A\)</span>).</p>
<pre class="r"><code>library(ivpack)
ivmodel <- ivreg(formula = Y ~ T | A, data = DT)
broom::tidy(ivmodel)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.30 0.0902 14.4 6.11e-43
## 2 T 1.52 0.219 6.92 8.08e-12</code></pre>
<p>The causal effect that IV methods is often called the local area treatment effect (LATE), which is just another way to talk about the CACE. Essentially, IV is estimating the causal effect for people whose behavior is modified (or would be modified) by the instrument. If we calculate the average CACE using the (unobservable) potential outcomes data for the compliers, the estimate is quite close to the IV estimate of 1.52:</p>
<pre class="r"><code>DT[fS == "Complier", mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 1.53</code></pre>
</div>
<div id="propensity-score-matching" class="section level3">
<h3>Propensity score matching</h3>
<p>If we were somehow able to measure <span class="math inline">\(U\)</span>, the underlying health status, we would be in a position to estimate the average causal effect for the treated, what I have been calling ACT, using propensity score matching. The idea here is to create a comparison group from the untreated sample that looks similar to the treated in every way except for treatment. This control is designed to be the counterfactual for the treated.</p>
<p>One way to do this is by matching on the propensity score - the probability of treatment. (See this <a href="https://www.tandfonline.com/doi/abs/10.1080/00273171.2011.568786">article</a> on propensity score methods for a really nice overview on the topic.)</p>
<p>To estimate the probability of treatment, we fit a “treatment” model, in this case a logistic generalized linear model since the treatment is binary. From this model, we can generate a predicted value for each individual. We can use software, in this case the R package <code>Matching</code>, to find individuals in the untreated group who share the exact or very similar propensity for treatment. Actually in this case, I will “match with replacement” so that while each treated individual will be included once, some controls might be matched with more than one treated (and those that are included repeatedly will be counted multiple times in the data).</p>
<p>It turns out that when we do this, the two groups will be balanced on everything that matters. In this case, the “everything”" that matters is only health status <span class="math inline">\(U\)</span>. (We actually could have matched directly on <span class="math inline">\(U\)</span> here, but I wanted to show propensity score matching, which is useful when there are many confounders that matter, and matching on them separately would be extremely difficult or impossible.)</p>
<p>Once we have the two groups, all we need to do is take the difference of the means of the two groups and that will give us an estimate for ACT. We could use bootstrapping methods to estimate the standard error. Below, we will use Monte Carlo simulation, so that will give us sense of the variability.</p>
<pre class="r"><code>library(Matching)
# Treatment model and ps estimation
glm.fit <- glm(T ~ U, family=binomial, data=DT)
DT$ps = predict(glm.fit,type="response")
setkey(DT, T, id)
TR = DT$T
X = DT$ps
# Matching with replacement
matches <- Match(Y = NULL, Tr = TR, X = X, ties = FALSE, replace = TRUE)
# Select matches from original dataset
dt.match <- DT[c(matches$index.treated, matches$index.control)]
# ACT estimate
dt.match[T == 1, mean(Y)] - dt.match[T == 0, mean(Y)]</code></pre>
<pre><code>## [1] 1.79</code></pre>
<p>Once again, the matching estimate is quite close to the “true” value of the ACT calculated using the potential outcomes:</p>
<pre class="r"><code>DT[T == 1, mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 1.77</code></pre>
</div>
<div id="inverse-probability-weighting" class="section level3">
<h3>Inverse probability weighting</h3>
<p>This last method also uses the propensity score, but as a weight, rather than for the purposes of matching. Each individual weight is the inverse probability of receiving the treatment they actually received. (I wrote a series of posts on IPW; you can look <a href="https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/">here</a> if you want to see a bit more.)</p>
<p>To implement IPW in this simple case, we just calculate the weight based on the propensity score, and use that weight in a simple linear regression model:</p>
<pre class="r"><code>DT[, ipw := 1 / ((ps * T) + ( (1 - ps) * (1 - T) ))]
lm.ipw <- lm(Y ~ T, weights = DT$ipw, data = DT)
broom::tidy(lm.ipw)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.21 0.0787 15.3 1.06e-47
## 2 T 1.02 0.110 9.28 1.04e-19</code></pre>
<p>The IPW estimate is quite close to the estimate of the average causal effect (ACE). That is, the IPW is the marginal average:</p>
<pre class="r"><code>DT[, mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 1.1</code></pre>
</div>
<div id="randomized-clinical-trial" class="section level3">
<h3>Randomized clinical trial</h3>
<p>If we can make the assumption that <span class="math inline">\(A\)</span> is not the instrument but is the actual randomization <em>and</em> that everyone is a complier (i.e. everyone follows the randomized protocol), then the estimate we get from comparing treated with controls will also be quite close to the ACE of 1.1. So, the randomized trial in its ideal execution provides an estimate of the average causal effect for the entire sample.</p>
<pre class="r"><code>randtrial <- lm(Y.r ~ A, data = DT)
broom::tidy(randtrial)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.22 0.0765 15.9 5.35e-51
## 2 A 1.09 0.108 10.1 8.26e-23</code></pre>
</div>
<div id="intention-to-treat-from-rct" class="section level3">
<h3>Intention-to-treat from RCT</h3>
<p>Typically, however, in a randomized trial, there isn’t perfect compliance, so randomization is more like strong encouragement. Studies are typically analyzed using an intent-to-treat approach, doing the analysis <em>as if</em> protocol was followed correctly. This method is considered conservative (in the sense that the estimated effect is closer to 0 than true ACE is), because many of those assumed to have been treated were not actually treated, and <em>vice versa</em>. In this case, the estimated ITT quantity is quite a bit smaller than the estimate from a perfectly executed RCT (which is the ACE):</p>
<pre class="r"><code>itt.fit <- lm(Y ~ A, data = DT)
broom::tidy(itt.fit)</code></pre>
<pre><code>## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.50 0.0821 18.3 1.34e-64
## 2 A 0.659 0.116 5.68 1.76e- 8</code></pre>
</div>
<div id="per-protocol-analysis-from-rct" class="section level3">
<h3>Per protocol analysis from RCT</h3>
<p>Yet another approach to analyzing the data is to consider only those cases that followed protocol. So, for those randomized to treatment, we would look only at those who actually were treated. And for those randomized to control, we would only look at those who did not get treatment. It is unclear what this is actually measuring since the two groups are not comparable: the treated group includes both compliers and always-takers, whereas the control group includes both compliers and never-takers. If always-takers have larger causal effects on average and never-takers have smaller causal effects on average, the per protocol estimate will be larger than the average causal effect (ACE), and will not represent any other obvious quantity.</p>
<p>And with this data set, this is certainly the case:</p>
<pre class="r"><code>DT[A == 1 & T == 1, mean(Y)] - DT[A == 0 & T == 0, mean(Y)] </code></pre>
<pre><code>## [1] 2.22</code></pre>
</div>
<div id="monte-carlo-simulation" class="section level3">
<h3>Monte Carlo simulation</h3>
<p>I leave you with a figure that shows the point estimates and 95% confidence intervals for each of these methods. Based on 1000 replications of the data set, this series of plots underscores the relationship of the methods to the various causal estimands.</p>
<p><img src="https://www.rdatagen.net/post/2018-11-28-different-models-estimate-different-causal-effects-part-ii_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
</div>
Generating data to explore the myriad causal effects that can be estimated in observational data analysis
https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/
Tue, 20 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/
<p>I’ve been inspired by two recent talks describing the challenges of using instrumental variable (IV) methods. IV methods are used to estimate the causal effects of an exposure or intervention when there is unmeasured confounding. This estimated causal effect is very specific: the complier average causal effect (CACE). But, the CACE is just one of several possible causal estimands that we might be interested in. For example, there’s the average causal effect (ACE) that represents a population average (not just based the subset of compliers). Or there’s the average causal effect for the exposed or treated (ACT) that allows for the fact that the exposed could be different from the unexposed.</p>
<p>I thought it would be illuminating to analyze a single data set using different causal inference methods, including IV as well as propensity score matching and inverse probability weighting. Each of these methods targets different causal estimands, which may or may not be equivalent depending on the subgroup-level causal effects and underlying population distribution of those subgroups.</p>
<p>This is the first of a two-part post. In this first part, I am focusing entirely on the data generation process (DGP). In the follow-up, I will get to the model estimation.</p>
<div id="underlying-assumptions-of-the-dgp" class="section level3">
<h3>Underlying assumptions of the DGP</h3>
<p>Since the motivation here is instrumental variable analysis, it seems natural that the data generation process include a possible instrument. (Once again, I am going to refer to elsewhere in case you want more details on the theory and estimation of IV models. Here is an excellent in-depth tutorial by <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201653/"><em>Baiocchi et al</em></a> that provides great background. I’ve even touched on the topic of CACE in an earlier series of <a href="https://www.rdatagen.net/post/cace-explored/">posts</a>. Certainly, there is no lack of discussion on this topic, as a quick search around the internet will make readily obvious.)</p>
<p>The figure below is a variation on the directed acyclic graph (DAG) that is often very useful in laying out causal assumptions of a DGP. This particular figure is a type of SWIG: single world intervention graph. SWIGs, <a href="https://pdfs.semanticscholar.org/07bb/cb458109d2663acc0d098e8913892389a2a7.pdf">developed by Robins and Richardson</a>, fuse the worlds of potential outcomes and DAGs.</p>
<p><img src="https://www.rdatagen.net/img/post-ivdgp/IV_SWIT.png" /></p>
<p>Important things to note here:</p>
<ol style="list-style-type: decimal">
<li><p>There is an instrumental variable <span class="math inline">\(A\)</span> that has a direct causal relationship only to the exposure of interest, <span class="math inline">\(T\)</span>. If the exposure is a particular medical intervention, think of the instrument as some kind of encouragement to get that treatment. Some people get the encouragement, others don’t - though on average folks who are encouraged are no different from folks who are not (at least not in ways that relate to the outcome.)</p></li>
<li><p>There is a confounder <span class="math inline">\(U\)</span>, possibly unmeasured, that is related both to potential outcomes and the exposure, but not to the encouragement (the instrument)! In the example below, we conceive of <span class="math inline">\(U\)</span> as an underlying health status.</p></li>
<li><p>Exposure variable <span class="math inline">\(T\)</span> (that, in this case, is binary, just to keep things simpler) indicates whether a person gets the treatment or not.</p></li>
<li><p>Each individual will have two <em>potential treatments</em> <span class="math inline">\(T^0\)</span> and <span class="math inline">\(T^1\)</span>, where <span class="math inline">\(T^0\)</span> is the treatment when there is no encouragement (i.e. A = 0), and <span class="math inline">\(T^1\)</span> is the treatment when <span class="math inline">\(A = 1\)</span>. For any individual, we actually only observe one of these treatments (depending on the actual value of <span class="math inline">\(A\)</span>. The population of interest consists of <strong>always-takers</strong>, <strong>compliers</strong>, and <strong>never-takers</strong>. <em>Never-takers</em> always reject the treatment regardless of whether or not they get encouragement - that is, <span class="math inline">\(T^0 = T^1 = 0\)</span>. <em>Compliers</em> only seek out the treatment when they are encouraged, otherwise they don’t: <span class="math inline">\(T^0 = 0\)</span> and <span class="math inline">\(T^1 = 1\)</span>. And <em>always-takers</em> always (of course) seek out the treatment: <span class="math inline">\(T^0 = T^1 = 1\)</span>. (In order for the model to be identifiable, we need to make a not-so-crazy assumption that there are no so-called <em>deniers</em>, where <span class="math inline">\(T^0 = 1\)</span> and <span class="math inline">\(T^1 = 0\)</span>.) An individual may have a different complier status depending on the instrument and exposure (<em>i.e.</em>, one person might be a never-taker in one scenario but a complier in another). In this simulation, larger values of the confounder <span class="math inline">\(U\)</span> will increase <span class="math inline">\(P(T^a = 1)\)</span> for both <span class="math inline">\(a \in (0,1)\)</span>.</p></li>
<li><p>Each individual will have two <em>potential outcomes</em>, only one of which is observed. <span class="math inline">\(Y_i^0\)</span> is the outcome for person <span class="math inline">\(i\)</span> when they are unexposed or do not receive the treatment. <span class="math inline">\(Y_i^1\)</span> is the outcome for that same person when they are exposed or do receive the treatment. In this case, the confounder <span class="math inline">\(U\)</span> can affect the potential outcomes. (This diagram is technically a SWIT, which is template, since I have generically referred to the potential treatment <span class="math inline">\(T^a\)</span> and potential outcome <span class="math inline">\(Y^t\)</span>.)</p></li>
<li><p>Not shown in this diagram are the observed <span class="math inline">\(T_i\)</span> and <span class="math inline">\(Y_i\)</span>; we assume that <span class="math inline">\(T_i = (T_i^a | A = a)\)</span> and <span class="math inline">\(Y_i = (Y_i^t | T = t)\)</span></p></li>
<li><p>Also not shown on the graph is the causal estimand of an exposure for individual <span class="math inline">\(i\)</span>, which can be defined as <span class="math inline">\(CE_i \equiv Y^1_i - Y^0_i\)</span>. We can calculate the average causal effect, <span class="math inline">\(E[CE]\)</span>, for the sample as a whole as well as for subgroups.</p></li>
</ol>
</div>
<div id="dgp-for-potential-outcomes" class="section level3">
<h3>DGP for potential outcomes</h3>
<p>The workhorse of this data generating process is a logistic sigmoid function that represents the mean potential outcome <span class="math inline">\(Y^t\)</span> at each value of <span class="math inline">\(u\)</span>. This allows us to easily generate homogeneous or heterogeneous causal effects. The function has four parameters, <span class="math inline">\(M\)</span>, <span class="math inline">\(\gamma\)</span>, <span class="math inline">\(\delta\)</span>, and <span class="math inline">\(\alpha\)</span>:</p>
<p><span class="math display">\[
Y^t = f(u) = M/[1 + exp(-\gamma(u - \delta))] + \alpha,
\]</span>
where <span class="math inline">\(M\)</span> is the maximum of the function (assuming the minimum is <span class="math inline">\(0\)</span>), <span class="math inline">\(\gamma\)</span> is the steepness of the curve, <span class="math inline">\(\delta\)</span> is the inflection point of the curve, and <span class="math inline">\(\alpha\)</span> is a vertical shift of the entire curve. This function is easily implemented in R:</p>
<pre class="r"><code>fYt <- function(x, max, grad, inflect = 0, offset = 0) {
( max / (1 + exp( -grad * (x - inflect) ) ) ) + offset
}</code></pre>
<p>Here is a single curve based on an arbitrary set of parameters:</p>
<pre class="r"><code>ggplot(data = data.frame(x = 0), mapping = aes(x = x)) +
stat_function(fun = fYt, size = 2,
args = list(max = 1.5, grad = 5, inflect = 0.2)) +
xlim(-1.5, 1.5)</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-11-20-generating-data-to-explore-the-myriad-causal-effects_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>The figures below show the mean of the potential outcomes <span class="math inline">\(Y^0\)</span> and <span class="math inline">\(Y^1\)</span> under two different scenarios. On the left, the causal effect at each level of <span class="math inline">\(u\)</span> is constant, and on the right, the causal effect changes over the different values of <span class="math inline">\(u\)</span>, increasing rapidly when <span class="math inline">\(0 < u < 0.2\)</span>.</p>
<p><img src="https://www.rdatagen.net/post/2018-11-20-generating-data-to-explore-the-myriad-causal-effects_files/figure-html/unnamed-chunk-5-1.png" width="1056" /></p>
</div>
<div id="homogeneous-causal-effect" class="section level3">
<h3>Homogeneous causal effect</h3>
<p>Here’s a closer look at the the different causal effects under the first scenario of homogeneous causal effects across values of <span class="math inline">\(u\)</span> by generating some data. The data definitions are provided in three steps. In the first step, the confounder <span class="math inline">\(U\)</span> is generated. Think of this as health status, which can take on values ranging from <span class="math inline">\(-0.5\)</span> to <span class="math inline">\(0.5\)</span>, where lower scores indicate worse health.</p>
<p>Next up are the definitions of the potential outcomes of treatment and outcome, both of which are dependent on the unmeasured confounder.</p>
<p>(Though technically not a definition step, the instrument assignment (variable <span class="math inline">\(A\)</span>) is generated later using <code>trtAssign</code>.)</p>
<p>In the final steps, we generate the observed treatment <span class="math inline">\(T\)</span> (a function of both <span class="math inline">\(A\)</span> and complier status <span class="math inline">\(S\)</span>), and observed outcome <span class="math inline">\(Y\)</span> (which is determined by <span class="math inline">\(T\)</span>). A complier status is determined based on the potential outcomes of treatment.</p>
<pre class="r"><code>library(simstudy)
### Potential treatments U and outcomes Y
def <- defData(varname = "U", formula = "-0.5;0.5",
dist = "uniform")
def <- defData(def, varname = "T0",
formula = "-2 + 4 * U",
dist = "binary", link = "logit")
def <- defData(def, varname = "T1x",
formula = "4 * U ",
dist = "binary", link = "logit")
# This prevents any deniers:
def <- defData(def, varname = "T1",
formula = "(T0 == 0) * T1x + (T0 == 1) * 1",
dist = "nonrandom")
def <- defData(def, varname = "Y0",
formula = "fYt(U, 5, 15, 0.02)",
variance = 0.25)
def <- defData(def, varname = "Y1",
formula = "fYt(U, 5.0, 15, 0.02, 1)",
variance = 0.25)
### Observed treatments
defA <- defDataAdd(varname = "T",
formula = "(A == 0) * T0 + (A == 1) * T1")
defA <- defDataAdd(defA, varname = "Y",
formula = "(T == 0) * Y0 + (T == 1) * Y1",
dist = "nonrandom")
defA <- defDataAdd(defA, varname = "Y.r",
formula = "(A == 0) * Y0 + (A == 1) * Y1",
dist = "nonrandom")
### Complier status
defC <- defCondition(condition = "T0 == 0 & T1 == 0", formula = 1,
dist = "nonrandom")
defC <- defCondition(defC, condition = "T0 == 0 & T1 == 1", formula = 2,
dist = "nonrandom")
defC <- defCondition(defC, condition = "T0 == 1 & T1 == 1", formula = 3,
dist = "nonrandom")</code></pre>
<p>Once all the definitions are set, it is quite simple to generate the data:</p>
<pre class="r"><code>set.seed(383726)
# Step 1 - generate U and potential outcomes for T and Y
dx <- genData(500, def)
# Step 2 - randomly assign instrument
dx <- trtAssign(dx, nTrt = 2, grpName = "A" )
# Step 3 - generate observed T and Y
dx <- addColumns(defA, dx)
# Step 4 - determine complier status
dx <- addCondition(defC, dx, "S")
dx <- genFactor(dx, "S", labels = c("Never", "Complier", "Always"))</code></pre>
</div>
<div id="looking-at-the-data" class="section level3">
<h3>Looking at the data</h3>
<p>Here are a few records from the generated dataset:</p>
<pre class="r"><code>dx</code></pre>
<pre><code>## id S A U T0 T1x T1 Y0 Y1 T Y Y.r fS
## 1: 1 1 1 0.282 0 0 0 5.5636 6.01 0 5.5636 6.01 Never
## 2: 2 3 0 0.405 1 1 1 4.5534 5.83 1 5.8301 4.55 Always
## 3: 3 3 1 0.487 1 1 1 6.0098 5.82 1 5.8196 5.82 Always
## 4: 4 2 1 0.498 0 1 1 5.2695 6.43 1 6.4276 6.43 Complier
## 5: 5 1 1 -0.486 0 0 0 0.0088 1.02 0 0.0088 1.02 Never
## ---
## 496: 496 3 0 -0.180 1 0 1 0.8384 1.10 1 1.0966 0.84 Always
## 497: 497 3 1 0.154 1 1 1 4.9118 5.46 1 5.4585 5.46 Always
## 498: 498 1 1 0.333 0 0 0 5.4800 5.46 0 5.4800 5.46 Never
## 499: 499 1 1 0.049 0 0 0 3.4075 5.15 0 3.4075 5.15 Never
## 500: 500 1 1 -0.159 0 0 0 0.4278 0.96 0 0.4278 0.96 Never</code></pre>
<p>The various average causal effects, starting with the (marginal) average causal effect and ending with the average causal effect for those treated are all close to <span class="math inline">\(1\)</span>:</p>
<pre class="r"><code>ACE <- dx[, mean(Y1 - Y0)]
AACE <- dx[fS == "Always", mean(Y1 - Y0)]
CACE <- dx[fS == "Complier", mean(Y1 - Y0)]
NACE <- dx[fS == "Never", mean(Y1 - Y0)]
ACT <- dx[T == 1, mean(Y1 - Y0)]</code></pre>
<pre><code>## ceType ce
## 1: ACE 0.97
## 2: AACE 0.96
## 3: CACE 1.00
## 4: NACE 0.96
## 5: ACT 1.05</code></pre>
<p>Here is a visual summary of the generated data. The upper left shows the underlying data generating functions for the potential outcomes and the upper right plot shows the various average causal effects: average causal effect for the population (ACE), average causal effect for always-takers (AACE), complier average causal effect (CACE), average causal effect for never-takers (NACE), and the average causal effect for the treated (ACT).</p>
<p>The true individual-specific causal effects color-coded based on complier status (that we could never observe in the real world, but we can here in simulation world) are on the bottom left, and the true individual causal effects for those who received treatment are on the bottom right. These figures are only remarkable in that all average causal effects and individual causal effects are close to <span class="math inline">\(1\)</span>, reflecting the homogeneous causal effect data generating process.</p>
<p><img src="https://www.rdatagen.net/post/2018-11-20-generating-data-to-explore-the-myriad-causal-effects_files/figure-html/unnamed-chunk-11-1.png" width="1056" /></p>
</div>
<div id="heterogenous-causal-effect-1" class="section level3">
<h3>Heterogenous causal effect #1</h3>
<p>Here is a set of figures for a heterogeneous data generating process (which can be seen on the upper left). Now, the average causal effects are quite different from each other. In particular <span class="math inline">\(ACE < CACE < ACT\)</span>. Obviously, none of these quantities is wrong, they are just estimating the average effect for different groups of people that are characterized by different levels of health status <span class="math inline">\(U\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2018-11-20-generating-data-to-explore-the-myriad-causal-effects_files/figure-html/unnamed-chunk-12-1.png" width="1056" /></p>
</div>
<div id="heterogenous-causal-effect-2" class="section level3">
<h3>Heterogenous causal effect #2</h3>
<p>Finally, here is one more scenario, also with heterogeneous causal effects. In this case <span class="math inline">\(ACE \approx CACE\)</span>, but the other effects are quite different, actually with different signs.</p>
<p><img src="https://www.rdatagen.net/post/2018-11-20-generating-data-to-explore-the-myriad-causal-effects_files/figure-html/unnamed-chunk-13-1.png" width="1056" /></p>
</div>
<div id="next-up-estimating-the-causal-effects" class="section level3">
<h3>Next up: estimating the causal effects</h3>
<p>In the second part of this post, I will use this DGP and estimate these effects using various modeling techniques. It will hopefully become apparent that different modeling approaches provide estimates of different causal estimands.</p>
</div>
Causal mediation estimation measures the unobservable
https://www.rdatagen.net/post/causal-mediation/
Tue, 06 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/causal-mediation/
<p>I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit.</p>
<p>Quite often we are interested in understanding the relationship between an exposure or intervention on an outcome. Does exposure <span class="math inline">\(A\)</span> (could be randomized or not) have an effect on outcome <span class="math inline">\(Y\)</span>?</p>
<p><img src="https://www.rdatagen.net/img/post-mediation/Model_1.png" /></p>
<p>But sometimes we are interested in understanding <em>more</em> than whether or not <span class="math inline">\(A\)</span> causes or influences <span class="math inline">\(B\)</span>; we might want to have some insight into the mechanisms <em>underlying</em> that influence. And this is where mediation analysis can be useful. (If you want to delve deeply into the topic, I recommend you check out this <a href="https://global.oup.com/academic/product/explanation-in-causal-inference-9780199325870?cc=us&lang=en&">book</a> by Tyler VanderWeele, or this nice <a href="https://www.mailman.columbia.edu/research/population-health-methods/causal-mediation">website</a> developed at Columbia University.)</p>
<p>In the example here, I am using the simplest possible scenario of an exposure <span class="math inline">\(A\)</span>, a mediator <span class="math inline">\(M\)</span>, and an outcome <span class="math inline">\(Y\)</span>, without any confounding:</p>
<p><img src="https://www.rdatagen.net/img/post-mediation/Model_2.png" /></p>
<p>A key challenge of understanding and conducting a mediation analysis is how we should <em>quantify</em> this concept of mediation. Sure, <span class="math inline">\(A\)</span> has an effect on <span class="math inline">\(M\)</span>, which in turn has an effect on <span class="math inline">\(Y\)</span>, and <span class="math inline">\(A\)</span> also may have an effect on <span class="math inline">\(Y\)</span> through other pathways. But how can we make sense of all of this? One approach, which is a relatively recent development, is to use the <em>potential outcome</em> framework of causal inference to define the various estimands (or quantities) that arise in a mediation analysis. (I draw on a <a href="https://www.jstor.org/stable/41058997?seq=1#metadata_info_tab_contents">paper</a> by Imai, Keele and Yamamoto for the terminology, as there is not complete agreement on what to call various quantities. The estimation methods and software used here are also described in the paper.)</p>
<div id="defining-the-potential-outcomes" class="section level3">
<h3>Defining the potential outcomes</h3>
<p>In an earlier <a href="https://www.rdatagen.net/post/be-careful/">post</a>, I described the concept of potential outcomes. I extend that a bit here to define the quantities we are interested in. In this case, we have two effects of the possible exposure: <span class="math inline">\(M\)</span> and <span class="math inline">\(Y\)</span>. Under this framework, each individual has a potential outcome for each level of <span class="math inline">\(A\)</span> (I am assuming <span class="math inline">\(A\)</span> is binary). So, for the mediator, <span class="math inline">\(M_{i0}\)</span> and <span class="math inline">\(M_{i1}\)</span> are the values of <span class="math inline">\(M\)</span> we would observe for individual <span class="math inline">\(i\)</span> without and with exposure, respectively. That is pretty straightforward. (From here on out, I will remove the subscript <span class="math inline">\(i\)</span>, because it gets a little unwieldy.)</p>
<p>The potential outcomes under <span class="math inline">\(Y\)</span> are less intuitive, as there are four of them. First, there is <span class="math inline">\(Y_{0M_0}\)</span>, which is the potential outcome of <span class="math inline">\(Y\)</span> <em>without</em> exposure for <span class="math inline">\(A\)</span> and whatever the potential outcome for <span class="math inline">\(M\)</span> is <em>without</em> exposure for <span class="math inline">\(A\)</span>. This is what we observe when <span class="math inline">\(A=0\)</span> for an individual. <span class="math inline">\(Y_{1M_1}\)</span> is the potential outcome of <span class="math inline">\(Y\)</span> <em>with</em> exposure for <span class="math inline">\(A\)</span> and whatever the potential outcome for <span class="math inline">\(M\)</span> is <em>with</em> exposure for <span class="math inline">\(A\)</span>. This is what we observe when <span class="math inline">\(A=1\)</span> for an individual. That’s all fine.</p>
<p>But we also have <span class="math inline">\(Y_{0M_1}\)</span>, which can never be observed unless we can intervene on the mediator <span class="math inline">\(M\)</span> somehow. This is the potential outcome of <span class="math inline">\(Y\)</span> <em>without</em> exposure for <span class="math inline">\(A\)</span> and whatever the mediator would have been had the individual been exposed. This potential outcome is controversial, because it is defined across two different universes of exposure to <span class="math inline">\(A\)</span>. Finally, there is <span class="math inline">\(Y_{1M_0}\)</span>. It is analogously defined across two universes, but in reverse.</p>
</div>
<div id="defining-the-causal-mediation-effects-and-direct-effects" class="section level3">
<h3>Defining the causal mediation effects and direct effects</h3>
<p>The estimands or quantities that we are interested are defined in terms of the potential outcomes. The <strong><em>causal mediation effects</em></strong> for an individual are</p>
<p><span class="math display">\[
\begin{aligned}
CME_0 &= Y_{0M_1} - Y_{0M_0} \\
CME_1 &= Y_{1M_1} - Y_{1M_0},
\end{aligned}
\]</span></p>
<p>and the <strong><em>causal direct effects</em></strong> are</p>
<p><span class="math display">\[
\begin{aligned}
CDE_0 &= Y_{1M_0} - Y_{0M_0} \\
CDE_1 &= Y_{1M_1} - Y_{0M_1}.
\end{aligned}
\]</span></p>
<p>A few important points. (1) Since we are in the world of potential outcomes, we do not observe these quantities for everyone. In fact, we don’t observe these quantities for anyone, since some of the measures are across two universes. (2) The two causal mediation effects under do not need to be the same. The same goes for the two causal direct effects. (3) Under a set of pretty strong assumptions related to unmeasured confounding, independence, and consistency (see <a href="https://www.jstor.org/stable/41058997?seq=1#metadata_info_tab_contents"><em>Imai et al</em></a> for the details), the average causal mediation effects and average causal direct effects can be estimated using <em>observed</em> data only. Before I simulate some data to demonstrate all of this, here is the definition for the <strong><em>total causal effect</em></strong> (and its decomposition into mediation and direct effects):</p>
<p><span class="math display">\[
\begin{aligned}
TCE &= Y_{1M_1} - Y_{0M_0} \\
&= CME_1 + CDE_0 \\
&= CME_0 + CDE_1
\end{aligned}
\]</span></p>
</div>
<div id="generating-the-data" class="section level3">
<h3>Generating the data</h3>
<p>I’m using the <code>simstudy</code> package to generate the data. I’ll start by generating the binary potential outcomes for the mediator, <span class="math inline">\(M_0\)</span> and <span class="math inline">\(M_1\)</span>, which are correlated in this example. <span class="math inline">\(P(M_1=1) > P(M_0=1)\)</span>, implying that exposure to <span class="math inline">\(A\)</span> does indeed have an effect on <span class="math inline">\(M\)</span>. Note that it is possible that for an individual <span class="math inline">\(M_0 = 1\)</span> and <span class="math inline">\(M_1 = 0\)</span>, so that exposure to <span class="math inline">\(A\)</span> has an effect contrary to what we see in the population generally. (We don’t need to make this assumption in the data generation process; we could force <span class="math inline">\(M_1\)</span> to be 1 if <span class="math inline">\(M_0\)</span> is 1.)</p>
<pre class="r"><code>set.seed(3872672)
dd <- genCorGen(n=5000, nvars = 2, params1 = c(.2, .6),
dist = "binary", rho = .3, corstr = "cs",
wide = TRUE, cnames = c("M0", "M1"))</code></pre>
<p>Observe treatment:</p>
<pre class="r"><code>dd <- trtObserve(dd, 0.6, grpName = "A")</code></pre>
<p>Initial data set:</p>
<pre class="r"><code>dd</code></pre>
<pre><code>## id A M0 M1
## 1: 1 0 0 1
## 2: 2 1 0 0
## 3: 3 1 0 1
## 4: 4 0 1 1
## 5: 5 1 0 0
## ---
## 4996: 4996 1 0 1
## 4997: 4997 0 0 0
## 4998: 4998 1 1 1
## 4999: 4999 1 1 0
## 5000: 5000 0 0 0</code></pre>
<p><span class="math inline">\(Y_{0M_0}\)</span> is a function of <span class="math inline">\(M_0\)</span> and some noise <span class="math inline">\(e_0\)</span>, and <span class="math inline">\(Y_{0M_1}\)</span> is a function of <span class="math inline">\(M_1\)</span> and the same noise (this is not a requirement). However, if <span class="math inline">\(M_0 = M_1\)</span> (i.e. the mediator is not affected by exposure status), then I am setting <span class="math inline">\(Y_{0M_1} = Y_{0M_0}\)</span>. In this case, <span class="math inline">\(CME_0\)</span> for an individual is <span class="math inline">\(2(M_1 - M_0)\)</span>, so <span class="math inline">\(CME_0 \in \{-2, 0, 2\}\)</span>, and the population average <span class="math inline">\(CME_0\)</span> will depend on the mixture of potential outcomes <span class="math inline">\(M_0\)</span> and <span class="math inline">\(M_1\)</span>.</p>
<pre class="r"><code>def <- defDataAdd(varname = "e0", formula = 0,
variance = 1, dist = "normal")
def <- defDataAdd(def, varname = "Y0M0", formula = "2 + M0*2 + e0",
dist = "nonrandom")
def <- defDataAdd(def, varname = "Y0M1", formula = "2 + M1*2 + e0",
variance = 1, dist = "nonrandom")</code></pre>
<p>The same logic holds for <span class="math inline">\(Y_{1M_0}\)</span> and <span class="math inline">\(Y_{1M_1}\)</span>, though at the individual level <span class="math inline">\(CME_1 \in \{-5, 0, 5\}\)</span>:</p>
<pre class="r"><code>def <- defDataAdd(def, varname = "e1", formula = 0,
variance = 1, dist = "normal")
def <- defDataAdd(def, varname = "Y1M0", formula = "8 + M0*5 + e1",
dist = "nonrandom")
def <- defDataAdd(def, varname = "Y1M1", formula = "8 + M1*5 + e1",
dist = "nonrandom")</code></pre>
<p>The <em>observed</em> mediator (<span class="math inline">\(M\)</span>) and outcome (<span class="math inline">\(Y\)</span>) are determined by the observed exposure (<span class="math inline">\(A\)</span>).</p>
<pre class="r"><code>def <- defDataAdd(def, varname = "M",
formula = "(A==0) * M0 + (A==1) * M1", dist = "nonrandom")
def <- defDataAdd(def, varname = "Y",
formula = "(A==0) * Y0M0 + (A==1) * Y1M1", dist = "nonrandom")</code></pre>
<p>Here is the entire data definitions table:</p>
<table class="table table-condensed">
<thead>
<tr>
<th style="text-align:right;">
varname
</th>
<th style="text-align:right;">
formula
</th>
<th style="text-align:right;">
variance
</th>
<th style="text-align:right;">
dist
</th>
<th style="text-align:right;">
link
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">normal </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y0M0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">2 + M0*2 + e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y0M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">2 + M1*2 + e0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">normal </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y1M0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">8 + M0*5 + e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y1M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">8 + M1*5 + e1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">M </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">(A==0) * M0 + (A==1) * M1 </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
<span style="font-size: 16px">Y </span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">(A==0) * Y0M0 + (A==1) * Y1M1</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">0</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">nonrandom</span>
</td>
<td style="text-align:right;">
<span style="font-size: 16px">identity</span>
</td>
</tr>
</tbody>
</table>
<p>Based on the parameters used to generate the data, we can calculate the expected causal mediation effects:</p>
<p><span class="math display">\[
\begin{aligned}
E[CME_0] &= E[2 + 2M_1 + e_0] - E[2+2M_0+e_0] \\
&= E[2(M_1 - M_0)] \\
&= 2(E[M_1] - E[M_0]) \\
&= 2(0.6 - 0.2) \\
&= 0.8
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
E[CME_1] &= E[8 + 5M_1 + e_1] - E[8+5M_0+e_1] \\
&= E[5(M_1 - M_0)] \\
&= 5(E[M_1] - E[M_0]) \\
&= 5(0.6 - 0.2) \\
&= 2.0
\end{aligned}
\]</span></p>
<p>Likewise, the expected values of the causal direct effects can be calculated:</p>
<p><span class="math display">\[
\begin{aligned}
E[CDE_0] &= E[8 + 5M_0 + e_1] - E[2+2M_0+e_0] \\
&= E[6 + 5M_0 - 2M_0)] \\
&= 6 + 3M_0 \\
&= 6 + 3*0.2 \\
&= 6.6
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
E[CDE_1] &= E[8 + 5M_1 + e_1] - E[2+2M_1+e_0] \\
&= E[6 + 5M_1 - 2M_1)] \\
&= 6 + 3M_1 \\
&= 6 + 3*0.6 \\
&= 7.8
\end{aligned}
\]</span></p>
<p>Finally, the expected total causal effect is:</p>
<p><span class="math display">\[
\begin{aligned}
ATCE &= E[CDE_0] + E[CME_1] = 6.6 + 2.0 \\
&= E[CDE_1] + E[CME_0] = 7.8 + 0.8 \\
&= 8.6
\end{aligned}
\]</span>
And now, the complete data set can be generated.</p>
<pre class="r"><code>dd <- addColumns(def, dd)
dd <- delColumns(dd, c("e0", "e1")) # these are not needed
dd</code></pre>
<pre><code>## id A M0 M1 Y0M0 Y0M1 Y1M0 Y1M1 M Y
## 1: 1 0 0 1 0.933 2.93 7.58 12.58 0 0.933
## 2: 2 1 0 0 2.314 2.31 6.84 6.84 0 6.841
## 3: 3 1 0 1 3.876 5.88 9.05 14.05 1 14.053
## 4: 4 0 1 1 5.614 5.61 12.04 12.04 1 5.614
## 5: 5 1 0 0 1.469 1.47 8.81 8.81 0 8.809
## ---
## 4996: 4996 1 0 1 2.093 4.09 8.82 13.82 1 13.818
## 4997: 4997 0 0 0 1.734 1.73 7.28 7.28 0 1.734
## 4998: 4998 1 1 1 3.256 3.26 12.49 12.49 1 12.489
## 4999: 4999 1 1 0 5.149 3.15 12.57 7.57 0 7.572
## 5000: 5000 0 0 0 1.959 1.96 5.23 5.23 0 1.959</code></pre>
</div>
<div id="looking-at-the-observed-potential-outcomes" class="section level3">
<h3>Looking at the “observed” potential outcomes</h3>
<p>The advantage of simulating data is that we can see what the average causal effects are based on the potential outcomes. Here are the average potential outcomes in the generated data set:</p>
<pre class="r"><code>dd[,.( Y0M0 = mean(Y0M0), Y0M1 = mean(Y0M1),
Y1M0 = mean(Y1M0), Y1M1 = mean(Y1M1))]</code></pre>
<pre><code>## Y0M0 Y0M1 Y1M0 Y1M1
## 1: 2.39 3.2 8.99 11</code></pre>
<p>The four average causal effects based on the data are quite close to the expected values:</p>
<pre class="r"><code>dd[, .(ACME0 = mean(Y0M1 - Y0M0), ACME1= mean(Y1M1 - Y1M0),
ACDE0 = mean(Y1M0 - Y0M0), ACDE1= mean(Y1M1 - Y0M1))]</code></pre>
<pre><code>## ACME0 ACME1 ACDE0 ACDE1
## 1: 0.81 2.03 6.6 7.81</code></pre>
<p>And the here is the average total causal effect from the data set:</p>
<pre class="r"><code>dd[, mean(Y1M1 - Y0M0)]</code></pre>
<pre><code>## [1] 8.62</code></pre>
<p>All of these quantities can be visualized in this figure. The lengths of the solid vertical lines are the mediated effects. The lengths of the dotted vertical lines are the direct effects. And the sums of these vertical lines (by color) each represent the total effect:</p>
<p><img src="https://www.rdatagen.net/post/2018-11-07-causal-mediation_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
</div>
<div id="estimated-causal-mediation-effect-from-observed-data" class="section level3">
<h3>Estimated causal mediation effect from observed data</h3>
<p>Clearly, the real interest is in estimating the causal effects from data that we can actually observe. And that, of course, is where things start to get challenging. I will not go into the important details here (again, <a href="https://www.jstor.org/stable/41058997?seq=1#metadata_info_tab_contents"><em>Imai et al</em></a> provide these), but here are formulas that have been derived to estimate the effects (simplified since there are no confounders in this example) and the calculations using the observed data:</p>
<p><span class="math display">\[\small
\hat{CME_0} =\sum_{m\in0,1}E[Y|A=0, M=m][P(M=m|A=1)-P(M=m|A=0)]
\]</span></p>
<pre class="r"><code># Estimate CME0
dd[M == 0 & A == 0, mean(Y)] *
(dd[A == 1, mean(M == 0)] - dd[A == 0, mean(M == 0)]) +
dd[M == 1 & A == 0, mean(Y)] *
(dd[A == 1, mean(M == 1)] - dd[A == 0, mean(M == 1)])</code></pre>
<pre><code>## [1] 0.805</code></pre>
<p><span class="math display">\[\small
\hat{CME_1} =\sum_{m\in0,1}E[Y|A=1, M=m][P(M=m|A=1)-P(M=m|A=0)]
\]</span></p>
<pre class="r"><code># Estimate CME1
dd[M == 0 & A == 1, mean(Y)] *
(dd[A == 1, mean(M == 0)] - dd[A == 0, mean(M == 0)]) +
dd[M == 1 & A == 1, mean(Y)] *
(dd[A == 1, mean(M == 1)] - dd[A == 0, mean(M == 1)])</code></pre>
<pre><code>## [1] 2</code></pre>
<p><span class="math display">\[\small
\hat{CDE_0} =\sum_{m\in0,1}(E[Y|A=1, M=m] - E[Y|A=0, M=m])P(M=m|A=0)
\]</span></p>
<pre class="r"><code># Estimate CDE0
(dd[M == 0 & A == 1, mean(Y)] - dd[M == 0 & A == 0, mean(Y)]) *
dd[A == 0, mean(M == 0)] +
(dd[M == 1 & A == 1, mean(Y)] - dd[M == 1 & A == 0, mean(Y)]) *
dd[A == 0, mean(M == 1)]</code></pre>
<pre><code>## [1] 6.56</code></pre>
<p><span class="math display">\[\small
\hat{CDE_1} =\sum_{m\in0,1}(E[Y|A=1, M=m] - E[Y|A=0, M=m])P(M=m|A=1)
\]</span></p>
<pre class="r"><code># Estimate CDE1
(dd[M == 0 & A == 1, mean(Y)] - dd[M == 0 & A == 0, mean(Y)]) *
dd[A == 1, mean(M == 0)] +
(dd[M == 1 & A == 1, mean(Y)] - dd[M == 1 & A == 0, mean(Y)]) *
dd[A == 1, mean(M == 1)]</code></pre>
<pre><code>## [1] 7.76</code></pre>
</div>
<div id="estimation-with-mediation-package" class="section level3">
<h3>Estimation with mediation package</h3>
<p>Fortunately, there is software available to provide these estimates (and more importantly measures of uncertainty). In <code>R</code>, one such package is <code>mediation</code>, which is available on <a href="https://cran.r-project.org/web/packages/mediation/index.html">CRAN</a>. This package implements the formulas derived in the <em>Imai et al</em> paper.</p>
<p>Not surprisingly, the model estimates are in line with expected values, true underlying effects, and the previous estimates conducted by hand:</p>
<pre class="r"><code>library(mediation)
med.fit <- glm(M ~ A, data = dd, family = binomial("logit"))
out.fit <- lm(Y ~ M*A, data = dd)
med.out <- mediate(med.fit, out.fit, treat = "A", mediator = "M",
robustSE = TRUE, sims = 1000)
summary(med.out)</code></pre>
<pre><code>##
## Causal Mediation Analysis
##
## Quasi-Bayesian Confidence Intervals
##
## Estimate 95% CI Lower 95% CI Upper p-value
## ACME (control) 0.8039 0.7346 0.88 <2e-16 ***
## ACME (treated) 2.0033 1.8459 2.16 <2e-16 ***
## ADE (control) 6.5569 6.4669 6.65 <2e-16 ***
## ADE (treated) 7.7563 7.6555 7.86 <2e-16 ***
## Total Effect 8.5602 8.4317 8.69 <2e-16 ***
## Prop. Mediated (control) 0.0940 0.0862 0.10 <2e-16 ***
## Prop. Mediated (treated) 0.2341 0.2179 0.25 <2e-16 ***
## ACME (average) 1.4036 1.2917 1.52 <2e-16 ***
## ADE (average) 7.1566 7.0776 7.24 <2e-16 ***
## Prop. Mediated (average) 0.1640 0.1524 0.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sample Size Used: 5000
##
##
## Simulations: 1000</code></pre>
</div>
Cross-over study design with a major constraint
https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/
Tue, 23 Oct 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/
<p>Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention <span class="math inline">\(A\)</span> better than doing nothing and (2) is <span class="math inline">\(A^+\)</span>, the slightly augmented version of <span class="math inline">\(A\)</span>, better than just <span class="math inline">\(A\)</span>?</p>
<p>It was clear in this context (and it is certainly not usually the case) that exposure to <span class="math inline">\(A\)</span> on one day would have <em>no</em> effect on the outcome under <span class="math inline">\(A^+\)</span> the next day (or <em>vice versa</em>). That is, spillover risks were minimal. Given this, the study was an ideal candidate for a cross-over design, where each study participant would receive both versions of the intervention and the control. This design can be much more efficient than a traditional RCT, because we can control for variability across patients.</p>
<p>While a cross-over study is interesting and challenging in its own right, the researchers had a pretty serious constraint: they did not feel they could assign intervention <span class="math inline">\(A^+\)</span> until <span class="math inline">\(A\)</span> had been applied, which would be necessary in a proper cross-over design. So, we had to come up with something a little different.</p>
<p>This post takes a look at how to generate data for and analyze data from a more standard cross-over trial, and then presents the solution we came up with for the problem at hand.</p>
<div id="cross-over-design-with-three-exposures" class="section level3">
<h3>Cross-over design with three exposures</h3>
<p>If we are free to assign any intervention on any day, one possible randomization scheme involving three interventions could look like this:
<img src="https://www.rdatagen.net/img/post-crossover/3way.png" /></p>
<p>Key features of this scheme are: (1) all individuals are exposed to each intervention over three days, (2) on any given day, each intervention is applied to one group of participants (just in case the specific day has an impact on the outcome), and (3) not every permutation is included (for example, <span class="math inline">\(A\)</span> does not immediately proceed <span class="math inline">\(Control\)</span> in any sequence), because the relative ordering of interventions in this case is assumed not to matter. (We might need to expand to six groups to rectify this.)</p>
</div>
<div id="data-simulation" class="section level3">
<h3>Data simulation</h3>
<p>In this simulation, we will assume (1) that the outcome is slightly elevated on days two and three, (2) <span class="math inline">\(A\)</span> is an improvement over <span class="math inline">\(Control\)</span>, (3) <span class="math inline">\(A^+\)</span> is an improvement over <span class="math inline">\(A\)</span>, (4) there is strong correlation of outcomes within each individual, and (5) group membership has no bearing on the outcome.</p>
<p>First, I define the data, starting with the different sources of variation. I have specified a fairly high intra-class coefficient (ICC), because it is reasonable to assume that there will be quite a bit of variation across individuals:</p>
<pre class="r"><code>vTotal = 1
vAcross <- iccRE(ICC = 0.5, varTotal = vTotal, "normal")
vWithin <- vTotal - vAcross
### Definitions
b <- defData(varname = "b", formula = 0, variance = vAcross,
dist = "normal")
d <- defCondition(condition = "rxlab == 'C'",
formula = "0 + b + (day == 2) * 0.5 + (day == 3) * 0.25",
variance = vWithin, dist = "normal")
d <- defCondition(d, "rxlab == 'A'",
formula = "0.4 + b + (day == 2) * 0.5 + (day == 3) * 0.25",
variance = vWithin, dist = "normal")
d <- defCondition(d, "rxlab == 'A+'",
formula = "1.0 + b + (day == 2) * 0.5 + (day == 3) * 0.25",
variance = vWithin, dist = "normal")</code></pre>
<p>Next, I generate the data, assigning three groups, each of which is tied to one of the three treatment sequences.</p>
<pre class="r"><code>set.seed(39217)
db <- genData(240, b)
dd <- trtAssign(db, 3, grpName = "grp")
dd <- addPeriods(dd, 3)
dd[grp == 1, rxlab := c("C", "A", "A+")]
dd[grp == 2, rxlab := c("A+", "C", "A")]
dd[grp == 3, rxlab := c("A", "A+", "C")]
dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))]
dd[, day := factor(period + 1)]
dd <- addCondition(d, dd, newvar = "Y")
dd</code></pre>
<pre><code>## timeID Y id period grp b rxlab day
## 1: 1 0.9015848 1 0 2 0.2664571 A+ 1
## 2: 2 1.2125919 1 1 2 0.2664571 C 2
## 3: 3 0.7578572 1 2 2 0.2664571 A 3
## 4: 4 2.0157066 2 0 3 1.1638244 A 1
## 5: 5 2.4948799 2 1 3 1.1638244 A+ 2
## ---
## 716: 716 1.9617832 239 1 1 0.3340201 A 2
## 717: 717 1.9231570 239 2 1 0.3340201 A+ 3
## 718: 718 1.0280355 240 0 3 1.4084395 A 1
## 719: 719 2.5021319 240 1 3 1.4084395 A+ 2
## 720: 720 0.4610550 240 2 3 1.4084395 C 3</code></pre>
<p>Here is a plot of the treatment averages each day for each of the three groups:</p>
<pre class="r"><code>dm <- dd[, .(Y = mean(Y)), keyby = .(grp, period, rxlab)]
ngrps <- nrow(dm[, .N, keyby = grp])
nperiods <- nrow(dm[, .N, keyby = period])
ggplot(data = dm, aes(y=Y, x = period + 1)) +
geom_jitter(data = dd, aes(y=Y, x = period + 1),
width = .05, height = 0, color="grey70", size = 1 ) +
geom_line(color = "grey50") +
geom_point(aes(color = rxlab), size = 2.5) +
scale_color_manual(values = c("#4477AA", "#DDCC77", "#CC6677")) +
scale_x_continuous(name = "day", limits = c(0.9, nperiods + .1),
breaks=c(1:nperiods)) +
facet_grid(~ factor(grp, labels = paste("Group", 1:ngrps))) +
theme(panel.grid = element_blank(),
legend.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-10-23-when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
<div id="estimating-the-effects" class="section level3">
<h3>Estimating the effects</h3>
<p>To estimate the treatment effects, I will use this mixed effects linear regression model:</p>
<p><span class="math display">\[Y_{it} = \alpha_0 + \gamma_{t} D_{it} + \beta_1 A_{it} + \beta_2 P_{it} + b_i + e_i\]</span></p>
<p>where <span class="math inline">\(Y_{it}\)</span> is the outcome for individual <span class="math inline">\(i\)</span> on day <span class="math inline">\(t\)</span>, <span class="math inline">\(t \in (1,2,3)\)</span>. <span class="math inline">\(A_{it}\)</span> is an indicator for treatment <span class="math inline">\(A\)</span> in time <span class="math inline">\(t\)</span>; likewise <span class="math inline">\(P_{it}\)</span> is an indicator for <span class="math inline">\(A^+\)</span>. <span class="math inline">\(D_{it}\)</span> is an indicator that the outcome was recorded on day <span class="math inline">\(t\)</span>. <span class="math inline">\(b_i\)</span> is an individual (latent) random effect, <span class="math inline">\(b_i \sim N(0, \sigma_b^2)\)</span>. <span class="math inline">\(e_i\)</span> is the (also latent) noise term, <span class="math inline">\(e_i \sim N(0, \sigma_e^2)\)</span>.</p>
<p>The parameter <span class="math inline">\(\alpha_0\)</span> is the mean outcome on day 1 under <span class="math inline">\(Control\)</span>. The <span class="math inline">\(\gamma\)</span>’s are the day-specific effects for days 2 and 3, with <span class="math inline">\(\gamma_1\)</span> fixed at 0. <span class="math inline">\(\beta_1\)</span> is the effect of <span class="math inline">\(A\)</span> (relative to <span class="math inline">\(Control\)</span>) and <span class="math inline">\(\beta_2\)</span> is the effect of <span class="math inline">\(A^+\)</span>. In this case, the researchers were primarily interested in <span class="math inline">\(\beta_1\)</span> and <span class="math inline">\(\beta_2 - \beta_1\)</span>, which is the incremental change from <span class="math inline">\(A\)</span> to <span class="math inline">\(A^+\)</span>.</p>
<pre class="r"><code>library(lme4)
lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd)
rndTidy(lmerfit)</code></pre>
<pre><code>## term estimate std.error statistic group
## 1: (Intercept) -0.14 0.08 -1.81 fixed
## 2: day2 0.63 0.06 9.82 fixed
## 3: day3 0.38 0.06 5.97 fixed
## 4: rxlabA 0.57 0.06 8.92 fixed
## 5: rxlabA+ 0.98 0.06 15.35 fixed
## 6: sd_(Intercept).id 0.74 NA NA id
## 7: sd_Observation.Residual 0.70 NA NA Residual</code></pre>
<p>As to why we would want to bother with this complex design if we could just randomize individuals to one of three treatment groups, this little example using a more standard parallel design might provide a hint:</p>
<pre class="r"><code>def2 <- defDataAdd(varname = "Y",
formula = "0 + (frx == 'A') * 0.4 + (frx == 'A+') * 1",
variance = 1, dist = "normal")
dd <- genData(240)
dd <- trtAssign(dd, nTrt = 3, grpName = "rx")
dd <- genFactor(dd, "rx", labels = c("C","A","A+"), replace = TRUE)
dd <- addColumns(def2, dd)
lmfit <- lm(Y~frx, data = dd)
rndTidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1: (Intercept) -0.12 0.10 -1.15 0.25
## 2: frxA 0.64 0.15 4.38 0.00
## 3: frxA+ 1.01 0.15 6.86 0.00</code></pre>
<p>If we compare the standard error for the effect of <span class="math inline">\(A^+\)</span> in the two studies, the cross-over design is much more efficient (i.e. standard error is considerably smaller: 0.06 vs. 0.15). This really isn’t so surprising since we have collected a lot more data and modeled variation across individuals in the cross-over study.</p>
</div>
<div id="constrained-cross-over-design" class="section level3">
<h3>Constrained cross-over design</h3>
<p>Unfortunately, the project was not at liberty to implement the three-way/three-day design just simulated. We came up with this approach that would provide some cross-over, but with an added day of treatment and measurement:</p>
<p><img src="https://www.rdatagen.net/img/post-crossover/4constrained.png" /></p>
<p>The data generation is slightly modified, though the original definitions can still be used:</p>
<pre class="r"><code>db <- genData(240, b)
dd <- trtAssign(db, 2, grpName = "grp")
dd <- addPeriods(dd, 4)
dd[grp == 0, rxlab := c("C", "C", "A", "A+")]
dd[grp == 1, rxlab := c("C", "A", "A+", "A")]
dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))]
dd[, day := factor(period + 1)]
dd <- addCondition(d, dd, "Y")</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-10-23-when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>The model estimates indicate slightly higher standard errors than in the pure cross-over design:</p>
<pre class="r"><code>lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd)
rndTidy(lmerfit)</code></pre>
<pre><code>## term estimate std.error statistic group
## 1: (Intercept) 0.15 0.06 2.36 fixed
## 2: day2 0.48 0.08 6.02 fixed
## 3: day3 0.16 0.12 1.32 fixed
## 4: day4 -0.12 0.12 -1.02 fixed
## 5: rxlabA 0.46 0.10 4.70 fixed
## 6: rxlabA+ 1.14 0.12 9.76 fixed
## 7: sd_(Intercept).id 0.69 NA NA id
## 8: sd_Observation.Residual 0.68 NA NA Residual</code></pre>
<p>Here are the key parameters of interest (refit using package <code>lmerTest</code> to get the contrasts). The confidence intervals include the true values (<span class="math inline">\(\beta_1 = 0.4\)</span> and <span class="math inline">\(\beta_2 - \beta_1 = 0.6\)</span>):</p>
<pre class="r"><code>library(lmerTest)
lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd)
L <- matrix(c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1),
nrow = 2, ncol = 6, byrow = TRUE)
con <- data.table(contest(lmerfit, L, confint = TRUE, joint = FALSE))
round(con[, .(Estimate, `Std. Error`, lower, upper)], 3)</code></pre>
<pre><code>## Estimate Std. Error lower upper
## 1: 0.462 0.098 0.269 0.655
## 2: 0.673 0.062 0.551 0.795</code></pre>
</div>
<div id="exploring-bias" class="section level3">
<h3>Exploring bias</h3>
<p>A single data set does not tell us if the proposed approach is indeed unbiased. Here, I generate 1000 data sets and fit the mixed effects model. In addition, I fit a model that ignores the day factor to see if it will induce bias (of course it will).</p>
<pre class="r"><code>iter <- 1000
ests <- vector("list", iter)
xests <- vector("list", iter)
for (i in 1:iter) {
db <- genData(240, b)
dd <- trtAssign(db, 2, grpName = "grp")
dd <- addPeriods(dd, 4)
dd[grp == 0, rxlab := c("C", "C", "A", "A+")]
dd[grp == 1, rxlab := c("C", "A", "A+", "A")]
dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))]
dd[, day := factor(period + 1)]
dd <- addCondition(d, dd, "Y")
lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd)
xlmerfit <- lmer(Y ~ rxlab + (1|id), data = dd)
ests[[i]] <- data.table(estA = fixef(lmerfit)[5],
estAP = fixef(lmerfit)[6] - fixef(lmerfit)[5])
xests[[i]] <- data.table(estA = fixef(xlmerfit)[2],
estAP = fixef(xlmerfit)[3] - fixef(xlmerfit)[2])
}
ests <- rbindlist(ests)
xests <- rbindlist(xests)</code></pre>
<p>The results for the correct model estimation indicate that there is no bias (and that the standard error estimates from the model fit above were correct):</p>
<pre class="r"><code>ests[, .(A.est = round(mean(estA), 3),
A.se = round(sd(estA), 3),
AP.est = round(mean(estAP), 3),
AP.se = round(sd(estAP), 3))]</code></pre>
<pre><code>## A.est A.se AP.est AP.se
## 1: 0.407 0.106 0.602 0.06</code></pre>
<p>In contrast, the estimates that ignore the day or period effect are in fact biased (as predicted):</p>
<pre class="r"><code>xests[, .(A.est = round(mean(estA), 3),
A.se = round(sd(estA), 3),
AP.est = round(mean(estAP), 3),
AP.se = round(sd(estAP), 3))]</code></pre>
<pre><code>## A.est A.se AP.est AP.se
## 1: 0.489 0.053 0.474 0.057</code></pre>
</div>
In regression, we assume noise is independent of all measured predictors. What happens if it isn't?
https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/
Tue, 09 Oct 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/
<p>A number of key assumptions underlie the linear regression model - among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model.</p>
<p>In this simple model:</p>
<p><span class="math display">\[Y_i = \beta_0 + \beta_1X_i + e_i,\]</span></p>
<p><span class="math inline">\(Y_i\)</span> has both a structural and stochastic (random) component. The structural component is the linear relationship of <span class="math inline">\(Y\)</span> with <span class="math inline">\(X\)</span>. The random element is often called the <span class="math inline">\(error\)</span> term, but I prefer to think of it as <span class="math inline">\(noise\)</span>. <span class="math inline">\(e_i\)</span> is not measuring something that has gone awry, but rather it is variation emanating from some unknown, unmeasurable source or sources for each individual <span class="math inline">\(i\)</span>. It represents everything we haven’t been able to measure.</p>
<p>Our goal is to estimate <span class="math inline">\(\beta_1\)</span>, which characterizes the structural linear relationship of <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>. When we estimate the model, we get a quantity <span class="math inline">\(\hat{\beta_1}\)</span>, and we hope that on average we do pretty well (i.e. if we were to estimate <span class="math inline">\(\beta_1\)</span> repeatedly, <span class="math inline">\(E[\hat{\beta_1}] = \beta_1\)</span>). In order for us to make sure that is the case, we need to assume that <span class="math inline">\(e_i\)</span> and <span class="math inline">\(X_i\)</span> are independent. In other words, the sources that comprise <span class="math inline">\(e_i\)</span> must not be related in way to whatever it is that <span class="math inline">\(X_i\)</span> is measuring.</p>
<div id="correlation-without-causation" class="section level3">
<h3>Correlation without causation</h3>
<p>First, I’ll generate <span class="math inline">\(X's\)</span> and <span class="math inline">\(e's\)</span> that are correlated with a using a data generation process that makes no assumptions about the underlying causal process. The provides a picture of how <span class="math inline">\(\hat{\beta_1}\)</span> might diverge from the true <span class="math inline">\(\beta_1\)</span>.</p>
<pre class="r"><code>library(simstudy)
set.seed(3222)
dT <- genCorData(500, mu = c(0, 0), sigma = c(sqrt(1.25), 1),
rho = 0.446, corstr = "cs", cnames = c("X","eCor"))</code></pre>
<p>Outcome <span class="math inline">\(Y\)</span> is based on <span class="math inline">\(X\)</span> and <span class="math inline">\(e_{cor}\)</span>. For comparison’s sake, I generate a parallel outcome that is also based on <span class="math inline">\(X\)</span> but the noise variable <span class="math inline">\(e_{ind}\)</span> is independent of <span class="math inline">\(X\)</span>:</p>
<pre class="r"><code>def <- defDataAdd(varname = "Ycor", formula = "X + eCor",
dist = "nonrandom")
def <- defDataAdd(def, varname = "eInd", formula = 0, variance = 1,
dist = "normal" )
def <- defDataAdd(def, varname = "Yind", formula = "X + eInd",
dist = "nonrandom")
dT <- addColumns(def, dT)
dT</code></pre>
<pre><code>## id X eCor Ycor eInd Yind
## 1: 1 -1.1955846 -0.1102777 -1.3058624 1.1369435 -0.05864113
## 2: 2 -0.4056655 -0.6709221 -1.0765875 -0.8441431 -1.24980856
## 3: 3 -0.5893938 1.2146488 0.6252550 -0.2666314 -0.85602516
## 4: 4 0.9090881 0.3108645 1.2199526 0.3397857 1.24887377
## 5: 5 -2.6139989 -1.7382986 -4.3522975 -0.1793858 -2.79338470
## ---
## 496: 496 3.1615624 0.6160661 3.7776285 0.4658992 3.62746167
## 497: 497 0.6416140 0.1031316 0.7447456 -0.1440062 0.49760784
## 498: 498 0.1340967 -0.4029388 -0.2688421 0.6165793 0.75067604
## 499: 499 -1.2381040 0.8197002 -0.4184038 0.6717294 -0.56637463
## 500: 500 -0.7159373 -0.0905287 -0.8064660 0.9148175 0.19888019</code></pre>
<p>The observed <span class="math inline">\(X\)</span> and <span class="math inline">\(e_{cor}\)</span> are correlated, but <span class="math inline">\(X\)</span> and <span class="math inline">\(e_{ind}\)</span> are not:</p>
<pre class="r"><code>dT[, cor(cbind(X, eCor))]</code></pre>
<pre><code>## X eCor
## X 1.0000000 0.4785528
## eCor 0.4785528 1.0000000</code></pre>
<pre class="r"><code>dT[, cor(cbind(X, eInd))]</code></pre>
<pre><code>## X eInd
## X 1.00000000 -0.02346812
## eInd -0.02346812 1.00000000</code></pre>
<p>On the left below is a plot of outcome <span class="math inline">\(Y_{ind}\)</span> as a function of <span class="math inline">\(X\)</span>. The red line is the true structural component defining the relationship between these two variables. The points are scattered around that line without any clear pattern, which is indicative of independent noise.</p>
<p>The plot on the right shows <span class="math inline">\(Y_{cor}\)</span> as a function of <span class="math inline">\(X\)</span>. Since the stochastic component of <span class="math inline">\(Y_{cor}\)</span> is the correlated noise, the lower <span class="math inline">\(X\)</span> values are more likely to fall below the true line, and the larger <span class="math inline">\(X\)</span> values above. The red line does not appear to be a very good fit in this case; this is the bias induced by correlated noise.</p>
<p><img src="https://www.rdatagen.net/post/2018-10-10-linear-regression-models-assume-noise-is-independent_files/figure-html/unnamed-chunk-4-1.png" width="921.6" /></p>
<p>The model fits corroborate the visual inspection. <span class="math inline">\(\hat{\beta_1}\)</span> based on uncorrelated noise is close to 1, the true value:</p>
<pre class="r"><code>fit2 <- lm(Yind ~ X, data = dT)
rndTidy(fit2)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1: (Intercept) 0.06 0.05 1.37 0.17
## 2: X 0.98 0.04 25.75 0.00</code></pre>
<p><span class="math inline">\(\hat{\beta_1}\)</span> based on correlated noise is 1.42, larger than the true value:</p>
<pre class="r"><code>fit1 <- lm(Ycor ~ X, data = dT)
rndTidy(fit1)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1: (Intercept) -0.01 0.04 -0.25 0.8
## 2: X 1.42 0.03 41.28 0.0</code></pre>
<p>A plot of the fitted (blue) line based on the biased estimate clearly shows the problem of regression estimation in this context:</p>
<p><img src="https://www.rdatagen.net/post/2018-10-10-linear-regression-models-assume-noise-is-independent_files/figure-html/unnamed-chunk-7-1.png" width="460.8" /></p>
</div>
<div id="thinking-about-underlying-causality-and-noise" class="section level3">
<h3>Thinking about underlying causality and noise</h3>
<p>Here is a pure thought exercise to consider this bias induced by the correlation. Fundamentally, the implications depend on the purpose of the model. If we are using the model for description or prediction, we may not care about the bias. For example, if we are <em>describing</em> how <span class="math inline">\(Y\)</span> changes as <span class="math inline">\(X\)</span> changes in some population, the underlying data generation process may not be of interest. Likewise, if our goal is predicting <span class="math inline">\(Y\)</span> based on an observed <span class="math inline">\(X\)</span>, the biased estimate of <span class="math inline">\(\beta_1\)</span> may be adequate.</p>
<p>However, if we are interested in understanding how <em>intervening</em> or <em>changing</em> the level of <span class="math inline">\(X\)</span> at the individual level effects the outcome <span class="math inline">\(Y\)</span> for that individual, then an unbiased estimate of <span class="math inline">\(\beta_1\)</span> becomes more important, and noise that is correlated with the predictor of interest could be problematic.</p>
<p>However, in a causal context, all noise may not be created equally. Consider these two different causal models:</p>
<p><img src="https://www.rdatagen.net/img/post-correrrors/confounding_mediation.png" /></p>
<p>We can generate identically distributed data based on these two mechanisms:</p>
<pre class="r"><code># Confounding
defc <- defData(varname = "U", formula=0, variance=1, dist="normal")
defc <- defData(defc, "X", "0.5*U", 1, "normal")
defc <- defData(defc, "Y", "X + U", dist = "nonrandom")
dcon <- genData(1000, defc)</code></pre>
<pre class="r"><code># Mediation
defm <- defData(varname="X", formula=0, variance =1.25, dist="normal")
defm <- defData(defm, "U", ".4*X", .8, "normal")
defm <- defData(defm, "Y", "X + U", dist = "nonrandom")
dmed <- genData(1000, defm)</code></pre>
<p>The observed covariance between <span class="math inline">\(X\)</span> and <span class="math inline">\(U\)</span> (the noise) is similar for the two processes …</p>
<pre class="r"><code>dcon[, var(cbind(X,U))]</code></pre>
<pre><code>## X U
## X 1.2516199 0.5807696
## U 0.5807696 1.0805321</code></pre>
<pre class="r"><code>dmed[, var(cbind(X,U))]</code></pre>
<pre><code>## X U
## X 1.2365285 0.5401577
## U 0.5401577 1.0695366</code></pre>
<p>… as is the model fit for each:</p>
<p><img src="https://www.rdatagen.net/post/2018-10-10-linear-regression-models-assume-noise-is-independent_files/figure-html/unnamed-chunk-11-1.png" width="921.6" /></p>
<p>And here is a pair of histograms of estimated values of <span class="math inline">\(\beta_1\)</span> for each data generating process, based on 1000 replications of samples of 100 individuals. Again, pretty similar:</p>
<p><img src="https://www.rdatagen.net/post/2018-10-10-linear-regression-models-assume-noise-is-independent_files/figure-html/unnamed-chunk-12-1.png" width="864" /></p>
<p>Despite the apparent identical nature of the two data generating processes, I would argue that biased estimation is only a problem in the context of confounding noise. If we intervene on <span class="math inline">\(X\)</span> without changing <span class="math inline">\(U\)</span>, which could occur in the context of unmeasured confounding, the causal effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> would be overestimated by the regression model. However, if we intervene on <span class="math inline">\(X\)</span> in the context of a process that involves mediation, it would be appropriate to consider all the post-intervention effects of changing <span class="math inline">\(X\)</span>, so the “biased” estimate may in fact be the appropriate one.</p>
<p>The key here, of course, is that we cannot verify this unobserved process. By definition, the noise is unobservable and stochastic. But, if we are developing models that involve causal relations of unmeasured quantities, we have to be explicit about the causal nature underlying these unmeasured quantities. That way, we know if we should be concerned about hidden correlation or not.</p>
</div>
simstudy update: improved correlated binary outcomes
https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/
Tue, 25 Sep 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/
<p>An updated version of the <code>simstudy</code> package (0.1.10) is now available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last <a href="https://www.rdatagen.net/post/binary-beta-beta-binomial/">post</a>, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to <code>genCorGen</code> and <code>addCorGen</code>, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data. Most significantly, there is a newly implemented algorithm based on the work of <a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1991.10475828">Emrich & Piedmonte</a>, which I mentioned the last time around.</p>
<div id="limitation-of-copula-algorithm" class="section level3">
<h3>Limitation of copula algorithm</h3>
<p>The existing copula algorithm is limited when generating correlated binary data. (I did acknowledge this when I first <a href="https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/">introduced</a> the new functions.) The generated marginal means are what we would expect. though the observed correlation on the binary scale is biased downwards towards zero. Using the copula algorithm, the specified correlation really pertains to the underlying normal data that is used in the data generation process. Information is lost when moving between the continuous and dichotomous distributions:</p>
<pre class="r"><code>library(simstudy)
set.seed(736258)
d1 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = TRUE,
method = "copula")
d1</code></pre>
<pre><code>## id V1 V2 V3 V4
## 1: 1 0 0 0 0
## 2: 2 0 1 1 1
## 3: 3 0 1 0 1
## 4: 4 0 0 1 0
## 5: 5 0 1 0 1
## ---
## 996: 996 0 0 0 0
## 997: 997 0 1 0 0
## 998: 998 0 1 1 1
## 999: 999 0 0 0 0
## 1000: 1000 0 0 0 0</code></pre>
<pre class="r"><code>d1[, .(V1 = mean(V1), V2 = mean(V2),
V3 = mean(V3), V4 = mean(V4))]</code></pre>
<pre><code>## V1 V2 V3 V4
## 1: 0.184 0.486 0.595 0.704</code></pre>
<pre class="r"><code>d1[, round(cor(cbind(V1, V2, V3, V4)), 2)]</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.18 0.17 0.17
## V2 0.18 1.00 0.19 0.23
## V3 0.17 0.19 1.00 0.15
## V4 0.17 0.23 0.15 1.00</code></pre>
</div>
<div id="the-ep-option-offers-an-improvement" class="section level3">
<h3>The <em>ep</em> option offers an improvement</h3>
<p>Data generated using the Emrich & Piedmonte algorithm, done by specifying the “<em>ep</em>” method, does much better; the observed correlation is much closer to what we specified. (Note that the E&P algorithm may restrict the range of possible correlations; if you specify a correlation outside of the range, an error message is issued.)</p>
<pre class="r"><code>set.seed(736258)
d2 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = TRUE,
method = "ep")
d2[, .(V1 = mean(V1), V2 = mean(V2),
V3 = mean(V3), V4 = mean(V4))]</code></pre>
<pre><code>## V1 V2 V3 V4
## 1: 0.199 0.504 0.611 0.706</code></pre>
<pre class="r"><code>d2[, round(cor(cbind(V1, V2, V3, V4)), 2)]</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.33 0.33 0.29
## V2 0.33 1.00 0.32 0.31
## V3 0.33 0.32 1.00 0.28
## V4 0.29 0.31 0.28 1.00</code></pre>
<p>If we generate the data using the “long” form, we can fit a <em>GEE</em> marginal model to recover the parameters used in the data generation process:</p>
<pre class="r"><code>library(geepack)
set.seed(736258)
d3 <- genCorGen(n = 1000, nvars = 4, params1 = c(0.2, 0.5, 0.6, 0.7),
dist = "binary", rho = 0.3, corstr = "cs", wide = FALSE,
method = "ep")
geefit3 <- geeglm(X ~ factor(period), id = id, data = d3,
family = binomial, corstr = "exchangeable")
summary(geefit3)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = X ~ factor(period), family = binomial, data = d3,
## id = id, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.39256 0.07921 309.1 <2e-16 ***
## factor(period)1 1.40856 0.08352 284.4 <2e-16 ***
## factor(period)2 1.84407 0.08415 480.3 <2e-16 ***
## factor(period)3 2.26859 0.08864 655.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.01708
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.3114 0.01855
## Number of clusters: 1000 Maximum cluster size: 4</code></pre>
<p>And the point estimates for each variable on the probability scale:</p>
<pre class="r"><code>round(1/(1+exp(1.3926 - c(0, 1.4086, 1.8441, 2.2686))), 2)</code></pre>
<pre><code>## [1] 0.20 0.50 0.61 0.71</code></pre>
</div>
<div id="longitudinal-repeated-measures" class="section level3">
<h3>Longitudinal (repeated) measures</h3>
<p>One researcher wanted to generate individual-level longitudinal data that might be analyzed using a GEE model. This is not so different from what I just did, but incorporates a specific time trend to define the probabilities. In this case, the steps are to (1) generate longitudinal data using the <code>addPeriods</code> function, (2) define the longitudinal probabilities, and (3) generate correlated binary outcomes with an AR-1 correlation structure.</p>
<pre class="r"><code>set.seed(393821)
probform <- "-2 + 0.3 * period"
def1 <- defDataAdd(varname = "p", formula = probform,
dist = "nonrandom", link = "logit")
dx <- genData(1000)
dx <- addPeriods(dx, nPeriods = 4)
dx <- addColumns(def1, dx)
dg <- addCorGen(dx, nvars = 4,
corMatrix = NULL, rho = .4, corstr = "ar1",
dist = "binary", param1 = "p",
method = "ep", formSpec = probform,
periodvar = "period")</code></pre>
<p>The correlation matrix from the observed data is reasonably close to having an AR-1 structure, where <span class="math inline">\(\rho = 0.4\)</span>, <span class="math inline">\(\rho^2 = 0.16\)</span>, <span class="math inline">\(\rho^3 = 0.064\)</span>.</p>
<pre class="r"><code>cor(dcast(dg, id ~ period, value.var = "X")[,-1])</code></pre>
<pre><code>## 0 1 2 3
## 0 1.00000 0.4309 0.1762 0.04057
## 1 0.43091 1.0000 0.3953 0.14089
## 2 0.17618 0.3953 1.0000 0.36900
## 3 0.04057 0.1409 0.3690 1.00000</code></pre>
<p>And again, the model recovers the time trend parameter defined in variable <code>probform</code> as well as the correlation parameter:</p>
<pre class="r"><code>geefit <- geeglm(X ~ period, id = id, data = dg, corstr = "ar1",
family = binomial)
summary(geefit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = X ~ period, family = binomial, data = dg, id = id,
## corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.9598 0.0891 484.0 <2e-16 ***
## period 0.3218 0.0383 70.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.0621
##
## Correlation: Structure = ar1 Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.397 0.0354
## Number of clusters: 1000 Maximum cluster size: 4</code></pre>
</div>
<div id="model-mis-specification" class="section level3">
<h3>Model mis-specification</h3>
<p>And just for fun, here is an example of how simulation might be used to investigate the performance of a model. Let’s say we are interested in the implications of mis-specifying the correlation structure. In this case, we can fit two GEE models (one correctly specified and one mis-specified) and assess the sampling properties of the estimates from each:</p>
<pre class="r"><code>library(broom)