ouR data generation
https://www.rdatagen.net/
Recent content on ouR data generationHugo -- gohugo.iokeith.goldfeld@nyumc.org (Keith Goldfeld)keith.goldfeld@nyumc.org (Keith Goldfeld)Tue, 23 Nov 2021 00:00:00 +0000The design effect of a cluster randomized trial with baseline measurements
https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/
Tue, 23 Nov 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/
<script src="https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/index.en_files/header-attrs/header-attrs.js"></script>
<p>Is it possible to reduce the sample size requirements of a stepped wedge cluster randomized trial simply by collecting baseline information? In a trial with randomization at the individual level, it <em>is</em> generally the case that if we are able to measure an outcome for subjects at two time periods, first at baseline and then at follow-up, we can reduce the overall sample size. But does this extend to (a) cluster randomized trials generally, and to (b) stepped wedge designs more specifically?</p>
<p>The answer to (a) is a definite “yes,” as described in a 2012 paper by <em>Teerenstra et al</em> (more details on that below). As for (b), two colleagues on the <a href="https://impactcollaboratory.org/design-and-statistics-core/" target="_blank">Design and Statistics Core</a> of the <a href="https://impactcollaboratory.org/" target="_blank">NIA IMPACT Collaboratory</a>, Monica Taljaard and Fan Li, and I have just started thinking about this. Ultimately, we hope to have an analytic solution that provides more formal guidance for stepped wedge designs; but to get things started, we thought we could explore a bit using simulation.</p>
<div id="quick-overview" class="section level2">
<h2>Quick overview</h2>
<p>Generally speaking, why might baseline measurements have any impact at all? The curse of any clinical trial is variability - the more noise (variability) there is in the outcome, the more difficult it is to identify the signal (effect). For example, if we are interested in measuring the impact of an intervention on the quality of life (QOL) across a diverse range of patients, the measurement (which typically ranges from 0 to 1) might vary considerably from person to person, regardless of the intervention. If the intervention has a real but moderate effect of, say, 0.1 points, it could easily get lost if the standard deviation is considerably larger, say 0.25.</p>
<p>It turns out that if we collect baseline QOL scores and can “control” for those measurements in some way (by conducting a repeated measures analysis, using ANCOVA, or assessing the difference itself as an outcome), we might be able to reduce the variability across study subjects sufficiently to give us a better chance at picking up the signal. Previously, I’ve written about <a href="https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/" target="_blank">baseline covariate adjustment</a> in the context of clinical trials where randomization is at the individual subject level; now we will turn to the case where randomization is at the cluster or site level.</p>
<p>This post focuses on work already done to derive <em>design effects</em> for parallel cluster randomized trials (CRTs) that collect baseline measurements; we will get to stepped wedge designs in future posts. I described the <a href="https://www.rdatagen.net/post/what-exactly-is-the-design-effect/" target="_blank">design effect</a> pretty generally in an earlier post, but the <a href="https://onlinelibrary.wiley.com/doi/full/10.1002/sim.5352" target="_blank">paper</a> by <em>Teerenstra et al</em>, titled “A simple sample size formula for analysis of covariance in cluster randomized trials” provides a great foundation to understand how baseline measurements can impact sample sizes in clustered designs.</p>
<p>Here’s a brief outline of what follows: after showing an example based on a simple 2-arm randomized control trial with 350 subjects that has 80% power to detect a standardized effect size of 0.3, I describe and simulate a series of designs with cluster sizes of 30 subjects that require progressively fewer clusters but also provide 80% power under the same effect size and total variance assumptions: a simple CRT that needs 64 sites, a cross-sectional pre-post design that needs 52, a repeated measures design that needs 38, and a repeated measures design that models follow-up outcomes only (i.e. uses an ANCOVA model) that requires only 32.</p>
</div>
<div id="simple-rct" class="section level2">
<h2>Simple RCT</h2>
<p>We start with a simple RCT (without any clustering) that randomizes individuals to treatment or control.</p>
<p><span class="math display">\[
Y_{i} = \alpha + \delta Z_{i} + s_{i}
\]</span> where <span class="math inline">\(Y_{i}\)</span> is a continuous outcome measure for individual <span class="math inline">\(i\)</span>, and <span class="math inline">\(Z_{i}\)</span> is the treatment status of individual <span class="math inline">\(i\)</span>. <span class="math inline">\(\delta\)</span> is the treatment effect. <span class="math inline">\(s_{i} \sim N(0, \sigma_s^2)\)</span> are the individual random effects or noise.</p>
<p>Now that we are about to start coding, here are the necessary packages:</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG")
set.seed(19287)
library(simstudy)
library(ggplot2)
library(lmerTest)
library(parallel)
library(data.table)
library(pwr)
library(gtsummary)
library(paletteer)
library(magrittr)</code></pre>
<p>In the examples that follow, overall variance <span class="math inline">\(\sigma^2 = 64\)</span>. In this first example, then, <span class="math inline">\(\sigma_s^2 = 64\)</span> since that is the only source of variation. The overall effect size <span class="math inline">\(\delta\)</span>, which is the difference in average scores across treatment groups, is assumed to be 2.4, a standardized effect size <span class="math inline">\(2.4/8 = 0.3.\)</span> We will need to generate 350 individual subjects (175 in each arm) to achieve power of 80%.</p>
<pre class="r"><code>pwr.t.test(d = 0.3, power = 0.80)</code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 175
## d = 0.3
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<div id="data-generation-process" class="section level4">
<h4>Data generation process</h4>
<p>Here is the data definition and generation process:</p>
<pre class="r"><code>simple_rct <- function(N) {
# data definition for outcome
defS <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
defS <- defData(defS, varname = "y", formula = "2.4*rx", variance = 64, dist = "normal")
dd <- genData(N, defS)
dd[]
}
dd <- simple_rct(350)</code></pre>
<p>Here is a visualization of the outcome measures by treatment arm.</p>
<p><img src="https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/index.en_files/figure-html/unnamed-chunk-4-1.png" width="480" /></p>
</div>
<div id="estimating-effect-size" class="section level4">
<h4>Estimating effect size</h4>
<p>A simple linear regression model estimates the effect size:</p>
<pre class="r"><code>fit1 <- lm(y ~ rx, data = dd)
tbl_regression(fit1) %>%
modify_footnote(ci ~ NA, abbreviation = TRUE)</code></pre>
<div id="gpdyzpkjsn" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#gpdyzpkjsn .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#gpdyzpkjsn .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#gpdyzpkjsn .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#gpdyzpkjsn .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 6px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#gpdyzpkjsn .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#gpdyzpkjsn .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#gpdyzpkjsn .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#gpdyzpkjsn .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#gpdyzpkjsn .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#gpdyzpkjsn .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#gpdyzpkjsn .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#gpdyzpkjsn .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#gpdyzpkjsn .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#gpdyzpkjsn .gt_from_md > :first-child {
margin-top: 0;
}
#gpdyzpkjsn .gt_from_md > :last-child {
margin-bottom: 0;
}
#gpdyzpkjsn .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#gpdyzpkjsn .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#gpdyzpkjsn .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#gpdyzpkjsn .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#gpdyzpkjsn .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#gpdyzpkjsn .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#gpdyzpkjsn .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#gpdyzpkjsn .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#gpdyzpkjsn .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#gpdyzpkjsn .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#gpdyzpkjsn .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#gpdyzpkjsn .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#gpdyzpkjsn .gt_left {
text-align: left;
}
#gpdyzpkjsn .gt_center {
text-align: center;
}
#gpdyzpkjsn .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#gpdyzpkjsn .gt_font_normal {
font-weight: normal;
}
#gpdyzpkjsn .gt_font_bold {
font-weight: bold;
}
#gpdyzpkjsn .gt_font_italic {
font-style: italic;
}
#gpdyzpkjsn .gt_super {
font-size: 65%;
}
#gpdyzpkjsn .gt_footnote_marks {
font-style: italic;
font-weight: normal;
font-size: 65%;
}
</style>
<table class="gt_table">
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1"><strong>Characteristic</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>Beta</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>95% CI</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>p-value</strong></th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr><td class="gt_row gt_left">rx</td>
<td class="gt_row gt_center">3.2</td>
<td class="gt_row gt_center">1.6, 4.9</td>
<td class="gt_row gt_center"><0.001</td></tr>
</tbody>
</table>
</div>
</div>
<div id="confirming-power" class="section level4">
<h4>Confirming power</h4>
<p>We can confirm the power by repeatedly generating data sets and fitting models, recording the p-values for each replication.</p>
<pre class="r"><code>replicate <- function() {
dd <- simple_rct(350)
fit1 <- lm(y ~ rx, data = dd)
coef(summary(fit1))["rx", "Pr(>|t|)"]
}
p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)</code></pre>
<p>Here is the estimated power based on 1000 replications:</p>
<pre class="r"><code>mean(unlist(p_values) < 0.05)</code></pre>
<pre><code>## [1] 0.79</code></pre>
</div>
</div>
<div id="parallel-cluster-randomized-trial" class="section level2">
<h2>Parallel cluster randomized trial</h2>
<p>If we need to randomize at the site level (i.e., conduct a CRT), we can describe the data generation process as</p>
<p><span class="math display">\[Y_{ij} = \alpha + \delta Z_{j} + c_j + s_i\]</span></p>
<p>where <span class="math inline">\(Y_{ij}\)</span> is a continuous outcome for subject <span class="math inline">\(i\)</span> in site <span class="math inline">\(j\)</span>. <span class="math inline">\(Z_{j}\)</span> is the treatment indicator for site <span class="math inline">\(j\)</span>. Again, <span class="math inline">\(\delta\)</span> is the treatment effect. <span class="math inline">\(c_j \sim N(0,\sigma_c^2)\)</span> is a site level effect, and <span class="math inline">\(s_i \sim N(0, \sigma_s^2)\)</span> is the subject level effect. The correlation of any two subjects in a cluster is <span class="math inline">\(\rho\)</span> (the ICC):</p>
<p><span class="math display">\[\rho = \frac{\sigma_c^2}{\sigma_c^2 + \sigma_s^2}\]</span></p>
<p>If we have a pre-specified number (<span class="math inline">\(n\)</span>) of subjects at each site, we can estimate the sample size required in the CRT might applying a <em>design effect</em> <span class="math inline">\(1+(n-1)\rho\)</span> to the sample size of an RCT that has the same overall variance. So, if <span class="math inline">\(\sigma_c^2 + \sigma_s^2 = 64\)</span>, we can augment the sample size we used in the initial example. If <span class="math inline">\(\sigma_c^2 = 9.6\)</span> + <span class="math inline">\(\sigma_s^2 = 54.4\)</span>, <span class="math inline">\(\rho = 0.15\)</span>. We anticipate having 30 subjects at each site so the design effect is</p>
<p><span class="math display">\[1 + (30 - 1) \times 0.15 = 5.35\]</span></p>
<p>This means we will need <span class="math inline">\(5.35 \times 350 = 1872\)</span> total subjects based on the same effect size and power assumptions. Since we anticipate 30 subjects per site, we need <span class="math inline">\(1872 / 30 = 62.4\)</span> sites - we will round up to the nearest even number and use 64 sites.</p>
<div id="data-generation-process-1" class="section level4">
<h4>Data generation process</h4>
<pre class="r"><code>simple_crt <- function(nsites, n) {
defC <- defData(varname = "rx", formula = "1;1", dist = "trtAssign")
defC <- defData(defC, varname = "c", formula = "0", variance = 9.6, dist = "normal")
defS <- defDataAdd(varname="y", formula="c + 2.4*rx", variance = 54.4, dist="normal")
# site/cluster level data
dc <- genData(nsites, defC, id = "site")
# individual level data
dd <- genCluster(dc, "site", n, "id")
dd <- addColumns(defS, dd)
dd[]
}
dd <- simple_crt(20, 50)</code></pre>
<p>Once again, the sites randomized to the treatment arm are colored red:</p>
<p><img src="https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/index.en_files/figure-html/unnamed-chunk-9-1.png" width="768" /></p>
</div>
<div id="estimating-effect-size-1" class="section level4">
<h4>Estimating effect size</h4>
<p>A mixed effects model is used to estimate the effect size. I’m using a larger data set to recover the parameters used in the data generation process:</p>
<pre class="r"><code>dd <- simple_crt(200,100)
fit2 <- lmer(y ~ rx + (1|site), data = dd)
tbl_regression(fit2, tidy_fun = broom.mixed::tidy) %>%
modify_footnote(ci ~ NA, abbreviation = TRUE)</code></pre>
<div id="izjetofxgq" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#izjetofxgq .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#izjetofxgq .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#izjetofxgq .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#izjetofxgq .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 6px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#izjetofxgq .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#izjetofxgq .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#izjetofxgq .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#izjetofxgq .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#izjetofxgq .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#izjetofxgq .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#izjetofxgq .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#izjetofxgq .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#izjetofxgq .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#izjetofxgq .gt_from_md > :first-child {
margin-top: 0;
}
#izjetofxgq .gt_from_md > :last-child {
margin-bottom: 0;
}
#izjetofxgq .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#izjetofxgq .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#izjetofxgq .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#izjetofxgq .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#izjetofxgq .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#izjetofxgq .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#izjetofxgq .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#izjetofxgq .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#izjetofxgq .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#izjetofxgq .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#izjetofxgq .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#izjetofxgq .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#izjetofxgq .gt_left {
text-align: left;
}
#izjetofxgq .gt_center {
text-align: center;
}
#izjetofxgq .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#izjetofxgq .gt_font_normal {
font-weight: normal;
}
#izjetofxgq .gt_font_bold {
font-weight: bold;
}
#izjetofxgq .gt_font_italic {
font-style: italic;
}
#izjetofxgq .gt_super {
font-size: 65%;
}
#izjetofxgq .gt_footnote_marks {
font-style: italic;
font-weight: normal;
font-size: 65%;
}
</style>
<table class="gt_table">
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1"><strong>Characteristic</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>Beta</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>95% CI</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>p-value</strong></th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr><td class="gt_row gt_left">rx</td>
<td class="gt_row gt_center">1.2</td>
<td class="gt_row gt_center">0.21, 2.1</td>
<td class="gt_row gt_center">0.018</td></tr>
<tr><td class="gt_row gt_left">site.sd__(Intercept)</td>
<td class="gt_row gt_center">3.4</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">Residual.sd__Observation</td>
<td class="gt_row gt_center">7.4</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
</tbody>
</table>
</div>
</div>
<div id="confirming-power-1" class="section level4">
<h4>Confirming power</h4>
<p>Now, I will confirm power using 64 sites with 30 subjects per site, for a total of 1920 subjects (compared with only 350 in the RCT):</p>
<pre class="r"><code>replicate <- function() {
dd <- simple_crt(64, 30)
fit2 <- lmer(y ~ rx + (1|site), data = dd)
coef(summary(fit2))["rx", "Pr(>|t|)"]
}
p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)
mean(unlist(p_values) < 0.05)</code></pre>
<pre><code>## [1] 0.8</code></pre>
</div>
</div>
<div id="crt-with-baseline-measurement" class="section level2">
<h2>CRT with baseline measurement</h2>
<p>We paid quite a hefty price moving from an RCT to a CRT in terms of the number of subjects we need to collect data on. If these data are coming from administrative systems, that added burden might not be an issue, but if we need to consent all the subjects and survey them individually, this could be quite burdensome.</p>
<p>We may be able to decrease the required number of clusters (i.e. reduce the design effect) if we can collect baseline measurements of the outcome. The baseline and follow-up measurements can be collected from the same subjects or different subjects, <em>though the impact on the design effect depends on what approach is taken</em>.</p>
<p><span class="math display">\[
Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1}k Z_{j} + c_j + cp_{jk} + s_{ij} + sp_{ijk}
\]</span></p>
<p>where <span class="math inline">\(Y_{ijk}\)</span> is a continuous outcome measure for individual <span class="math inline">\(i\)</span> in site <span class="math inline">\(j\)</span> and measurement <span class="math inline">\(k \in \{0,1\}\)</span>. <span class="math inline">\(k=0\)</span> for baseline measurement, and <span class="math inline">\(k=1\)</span> for the follow-up. <span class="math inline">\(Z_{j}\)</span> is the treatment status of cluster <span class="math inline">\(j\)</span>, <span class="math inline">\(Z_{j} \in \{0,1\}.\)</span> <span class="math inline">\(\alpha_0\)</span> is the mean outcome at baseline for subjects in the control clusters, <span class="math inline">\(\alpha_1\)</span> is the change from baseline to follow-up in the control arm, <span class="math inline">\(\delta_{0}\)</span> is the difference at baseline between control and treatment arms (we would expect this to be <span class="math inline">\(0\)</span> in a randomized trial), and <span class="math inline">\(\delta_{1}\)</span> is the difference in the change from baseline to follow-up between the two arms. (In a randomized trial, since <span class="math inline">\(\delta_0\)</span> should be close to <span class="math inline">\(0\)</span>, <span class="math inline">\(\delta_1\)</span> is the treatment effect.)</p>
<p>The model has cluster-specific and subject-specific random effects. For both, there can be time-invariant effects and time-varying effects. <span class="math inline">\(c_j \sim N(0,\sigma_c^2)\)</span> are time invariant site-specific effects, and <span class="math inline">\(cp_{jk}\)</span> are the site-specific period (time varying) effects, where <span class="math inline">\(c_{jk} \sim N(0, \sigma_{cp}^2)\)</span>. At the subject level there can be <span class="math inline">\(s_{ij} \sim N(0, \sigma_s^2)\)</span> and <span class="math inline">\(sp_{ijk} \sim N(0, \sigma_{sp}^2)\)</span>.</p>
<p>Here is the generic code that will facilitate data generation in this model:</p>
<pre class="r"><code>crt_base <- function(effect, nsites, n, s_c, s_cp, s_s, s_sp) {
defC <- defData(varname = "c", formula = 0, variance = "..s_c")
defC <- defData(defC, varname = "rx", formula = "1;1", dist = "trtAssign")
defCP <- defDataAdd(varname = "c.p", formula = 0, variance = "..s_cp")
defS <- defDataAdd(varname = "s", formula = 0, variance = "..s_s")
defSP <- defDataAdd(varname = "y",
formula = "..effect * rx * period + c + c.p + s",
variance ="..s_sp")
dc <- genData(nsites, defC, id = "site")
dcp <- addPeriods(dc, 2, "site")
dcp <- addColumns(defCP, dcp)
dcp <- dcp[, .(site, period, c.p, timeID)]
ds <- genCluster(dc, "site", n, "id")
ds <- addColumns(defS, ds)
dsp <- addPeriods(ds, 2)
setnames(dsp, "timeID", "obsID")
setkey(dsp, site, period)
setkey(dcp, site, period)
dd <- merge(dsp, dcp)
dd <- addColumns(defSP, dd)
setkey(dd, site, id, period)
dd[]
}</code></pre>
</div>
<div id="design-effect" class="section level2">
<h2>Design effect</h2>
<p>In their paper, <em>Teerenstra et al</em> develop a design effect that takes into account the baseline measurement. Here are a few key quantities that are needed for the calculation:</p>
<p>The <strong>correlation of two subject measurements in the same cluster and same time period</strong> is the ICC or <span class="math inline">\(\rho\)</span>, and is:</p>
<p><span class="math display">\[\rho = \frac{\sigma_c^2 + \sigma_{cp}^2}{\sigma_c^2 + \sigma_{cp}^2 + \sigma_s^2 + \sigma_{sp}^2}
\]</span></p>
<p>In order to estimate design effect, we need two more correlations. The <strong>correlation between baseline and follow-up <em>random effects</em> at the <em>cluster level</em></strong> is</p>
<p><span class="math display">\[\rho_c = \frac{\sigma_c^2}{\sigma_c^2 + \sigma_{cp}^2}\]</span></p>
<p>and the <strong>correlation between baseline and follow-up random effects at the <em>subject level</em></strong> is</p>
<p><span class="math display">\[\rho_s = \frac{\sigma_s^2}{\sigma_s^2 + \sigma_{sp}^2}\]</span></p>
<p>A value <span class="math inline">\(r\)</span> is used to estimate the design effect, and is defined as</p>
<p><span class="math display">\[ r = \frac{n\rho\rho_c + (1-\rho)\rho_s}{1 + (n-1)\rho}\]</span></p>
<p>If we are able to collect baseline measurements and our focus is on estimating <span class="math inline">\(\delta_1\)</span> from the model, the design effect is slightly modified from before:</p>
<p><span class="math display">\[ (1 + (n-1)\rho)(2(1-r)) \]</span></p>
</div>
<div id="cross-sectional-cohorts" class="section level2">
<h2>Cross-sectional cohorts</h2>
<p>We may not be able to collect two measurements for each subject at a site, but if we can collect measurements on two different cohorts, one at baseline before the intervention is implemented, and one cohort in a second period (either after the intervention has been implemented or not, depending on the randomization assignment of the cluster), we might be able to reduce the number of clusters.</p>
<p>In this case, <span class="math inline">\(\sigma_s^2 = 0\)</span> and <span class="math inline">\(\rho_s = 0\)</span>, so the general model reduces to</p>
<p><span class="math display">\[
Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1} k Z_{j} + c_j + cp_{jk} + sp_{ijk}
\]</span></p>
<div id="data-generation-process-2" class="section level4">
<h4>Data generation process</h4>
<p>The parameters for this simulation are <span class="math inline">\(\delta_1 = 2.4\)</span>, <span class="math inline">\(\sigma_c^2 = 6.8\)</span>, <span class="math inline">\(\sigma_{cp}^2 = 2.8\)</span>, <span class="math inline">\(\sigma_{sp}^2 = 54.4\)</span>. Total variance <span class="math inline">\(\sigma_c^2 + \sigma_{cp}^2 + \sigma_{sp}^2 = 6.8 + 2.8 + 54.4 = 64\)</span>, as used previously.</p>
<pre class="r"><code>dd <- crt_base(effect = 2.4, nsites = 20, n = 30, s_c=6.8, s_cp=2.8, s_s=0, s_sp=54.4)</code></pre>
<p>Here is a visualization of the outcome measures by site and by period, with the sites in the <em>treatment</em> arm colored in red (only in the follow-up period).</p>
<p><img src="https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/index.en_files/figure-html/unnamed-chunk-14-1.png" width="864" /></p>
</div>
<div id="estimating-effect-size-2" class="section level4">
<h4>Estimating effect size</h4>
<p>To estimate the effect size we fit a mixed effect model with cluster-specific effects only (both time invariant and time varying).</p>
<pre class="r"><code>dd <- crt_base(effect = 2.4, nsites = 200, n = 100, s_c=6.8, s_cp=2.8, s_s=0, s_sp=54.4)
fit3 <- lmer(y ~ period*rx+ (1|timeID:site) + (1 | site), data = dd)
tbl_regression(fit3, tidy_fun = broom.mixed::tidy) %>%
modify_footnote(ci ~ NA, abbreviation = TRUE)</code></pre>
<div id="dqyroxmggv" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#dqyroxmggv .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#dqyroxmggv .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#dqyroxmggv .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#dqyroxmggv .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 6px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#dqyroxmggv .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#dqyroxmggv .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#dqyroxmggv .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#dqyroxmggv .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#dqyroxmggv .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#dqyroxmggv .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#dqyroxmggv .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#dqyroxmggv .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#dqyroxmggv .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#dqyroxmggv .gt_from_md > :first-child {
margin-top: 0;
}
#dqyroxmggv .gt_from_md > :last-child {
margin-bottom: 0;
}
#dqyroxmggv .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#dqyroxmggv .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#dqyroxmggv .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#dqyroxmggv .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#dqyroxmggv .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#dqyroxmggv .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#dqyroxmggv .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#dqyroxmggv .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#dqyroxmggv .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#dqyroxmggv .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#dqyroxmggv .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#dqyroxmggv .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#dqyroxmggv .gt_left {
text-align: left;
}
#dqyroxmggv .gt_center {
text-align: center;
}
#dqyroxmggv .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#dqyroxmggv .gt_font_normal {
font-weight: normal;
}
#dqyroxmggv .gt_font_bold {
font-weight: bold;
}
#dqyroxmggv .gt_font_italic {
font-style: italic;
}
#dqyroxmggv .gt_super {
font-size: 65%;
}
#dqyroxmggv .gt_footnote_marks {
font-style: italic;
font-weight: normal;
font-size: 65%;
}
</style>
<table class="gt_table">
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1"><strong>Characteristic</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>Beta</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>95% CI</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>p-value</strong></th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr><td class="gt_row gt_left">period</td>
<td class="gt_row gt_center">-0.03</td>
<td class="gt_row gt_center">-0.52, 0.46</td>
<td class="gt_row gt_center">>0.9</td></tr>
<tr><td class="gt_row gt_left">rx</td>
<td class="gt_row gt_center">0.17</td>
<td class="gt_row gt_center">-0.78, 1.1</td>
<td class="gt_row gt_center">0.7</td></tr>
<tr><td class="gt_row gt_left">period * rx</td>
<td class="gt_row gt_center">2.7</td>
<td class="gt_row gt_center">2.0, 3.4</td>
<td class="gt_row gt_center"><0.001</td></tr>
<tr><td class="gt_row gt_left">timeID:site.sd__(Intercept)</td>
<td class="gt_row gt_center">1.6</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">site.sd__(Intercept)</td>
<td class="gt_row gt_center">2.9</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">Residual.sd__Observation</td>
<td class="gt_row gt_center">7.4</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
</tbody>
</table>
</div>
</div>
<div id="confirming-power-2" class="section level4">
<h4>Confirming power</h4>
<p>Based on the variance assumptions, we can update our design effect:</p>
<pre class="r"><code>s_c <- 6.8
s_cp <- 2.8
s_s <- 0
s_sp <- 54.4
rho <- (s_c + s_cp)/(s_c + s_cp + s_s + s_sp)
rho_c <- s_c/(s_c + s_cp)
rho_s <- s_s/(s_s + s_sp)
n <- 30
r <- (n * rho * rho_c + (1-rho) * rho_s) / (1 + (n-1) * rho)</code></pre>
<p>The design effect for the CRT without any baseline measurement was 5.35. With the two-cohort design, the design effect is reduced slightly:</p>
<pre class="r"><code>(des_effect <- (1 + (n - 1) * rho) * 2 * (1 - r))</code></pre>
<pre><code>## [1] 4.3</code></pre>
<pre class="r"><code>des_effect * 350 / n</code></pre>
<pre><code>## [1] 50</code></pre>
<p>The desired number of sites is over 50, so rounding up to the next even number gives us 52:</p>
<pre class="r"><code>replicate <- function() {
dd <- crt_base(2.4, 52, 30, s_c = 6.8, s_cp = 2.8, s_s = 0, s_sp = 54.4)
fit3 <- lmer(y ~ period * rx+ (1|timeID:site) + (1 | site), data = dd)
coef(summary(fit3))["period:rx", "Pr(>|t|)"]
}
p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)
mean(unlist(p_values) < 0.05)</code></pre>
<pre><code>## [1] 0.8</code></pre>
</div>
</div>
<div id="repeated-measurements" class="section level2">
<h2>Repeated measurements</h2>
<p>We can reduce the number of clusters further if instead of measuring one cohort prior to the intervention and another after the intervention, we measure a single cohort twice - once at baseline and once at follow-up. Now we use the full model that decomposes the subject level variance into a time invariant effect (<span class="math inline">\(c_j\)</span>) and a time varying effect <span class="math inline">\(cp_{jk}\)</span>:</p>
<p><span class="math display">\[
Y_{ijk} = \alpha_0 + \alpha_1 k + \delta_{0} Z_j + \delta_{1} k Z_{j} + c_j + cp_{jk} + s_{ij} + sp_{ijk}
\]</span></p>
<div id="data-generation-process-3" class="section level4">
<h4>Data generation process</h4>
<p>These are the parameters, <span class="math inline">\(\delta_1 = 2.4\)</span>, <span class="math inline">\(\sigma_c^2 = 6.8\)</span>, <span class="math inline">\(\sigma_{cp}^2 = 2.8\)</span>, <span class="math inline">\(\sigma_s = 38,\)</span> and <span class="math inline">\(\sigma_{sp}^2 = 16.4\)</span>.</p>
<pre class="r"><code>dd <- crt_base(effect=2.4, nsites=20, n=30, s_c=6.8, s_cp=2.8, s_s=38, s_sp=16.4)</code></pre>
<p>Here is what the data look like; each line represents an individual subject at the two time points, baseline and follow-up.</p>
<p><img src="https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/index.en_files/figure-html/unnamed-chunk-20-1.png" width="864" /></p>
</div>
<div id="estimating-effect-size-3" class="section level4">
<h4>Estimating effect size</h4>
<p>The mixed effect model includes cluster-specific effects only (both time invariant and time varying), as well as subject level effects. Again, total variance (<span class="math inline">\(\sigma_c^2 + \sigma_{cp}^2 + \sigma_s^2 + \sigma_{sp}^2\)</span>) is 64.</p>
<pre class="r"><code>dd <- crt_base(effect = 2.4, nsites = 200, n = 100,
s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
fit4 <- lmer(y ~ period*rx + (1 | id:site) + (1|timeID:site) + (1 | site), data = dd)
tbl_regression(fit4, tidy_fun = broom.mixed::tidy) %>%
modify_footnote(ci ~ NA, abbreviation = TRUE)</code></pre>
<div id="syvravrbas" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#syvravrbas .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#syvravrbas .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#syvravrbas .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#syvravrbas .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 6px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#syvravrbas .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#syvravrbas .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#syvravrbas .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#syvravrbas .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#syvravrbas .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#syvravrbas .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#syvravrbas .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#syvravrbas .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#syvravrbas .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#syvravrbas .gt_from_md > :first-child {
margin-top: 0;
}
#syvravrbas .gt_from_md > :last-child {
margin-bottom: 0;
}
#syvravrbas .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#syvravrbas .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#syvravrbas .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#syvravrbas .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#syvravrbas .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#syvravrbas .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#syvravrbas .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#syvravrbas .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#syvravrbas .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#syvravrbas .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#syvravrbas .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#syvravrbas .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#syvravrbas .gt_left {
text-align: left;
}
#syvravrbas .gt_center {
text-align: center;
}
#syvravrbas .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#syvravrbas .gt_font_normal {
font-weight: normal;
}
#syvravrbas .gt_font_bold {
font-weight: bold;
}
#syvravrbas .gt_font_italic {
font-style: italic;
}
#syvravrbas .gt_super {
font-size: 65%;
}
#syvravrbas .gt_footnote_marks {
font-style: italic;
font-weight: normal;
font-size: 65%;
}
</style>
<table class="gt_table">
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1"><strong>Characteristic</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>Beta</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>95% CI</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>p-value</strong></th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr><td class="gt_row gt_left">period</td>
<td class="gt_row gt_center">-0.21</td>
<td class="gt_row gt_center">-0.73, 0.31</td>
<td class="gt_row gt_center">0.4</td></tr>
<tr><td class="gt_row gt_left">rx</td>
<td class="gt_row gt_center">-0.19</td>
<td class="gt_row gt_center">-1.1, 0.73</td>
<td class="gt_row gt_center">0.7</td></tr>
<tr><td class="gt_row gt_left">period * rx</td>
<td class="gt_row gt_center">2.4</td>
<td class="gt_row gt_center">1.7, 3.2</td>
<td class="gt_row gt_center"><0.001</td></tr>
<tr><td class="gt_row gt_left">id:site.sd__(Intercept)</td>
<td class="gt_row gt_center">6.2</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">timeID:site.sd__(Intercept)</td>
<td class="gt_row gt_center">1.8</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">site.sd__(Intercept)</td>
<td class="gt_row gt_center">2.7</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">Residual.sd__Observation</td>
<td class="gt_row gt_center">4.1</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
</tbody>
</table>
</div>
</div>
<div id="confirming-power-3" class="section level4">
<h4>Confirming power</h4>
<p>Based on the variance assumptions, we can update our design effect a second time:</p>
<pre class="r"><code>s_c <- 6.8
s_cp <- 2.8
s_s <- 38
s_sp <- 16.4
rho <- (s_c + s_cp)/(s_c + s_cp + s_s + s_sp)
rho_c <- s_c/(s_c + s_cp)
rho_s <- s_s/(s_s + s_sp)
n <- 30
r <- (n * rho * rho_c + (1-rho) * rho_s) / (1 + (n-1) * rho)</code></pre>
<p>And again, the design effect (and sample size requirement) is reduced:</p>
<pre class="r"><code>(des_effect <- (1 + (n - 1) * rho) * 2 * (1 - r))</code></pre>
<pre><code>## [1] 3.1</code></pre>
<pre class="r"><code>des_effect * 350 / n</code></pre>
<pre><code>## [1] 37</code></pre>
<p>The desired number of sites is over 36, so I will round up to 38:</p>
<pre class="r"><code>replicate <- function() {
dd <- crt_base(2.4, 38, 30, s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
fit4 <- lmer(y ~ period*rx + (1 | id:site) + (1|timeID:site) + (1 | site), data = dd)
coef(summary(fit4))["period:rx", "Pr(>|t|)"]
}
p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)
mean(unlist(p_values) < 0.05)</code></pre>
<pre><code>## [1] 0.79</code></pre>
</div>
</div>
<div id="repeated-measurements---ancova" class="section level2">
<h2>Repeated measurements - ANCOVA</h2>
<p>We may be able to reduce the number of clusters even further by changing the model so that we are comparing follow-up outcomes of the two treatment arms (as opposed to measuring the differences in changes as we just did). This model is</p>
<p><span class="math display">\[
Y_{ij1} = \alpha_0 + \gamma Y_{ij0} + \delta Z_j + c_j + s_{ij}
\]</span></p>
<p>where we have adjusted for baseline measurement <span class="math inline">\(Y_{ij0}.\)</span> Even though the estimation model has changed, I am using the exact same data generation process as before, with the same effect size and variance assumptions:</p>
<pre class="r"><code>dd <- crt_base(effect = 2.4, nsites = 200, n = 100,
s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
dobs <- dd[, .(site, rx, id, period, timeID, y)]
dobs <- dcast(dobs, site + rx + id ~ period, value.var = "y")
fit5 <- lmer(`1` ~ `0` + rx + (1 | site), data = dobs)
tbl_regression(fit5, tidy_fun = broom.mixed::tidy) %>%
modify_footnote(ci ~ NA, abbreviation = TRUE)</code></pre>
<div id="ymuatefzfz" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>html {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Helvetica Neue', 'Fira Sans', 'Droid Sans', Arial, sans-serif;
}
#ymuatefzfz .gt_table {
display: table;
border-collapse: collapse;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#ymuatefzfz .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#ymuatefzfz .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#ymuatefzfz .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 0;
padding-bottom: 6px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#ymuatefzfz .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#ymuatefzfz .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#ymuatefzfz .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#ymuatefzfz .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#ymuatefzfz .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#ymuatefzfz .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#ymuatefzfz .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#ymuatefzfz .gt_group_heading {
padding: 8px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
}
#ymuatefzfz .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#ymuatefzfz .gt_from_md > :first-child {
margin-top: 0;
}
#ymuatefzfz .gt_from_md > :last-child {
margin-bottom: 0;
}
#ymuatefzfz .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#ymuatefzfz .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 12px;
}
#ymuatefzfz .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#ymuatefzfz .gt_first_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
}
#ymuatefzfz .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#ymuatefzfz .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#ymuatefzfz .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#ymuatefzfz .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#ymuatefzfz .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#ymuatefzfz .gt_footnote {
margin: 0px;
font-size: 90%;
padding: 4px;
}
#ymuatefzfz .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#ymuatefzfz .gt_sourcenote {
font-size: 90%;
padding: 4px;
}
#ymuatefzfz .gt_left {
text-align: left;
}
#ymuatefzfz .gt_center {
text-align: center;
}
#ymuatefzfz .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#ymuatefzfz .gt_font_normal {
font-weight: normal;
}
#ymuatefzfz .gt_font_bold {
font-weight: bold;
}
#ymuatefzfz .gt_font_italic {
font-style: italic;
}
#ymuatefzfz .gt_super {
font-size: 65%;
}
#ymuatefzfz .gt_footnote_marks {
font-style: italic;
font-weight: normal;
font-size: 65%;
}
</style>
<table class="gt_table">
<thead class="gt_col_headings">
<tr>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1"><strong>Characteristic</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>Beta</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>95% CI</strong></th>
<th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1"><strong>p-value</strong></th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr><td class="gt_row gt_left">0</td>
<td class="gt_row gt_center">0.70</td>
<td class="gt_row gt_center">0.69, 0.71</td>
<td class="gt_row gt_center"><0.001</td></tr>
<tr><td class="gt_row gt_left">rx</td>
<td class="gt_row gt_center">2.5</td>
<td class="gt_row gt_center">1.8, 3.1</td>
<td class="gt_row gt_center"><0.001</td></tr>
<tr><td class="gt_row gt_left">site.sd__(Intercept)</td>
<td class="gt_row gt_center">2.2</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
<tr><td class="gt_row gt_left">Residual.sd__Observation</td>
<td class="gt_row gt_center">5.3</td>
<td class="gt_row gt_center"></td>
<td class="gt_row gt_center"></td></tr>
</tbody>
</table>
</div>
<div id="design-effect-1" class="section level4">
<h4>Design effect</h4>
<p><em>Teerenstra et al</em> derived an alternative design effect that is specific to the ANCOVA model:</p>
<p><span class="math display">\[
(1 + (n-1)\rho) (1-r^2)
\]</span></p>
<p>where <span class="math inline">\(r\)</span> is the same as before. Since <span class="math inline">\((1-r^2) < 2(1-r), \ 0 \le r < 1\)</span>, this will be a reduction from the earlier model.</p>
<pre class="r"><code>(des_effect <- (1 + (n - 1) * rho) * (1 - r^2))</code></pre>
<pre><code>## [1] 2.7</code></pre>
<pre class="r"><code>des_effect * 350 / n</code></pre>
<pre><code>## [1] 31</code></pre>
</div>
<div id="confirming-power-4" class="section level4">
<h4>Confirming power</h4>
<pre class="r"><code>replicate <- function() {
dd <- crt_base(2.4, 32, 30, s_c = 6.8, s_cp = 2.8, s_s = 38, s_sp = 16.4)
dobs <- dd[, .(site, rx, id, period, timeID, y)]
dobs <- dcast(dobs, site + rx + id ~ period, value.var = "y")
fit5 <- lmer(`1` ~ `0` + rx + (1 | site), data = dobs)
coef(summary(fit5))["rx", "Pr(>|t|)"]
}
p_values <- mclapply(1:1000, function(x) replicate(), mc.cores = 4)
mean(unlist(p_values) < 0.05)</code></pre>
<pre><code>## [1] 0.78</code></pre>
</div>
</div>
<div id="next-steps" class="section level2">
<h2>Next steps</h2>
<p>These simulations confirmed the design effects derived by <em>Teerenstra et al</em>. In the next post, we will turn to baseline measurements in the context of a stepped wedge design, to see if these results translate to a more complex setting. The design effects themselves have not yet been derived. In the meantime, to get yourself psyched up for what is coming, you can read more generally about stepped wedge designs <a href="https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/">here</a>, <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a>, <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">here</a>, <a href="https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/">here</a>, <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">here</a>, and <a href="https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/">here</a>.</p>
<p>
<p><small><font color="darkkhaki">
Reference:</p>
<p>Teerenstra, Steven, Sandra Eldridge, Maud Graff, Esther de Hoop, and George F. Borm. “A simple sample size formula for analysis of covariance in cluster randomized trials.” Statistics in medicine 31, no. 20 (2012): 2169-2178.</p>
<p>Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
simstudy update: adding flexibility to data generation
https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/
Tue, 09 Nov 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/
<script src="https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/index.en_files/header-attrs/header-attrs.js"></script>
<p>A new version of <code>simstudy</code> (0.3.0) is now available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html" target="_blank">CRAN</a> and on the <a href="https://github.com/kgoldfeld/simstudy/releases" target="_blank">package website</a>. Along with some less exciting bug fixes, we have added capabilities to a few existing features: double-dot variable reference, treatment assignment, and categorical data definition. These simple additions should make the data generation process a little smoother and more flexible.</p>
<div id="using-non-scalar-double-dot-variable-reference" class="section level2">
<h2>Using non-scalar double-dot variable reference</h2>
<p>Double-dot notation was <a href="https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/">introduced</a> in the last version of <code>simstudy</code> to allow data definitions to be more dynamic. Previously, the double-dot variable could only be a scalar value, and with the current version, double-dot notation is now also <em>array-friendly</em>.</p>
<p>Before the examples, here are the necessary packages for this post:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)</code></pre>
<div id="example-1" class="section level4">
<h4>Example 1</h4>
<p>In the first example, we want to create a mixture distribution from a vector of values (which we can also do using a <em>categorical</em> distribution, more on that in a little bit). We can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4, specified as</p>
<pre class="r"><code>sizes <- c(2, 4)</code></pre>
<p>The data definition references each element of the vector:</p>
<pre class="r"><code>defblk <- defData(varname = "blksize",
formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")</code></pre>
<pre class="r"><code>genData(1000, defblk)</code></pre>
<pre><code>## id blksize
## 1: 1 2
## 2: 2 4
## 3: 3 2
## 4: 4 4
## 5: 5 4
## ---
## 996: 996 4
## 997: 997 4
## 998: 998 4
## 999: 999 4
## 1000: 1000 4</code></pre>
</div>
<div id="example-2" class="section level4">
<h4>Example 2</h4>
<p>In this second example, there is a vector variable <em>tau</em> of positive real numbers that sum to 1, and we want to calculate the weighted average of three numbers using <em>tau</em> as the weights. We could use the following code to estimate a weighted average <em>theta</em>:</p>
<pre class="r"><code>tau <- rgamma(3, 5, 2)
tau <- tau / sum(tau)
tau</code></pre>
<pre><code>## [1] 0.362 0.400 0.238</code></pre>
<pre class="r"><code>d <- defData(varname = "a", formula = 3, variance = 4)
d <- defData(d, varname = "b", formula = 8, variance = 2)
d <- defData(d, varname = "c", formula = 11, variance = 6)
d <- defData(d, varname = "theta", formula = "..tau[1]*a + ..tau[2]*b + ..tau[3]*c",
dist = "nonrandom")
set.seed(19483)
genData(4, d)</code></pre>
<pre><code>## id a b c theta
## 1: 1 1.87 8.16 13.72 7.21
## 2: 2 3.45 7.45 6.08 5.68
## 3: 3 7.41 6.27 10.21 7.62
## 4: 4 2.34 9.52 10.01 7.04</code></pre>
<p>However, we can simplify the calculation of <em>theta</em> a bit by using matrix multiplication:</p>
<pre class="r"><code>d <- updateDef(d, changevar = "theta", newformula = "t(..tau) %*% c(a, b, c)")
set.seed(19483)
genData(4, d)</code></pre>
<pre><code>## id a b c theta
## 1: 1 1.87 8.16 13.72 7.21
## 2: 2 3.45 7.45 6.08 5.68
## 3: 3 7.41 6.27 10.21 7.62
## 4: 4 2.34 9.52 10.01 7.04</code></pre>
</div>
<div id="example-3" class="section level4">
<h4>Example 3</h4>
<p>The arrays can also have <strong>multiple dimensions</strong>, as in a <span class="math inline">\(m \times n\)</span> matrix. If we want to specify the mean outcomes for a <span class="math inline">\(2 \times 2\)</span> factorial study design with two interventions <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>, we can use a simple matrix and draw the means directly from the matrix, which in this example is stored in the variable <em>effect</em>:</p>
<pre class="r"><code>effect <- matrix(c(0, 8, 10, 12), nrow = 2)
effect</code></pre>
<pre><code>## [,1] [,2]
## [1,] 0 10
## [2,] 8 12</code></pre>
<p>Using double dot notation, it is possible to reference the matrix cell values directly, depending on the values of <em>a</em> and <em>b</em>:</p>
<pre class="r"><code>d1 <- defData(varname = "a", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "b", formula = ".5;.5",
variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "outcome", formula = "..effect[a, b]",
variance = 9, dist="normal")</code></pre>
<pre class="r"><code>dx <- genData(1000, d1)
dx</code></pre>
<pre><code>## id a b outcome
## 1: 1 1 2 12.07
## 2: 2 2 2 9.70
## 3: 3 2 2 10.76
## 4: 4 2 2 11.04
## 5: 5 2 1 5.51
## ---
## 996: 996 1 1 -2.80
## 997: 997 2 1 5.15
## 998: 998 2 2 19.47
## 999: 999 2 1 10.53
## 1000: 1000 2 1 3.89</code></pre>
<p>The plot shows individual values as well as mean values by intervention arm:</p>
<p><img src="https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/index.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
</div>
<div id="assigned-treatment-using-trtassign-distribution-in-defdata" class="section level2">
<h2>Assigned treatment using <em>trtAssign</em> distribution in <code>defData</code></h2>
<p>The function <a href="https://kgoldfeld.github.io/simstudy/articles/treat_and_exposure.html" target="_blank">trtAssign</a> currently provides functionality to randomize group assignments using stratification and non-standard ratios (e.g. 2:2:1 for a three arm trial). Starting with version 0.3.0, it is also possible to generate these treatment assignments directly in the <code>defData</code> and <code>genData</code> process without a separate call to <code>trtAssign</code>. We’ve done this by adding <em>trtAssign</em> as a possible distribution.</p>
<p>In this example, randomization is stratified by <em>gender</em> and <em>age</em> (specified in the <code>variance</code> argument), and randomization is 1:1 treatment to control (specified in <code>formula</code>). The outcome <em>y</em> is effected by both of these factors as well as the treatment assignment variable <em>rx</em>.</p>
<pre class="r"><code>def <- defData(varname = "male", dist = "binary",
formula = .5 , id="cid")
def <- defData(def, varname = "over65", dist = "binary",
formula = "-1.7 + .8*male", link="logit")
def <- defData(def, varname = "rx", dist = "trtAssign",
formula = "1;1", variance = "male;over65")
def <- defData(def, varname = "y", dist = "normal",
formula = "20 + 5*male + 10*over65 + 10*rx", variance = 40)
dtstudy <- genData(330, def)
dtstudy</code></pre>
<pre><code>## cid male over65 rx y
## 1: 1 1 0 0 20.4
## 2: 2 1 0 0 23.9
## 3: 3 0 1 0 23.7
## 4: 4 1 0 1 25.9
## 5: 5 0 1 0 35.4
## ---
## 326: 326 1 1 1 46.3
## 327: 327 1 0 1 33.2
## 328: 328 1 0 1 35.5
## 329: 329 1 1 0 42.2
## 330: 330 0 0 0 12.5</code></pre>
<p>Here are the counts and average outcomes for each <em>gender</em>, <em>age</em>, and <em>treatment</em> combination:</p>
<pre class="r"><code>dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)]</code></pre>
<pre><code>## male over65 rx n avg
## 1: 0 0 0 72 20.3
## 2: 0 0 1 72 30.3
## 3: 0 1 0 12 28.7
## 4: 0 1 1 12 38.5
## 5: 1 0 0 55 25.0
## 6: 1 0 1 56 34.3
## 7: 1 1 0 26 36.6
## 8: 1 1 1 25 45.7</code></pre>
</div>
<div id="categogorical-data" class="section level2">
<h2>Categogorical data</h2>
<p>Finally, in previous versions, the <em>categorical</em> distribution generated a set of integer categories:</p>
<pre class="r"><code>def <- defData(varname = "grp", formula = ".4;.3;.2;.1", dist="categorical")
genData(1000, def)</code></pre>
<pre><code>## id grp
## 1: 1 1
## 2: 2 4
## 3: 3 4
## 4: 4 1
## 5: 5 2
## ---
## 996: 996 1
## 997: 997 4
## 998: 998 2
## 999: 999 4
## 1000: 1000 1</code></pre>
<p>Now, it is possible to generate specific values or string categories by using the <code>variance</code> argument:</p>
<pre class="r"><code>def <- defData(varname = "grp", formula = ".4;.3;.2;.1",
variance = "a;b;c;d", dist="categorical")
dd <- genData(1000, def)
dd</code></pre>
<pre><code>## id grp
## 1: 1 d
## 2: 2 b
## 3: 3 a
## 4: 4 c
## 5: 5 b
## ---
## 996: 996 d
## 997: 997 a
## 998: 998 b
## 999: 999 a
## 1000: 1000 d</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/index.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>To replicate Example 1 above, here is an alternative way to generate block sizes of 2 and 4 using the <em>categorical</em> distribution and the new functionality.</p>
<pre class="r"><code>defblk <- defData(varname = "blksize",
formula = ".5;.5", variance = "2;4", dist = "categorical")
genData(1000, defblk)</code></pre>
<pre><code>## id blksize
## 1: 1 2
## 2: 2 4
## 3: 3 2
## 4: 4 4
## 5: 5 4
## ---
## 996: 996 2
## 997: 997 2
## 998: 998 2
## 999: 999 2
## 1000: 1000 2</code></pre>
</div>
<div id="possible-next-steps" class="section level2">
<h2>Possible next steps</h2>
<p>As we expand the functionality of <code>simstudy</code>, we realize that the current structure of the data definition table cannot always easily accommodate all of our new ideas. As a result, we have ended up having to shoehorn some solutions in non-intuitive ways as we grow. We are in the process of reconsidering that structure so that we won’t have (as many of) these awkward specifications in the future (though we will be making everything backwards compatible, so no worries there).</p>
</div>
Sample size requirements for a Bayesian factorial study design
https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/
Tue, 26 Oct 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/
<script src="https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/index.en_files/header-attrs/header-attrs.js"></script>
<p>How do you determine sample size when the goal of a study is not to conduct a null hypothesis test but to provide an estimate of multiple effect sizes? I needed to get a handle on this for a recent grant submission, which I’ve been writing about over the past month, <a href="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/" target="_blank">here</a> and <a href="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/" target="_blank">here</a>. (I provide a little more context for all of this in those earlier posts.) The statistical inference in the study will be based on the estimated posterior distributions from a Bayesian model, so it seems like we’d like those distributions to be as informative as possible. We need to set the sample size large enough to reduce the dispersion of those distributions to a helpful level.</p>
<p>Once I determined that I wanted to target the variance of the posterior distributions, it was just a matter of figuring out what that target should be and then simulate data to see what sample sizes could give us that target. I used the expected standard deviation (<span class="math inline">\(\sigma\)</span>) as the criterion for sample size selection.</p>
<div id="setting-the-target" class="section level3">
<h3>Setting the target</h3>
<p>To determine the target level of precision, I assessed the width of the posterior distributions under different standard deviations. In particular, I identified the posterior probabilities with a mean OR = 1.25 <span class="math inline">\((log(OR) = 0.22)\)</span> where <span class="math inline">\(P(log(OR) > 0) \ge 0.95\)</span>. The target OR is somewhat arbitrary, but seemed like a meaningful effect size based on discussions with my collaborators.</p>
<p>I did a quick search for the standard deviation that would yield a 95% threshold at or very close to 0. That is, 95% of the distribution should lie to the right of 0. Assuming that the target posterior distribution will be approximately <em>normal</em> with a mean of 0.22, I used the <code>qnorm</code> function to find the 95% thresholds for range of standard deviations between 0.10 and 0.15.</p>
<pre class="r"><code>sd <- seq(.15, .10, by = -0.005)
cbind(sd, threshold = round(qnorm(.05, .22 , sd = sd), 3))</code></pre>
<pre><code>## sd threshold
## [1,] 0.150 -0.027
## [2,] 0.145 -0.019
## [3,] 0.140 -0.010
## [4,] 0.135 -0.002
## [5,] 0.130 0.006
## [6,] 0.125 0.014
## [7,] 0.120 0.023
## [8,] 0.115 0.031
## [9,] 0.110 0.039
## [10,] 0.105 0.047
## [11,] 0.100 0.056</code></pre>
<p>It looks like the target standard deviation should be close to 0.135, which is also apparent from the plot of the 95% intervals centered at 0.22:</p>
<p><img src="https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/index.en_files/figure-html/unnamed-chunk-3-1.png" width="432" /></p>
</div>
<div id="using-simulation-to-establish-sample-size" class="section level3">
<h3>Using simulation to establish sample size</h3>
<p>The final step was to repeatedly simulate data sets using different sample size assumptions, fitting models, and estimating the posterior distribution standard deviations for associated with each data set (and sample size). I evaluated sample sizes ranging from 400 to 650 individuals, increasing in increments of 50. For each sample size, I generated 250 data sets, for a total of 1,500 data sets and model estimates. Given that each model estimation is quite resource intensive, I generated all the data and estimated the models using a high performance computing environment that provided me with 90 nodes and 4 processors on each node so that the Bayesian MCMC process could all run in parallel - so parallelization of parallel processes. In total, this took about 2 hours to run.</p>
<p>(I am including the code in the <a href="#addendum">addendum</a> below. The structure is similar to what I have <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">described</a> in the past on how one might do these types of explorations with simulated data and Bayesian modelling.)</p>
<p>Below is the output for a single data set to provide an example of the data being generated by the simulations. We have estimated seven log-odds ratios (see <a href="" target="_blank">here</a> for an explanation of why there are seven), and the simulation returns a summary of the posterior distribution for each: selected quantiles and the standard deviation.</p>
<pre><code>## n var p0.025 p0.25 p0.5 p0.75 p0.975 sd
## 1: 650 lOR[1] 0.113 0.269 0.354 0.441 0.603 0.126
## 2: 650 lOR[2] 0.410 0.569 0.652 0.735 0.892 0.123
## 3: 650 lOR[3] 0.427 0.585 0.667 0.752 0.906 0.123
## 4: 650 lOR[4] 0.367 0.526 0.608 0.691 0.851 0.122
## 5: 650 lOR[5] 0.436 0.592 0.675 0.757 0.913 0.122
## 6: 650 lOR[6] 0.703 0.861 0.945 1.028 1.180 0.122
## 7: 650 lOR[7] 0.722 0.886 0.969 1.054 1.214 0.125</code></pre>
<p>The plot below shows the estimated standard deviations for a single log-odds ratio (in this case <span class="math inline">\(\lambda_4\)</span>), with a point for each of the 1,500 simulate data sets. At 550 subjects, the mean standard deviation (represented by the curve) is starting to get close to 0.135, but there is still quite a bit of uncertainty. To be safe, we might want to set the upper limit for the study to be 600 patients, because we are quite confident that the standard deviation will be low enough to meet our criteria (almost 90% of the standard deviations from the simulations were below 0.135, though at 650 patients that proportion was over 98%).</p>
<p><img src="https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/index.en_files/figure-html/unnamed-chunk-5-1.png" width="480" /></p>
<p><a name="addendum"></a></p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>This code generates repeated data sets under different sample size assumptions and draws samples from the posterior distribution for each of those data sets. The simulations are set up to run on a high performance computing (HPC) environment, so multiple data sets can be generated and analyzed simultaneously. If you do not have access to and HPC, you can run locally using <code>lapply</code> or <code>mclapply</code> rather than <code>Slurm_lapply</code>, but unless you have an extremely powerful desktop or laptop, expect these kinds of simulations to take days rather than hours.</p>
<p>One particularly interesting feature of the data generation process used in these simulations is that the effect size parameters are not considered to be fixed, but are themselves drawn from a distribution of parameters. Given that we are never certain about what the parameters should be in the data generation process, this adds an appropriate level of uncertainty that gets reflected in our target estimates. If we are slightly conservative in our sample size selection, this will take into account this additional uncertainty. Of course, how much uncertainty will depend on the situation.</p>
<pre class="r"><code>library(cmdstanr)
library(simstudy)
library(data.table)
library(posterior)
library(slurmR)
library(glue)
s_define <- function() {
f <- "..t_0 + ..t_a*a + ..t_b*b + ..t_c*c +
..t_ab*a*b + ..t_ac*a*c + ..t_bc*b*c + ..t_abc*a*b*c"
defY <- defDataAdd(varname = "y", formula = f, dist = "binary", link="logit")
return(list(defY = defY))
}
s_generate <- function(list_of_defs, argsvec) {
list2env(list_of_defs, envir = environment())
list2env(as.list(argsvec), envir = environment())
# introducing uncertainty into the data generation process
t_0 <- mu_int
t_a <- rnorm(1, mu_a, .10)
t_b <- rnorm(1, mu_b, .10)
t_c <- rnorm(1, mu_c, .10)
t_ab <- rnorm(1, mu_ab, .10)
t_ac <- rnorm(1, mu_ac, .10)
t_bc <- rnorm(1, mu_bc, .10)
t_abc <- mu_abc
dd <- genData(8 * n)
dd <- addMultiFac(dd, nFactors = 3, colNames = c("a", "b", "c"))
dd <- addColumns(defY, dd)
return(dd)
}
s_model <- function(generated_data, mod) {
dt_to_list <- function(dx) {
N <- nrow(dx)
x_abc <- model.matrix(~a*b*c, data = dx)
y <- dx[, y]
list(N = N, x_abc = x_abc, y = y)
}
fit <- mod$sample(
data = dt_to_list(generated_data),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500,
adapt_delta = 0.98,
max_treedepth = 20,
show_messages = FALSE
)
posterior <- data.frame(as_draws_rvars(fit$draws(variables = "lOR")))
pcts <- c(.025, 0.25, .50, 0.75, .975)
sumstats <- data.table(t(quantile(posterior$lOR, pcts)))
setnames(sumstats, glue("p{pcts}"))
sumstats$sd <- sd(posterior$lOR)
sumstats$var <- glue("lOR[{1:7}]")
return(sumstats) # model_results is a data.table
}
s_replicate <- function(argsvec, mod) {
set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
list_of_defs <- s_define()
generated_data <- s_generate(list_of_defs, argsvec)
model_results <- s_model(generated_data, mod)
#--- summary statistics ---#
summary_stats <- data.table(t(argsvec), model_results)
return(summary_stats) # summary_stats is a data.table
}
#--- Set arguments ---#
scenario_list <- function(...) {
argmat <- expand.grid(...)
return(asplit(argmat, MARGIN = 1))
}
n <- c(400, 450, 500, 550, 600, 650)
mu_int <- -1.4
mu_m <- 0.5
mu_x <- -0.3
mu_abc <- 0.3
scenarios <- scenario_list(n = n,
mu_int = mu_int, mu_a = mu_m, mu_b = mu_m, mu_c = mu_m,
mu_ab = mu_x, mu_ac = mu_x, mu_bc = mu_x, mu_abc = mu_abc)
scenarios <- rep(scenarios, each = 250)
#--- run on HPC ---#
set_cmdstan_path(path = "/gpfs/.../cmdstan/2.25.0")
smodel <- cmdstan_model("/gpfs/.../model_ind.stan")
job <- Slurm_lapply(
X = scenarios,
FUN = s_replicate,
mod = smodel,
njobs = min(90L, length(scenarios)),
mc.cores = 4L,
job_name = "i_ss",
tmp_path = "/gpfs/.../scratch",
plan = "wait",
sbatch_opt = list(time = "12:00:00", partition = "cpu_short", `mem-per-cpu` = "4G"),
export = c("s_define", "s_generate", "s_model"),
overwrite = TRUE
)
res <- Slurm_collect(job)
save(res, file = "/gpfs/.../post_ss.rda")</code></pre>
</div>
A Bayesian analysis of a factorial design focusing on effect size estimates
https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/
Tue, 12 Oct 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/
<script src="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/index.en_files/header-attrs/header-attrs.js"></script>
<p>Factorial study designs present a number of analytic challenges, not least of which is how to best understand whether simultaneously applying multiple interventions is beneficial. <a href="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/" target="_blank">Last time</a> I presented a possible approach that focuses on estimating the variance of effect size estimates using a Bayesian model. The scenario I used there focused on a hypothetical study evaluating two interventions with four different levels each. This time around, I am considering a proposed study to reduce emergency department (ED) use for patients living with dementia that I am actually involved with. This study would have three different interventions, but only two levels for each (i.e., yes or no), for a total of 8 arms. In this case - the model I proposed previously does not seem like it would work well; the posterior distributions based on the variance-based model turn out to be bi-modal in shape, making it quite difficult to interpret the findings. So, I decided to turn the focus away from variance and emphasize the effect size estimates for each arm compared to control.</p>
<div id="model-specification" class="section level3">
<h3>Model specification</h3>
<p>As I mentioned, this is a case with three interventions (<span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, and <span class="math inline">\(c\)</span>), each of which has two levels; the full factorial design will have 8 arms:</p>
<span class="math display">\[\begin{aligned}
(1) \ a&=0, \ b=0, \ c=0 \\
(2) \ a&=1, \ b=0, \ c=0 \\
(3) \ a&=0, \ b=1, \ c=0 \\
(4) \ a&=0, \ b=0, \ c=1 \\
(5) \ a&=1, \ b=1, \ c=0 \\
(6) \ a&=1, \ b=0, \ c=1 \\
(7) \ a&=0, \ b=1, \ c=1 \\
(8) \ a&=1, \ b=1, \ c=1 \\
\end{aligned}\]</span>
<p>Although the proposed study is a cluster randomized trial, where each participating site will be assigned to one of the eight arms, I am simplifying things here a bit by assuming each individual patient <span class="math inline">\(i\)</span> will be randomized to each of <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, and <span class="math inline">\(c\)</span>, and <span class="math inline">\(a_i \in \{0,1\}\)</span>, <span class="math inline">\(b_i\in \{0,1\}\)</span>, and <span class="math inline">\(c_i\in \{0,1\}\)</span>.</p>
<p>Here is a model for outcome <span class="math inline">\(y_i\)</span>, a binary measure <span class="math inline">\((y_i \in {0,1})\)</span>, where the log-odds of the outcome for each patient is a function of the random assignment:</p>
<p><span class="math display">\[
y_{i} \sim \text{binomial}\left(p_{i}\right)
\]</span></p>
<p><span class="math display">\[
\text{log}\left( \frac{p_{i}}{1-p_{i}}\right) = \tau_0 + \tau_a a_i + \tau_b b_i + \tau_c c_i + \tau_{ab} a_i b_i + \tau_{ac} a_i c_i + \tau_{bc} b_i c_i + \tau_{abc}a_i b_i c_i
\]</span></p>
<p>This is just a standard logistic model specification, where the parameters can be interpreted as log-odds ratios. For example, <span class="math inline">\(\lambda_b = \tau_b\)</span> is the log odds ratio comparing patients randomized to receive only <span class="math inline">\(b\)</span> (group 3 from above) with the control arm where patients receive none of the interventions (group 1), and <span class="math inline">\(\lambda_{ac} = \tau_a + \tau_c + \tau_{ac}\)</span> is the log odds ratio comparing patients randomized to only <span class="math inline">\(a\)</span> and <span class="math inline">\(c\)</span> but not <span class="math inline">\(b\)</span> (group 6) compared with the control patients (group 1). This is the full set of log odds ratios for this design:</p>
<span class="math display">\[\begin{aligned}
\lambda_a &= \tau_a \\
\lambda_b &= \tau_b \\
\lambda_c &= \tau_c \\
\lambda_{ab} &= \tau_a + \tau_b + \tau_{ab} \\
\lambda_{ac} &= \tau_a + \tau_c + \tau_{ac} \\
\lambda_{bc} &= \tau_b + \tau_c + \tau_{bc} \\
\lambda_{abc} &= \tau_a + \tau_b + \tau_c + \tau_{ab} + \tau_{ac} + \tau_{bc} + \tau_{abc} \\
\end{aligned}\]</span>
<p>The focus of the analysis is to estimate posterior probability distributions for the <span class="math inline">\(\lambda\text{'s}\)</span>, and possibly to compare across the <span class="math inline">\(\lambda\text{'s}\)</span> (also using posterior distributions) to assess whether combining multiple interventions seems beneficial.</p>
</div>
<div id="prior-distribution-assumptions" class="section level3">
<h3>Prior distribution assumptions</h3>
<p>Rere are the prior distribution assumptions for the parameters in the Bayesian model:</p>
<span class="math display">\[\begin{aligned}
\tau_0 &\sim N(\mu=0, \sigma = 1) \\
\tau_a, \tau_b, \tau_c &\sim N(\mu = \delta_m, \sigma = \sigma_m) \\
\tau_{ab}, \tau_{ac}, \tau_{bc} &\sim N(\mu = \delta_x, \sigma = \sigma_x) \\
\tau_{abc} &\sim N(\mu = 0, \sigma = 1) \\
\delta_m &\sim N(\mu = 0, \sigma = 1) \\
\sigma_m &\sim t_\text{student}(\text{df}=3, \mu=0, \sigma = 2.5), \ \sigma_m \ge 0 \\
\delta_x &\sim N(0, 1) \\
\sigma_x &\sim t_\text{student}(\text{df}=3, \mu = 0, \sigma = 2.5), \ \sigma_x \ge 0 \\
\end{aligned}\]</span>
<p>While the focus of this model estimation is different from the approach I discussed <a href="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/" target="_blank">last time</a>, the prior distributions here share a key element with the earlier model. The priors for the main effects <span class="math inline">\(\tau_a, \ \tau_b, \text{ and } \tau_c\)</span> share a common mean <span class="math inline">\(\delta_m\)</span> and standard deviation <span class="math inline">\(\sigma_m\)</span>. Likewise the prior distributions for the pair-wise interaction effects share a common mean <span class="math inline">\(\delta_x\)</span> and standard deviation <span class="math inline">\(\sigma_x\)</span>. These four <em>hyperparameters</em> are estimated from the data. The prior distributions for the mean intervention effects <span class="math inline">\(\delta_m\)</span> and <span class="math inline">\(\delta_x\)</span> are specified with the aim towards conservativism or skepticism, with a large portion of the distribution centered around 0. The priors for the variance parameters are more diffuse (using a <span class="math inline">\(t\)</span>-distribution with 3-degrees of freedom, a compromise between a <em>Cauchy</em> distribution with very broad tails and a <em>normal</em> distribution with more constrained tails).</p>
<p>Statistical inference will be based on an examination of the posterior distributions for the log odds ratios comparing each of the treatment combinations with the control arm where none of the interventions is implemented. We can also compare across different combinations to assess if one particular combination seems to be stronger than another. Since we are not using a null-hypothesis testing framework and the effect estimates are pooled across the interventions, adjustments for multiple testing are not necessary. (In the future, I can show results of the experiments where I explored the operating characteristics of these models. Because of the pooling and shrinkage that is built into the model, there are no inflated type 1 errors, analogous to the situation where I <a href="https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/" target="_blank">evaluated</a> Bayesian methods for subgroup analysis.)</p>
</div>
<div id="data-definition-and-generation" class="section level3">
<h3>Data definition and generation</h3>
<p>Here are the libraries needed for the simulation, model estimation, and presentation of results:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(glue)
library(ggplot2)
library(cowplot)
library(ggdist)
library(paletteer)</code></pre>
<p>In this simulation, the log odds for the outcome in the control group has been set at -1.4, corresponding to odds = exp(-1.4) = 0.25, and probability of outcome = 1/(1+exp(1.4) = 20%. Here are the log-odds ratios that I assumed for each of the different arms with at least one treatment assignment:</p>
<span class="math display">\[\begin{aligned}
\lambda_a &= 0.5 \\
\lambda_b &= 0.6 \\
\lambda_c &= 0.0 \\
\lambda_{ab} &= 0.5 + 0.7 - 0.3 = 0.9 \\
\lambda_{ac} &= 0.5 + 0.0 + 0.0 = 0.5 \\
\lambda_{bc} &= 0.7 + 0.0 + 0.0 = 0.7 \\
\lambda_{abc} &= 0.5 + 0.7 + 0.0 - 0.3 + 0.0 + 0.0 + 0.0 = 0.9 \\
\end{aligned}\]</span>
<pre class="r"><code>f <- "..t_0 + ..t_a*a + ..t_b*b + ..t_c*c +
..t_ab*a*b + ..t_ac*a*c + ..t_bc*b*c + ..t_abc*a*b*c"
defY <- defDataAdd(varname = "y", formula = f, dist = "binary", link="logit")
t_0 <- -1.4
t_a <- 0.5
t_b <- 0.7
t_ab <- -0.3
t_c <- t_ac <- t_bc <- t_abc <- 0.0</code></pre>
<p>4000 patients will be randomized to the eight arms, 500 in each:</p>
<pre class="r"><code>set.seed(37159)
dd <- genData(8*500)
dd <- addMultiFac(dd, nFactors = 3, colNames = c("a", "b", "c"))
dd <- addColumns(defY, dd)
dd</code></pre>
<pre><code>## id a b c y
## 1: 1 1 0 0 0
## 2: 2 1 0 0 0
## 3: 3 1 0 0 1
## 4: 4 1 0 0 0
## 5: 5 0 0 1 0
## ---
## 3996: 3996 1 1 1 0
## 3997: 3997 1 1 0 0
## 3998: 3998 1 0 1 0
## 3999: 3999 0 0 1 0
## 4000: 4000 0 1 0 1</code></pre>
<p>Here are the observed proportions by treatment arm. The fact that the two panels (<span class="math inline">\(c = 0\)</span> and <span class="math inline">\(c = 1\)</span>) are pretty similar are an indication that intervention <span class="math inline">\(c\)</span> has no impact. And the fact that lines are not parallel in each panel are an indication that there is some interaction (in this case negative).</p>
<p><img src="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/index.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="model-fitting" class="section level3">
<h3>Model fitting</h3>
<p>The Bayesian sampling is using four chains of length 2,500 (following 1,000 warm-up iterations for each), so the posterior distribution will be estimated with 10,000 total samples. The code for the <code>Stan</code> model can be found in the <a href="#addendum">addendum</a>.</p>
<pre class="r"><code>dt_to_list <- function(dx) {
N <- nrow(dx)
x_abc <- model.matrix(~a*b*c, data = dx)
y <- dx[, y]
list(N = N, x_abc = x_abc, y = y)
}
mod <- cmdstan_model("code/model_ind.stan")
fit <- mod$sample(
data = dt_to_list(dd),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 1000,
iter_sampling = 2500,
adapt_delta = 0.98,
max_treedepth = 20,
show_messages = FALSE,
seed = 29817
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 1 finished in 113.5 seconds.
## Chain 3 finished in 124.9 seconds.
## Chain 4 finished in 129.5 seconds.
## Chain 2 finished in 130.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 124.5 seconds.
## Total execution time: 130.5 seconds.</code></pre>
</div>
<div id="presenting-the-results" class="section level3">
<h3>Presenting the results</h3>
<p>Here is the code for the first plot, which shows the distribution of effect sizes (on the log-odds scale) for each of the intervention arms. I’ve extracted the samples using the <code>posterior</code> package function <code>as_draw_rvars</code> that I recently described <a href="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/" target="_blank">here</a>.</p>
<pre class="r"><code>posterior <- data.frame(as_draws_rvars(fit$draws(variables = "lOR")))
pcts <- c(.025, 0.25, .50, 0.75, .975)
sumstats <- data.table(t(quantile(posterior$lOR, pcts)))
setnames(sumstats, glue("p{pcts}"))
sumstats$var <- glue("lOR[{1:7}]")
p <- ggplot(data = sumstats, aes(y = var, yend = var)) +
geom_vline(xintercept = 0, color = "grey85") +
geom_segment(aes(x = p0.025, xend = p0.975)) +
geom_segment(aes(x = p0.25, xend = p0.75),
size = 1.25, color = palettes_d$wesanderson$Moonrise2[2]) +
geom_point(aes(x = p0.5), size = 2.5) +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_text(margin = margin(t = 0, r = -12, b = 0, l = 0)),
plot.title = element_text(size = 10, face = "bold")
) +
ylab("treatment assignments (three interventions)") +
xlab("log odds ratio") +
xlim(-.5, 1.5) +
ggtitle("Posterior distribution of log OR by treatment assignment")
pimage <- axis_canvas(p, axis = 'y') +
draw_image("r_icons/r111.png", y = 6.5, scale = 0.35) +
draw_image("r_icons/r011.png", y = 5.5, scale = 0.35) +
draw_image("r_icons/r101.png", y = 4.5, scale = 0.35) +
draw_image("r_icons/r110.png", y = 3.5, scale = 0.35) +
draw_image("r_icons/r001.png", y = 2.5, scale = 0.35) +
draw_image("r_icons/r010.png", y = 1.5, scale = 0.35) +
draw_image("r_icons/r100.png", y = 0.5, scale = 0.35)</code></pre>
<p>Looking at the figure, it is apparent that that <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> likely had an effect, while <span class="math inline">\(c\)</span> probably did not. It also appears that the combination of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> might be an improvement, both with and without <span class="math inline">\(c\)</span>:</p>
<pre class="r"><code>ggdraw(insert_yaxis_grob(p, pimage, position = "left", width = grid::unit(.17, "null")))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/index.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>In the next and last plot, my goal is to compare the log-odds ratios of the different arms. I am showing the the posterior distributions for the differences between the estimated log-odds ratios. In this particular data set, <span class="math inline">\(a\)</span> does not look any different from <span class="math inline">\(b\)</span>, but the combination of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> does indeed look superior to either alone, regardless of whether <span class="math inline">\(c\)</span> is involved:</p>
<pre class="r"><code>data <- with(posterior, data.frame(
x = c(
"(1) b vs a",
"(2) ab vs a",
"(3) ab vs b",
"(4) abc vs ab",
"(5) abc vs ac",
"(6) abc vs bc"
),
diff = c(
lOR[2] - lOR[1],
lOR[4] - lOR[1],
lOR[4] - lOR[2],
lOR[7] - lOR[4],
lOR[7] - lOR[5],
lOR[7] - lOR[6]
)
))
ggplot(data = data, aes(dist = diff, x = x)) +
geom_hline(yintercept = 0, color = "grey80", size = .3) +
stat_dist_eye(fill = palettes_d$wesanderson$Moonrise2[1], position="dodge") +
theme(panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank()) +
ylab("difference")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/index.en_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
<p>Ultimately, how we present the data and draw our conclusions will depend on what we specify up front regarding the parameters and comparisons of interest. The great thing about a Bayesian model is that we have estimated everything in a single model, so there are no real concerns with multiple comparisons. However, reviewers still like to see results for analyses that were pre-specified. And if a decision is to be made based on those results, those decision rules should be pre-specified. But, my preference would be to show the findings and let readers decide if the results are compelling and/or determine if a more focused trial is needed.</p>
<p>In the next (and most likely, for now at least, final) post on this topic, I plan on <a href="https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/" target="_blank">describing</a> how I approached sample size estimation for this proposed study.</p>
<p><a name="addendum"></a></p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<pre class="stan"><code>data {
int<lower=0> N; // number patients
matrix<lower=0, upper=1>[N, 8] x_abc;
int<lower=0,upper=1> y[N]; // outcome for individual i
}
parameters {
vector[8] z;
real delta_m;
real<lower = 0> sigma_m;
real delta_x;
real<lower=0> sigma_x;
}
transformed parameters {
vector[8] tau;
tau[1] = z[1];
for (i in 2:4){
tau[i] = sigma_m * z[i] + delta_m;
}
for (i in 5:7){
tau[i] = sigma_x * z[i] + delta_x;
}
tau[8] = z[8];
}
model {
sigma_m ~ student_t(3, 0, 2.5);
sigma_x ~ student_t(3, 0, 2.5);
delta_m ~ normal(0, 1);
delta_x ~ normal(0, 1);
z ~ std_normal();
y ~ bernoulli_logit(x_abc * tau);
}
generated quantities {
real lOR[7];
lOR[1] = tau[2]; // a=1, b=0, c=0
lOR[2] = tau[3]; // a=0, b=1, c=0
lOR[3] = tau[4]; // a=0, b=0, c=1
lOR[4] = tau[2] + tau[3] + tau[5]; // a=1, b=1, c=0
lOR[5] = tau[2] + tau[4] + tau[6]; // a=1, b=0, c=1
lOR[6] = tau[3] + tau[4] + tau[7]; // a=0, b=1, c=1
lOR[7] = tau[2]+tau[3]+tau[4]+tau[5]+tau[6]+tau[7]+tau[8]; // a=1, b=1, c=1
}</code></pre>
</div>
Analyzing a factorial design by focusing on the variance of effect sizes
https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/
Tue, 28 Sep 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/
<script src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/header-attrs/header-attrs.js"></script>
<p>Way back in 2018, long before the pandemic, I <a href="https://www.rdatagen.net/post/testing-many-interventions-in-a-single-experiment/" target="_blank">described</a> a soon-to-be implemented <code>simstudy</code> function <code>genMultiFac</code> that facilitates the generation of multi-factorial study data. I <a href="https://www.rdatagen.net/post/so-how-efficient-are-multifactorial-experiments-part/" target="_blank">followed up</a> that post with a description of how we can use these types of efficient designs to answer multiple questions in the context of a single study.</p>
<p>Fast forward three years, and I am thinking about these designs again for a new grant application that proposes to study simultaneously three interventions aimed at reducing emergency department (ED) use for people living with dementia. The primary interest is to evaluate each intervention on its own terms, but also to assess whether any combinations seem to be particularly effective. While this will be a fairly large cluster randomized trial with about 80 EDs being randomized to one of the 8 possible combinations, I was concerned about our ability to estimate the interaction effects of multiple interventions with sufficient precision to draw useful conclusions, particularly if the combined effects of two or three interventions are less than additive. (That is, two interventions may be better than one, but not twice as good.)</p>
<p>I am thinking that a null hypothesis testing framework might not be so useful here, given the that the various estimates could be highly uncertain, not to mention the multiple statistical tests that we would need to conduct (and presumably adjust for). Rather, a Bayesian approach that pools estimates across interventions and provides posterior probability distributions may provide more insight into how the interventions interact could be a better way to go.</p>
<p>With this in mind, I went to the literature, and I found these papers by <a href="https://journals.sagepub.com/doi/full/10.1177/0193841X18818903" target="_blank"><em>Kassler et al</em></a> and <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-33/issue-1/Analysis-of-variancewhy-it-is-more-important-than-ever/10.1214/009053604000001048.full" target="_blank"><em>Gelman</em></a>. They both describe a way of thinking about interaction that emphasizes the estimates of variance across effect estimands. I went ahead and tested the idea with simulated data, which I’m showing here. Ultimately, I decided that this approach will not work so well for our study, and I came up with a pretty simple solution that I will share next time.</p>
<div id="identifying-interaction-through-variance" class="section level3">
<h3>Identifying interaction through variance</h3>
<p>The scenarios described by both papers involve studies that may be evaluating many possible interventions or exposures, each of which may have two or more levels. If we are dealing with a normally distributed (continuous) outcome measure, we can model that outcome as</p>
<p><span class="math display">\[
y_{i} \sim N\left(\mu = \tau_0 + \tau^1_{j_{1_i}} + \dots + \tau^k_{j_{k_i}} + \tau^{12}_{j_{12_i}} + \dots + \tau^{k-1, k}_{j_{k-1,k_i}} + \tau^{123}_{123_i} + \dots + \tau^{k-2, k-1, k}_{k-2, k-1, k_i} + \dots, \ \sigma = \sigma_0\right),
\]</span></p>
<p>where there are <span class="math inline">\(K\)</span> interventions, and intervention <span class="math inline">\(k\)</span> has <span class="math inline">\(j_k\)</span> levels. So, if intervention <span class="math inline">\(3\)</span> has 4 levels, <span class="math inline">\(j_3 \in \{1,2,3,4\}.\)</span> <span class="math inline">\(\tau_0\)</span> is effectively the grand mean. <span class="math inline">\(\tau^k_1, \tau^k_2, \dots, \tau^k_{j_k},\)</span> are the mean contributions for the <span class="math inline">\(k\)</span>th intervention, and we constrain <span class="math inline">\(\sum_{m=1}^{j_k} \tau^k_m = 0.\)</span> Again, for intervention <span class="math inline">\(3\)</span>, we would have <span class="math inline">\(\tau^3_1 \dots, \tau^3_4,\)</span> with <span class="math inline">\(\sum_{m=1}^{4} \tau^3_m = 0.\)</span></p>
<p>The adjustments made for the two-way interactions are represented by the <span class="math inline">\(\tau^{12}\)</span>’s through the <span class="math inline">\(\tau^{k-1,k}\)</span>’s. If intervention 5 has <span class="math inline">\(2\)</span> levels then for the interaction between interventions 3 and 5 we have <span class="math inline">\(\tau^{35}_{11}, \tau^{35}_{12}, \tau^{35}_{21}, \dots, \tau^{35}_{42}\)</span> and <span class="math inline">\(\sum_{m=1}^4 \sum_{n=1}^2 \tau^{35}_{m,n} = 0.\)</span></p>
<p>This pattern continues for higher orders of interaction (i.e. 3-way, 4-way, etc.).</p>
<p>In the Bayesian model, each set of <span class="math inline">\(\tau_k\)</span>’s shares a common prior distribution with mean 0 and standard deviation <span class="math inline">\(\sigma_k\)</span>:</p>
<p><span class="math display">\[
\tau^k_1, \dots, \tau^k_{j_k} \sim N(\mu = 0, \sigma = \sigma_k),
\]</span>
where <span class="math inline">\(\sigma_k\)</span> is a hyperparameter that will be estimated from the data. The same is true for the interaction terms for interventions <span class="math inline">\(k\)</span> and <span class="math inline">\(l\)</span>:</p>
<p><span class="math display">\[
\tau^{kl}_{11}, \dots, \tau^{kl}_{j_k, j_l} \sim N(\mu = 0, \sigma = \sigma_{kl}), \ \ \text{where } k < l
\]</span></p>
<p>To assess whether there is interaction between the interventions (i.e. the effects are not merely additive), we are actually interested the variance parameters of the interaction <span class="math inline">\(\tau\text{'s}\)</span>. If, for example there is no interaction between different levels of interventions of 3 and 5, then <span class="math inline">\(\sigma_{35}\)</span> should be close to <span class="math inline">\(0\)</span>, implying that <span class="math inline">\(\tau^{35}_{11} \approx \tau^{35}_{12} \approx \dots \approx \tau^{35}_{42} \approx 0\)</span>. On the other hand, if there is some interaction effect, then <span class="math inline">\(\sigma_{35} > 0,\)</span> implying that at least one <span class="math inline">\(\tau^{35} > 0.\)</span></p>
<p>One advantage of the proposed Bayesian model is that we can use partial pooling to get more precise estimates of the variance terms. By this, I mean that we can use information from each <span class="math inline">\(\sigma^{kl}\)</span> to inform the others. So, in the case of 2-way interaction, the prior probability assumption would suggest that the the variance terms were drawn from a common distribution:</p>
<p><span class="math display">\[
\sigma^{12}, \sigma^{13}, \dots, \sigma^{k-1,k} \sim N(\mu = 0, \sigma = \sigma_{\text{2-way}})
\]</span></p>
<p>We can impose more structure (and hopefully precision) by doing the same for the main effects:</p>
<p><span class="math display">\[
\sigma^{1}, \sigma^{2}, \dots, \sigma^{k} \sim N(\mu = 0, \sigma = \sigma_{\text{main}})
\]</span></p>
<p>Of course, for each higher order interaction (above 2-way), we could impose the same structure:</p>
<p><span class="math display">\[
\sigma^{123}, \dots, \sigma^{12k}, \dots, \sigma^{k-2, k-1, k} \sim N(\mu = 0, \sigma = \sigma_{\text{3-way}})
\]</span></p>
<p>And so on. Though at some point, we might want to assume that there is no higher order interaction and exclude it from the model; in most cases, we could stop at 2- or 3-way interaction and probably not sacrifice too much.</p>
</div>
<div id="example-from-simulation" class="section level3">
<h3>Example from simulation</h3>
<p>When I set out to explore this model, I started relatively simple, using only two interventions with four levels each. In this case, the factorial study would have 16 total arms <span class="math inline">\((4 \times 4)\)</span>. (Since I am using only 2 interventions, I am changing the notation slightly, using interventions <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> rather than <span class="math inline">\(1\)</span> and <span class="math inline">\(2\)</span>.) Individual <span class="math inline">\(i\)</span> is randomized to one level in <span class="math inline">\(a\)</span> and one level <span class="math inline">\(b\)</span>, and <span class="math inline">\(a_i \in \{1,2,3,4\}\)</span> and <span class="math inline">\(b_i\in \{1,2,3,4\}\)</span>, and <span class="math inline">\(ab_i \in \{11, 12, 13, 14, 21, 22, \dots, 44\}.\)</span> Using the same general model from above, here is the specific model for continuous <span class="math inline">\(y\)</span>:</p>
<p><span class="math display">\[
y_{i} \sim N\left(\mu = \tau_0 + \tau^a_{a_i} + \tau^b_{b_i} + \tau^{ab}_{ab_i}, \ \sigma = \sigma_0\right)
\]</span></p>
<p><br></p>
<p>Take note that we only have a single set of 2-way interactions since there are only two groups of interventions. Because of this, there is no need for a <span class="math inline">\(\sigma_{\text{2-way}}\)</span> hyperparameter; however, there is a hyperparameter <span class="math inline">\(\sigma_{\text{main}}\)</span> to pool across the main effects of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. Here are the prior distribution assumptions:</p>
<span class="math display">\[\begin{aligned}
\tau_0 &\sim N(0, 5) \\
\tau^a_1, \tau^a_2, \tau^a_3, \tau^a_4 &\sim N(0, \sigma_a) \\
\tau^b_1, \tau^b_2, \tau^b_3, \tau^b_4 &\sim N(0, \sigma_b) \\
\tau^{ab}_{11}, \tau^{ab}_{12}, \dots \tau^{ab}_{44} &\sim N(0, \sigma_{ab}) \\
\sigma_a, \sigma_b &\sim N(0, \sigma_\text{main}) \\
\sigma_{ab} &\sim N(0, 5) \\
\sigma_\text{main} &\sim N(0, 5) \\
\sigma &\sim N(0,5)
\end{aligned}\]</span>
<p>In order to ensure identifiability, we have the following constraints:</p>
<span class="math display">\[\begin{aligned}
\tau^a_1 + \tau^a_2 + \tau^a_3 + \tau^a_4 &= 0 \\
\tau^b_1 + \tau^b_2 + \tau^b_3 + \tau^b_4 &= 0 \\
\tau^{ab}_{11} + \tau^{ab}_{12} + \dots + \tau^{ab}_{43} + \tau^{ab}_{44} &= 0
\end{aligned}\]</span>
</div>
<div id="required-libraries" class="section level3">
<h3>Required libraries</h3>
<pre class="r"><code>library(simstudy)
library(data.table)
library(cmdstanr)
library(caret)
library(posterior)
library(bayesplot)
library(ggdist)
library(glue)</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The parameters <span class="math inline">\(\tau_0, \tau_a, \tau_b, \text{ and } \tau_{ab}\)</span> are set so that there is greater variation in treatment <span class="math inline">\(a\)</span> compared to treatment <span class="math inline">\(b\)</span>. In both cases, the sum of the parameters is set to <span class="math inline">\(0\)</span>.</p>
<pre class="r"><code>t_0 <- 0
t_a <- c(-8, -1, 3, 6)
t_b <- c(-3, -1, 0, 4)</code></pre>
<p>The interaction is set in this case so that there is an added effect when both <span class="math inline">\(a=2 \ \& \ b=2\)</span> and <span class="math inline">\(a=3 \ \& \ b=2\)</span>. Again, the parameters are set so that the <em>sum-to-zero</em> constraint is maintained.</p>
<pre class="r"><code>x <- c(4, 3)
nox <- - sum(x) / (16 - length(x))
t_ab <- matrix(c(nox, nox, nox, nox,
nox, 4, nox, nox,
nox, 3, nox, nox,
nox, nox, nox, nox), nrow = 4, byrow = TRUE)
t_ab</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] -0.5 -0.5 -0.5 -0.5
## [2,] -0.5 4.0 -0.5 -0.5
## [3,] -0.5 3.0 -0.5 -0.5
## [4,] -0.5 -0.5 -0.5 -0.5</code></pre>
<pre class="r"><code>sum(t_ab)</code></pre>
<pre><code>## [1] 0</code></pre>
<p>The data definitions for the arm assignments and the outcome <span class="math inline">\(y\)</span> are established using the <code>simstudy</code> package:</p>
<pre class="r"><code>d1 <- defDataAdd(varname = "y", formula = "mu", variance = 16, dist = "normal")</code></pre>
<p>Now we are ready to generate the data:</p>
<pre class="r"><code>set.seed(110)
dd <- genMultiFac(nFactors = 2, levels = 4, each = 30, colNames = c("a", "b"))
dd[, mu := t_0 + t_a[a] + t_b[b] + t_ab[a, b], keyby = id]
dd <- addColumns(d1, dd)</code></pre>
<p><br></p>
<div id="plot-of-bary-by-arm" class="section level4">
<h4>Plot of <span class="math inline">\(\bar{y}\)</span> by arm</h4>
<p>The plot shows the the average outcomes by arm. The interaction when <span class="math inline">\(a=2 \ \& \ b=2\)</span> and <span class="math inline">\(a=3 \ \& \ b=2\)</span> is apparent in the two locations where the smooth pattern of increases is interrupted.</p>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
</div>
</div>
<div id="sampling-from-the-posterior" class="section level3">
<h3>Sampling from the posterior</h3>
<p>The function shown next simply generates the data needed by <code>Stan</code>. (The <code>Stan</code> implementation is shown below in the <a href="#addendum">addendum</a>.) Take note that we convert the <span class="math inline">\(\tau_{ab}\)</span> design matrix of 0’s and 1’s to a single vector with values ranging from 1 to 16.</p>
<pre class="r"><code>dt_to_list <- function(dx) {
dx[, a_f := factor(a)]
dx[, b_f := factor(b)]
dv <- dummyVars(~ b_f:a_f , data = dx, n = c(4, 4))
dp <- predict(dv, dx )
N <- nrow(dx) ## number of observations
I <- 2
X2 <- 1
main <- as.matrix(dx[,.(a,b)])
ab <- as.vector(dp %*% c(1:16))
x <- as.matrix(ab, nrow = N, ncol = X2)
y <- dx[, y]
list(N=N, I=I, X2=X2, main=main, x=x, y=y)
}</code></pre>
<p>I am using <code>cmdstanr</code> to interact with <code>Stan</code>:</p>
<pre class="r"><code>mod <- cmdstan_model("code/model_2_factors.stan", force_recompile = TRUE)
fit <- mod$sample(
data = dt_to_list(dd),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500,
adapt_delta = 0.99,
step_size = .05,
max_treedepth = 20,
seed = 1721
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 2 finished in 17.5 seconds.
## Chain 1 finished in 19.8 seconds.
## Chain 3 finished in 22.1 seconds.
## Chain 4 finished in 25.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 21.2 seconds.
## Total execution time: 25.5 seconds.</code></pre>
</div>
<div id="diagnostic-checks" class="section level3">
<h3>Diagnostic checks</h3>
<p>Here is just one set of trace plots for <span class="math inline">\(\tau^a_1, \dots, \tau^a_4\)</span> that indicate the sampling went quite well - the variables not shown were equally well-behaved.</p>
<pre class="r"><code>posterior <- as_draws_array(fit$draws())
mcmc_trace(posterior, pars = glue("t[{1},{1:4}]"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-9-1.png" width="576" /></p>
</div>
<div id="variance-estimates" class="section level3">
<h3>Variance estimates</h3>
<p>Since we are focused on the possibility of 2-way interaction, the primary parameter of interest is <span class="math inline">\(\sigma_{ab},\)</span> the variation of the interaction effects. (In the <code>Stan</code> model specification this variance parameter is <em>sigma_x</em>, as in interaction.) The plot shows the 95% credible intervals for each of the main effect variance parameters as well as the interaction variance parameter.</p>
<p>The fact that the two main effect variance parameters (<span class="math inline">\(\sigma_a\)</span> and <span class="math inline">\(\sigma_b\)</span>) are greater than zero supports the data generation process which assumed different outcomes for different levels of interventions <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>, respectively.</p>
<p>And the credible interval for <span class="math inline">\(\sigma_{ab}\)</span> (<em>sigma_x</em>), likewise is shifted away from zero, suggesting there might be some interaction between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> at certain levels of each.</p>
<pre class="r"><code>mcmc_intervals(posterior, pars = c(glue("sigma_m[{1:2}]"), "sigma_x[1]"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
<p>We can hone in a bit more on the specific estimates of the <span class="math inline">\(\tau_{ab}\)</span>’s to see where those interactions might be occurring. It appears that <em>t_x[1,6]</em> (representing <span class="math inline">\(\tau_{22}\)</span>) is an important interaction term - which is consistent with the data generation process. However, <span class="math inline">\(\tau_{32}\)</span>, represented by <em>t_x[1,10]</em> is not obviously important. Perhaps we need more data.</p>
<pre class="r"><code>mcmc_intervals(posterior, pars = glue("t_x[1,{1:16}]"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>Below is a visual representation of how well the model fits the data by showing the interval of predicted cell counts for each <span class="math inline">\(a/b\)</span> pair. The observed means (shown as white dots) sit on top of the predictions (shown by the colored lines), suggesting the model is appropriate.</p>
<pre class="r"><code>r <- as_draws_rvars(fit$draws(variables = c("t_0","t", "t_x")))
dnew <- data.frame(
genMultiFac(nFactors = 2, levels = 4, each = 1, colNames = c("b", "a")))
dnew$yhat <- with(r,
rep(t_0, 16) + rep(t[1, ], each = 4) + rep(t[2, ], times = 4) + t(t_x))
ggplot(data = dnew, aes(x=b, dist = yhat)) +
geom_vline(aes(xintercept = b), color = "white", size = .25) +
stat_dist_lineribbon() +
geom_point(data = dsum, aes(y = yhat), color = "white", size = 2) +
facet_grid(.~a, labeller = labeller(a = label_both)) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank()) +
scale_fill_brewer()</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
</div>
<div id="only-one-treatment-effect-and-no-interaction" class="section level3">
<h3>Only one treatment effect and no interaction</h3>
<p>Perhaps the rationale for focusing on the variance can be best appreciated by looking at a contrasting scenario where there is only a single main effect (for intervention <span class="math inline">\(a\)</span>) and no interaction. Here we would expect the estimates for the intervention <span class="math inline">\(b\)</span> main effects variance as well as the variance of the interaction terms to be close to zero.</p>
<pre class="r"><code>t_0 <- 0
t_a <- c(-8, -1, 3, 6)
t_b <- c(0, 0, 0, 0)
t_ab <- matrix(0, nrow = 4, ncol = 4)</code></pre>
<p>The plot of the observed means is consistent with the data generation process:</p>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 2 finished in 10.0 seconds.
## Chain 1 finished in 10.6 seconds.
## Chain 4 finished in 10.8 seconds.
## Chain 3 finished in 22.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 13.5 seconds.
## Total execution time: 22.6 seconds.</code></pre>
<p>And yes, the posterior distribution for <span class="math inline">\(\sigma_{ab}\)</span> (<em>sigma_x</em>) is now very close to zero …</p>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-15-1.png" width="576" /></p>
<p>and the effect parameters are all centered around zero:</p>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-16-1.png" width="576" /></p>
<p>Once again, the predicted values are quite close to the observed means - indicating the model is a good fit:</p>
<p><img src="https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/index.en_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
</div>
<div id="next-steps" class="section level3">
<h3>Next steps</h3>
<p>In the motivating application, there are actually <em>three</em> interventions, but each one has only two levels (yes or no). In this case, the level mean and across-level variance parameters were poorly estimated, probably because there are so few levels. This forced me to take a more traditional approach, where I estimate the means of each randomization arm. I’ll share that <a href="https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/">next time</a>.</p>
<p>
<p><small><font color="darkkhaki"></p>
<p>References:</p>
<p>Gelman, Andrew. “Analysis of variance—why it is more important than ever.” <em>The annals of statistics</em> 33, no. 1 (2005): 1-53.</p>
<p>Kassler, Daniel, Ira Nichols-Barrer, and Mariel Finucane. “Beyond “treatment versus control”: How Bayesian analysis makes factorial experiments feasible in education research.” <em>Evaluation review</em> 44, no. 4 (2020): 238-261.</p>
</font></small>
</p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The model is implemented in Stan using a <em>non-centered</em> parameterization, so that the parameters <span class="math inline">\(tau\)</span> are a function of a set of <span class="math inline">\(z\)</span> parameters, which are standard normal parameters. This does not dramatically change the estimates, but eliminates <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">divergent</a> chains, improving sampling behavior.</p>
<pre class="stan"><code>data {
int<lower=1> N; // # of observations
int<lower=1> I; // # of interventions
int<lower=1> X2; // # of 2-way interactions
int main[N, I]; // interventions
int x[N, X2]; // interactions
vector[N] y; // outcome
}
parameters {
real t_0;
vector[3] z_raw[I];
vector[15] z_x_raw[X2];
real<lower=0> sigma;
real<lower=0> sigma_m[I];
real<lower=0> sigma_x[X2];
real<lower=0> sigma_main;
}
transformed parameters {
// constrain parameters to sum to 0
vector[4] z[I];
vector[16] z_x[X2];
vector[4] t[I];
vector[16] t_x[X2];
vector[N] yhat;
for (i in 1:I) {
z[i] = append_row(z_raw[i], -sum(z_raw[i]));
}
for (i in 1:X2) {
z_x[i] = append_row(z_x_raw[i], -sum(z_x_raw[i]));
}
for (i in 1:I)
for (j in 1:4)
t[i, j] = sigma_m[i] * z[i, j];
for (i in 1:X2)
for (j in 1:16)
t_x[i, j] = sigma_x[i] * z_x[i, j];
// yhat
for (n in 1:N) {
real ytemp;
ytemp = t_0;
for (i in 1:I) ytemp = ytemp + t[i, main[n, i]]; // 2 sets of main effects
for (i in 1:X2) ytemp = ytemp + t_x[i, x[n, i]]; // 1 set of interaction effects
yhat[n] = ytemp;
}
}
model {
sigma ~ normal(0, 5);
sigma_m ~ normal(0, sigma_main);
sigma_x ~ normal(0, 5);
sigma_main ~ normal(0, 5);
t_0 ~ normal(0, 5);
for (i in 1:I) z_raw[i] ~ std_normal();
for (i in 1:X2) z_x_raw[i] ~ std_normal();
y ~ normal(yhat, sigma);
}
</code></pre>
</div>
Drawing the wrong conclusion about subgroups: a comparison of Bayes and frequentist methods
https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/
Tue, 14 Sep 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/
<script src="https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/index.en_files/header-attrs/header-attrs.js"></script>
<p>In the previous <a href="https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/" target="_blank">post</a>, I simulated data from a hypothetical RCT that had heterogeneous treatment effects across subgroups defined by three covariates. I presented two Bayesian models, a strongly <em>pooled</em> model and an <em>unpooled</em> version, that could be used to estimate all the subgroup effects in a single model. I compared the estimates to a set of linear regression models that were estimated for each subgroup separately.</p>
<p>My goal in doing these comparisons is to see how often we might draw the wrong conclusion about subgroup effects when we conduct these types of analyses. In a typical frequentist framework, the probability of making a mistake is usually considerably greater than the 5% error rate that we allow ourselves, because conducting multiple tests gives us more chances to make a mistake. By using Bayesian hierarchical models that share information across subgroups and more reasonably measure uncertainty, I wanted to see if we can reduce the chances of drawing the wrong conclusions.</p>
<div id="simulation-framework" class="section level3">
<h3>Simulation framework</h3>
<p>The simulations used here are based on the same general process I used to generate a single data set the <a href="https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/" target="_blank">last time around</a>. The key difference is that I now want to understand the operating characteristics of the models, and this requires many data sets (and their model fits). Much of the modeling is similar to last time, so I’m primarily showing new code.</p>
<p>This is a pretty computing intensive exercise. While the models don’t take too long to fit, especially with only 150 observations per data set, fitting 2500 sets of models can take some time. As I do for all the simulations that require repeated Bayesian estimation, I executed all of this on a high-performance computer. I used a framework similar to what I’ve described for conducting <a href="https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/" target="_blank">power analyses</a> and <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">exploring the operating characteristics</a> of Bayesian models.</p>
<div id="definitions" class="section level4">
<h4>Definitions</h4>
<p>The definitions of the data generation process are the same as in the previous post, except I’ve made the generation of <code>theta</code> more flexible. Last time, I fixed the coefficients (<span class="math inline">\(\tau\)</span>’s) at specific values. Here, the <span class="math inline">\(\tau\)</span>’s can vary from iteration to iteration. Even though I am generating data with no treatment effect, I am taking a Bayesian point of view on this - so that the treatment effect parameters will have a distribution that is centered around 0 with very low variance.</p>
<pre class="r"><code>library(cmdstanr)
library(simstudy)
library(posterior)
library(data.table)
library(slurmR)
setgrp <- function(a, b, c) {
if (a==0 & b==0 & c==0) return(1)
if (a==1 & b==0 & c==0) return(2)
if (a==0 & b==1 & c==0) return(3)
if (a==0 & b==0 & c==1) return(4)
if (a==1 & b==1 & c==0) return(5)
if (a==1 & b==0 & c==1) return(6)
if (a==0 & b==1 & c==1) return(7)
if (a==1 & b==1 & c==1) return(8)
}
s_define <- function() {
d <- defData(varname = "a", formula = 0.6, dist="binary")
d <- defData(d, varname = "b", formula = 0.4, dist="binary")
d <- defData(d, varname = "c", formula = 0.3, dist="binary")
d <- defData(d, varname = "theta",
formula = "..tau[1] + ..tau[2]*a + ..tau[3]*b + ..tau[4]*c +
..tau[5]*a*b + ..tau[6]*a*c + ..tau[7]*b*c + ..tau[8]*a*b*c",
dist = "nonrandom"
)
drx <- defDataAdd(
varname = "y", formula = "0 + theta*rx",
variance = 16,
dist = "normal"
)
return(list(d = d, drx = drx))
}</code></pre>
</div>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>We are generating the eight values of <code>tau</code> for each iteration from a <span class="math inline">\(N(\mu = 0, \sigma = 0.5)\)</span> distribution before generating <code>theta</code> and the outcome <code>y</code>:</p>
<pre class="r"><code>s_generate <- function(n, list_of_defs) {
list2env(list_of_defs, envir = environment())
tau <- rnorm(8, 0, .5)
dd <- genData(n, d)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(drx, dd)
dd[, grp := setgrp(a, b, c), keyby = id]
dd[]
}</code></pre>
<p>Looking at a single data set, we can see that <code>theta</code> is close to, but is not exactly 0, as we would typically do in simulation using a frequentist framework (where the parameters are presumed known).</p>
<pre class="r"><code>set.seed(298372)
defs <- s_define()
s_generate(10, defs)</code></pre>
<pre><code>## id a b c theta rx y grp
## 1: 1 0 0 1 0.34 0 -3.45 4
## 2: 2 1 0 1 0.78 1 3.20 6
## 3: 3 0 0 0 -0.28 1 7.29 1
## 4: 4 0 1 0 -0.37 1 2.76 3
## 5: 5 0 0 0 -0.28 0 -0.48 1
## 6: 6 1 0 0 -0.25 0 1.09 2
## 7: 7 0 0 0 -0.28 0 -1.45 1
## 8: 8 0 0 1 0.34 1 -5.78 4
## 9: 9 1 0 0 -0.25 1 2.97 2
## 10: 10 1 0 0 -0.25 0 -1.25 2</code></pre>
</div>
<div id="model-fitting" class="section level4">
<h4>Model fitting</h4>
<p>The models here are precisely how I defined it in the last post. The code is a bit involved, so I’m not including it - let me know if you’d like to see it. For each data set, I fit a set of subgroup-specific linear regression models (as well as an overall model that ignored the subgroups), in addition to the two Bayesian models described in the previous post. Each replication defines the data, generates a new data set, and estimates the three different models before returning the results.</p>
<pre class="r"><code>s_model <- function(dd, mod_pool, mod_nopool) {
...
}
s_replicate <- function(x, n, mod_pool, mod_nopool) {
set_cmdstan_path(path = "/.../cmdstan/2.25.0")
defs <- s_define()
generated_data <- s_generate(n, defs)
estimates <- s_model(generated_data, mod_pool, mod_nopool)
estimates[]
}</code></pre>
<p>The computation is split up so that 50 multi-core computing nodes run 50 replications. There’s actually parallelization in parallel, as each of the nodes has multiple processors so the Bayesian models can be estimated with parallel chains:</p>
<pre class="r"><code>set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
model_pool <- cmdstan_model("/.../subs_pool_hpc.stan")
model_nopool <- cmdstan_model("/.../subs_nopool_hpc.stan")
job <- Slurm_lapply(
X = 1:2500,
FUN = s_replicate,
n = 150,
mod_pool = model_pool,
mod_nopool = model_nopool,
njobs = 50,
mc.cores = 4L,
job_name = "i_subs",
tmp_path = "/.../scratch",
plan = "wait",
sbatch_opt = list(time = "12:00:00", partition = "cpu_short", `mem-per-cpu` = "5G"),
export = c("s_define", "s_generate", "s_model"),
overwrite = TRUE
)
job
res <- Slurm_collect(job)
save(res, file = "/.../sub_0.rda")</code></pre>
</div>
</div>
<div id="results" class="section level3">
<h3>Results</h3>
<p>The figure shows the results from 80 models. Each column is a different subgroup (and the last is the overall treatment effect estimate). The intervals are the 95% credible intervals from the Bayesian models, and the 95% confidence interval from the linear regression model. The intervals are color coded based on whether the interval includes 0 (grey) or not (red). The red intervals are cases where we might incorrectly conclude that there is indeed some sort of effect. There are many more red lines for the linear regression estimates compared to either of the Bayesian models:</p>
<p><img src="https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/index.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>For the full set of 2500 replications, about 5% of the intervals from the <em>pooled</em> Bayes did not include 0, lower than the <em>unpooled</em> model, and far below the approach using individual subgroup regression models:</p>
<pre><code>## pooled unpooled lm
## 1: 0.051 0.11 0.37</code></pre>
<p>I started off the last post by motivating this set of simulations with an experience I recently had with journal reviewers who were skeptical of an analysis of a subgroup effect size. I’m not sure that the journal reviewers would buy the approach suggested here, but it seems that pooling estimates across subgroups provides a viable way to guard against making overly strong statements about effect sizes when they are not really justified.</p>
</div>
Subgroup analysis using a Bayesian hierarchical model
https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/
Tue, 31 Aug 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/
<script src="https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/index.en_files/header-attrs/header-attrs.js"></script>
<p>I’m part of a team that recently submitted the results of a randomized clinical trial for publication in a journal. The overall findings of the study were inconclusive, and we certainly didn’t try to hide that fact in our paper. Of course, the story was a bit more complicated, as the RCT was conducted during various phases of the COVID-19 pandemic; the context in which the therapeutic treatment was provided changed over time. In particular, other new treatments became standard of care along the way, resulting in apparent heterogeneous treatment effects for the therapy we were studying. It appears as if the treatment we were studying might have been effective only in one period when alternative treatments were not available. While we planned to evaluate the treatment effect over time, it was not our primary planned analysis, and the journal objected to the inclusion of the these secondary analyses.</p>
<p>Which got me thinking, of course, about subgroup analyses. In the context of a null hypothesis significance testing framework, it is well known that conducting numerous <em>post hoc</em> analyses carries the risk of dramatically inflating the probability of a Type 1 error - concluding there is some sort of effect when in fact there is none. So, if there is no overall effect, and you decide to look at a subgroup of the sample (say patients over 50), you may find that the treatment has an effect in that group. But, if you failed to adjust for multiple tests, than that conclusion may not be warranted. And if that second subgroup analysis was not pre-specified or planned ahead of time, that conclusion may be even more dubious.</p>
<p>If we use a Bayesian approach, we might be able to <a href="https://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/" target="_blank">avoid this problem</a>, and there might be no need to adjust for multiple tests. I have started to explore this a bit using simulated data under different data generation processes and prior distribution assumptions. It might all be a bit too much for a single post, so I am planning on spreading it out a bit.</p>
<div id="the-data" class="section level3">
<h3>The data</h3>
<p>To get this going, here are the libraries used in this post:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(cmdstanr)
library(posterior)</code></pre>
<p>In this simulated data set of 150 individuals, there are three binary covariates <span class="math inline">\(A, B, C \in \{0,1\}\)</span> and a treatment indicator <span class="math inline">\(rx \in \{0,1\}\)</span>. When we randomize the individuals to arms, we should have pretty good balance across treatment arms, so a comparison of the two treatment arms without adjusting for the covariates should provide a good estimate of the <em>overall</em> treatment effect. However, we might still be interested in looking at specific subgroups defined by <span class="math inline">\(A\)</span>, <span class="math inline">\(B\)</span>, and <span class="math inline">\(C\)</span>, say patients for whom <span class="math inline">\(A=0\)</span> or those where <span class="math inline">\(C=1\)</span>. (We could also look at subgroups defined by combinations of these covariates.)</p>
<p>In the data generation process, the treatment effect will be a parameter <span class="math inline">\(\theta\)</span> that will be determined by the levels of the three covariates. In this case, for patients <span class="math inline">\(A=B=C=0\)</span>, there will be no treatment effect. However, for patients with only <span class="math inline">\(A=1\)</span> (i.e., <span class="math inline">\(B=0\)</span> and <span class="math inline">\(C=0\)</span>), there will be a small treatment effect of <span class="math inline">\(2\)</span>, and there will be a slightly larger effect of <span class="math inline">\(4\)</span> for patients with <span class="math inline">\(C=1\)</span>, and for patients with <span class="math inline">\(A=1 \ \& \ C=1\)</span>, there will be a treatment effect of <span class="math inline">\(5\)</span>. For patients with <span class="math inline">\(B=1\)</span> (alone) there is no treatment effect.</p>
<pre class="r"><code>d <- defData(varname = "a", formula = 0.6, dist="binary")
d <- defData(d, varname = "b", formula = 0.3, dist="binary")
d <- defData(d, varname = "c", formula = 0.4, dist="binary")
d <- defData(d, varname = "theta", formula = "0 + 2*a + 4*c - 1*a*c", dist = "nonrandom")
drx <- defDataAdd(varname = "y", formula = "0 + theta*rx", variance = 16, dist = "normal")</code></pre>
<p>In the data generation process, I am assigning eight group identifiers based on the covariates that will be relevant for the Bayes model (described further below).</p>
<pre class="r"><code>setgrp <- function(a, b, c) {
if (a==0 & b==0 & c==0) return(1)
if (a==1 & b==0 & c==0) return(2)
if (a==0 & b==1 & c==0) return(3)
if (a==0 & b==0 & c==1) return(4)
if (a==1 & b==1 & c==0) return(5)
if (a==1 & b==0 & c==1) return(6)
if (a==0 & b==1 & c==1) return(7)
if (a==1 & b==1 & c==1) return(8)
}</code></pre>
<p>To generate the data:</p>
<pre class="r"><code>set.seed(3871598)
dd <- genData(150, d)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(drx, dd)
dd[, grp:= setgrp(a, b, c), keyby = id]
dd</code></pre>
<pre><code>## id a b c theta rx y grp
## 1: 1 1 0 1 5 0 0.28 6
## 2: 2 1 1 0 2 0 3.14 5
## 3: 3 0 0 0 0 0 0.73 1
## 4: 4 1 1 0 2 1 0.78 5
## 5: 5 1 1 1 5 0 -5.94 8
## ---
## 146: 146 1 1 0 2 1 4.68 5
## 147: 147 0 0 1 4 0 3.10 4
## 148: 148 1 0 0 2 0 5.88 2
## 149: 149 1 1 1 5 1 4.22 8
## 150: 150 0 1 1 4 1 4.76 7</code></pre>
<p>Here is a plot of the average outcome <span class="math inline">\(Y\)</span> for each of the subgroups with and without treatment. The treatment effect for a particular subgroup is the difference of the <span class="math inline">\(Y\)</span> values for each segment. Now, it appears that there is a treatment effect for the two subgroups <span class="math inline">\(B=0\)</span> and <span class="math inline">\(B=1\)</span>, yet <span class="math inline">\(B\)</span> was not supposed to have any impact on the overall effect size, which is <span class="math inline">\(0\)</span>. Just in case this is at all confusing, this is due to the fact that these patients have characteristics <span class="math inline">\(A\)</span> and <span class="math inline">\(C\)</span>, which <em>do</em> influence the effect size. Indeed, if you compare the subgroups <span class="math inline">\(B=0\)</span> and <span class="math inline">\(B=1\)</span>, it appears that the effect size could be the same, which is consistent with the fact that <span class="math inline">\(B\)</span> has no impact on effect size. This is definitely not the case when comparing <span class="math inline">\(C=0\)</span> and <span class="math inline">\(C=1\)</span>. I point this out, because when I report the estimated effect sizes from the models, I will be reporting the subgroup-specific effects shown here, rather than parameter estimates of <span class="math inline">\(\theta\)</span>.</p>
<p><img src="https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/index.en_files/figure-html/unnamed-chunk-5-1.png" width="768" /></p>
</div>
<div id="subgroup-analysis-using-simple-linear-regression" class="section level3">
<h3>Subgroup analysis using simple linear regression</h3>
<p>Before jumping into the Bayes models, I am fitting seven simple linear regression models to estimate seven treatment effects, one for each of the six subgroups defined by the covariates <span class="math inline">\(A\)</span>, <span class="math inline">\(B\)</span>, and <span class="math inline">\(C\)</span>, plus an overall estimate.</p>
<pre class="r"><code>df <- data.frame(dd)
est_lm <- function(dx) {
fit <- lm(y ~ rx, data = dx)
c(coef(fit)["rx"], confint(fit)[2,])
}
est_cis <- function(sub_grp) {
mean_pred <- lapply(split(df[,c(sub_grp, "y", "rx")], df[, c(sub_grp)]), est_lm)
do.call(rbind, mean_pred)
}
ci_subgroups <- do.call(rbind, lapply(c("a","b","c"), est_cis))
ci_overall <- est_lm(dd)
cis <- data.table(
subgroup = c("a = 0", "a = 1", "b = 0", "b = 1", "c = 0", "c = 1", "overall"),
model = 3,
rbind(ci_subgroups, ci_overall)
)
setnames(cis, c("rx","2.5 %", "97.5 %"), c("p.50","p.025", "p.975"))</code></pre>
<p>Inspecting the point estimates (denoted as <em>p.50</em> for the treatment effect for each subgroup (and the overall group), we see that they match pretty closely with the effect sizes depicted in the figure of the means by subgroup above. I’ll compare these estimates to the Bayes estimates in a bit.</p>
<pre class="r"><code>cis</code></pre>
<pre><code>## subgroup model p.50 p.025 p.975
## 1: a = 0 3 3.3 1.30 5.3
## 2: a = 1 3 3.0 1.31 4.6
## 3: b = 0 3 3.2 1.57 4.9
## 4: b = 1 3 2.6 0.61 4.6
## 5: c = 0 3 2.5 0.90 4.1
## 6: c = 1 3 4.4 2.19 6.5
## 7: overall 3 3.1 1.79 4.4</code></pre>
</div>
<div id="two-possible-bayesian-models" class="section level3">
<h3>Two possible Bayesian models</h3>
<p>I am including two Bayesian models here, one that I am calling a <em>pooled</em> model and the other an <em>unpooled</em> model (though the second is not absolutely unpooled, just relatively unpooled). In both cases, the outcome model is described as</p>
<p><span class="math display">\[
y_{ij} \sim N\left(\alpha_j +\theta_{j}x_i, \ \sigma_0 \right)
\]</span></p>
<p>where <span class="math inline">\(y_{ij}\)</span> is the outcome measure for individual <span class="math inline">\(i\)</span> who has covariate/subgroup pattern <span class="math inline">\(j\)</span>. (These subgroup patterns were defined above in <code>R</code> code. For example group 1 is all cases where <span class="math inline">\(a=b=c=0\)</span> and group 5 is <span class="math inline">\(a=b=1, \ c=0\)</span>.) <span class="math inline">\(x_i\)</span> is a treatment indicator, <span class="math inline">\(x \in \{0,1\}\)</span>. <span class="math inline">\(\alpha_j\)</span> is the intercept for covariate pattern <span class="math inline">\(j\)</span> (representing the mean outcome for all patients with pattern <span class="math inline">\(j\)</span> randomized to control). <span class="math inline">\(\theta_j\)</span> represents the treatment effect for patients with pattern <span class="math inline">\(j\)</span>. <span class="math inline">\(\sigma_0\)</span> is the within treatment arm/covariate pattern standard deviation, and is assumed to be constant across arms and patterns.</p>
<p>The treatment effect parameter <span class="math inline">\(\theta_j\)</span> can be further parameterized as function of a set of <span class="math inline">\(\tau\text{'s}.\)</span> (This parameterization was inspired by this <a href="https://journals.sagepub.com/doi/full/10.1177/1740774510396933" target="_blank">paper</a> by <em>Jones et al</em>.) The treatment effect is a deterministic function of the covariates <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, and <span class="math inline">\(c\)</span> as well as their interactions:</p>
<span class="math display">\[\begin{aligned}
\theta_1 &= \tau_0 \\
\theta_2 &= \tau_0 + \tau_a \\
\theta_3 &= \tau_0 + \tau_b \\
\theta_4 &= \tau_0 + \tau_c \\
\theta_5 &= \tau_0 + \tau_a + \tau_b + \tau_{ab} \\
\theta_6 &= \tau_0 + \tau_a + \tau_c + \tau_{ac} \\
\theta_7 &= \tau_0 + \tau_b + \tau_c + \tau_{bc} \\
\theta_8 &= \tau_0 + \tau_a + \tau_b + \tau_c + \tau_{ab} + \tau_{ac} + \tau_{bc} + \tau_{abc}
\end{aligned}\]</span>
<p>So far, the parameterization for the <em>pooled</em> and <em>unpooled</em> models are the same. Now we see how they diverge:</p>
<div id="pooled-model" class="section level4">
<h4>Pooled model</h4>
<p>The idea behind the <em>pooled</em> model is that the <em>main effects</em> of <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, <span class="math inline">\(c\)</span> (<span class="math inline">\(\tau_a\)</span>, <span class="math inline">\(\tau_b\)</span>, and <span class="math inline">\(\tau_c\)</span>, respectively) are drawn from the same distribution centered around <span class="math inline">\(\delta_m\)</span> with a standard deviation <span class="math inline">\(\sigma_m\)</span>, both of which will be estimated from the data. The estimated effect of one covariate will, to some extent, inform the estimated effect of the others. Of course, as the number of observations increases, the strength of pooling will be reduced. The three 2-level interaction effects (<span class="math inline">\(\tau_{ab}\)</span>, <span class="math inline">\(\tau_{ac}\)</span> and <span class="math inline">\(\tau_{bc}\)</span>) are independent of the main effects, but they also share a common distribution to be estimated from the data. (In this case we have only a single three-way interaction term <span class="math inline">\(\tau_{abc}\)</span>, but if we had 4 covariates rather than 3, we would have 4 three-way interaction terms, which could all share the same prior distribution. At some point, it might be reasonable to exclude higher order interactions, such as four- or five-way interactions.)</p>
<span class="math display">\[\begin{aligned}
\tau_a, \tau_b, \tau_c &\sim N(\mu = \delta_m, \sigma = \sigma_m) \\
\tau_{ab}, \tau_{ac}, \tau_{bc} &\sim N(\mu = \delta_x, \sigma = \sigma_x) \\
\end{aligned}\]</span>
<p>With the exception of <span class="math inline">\(\alpha_j\)</span> and <span class="math inline">\(\sigma_0\)</span>, the prior distributions for the model parameters are quite conservative/pessimistic, centered pretty closely around 0. (It would certainly be wise to explore how these prior assumptions impact the findings, but since this is just an illustrative example, I won’t dwell too much on these particular assumptions).</p>
<span class="math display">\[\begin{aligned}
\alpha_j &\sim N(\mu = 0, \sigma = 10), \ \ \ j \in \{1,\dots,8\} \\
\tau_0 &\sim N(\mu=0, \sigma = 2) \\
\tau_{abc} &\sim N(\mu = 0, \sigma = 2) \\
\delta_m &\sim N(\mu = 0, \sigma = 2) \\
\delta_x &\sim N(\mu = 0, \sigma = 2) \\
\sigma_0 &\sim N(\mu = 0, \sigma = 10), \ \ \ \sigma_0 \ge 0 \\
\sigma_m &\sim N(\mu = 0, \sigma = 1), \ \ \ \ \sigma_m \ge 0 \\
\sigma_x &\sim N(\mu = 0, \sigma = 1), \ \ \ \ \sigma_x \ge 0 \\
\end{aligned}\]</span>
</div>
<div id="unpooled-model" class="section level4">
<h4>Unpooled model</h4>
<p>In the unpooled model, the <span class="math inline">\(\tau\)</span>’s (and <span class="math inline">\(\alpha\)</span>’s) are not jointly parameterized with a common mean, and the prior distributions are more diffuse. The only variance estimation is for <span class="math inline">\(\sigma_0\)</span>:</p>
<span class="math display">\[\begin{aligned}
\alpha_j &\sim N(\mu=0, \sigma = 10), \ \ \ j \in \{1,\dots,8\} \\
\tau_0 &\sim N(\mu=0, \sigma = 10) \\
\tau_q &\sim N(\mu=0, \sigma = 10), \ \ \ q \in \{a, b, c\} \\
\tau_{qq} &\sim N(\mu=0, \sigma = 10), \ \ \ qq \in \{ab, ac, bc\} \\
\tau_{abc} &\sim N(\mu = 0, \sigma = 10) \\
\sigma_0 &\sim N(\mu = 0, \sigma = 10), \ \ \ \sigma_0 \ge 0 \\
\end{aligned}\]</span>
</div>
</div>
<div id="model-estimation" class="section level3">
<h3>Model estimation</h3>
<p>I’m using <code>cmdstanr</code> to estimate the models in <code>Stan</code>. (The Stan code is available if any anyone is interested, or you can try to write it yourself.) For each model, I am sampling in 4 chains of length 2500 following 500 warm-up steps. I’ll skip the required diagnostics here (e.g. trace plots) for brevity, but I did check everything, and things looked OK.</p>
<pre class="r"><code>model_pool <- cmdstan_model("code/pooled_subgroup.stan")
model_unpool <- cmdstan_model("code/unpooled_subgroup.stan")</code></pre>
<pre class="r"><code>fit_pool <- model_pool$sample(
data = list(N = dd[,.N], rx = dd[,rx], sub_grp = dd[,grp], y=dd[,y]),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500,
adapt_delta = 0.99,
max_treedepth = 20,
seed = 898171
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 1 finished in 1.4 seconds.
## Chain 2 finished in 1.4 seconds.
## Chain 3 finished in 1.5 seconds.
## Chain 4 finished in 1.6 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 1.5 seconds.
## Total execution time: 1.8 seconds.</code></pre>
<pre class="r"><code>fit_unpool <- model_unpool$sample(
data = list(N = dd[,.N], rx = dd[,rx], sub_grp = dd[,grp], y=dd[,y], prior_sigma=10),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500,
adapt_delta = 0.99,
max_treedepth = 20,
seed = 18717
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 3 finished in 1.4 seconds.
## Chain 2 finished in 1.5 seconds.
## Chain 4 finished in 1.7 seconds.
## Chain 1 finished in 2.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 1.7 seconds.
## Total execution time: 2.2 seconds.</code></pre>
</div>
<div id="extracting-posterior-probabilities" class="section level3">
<h3>Extracting posterior probabilities</h3>
<p>In this case, I am actually not directly interested in the effect parameters <span class="math inline">\(\theta_j\)</span>, but actually in the estimated treatment effects for the six subgroups defined by <span class="math inline">\(a=0\)</span>, <span class="math inline">\(a=1\)</span>, <span class="math inline">\(b=0\)</span>, <span class="math inline">\(b=1\)</span>, <span class="math inline">\(c=0\)</span>, and <span class="math inline">\(c=1\)</span>. (These groups are not distinct from one another, as each individual has measures for each of <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, and <span class="math inline">\(c\)</span>.) I will step through the process of how I get these estimates, and then will plot a summary of the estimates.</p>
<p>First, I extract the key parameter estimates into an <code>rvars</code> data structure (I discussed this data structure recently in a couple of posts - <a href="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/" target="_blank">here</a> and <a href="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/" target="_blank">here</a>). Although the object <code>r</code> below looks like a list of 3 items with just a handful of values, there is actually an entire data set supporting each value that contains 10,000 samples from the posterior distribution. What we are seeing are the mean and standard deviation of those distributions.</p>
<pre class="r"><code>r <- as_draws_rvars(fit_pool$draws(variables = c("alpha","theta","sigma")))
r</code></pre>
<pre><code>## # A draws_rvars: 2500 iterations, 4 chains, and 3 variables
## $alpha: rvar<2500,4>[8] mean ± sd:
## [1] -2.42 ± 0.89 0.49 ± 0.79 -1.61 ± 1.46 -0.88 ± 1.09 0.93 ± 1.30
## [6] 1.06 ± 0.89 2.64 ± 1.59 -0.18 ± 1.19
##
## $theta: rvar<2500,4>[8] mean ± sd:
## [1] 2.1 ± 1.03 2.8 ± 0.89 2.7 ± 1.10 3.6 ± 1.11 2.6 ± 1.31 4.2 ± 1.17
## [7] 4.0 ± 1.31 3.6 ± 1.72
##
## $sigma: rvar<2500,4>[1] mean ± sd:
## [1] 3.8 ± 0.23</code></pre>
<p>A cool feature of the <code>rvars</code> data structure (which is part of the package <code>posterior</code>) is that they can be stored in a data.frame, and easily manipulated. Here I am matching the <span class="math inline">\(\theta_j\)</span> to each individual depending on their covariate pattern <span class="math inline">\(j\)</span>. The plan is to generate simulated data for each individual based on the estimated means and standard deviations.</p>
<pre class="r"><code>df <- as.data.frame(dd)
df$theta_hat <- r$theta[dd$grp]
df$alpha_hat <- r$alpha[dd$grp]
df$mu_hat <- with(df, alpha_hat + rx* theta_hat)</code></pre>
<p>Here are the first ten rows (out of the 150 individual records):</p>
<pre class="r"><code>head(df, 10)</code></pre>
<pre><code>## id a b c theta rx y grp theta_hat alpha_hat mu_hat
## 1 1 1 0 1 5 0 0.28 6 4.2 ± 1.17 1.06 ± 0.89 1.06 ± 0.89
## 2 2 1 1 0 2 0 3.14 5 2.6 ± 1.31 0.93 ± 1.30 0.93 ± 1.30
## 3 3 0 0 0 0 0 0.73 1 2.1 ± 1.03 -2.42 ± 0.89 -2.42 ± 0.89
## 4 4 1 1 0 2 1 0.78 5 2.6 ± 1.31 0.93 ± 1.30 3.52 ± 0.96
## 5 5 1 1 1 5 0 -5.94 8 3.6 ± 1.72 -0.18 ± 1.19 -0.18 ± 1.19
## 6 6 1 1 1 5 0 -1.45 8 3.6 ± 1.72 -0.18 ± 1.19 -0.18 ± 1.19
## 7 7 1 1 0 2 0 5.47 5 2.6 ± 1.31 0.93 ± 1.30 0.93 ± 1.30
## 8 8 1 1 0 2 1 -2.33 5 2.6 ± 1.31 0.93 ± 1.30 3.52 ± 0.96
## 9 9 0 0 1 4 1 0.84 4 3.6 ± 1.11 -0.88 ± 1.09 2.69 ± 1.06
## 10 10 1 0 0 2 1 7.05 2 2.8 ± 0.89 0.49 ± 0.79 3.26 ± 0.78</code></pre>
<p>We can add a column of predicted “values” to the data frame.</p>
<pre class="r"><code>df$pred <- rvar_rng(rnorm, nrow(df), df$mu_hat, r$sigma)
head(df[,c("id", "grp", "mu_hat", "pred")], 10)</code></pre>
<pre><code>## id grp mu_hat pred
## 1 1 6 1.06 ± 0.89 1.01 ± 3.9
## 2 2 5 0.93 ± 1.30 0.95 ± 4.0
## 3 3 1 -2.42 ± 0.89 -2.41 ± 3.9
## 4 4 5 3.52 ± 0.96 3.53 ± 3.9
## 5 5 8 -0.18 ± 1.19 -0.19 ± 4.0
## 6 6 8 -0.18 ± 1.19 -0.18 ± 4.0
## 7 7 5 0.93 ± 1.30 0.90 ± 4.1
## 8 8 5 3.52 ± 0.96 3.49 ± 3.9
## 9 9 4 2.69 ± 1.06 2.67 ± 3.9
## 10 10 2 3.26 ± 0.78 3.30 ± 3.9</code></pre>
<p>But note that we don’t just have a single value for each of the 150 individuals, but 10,000 samples for each of the 150 individuals (for a total 1.5 million predicted values.) Here is a little bit of evidence that this is the case, as you can see that this is an <code>rvar</code> of dimension <span class="math inline">\(2500 \text{ samples} \times 4 \text{ chains}\)</span>, or <span class="math inline">\(10,000\)</span> predicted values:</p>
<pre class="r"><code>df[9, "pred"]</code></pre>
<pre><code>## rvar<2500,4>[1] mean ± sd:
## [1] 2.7 ± 3.9</code></pre>
<p>Finally, we are ready to get estimates of the within-subgroup effect sizes. I’ve written a little function to help out here. For each covariate <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, and <span class="math inline">\(c\)</span>, the function splits the data set into four subsets. So, for covariate <span class="math inline">\(a\)</span> we have <span class="math inline">\([a=0 \ \& \ rx=0]\)</span>, <span class="math inline">\([a=0 \ \& \ rx=1]\)</span>, <span class="math inline">\([a=1 \ \& \ rx=0]\)</span>, and <span class="math inline">\(a=1 \ \& \ rx=1]\)</span>. For each of those subsets, we get a distribution of mean predicted values by averaging across the distribution of individual predicted values. So, the variable <code>effects</code> contains the distribution of effects for the six subgroups (<span class="math inline">\(a=0\)</span>, <span class="math inline">\(a=1\)</span>, <span class="math inline">\(b=0\)</span>, <span class="math inline">\(b=1\)</span>, <span class="math inline">\(c=0\)</span>, and <span class="math inline">\(c=1\)</span>):</p>
<pre class="r"><code>est_effects <- function(sub_grp) {
mean_pred <- lapply(split(df[,c(sub_grp, "rx","pred")], df[, c(sub_grp, "rx")]),
function(x) rvar_mean(x$pred)
)
c(mean_pred[["0.1"]] - mean_pred[["0.0"]], mean_pred[["1.1"]] - mean_pred[["1.0"]])
}
effects <- do.call(c, lapply(c("a","b","c"), est_effects))
effects</code></pre>
<pre><code>## rvar<2500,4>[6] mean ± sd:
## [1] 2.5 ± 1.3 3.2 ± 1.1 2.8 ± 1.0 3.1 ± 1.5 2.7 ± 1.1 3.5 ± 1.3</code></pre>
<p>We can also get the distribution of the overall (marginal) treatment effect by sub-setting by <span class="math inline">\(rx\)</span> only. The last step is to create a summary table for the <em>pooled</em> model. Remember, the <code>effects</code> table is really a table of distributions, and we can extract summary statistics from those distributions for reporting or plotting. Here, we are extracting the <span class="math inline">\(2.5\%\)</span>, <span class="math inline">\(50\%\)</span>, and <span class="math inline">\(97.5\%\)</span> quantiles to show the median and a <span class="math inline">\(95\%\)</span> interval.</p>
<pre class="r"><code>mean_pred <- lapply(split(df[,c("rx","pred")], df[, "rx"]), function(x) rvar_mean(x$pred))
overall <- mean_pred[["1"]] - mean_pred[["0"]]
effects <- c(effects, overall)
sumstats_pooled <- data.table(
subgroup = c("a = 0", "a = 1", "b = 0", "b = 1", "c = 0", "c = 1", "overall"),
model = 1,
p.025 = quantile(effects, 0.025),
p.50 = quantile(effects, 0.50),
p.975 = quantile(effects, 0.975)
)</code></pre>
</div>
<div id="comparing-model-estimates" class="section level3">
<h3>Comparing model estimates</h3>
<p>Now to take a look at the distribution of effect sizes based on the different models. (I didn’t show it, but I also created a table called <code>sumstats_unpooled</code> using the same process I just walked you through.) Below is a plot of the effect estimates for each of the subgroups as well as the overall (marginal) effect estimates. The <code>lm</code> plot shows the point estimate with a <span class="math inline">\(95\%\)</span> confidence interval. The other two plots show the medians of the posterior distributions for the subgroup effects along with <span class="math inline">\(95\%\)</span> intervals.</p>
<p>Two important things to see in the plot, which will be very important when I write next time about “Type 1” errors, are the relative length of the intervals and the apparent shrinkage of some of the estimates. In all the cases, the length of the interval for the standard linear regression model is smaller than the two Bayesian models, reflecting less uncertainty. The <em>pooled</em> model also appears to have slightly less uncertainty compared to the <em>unpooled</em> model.</p>
<p>The second point is that the point estimates for the linear regression model and the median estimates for the <em>unpooled</em> model are quite close, while the <em>pooled</em> medians appear to be pulled away. The direction of the shrinkage is not coherent, because there is a mixture of main effects and interaction effects (the <span class="math inline">\(\tau\)</span>’s) that are shifting things around. It appears that the effects of the subgroups <span class="math inline">\(a=0,\)</span> <span class="math inline">\(b=0,\)</span> and <span class="math inline">\(c=0\)</span> are being pulled towards each other, and the same appears to be true for the group defined by <span class="math inline">\(a=1,\)</span> <span class="math inline">\(b=1,\)</span> and <span class="math inline">\(c=1.\)</span> This seems right as we know that the underlying parameters <span class="math inline">\(\tau_a\)</span>, <span class="math inline">\(\tau_b\)</span>, and <span class="math inline">\(\tau_c\)</span> are shrinking towards each other.</p>
<p><img src="https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/index.en_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
<p>If we were using the <em>pooled</em> model to draw conclusions, I would say that it appears that subgroups defined by <span class="math inline">\(c\)</span> seem to have heterogeneous treatment effects, though I would probably want to have more data to confirm, as the intervals are still quite wide. If we use the results from the linear regression model, we might want to proceed with caution, because the intervals are likely too narrow, we have not adjusted for multiple testing. We will see this next time when I look at a case where there are no underlying treatment effects in the data generation process.</p>
<p>
<p><small><font color="darkkhaki"></p>
<p>Reference:</p>
<p>Jones, Hayley E., David I. Ohlssen, Beat Neuenschwander, Amy Racine, and Michael Branson. “Bayesian models for subgroup analysis in clinical trials.” Clinical Trials 8, no. 2 (2011): 129-143.</p>
</font></small>
</p>
</div>
Posterior probability checking with rvars: a quick follow-up
https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/
Tue, 17 Aug 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/
<script src="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/index.en_files/header-attrs/header-attrs.js"></script>
<p>This is a relatively brief addendum to last week’s <a href="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/">post</a>, where I described how the <code>rvar</code> datatype implemented in the <code>R</code> package <code>posterior</code> makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit.</p>
<p>Immediately after putting the post up, I decided to make sure the correct model with the quadratic term would not result in extreme p-value (i.e. would fall between 0.02 and 0.98). And, again not surprisingly, the model was a good fit. I’m sharing all this here, because I got some advice on how to work with the <code>rvar</code> data a little more efficiently, and wanted to make sure those who are interested could see that. And while I was at it, I decided to investigate the distribution of Bayesian p-values under the condition that the model and data generating process are the same (i.e. the model is correct).</p>
<p>Just as a reminder, here is the data generation process:</p>
<p><span class="math display">\[y \sim N(\mu = 2 + 6*x - 0.3x^2, \ \sigma^2 = 4)\]</span></p>
<p>Here are the necessary libraries:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)</code></pre>
<p>And here is the data generation:</p>
<pre class="r"><code>b_quad <- -0.3
ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)
set.seed(72612)
dd <- genData(100, ddef)</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/index.en_files/figure-html/plot2-1.png" width="576" /></p>
<p>The <code>Stan</code> model is slightly modified to include the additional term; <span class="math inline">\(\gamma\)</span> is the quadratic parameter:</p>
<pre class="stan"><code>data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
transformed data{
vector[N] x2;
for (i in 1:N) {
x2[i] = x[i]*x[i];
};
}
parameters {
real alpha;
real beta;
real gamma;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta*x + gamma*x2, sigma);
}</code></pre>
<pre class="r"><code>mod <- cmdstan_model("code/quadratic_regression.stan")</code></pre>
<pre class="r"><code>fit <- mod$sample(
data = list(N = nrow(dd), x = dd$x, y = dd$y),
seed = 72651,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 3 finished in 0.5 seconds.
## Chain 1 finished in 0.5 seconds.
## Chain 2 finished in 0.5 seconds.
## Chain 4 finished in 0.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.5 seconds.
## Total execution time: 0.6 seconds.</code></pre>
<p>As before, I am plotting the observed (actual data) along with the 80% intervals of predicted values at each level of <span class="math inline">\(x\)</span>. The observed data appear to be randomly scattered within the intervals with no apparent pattern:</p>
<pre class="r"><code>post_rvars <- as_draws_rvars(fit$draws())
mu <- with(post_rvars, alpha + beta * as_rvar(dd$x) + gamma * as_rvar(dd$x^2))
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)
df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90))))
df.80[, extreme := !(y >= V1 & y <= V2)]</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/index.en_files/figure-html/plotintervals2-1.png" width="576" /></p>
<p>The code to estimate the p-value is slightly modified from last time. The important difference is that the lists of <code>rvars</code> (<em>bin_prop_y</em> and <em>bin_prop_pred</em>) are converted directly into vectors of <code>rvars</code> using the <code>do.call</code> function:</p>
<pre class="r"><code>df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)
bin_prop_y <- lapply(split(df, df$grp), function(x) rvar_mean(x$y < x$mu))
rv_y <- do.call(c, bin_prop_y)
T_y <- rvar_var(rv_y)
bin_prop_pred <- lapply(split(df, df$grp), function(x) rvar_mean(x$pred < x$mu))
rv_pred <- do.call(c, bin_prop_pred)
T_pred <- rvar_var(rv_pred)
mean(T_pred > T_y)</code></pre>
<pre><code>## [1] 0.583</code></pre>
<p>In this one case, the p-value is 0.58, suggesting the model is a good fit. But, could this have been a fluke? Looking below at the density plot of p-values based on 10,000 simulated data sets suggests not; indeed, <span class="math inline">\(P(0.02 < \text{p-value} < 0.98) = 99.8\%.\)</span> (If you are interested in the code that estimated the density of p-values, I can post it as well.)</p>
<p><img src="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/index.en_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
Fitting your model is only the beginning: Bayesian posterior probability checks with rvars
https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/
Mon, 09 Aug 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/
<script src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/header-attrs/header-attrs.js"></script>
<p>Say we’ve collected data and estimated parameters of a model that give structure to the data. An important question to ask is whether the model is a reasonable approximation of the true underlying data generating process. If we did a good job, we should be able to turn around and generate data from the model itself that looks similar to the data we started with. And if we didn’t do such a great job, the newly generated data will diverge from the original.</p>
<p>If we used a Bayesian approach to estimation, all the information we have about the parameters from our estimated model is contained in the data that have been sampled by the MCMC process. For example, if we are estimating a simple normal regression model with an intercept parameter <span class="math inline">\(\alpha\)</span>, a slope parameter <span class="math inline">\(\beta\)</span>, and a standard deviation parameter <span class="math inline">\(\sigma\)</span>, and we collected 10,000 samples from a posterior distribution, then we will have a multivariate table of possible values of <span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span> and <span class="math inline">\(\sigma\)</span>. To answer our question regarding model adequacy, we only need to extract the information contained in all that data.</p>
<p>I’ve been casting about for ways to do this extraction efficiently in <code>R</code>, so I posted an inquiry on the <a href="https://discourse.mc-stan.org/" target="_blank">Stan Forums</a> to get advice. I got a suggestion to look into the random variable dataytpe (<code>rvar</code>) recently implemented in the package `posterior. Not at all familiar with this, I started off by reading through the <a href="https://mc-stan.org/posterior/articles/rvar.html" target="_blank">vignette</a>, and then at this Kerman & Gelman <a href="https://link.springer.com/content/pdf/10.1007/s11222-007-9020-4.pdf" target="_blank">paper</a>.</p>
<p>To get a get a better handle on the ideas and tools, I decided to simulate some data, fit some models, and investigate what posterior probability checks might like look using <code>rvars</code>. I’m sharing some of the code with you here to give a bit of the flavor of what can be done. A little advanced warning: I am providing more output of the data than usual, because I think it is easier to grasp what is going on if you can see the data in the various stages of transformation.</p>
<p>Before I get started, here is the requisite list of the packages needed to run the code:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
library(abind)</code></pre>
<div id="simple-linear-model" class="section level3">
<h3>Simple linear model</h3>
<p>I am first generating data from a simple linear regression model where the outcome <span class="math inline">\(y\)</span> is a function of <span class="math inline">\(x\)</span>, and <span class="math inline">\(\alpha = 2\)</span>, <span class="math inline">\(\beta=6\)</span>, and <span class="math inline">\(\sigma = 2\)</span>:</p>
<p><span class="math display">\[y \sim N(\mu = 2 + 6*x, \ \sigma^2 = 4)\]</span></p>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>To get things going, I define the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> and generate the data using <code>simstudy</code>, and then take a look at the data:</p>
<pre class="r"><code>b_quad <- 0
ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)
set.seed(2762)
dd <- genData(100, ddef)</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/plot,%201-1.png" width="576" /></p>
</div>
<div id="model-fitting" class="section level4">
<h4>Model fitting</h4>
<p>I am using <code>cmdstan</code> and <code>cmdstanr</code> to estimate the model. Here is the <code>Stan</code> code:</p>
<pre class="stan"><code>data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta*x, sigma);
}</code></pre>
<p>Next, I compile the Stan code, and sample from the posterior. The sampling will be done in four parallel chains of 2,500 (after the warm-up of 500 samples), which will give me a total sample of 10,000 for each parameter. All of the samples are stored in the <code>cmdstan</code> object <em>fit</em>.</p>
<pre class="r"><code>mod <- cmdstan_model("code/linear_regression.stan")</code></pre>
<pre class="r"><code>fit <- mod$sample(
data = list(N = nrow(dd), x = dd$x, y = dd$y),
seed = 93736,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.2 seconds.</code></pre>
</div>
<div id="extracting-results" class="section level4">
<h4>Extracting results</h4>
<p>Typically, I would extract the data using the <code>draws</code> method of the <code>cmdstanr</code> object. By default, the <code>draws</code> method returns an array, which is essentially (though not exactly) a multi-dimensional matrix. In this case there are multiple matrices, one for each parameter. The display of each parameter shows the first five rows of the four chains.</p>
<pre class="r"><code>(post_array <- fit$draws())</code></pre>
<pre><code>## # A draws_array: 2500 iterations, 4 chains, and 4 variables
## , , variable = lp__
##
## chain
## iteration 1 2 3 4
## 1 -128 -129 -127 -128
## 2 -128 -128 -127 -128
## 3 -128 -129 -130 -128
## 4 -128 -128 -127 -129
## 5 -128 -127 -127 -128
##
## , , variable = alpha
##
## chain
## iteration 1 2 3 4
## 1 2.1 1.5 2.0 2.4
## 2 1.6 1.7 1.8 2.3
## 3 1.7 2.2 1.7 2.4
## 4 1.5 2.5 1.7 1.3
## 5 2.0 2.1 1.6 2.2
##
## , , variable = beta
##
## chain
## iteration 1 2 3 4
## 1 6.1 6.2 6.0 6.0
## 2 6.0 6.1 6.1 6.0
## 3 6.1 5.9 6.1 5.9
## 4 6.1 6.0 6.1 6.1
## 5 6.0 6.0 6.1 6.0
##
## , , variable = sigma
##
## chain
## iteration 1 2 3 4
## 1 2.0 2.4 2.0 2.1
## 2 2.0 2.2 2.0 2.4
## 3 2.3 2.3 1.8 2.3
## 4 2.2 2.2 2.1 2.1
## 5 2.3 2.3 2.3 2.5
##
## # ... with 2495 more iterations</code></pre>
<p>The package <code>bayesplot</code> uses this array to generate a range of different plots, including the important diagnostic trace plot:</p>
<pre class="r"><code>mcmc_trace(post_array, pars = c("alpha", "beta", "sigma"), facet_args = list(nrow = 3))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/diagfit1-1.png" width="576" /></p>
</div>
<div id="random-variable-datatype" class="section level4">
<h4>Random variable datatype</h4>
<p>Instead of extracting the array data, it is possible to convert the array into a <em>random variable datatype</em>, or <code>rvar</code>. It is probably easiest to explain what this is by looking more closely at it.</p>
<pre class="r"><code>(post_rvars <- as_draws_rvars(fit$draws()))</code></pre>
<pre><code>## # A draws_rvars: 2500 iterations, 4 chains, and 4 variables
## $lp__: rvar<2500,4>[1] mean ± sd:
## [1] -128 ± 1.2
##
## $alpha: rvar<2500,4>[1] mean ± sd:
## [1] 1.9 ± 0.41
##
## $beta: rvar<2500,4>[1] mean ± sd:
## [1] 6 ± 0.072
##
## $sigma: rvar<2500,4>[1] mean ± sd:
## [1] 2.2 ± 0.16</code></pre>
<p>You can see that the <em>post_rvars</em> object is essentially a list of 4 items: <em>lp</em> (log probability), <em>alpha</em>, <em>beta</em>, and <em>sigma</em>. But what exactly are those items, “1.9 ± 0.41” for <em>alpha</em>, “6 ± 0.072” for <em>beta</em>, and “2.2 ± 0.16” for <em>sigma</em>? Well, the <code>rvar</code> is really a shorthand way of representing the detailed data that is in the underlying array, and the text displayed is merely the mean and standard deviation of the underlying data (<span class="math inline">\(\mu ± \sigma\)</span>). We can peek under the hood a bit by using the function <code>draws_of</code>, and confirm the mean and standard deviation of the samples:</p>
<pre class="r"><code>beta_samples <- draws_of(post_rvars$beta)
data.table(beta_samples)</code></pre>
<pre><code>## V1
## 1: 6.07
## 2: 6.04
## 3: 6.13
## 4: 6.14
## 5: 5.98
## ---
## 9996: 6.00
## 9997: 6.07
## 9998: 6.03
## 9999: 6.13
## 10000: 6.14</code></pre>
<pre class="r"><code>c(mean(beta_samples), sd(beta_samples))</code></pre>
<pre><code>## [1] 6.0350 0.0715</code></pre>
</div>
<div id="why-all-the-fuss" class="section level4">
<h4>Why all the fuss?</h4>
<p>The whole point of the <code>rvar</code> datatype is that it makes it much easier to do things like estimate the distributions of the functions of the parameters and to generate predicted values of new observations, both things we need for posterior probability checking. Of course, there are other ways to do all of this (as is always the case in <code>R</code>), but <code>this</code>rvars` seem to eliminate a lot of the manipulation that might be necessary if we chose to work directly with the data arrays.</p>
<p>In this next step, I am generating the distribution of means <span class="math inline">\(\mu\)</span> for each of the 100 individuals in the data set:</p>
<p><span class="math display">\[\mu_i = \alpha + \beta x_i\]</span>
I want to do this, because ultimately, I want to generate predicted values for each individual, which come from <span class="math inline">\(N(\mu_i, \sigma)\)</span>. And I am not going to generate just a single predicted value for each individual, but rather 10,000 predicted values for each individual. So, now we will have the distribution of predicted values for each individual, which is quite powerful. And importantly, the distributions of these predicted values will incorporate the uncertainty of each <span class="math inline">\(\mu_i\)</span> and <span class="math inline">\(\sigma\)</span>. With the <code>rvar</code> datatype, all of this can be accomplished with just a few commands - no manipulation necessary.</p>
<p>All <code>rvar</code> equations need to be specified using by <code>rvar</code> objects. We need the product of <span class="math inline">\(x_i\)</span> and <span class="math inline">\(\beta\)</span> to get <span class="math inline">\(\mu_i\)</span>, but <span class="math inline">\(x_i\)</span> is observed data, not a random variable. No problem - we can covert the vector <span class="math inline">\(x\)</span> into a special kind of constant <code>rvar</code> that does not have a standard deviation. Once this is done, we can generate the <span class="math inline">\(\mu_i\)</span>’s</p>
<pre class="r"><code>x_rvar <- as_rvar(dd$x)
x_rvar</code></pre>
<pre><code>## rvar<1>[100] mean ± sd:
## [1] 6.554 ± NA 0.281 ± NA 2.115 ± NA 9.889 ± NA 8.715 ± NA
## [6] 3.448 ± NA 5.358 ± NA 1.069 ± NA 7.908 ± NA 0.445 ± NA
## [11] 4.574 ± NA 1.581 ± NA 8.915 ± NA 6.063 ± NA 6.278 ± NA
## [16] 7.146 ± NA 8.344 ± NA 6.385 ± NA 9.429 ± NA 1.391 ± NA
## [21] 3.542 ± NA 6.473 ± NA 4.689 ± NA 2.016 ± NA 3.818 ± NA
## [26] 1.566 ± NA 3.315 ± NA 0.118 ± NA 3.317 ± NA 4.523 ± NA
## [31] 4.711 ± NA 1.488 ± NA 8.488 ± NA 3.614 ± NA 3.521 ± NA
## [36] 0.653 ± NA 3.522 ± NA 5.541 ± NA 3.377 ± NA 0.142 ± NA
## [41] 0.626 ± NA 1.211 ± NA 5.616 ± NA 0.210 ± NA 5.320 ± NA
## [46] 3.459 ± NA 2.315 ± NA 6.498 ± NA 9.055 ± NA 6.483 ± NA
## [51] 1.087 ± NA 2.593 ± NA 8.007 ± NA 1.388 ± NA 7.268 ± NA
## [56] 1.101 ± NA 6.091 ± NA 7.920 ± NA 4.646 ± NA 7.842 ± NA
## [61] 3.113 ± NA 7.560 ± NA 6.683 ± NA 5.670 ± NA 8.468 ± NA
## [66] 9.152 ± NA 0.390 ± NA 4.365 ± NA 8.228 ± NA 6.732 ± NA
## [71] 4.516 ± NA 1.166 ± NA 6.558 ± NA 5.129 ± NA 9.666 ± NA
## [76] 4.314 ± NA 5.069 ± NA 0.872 ± NA 7.728 ± NA 0.780 ± NA
## [81] 0.053 ± NA 1.594 ± NA 5.457 ± NA 9.755 ± NA 0.147 ± NA
## [86] 8.765 ± NA 1.165 ± NA 9.738 ± NA 0.195 ± NA 9.965 ± NA
## [91] 9.175 ± NA 8.583 ± NA 9.464 ± NA 3.360 ± NA 3.768 ± NA
## [96] 9.105 ± NA 1.337 ± NA 7.878 ± NA 4.354 ± NA 2.428 ± NA</code></pre>
<pre class="r"><code>mu <- post_rvars$alpha + post_rvars$beta * x_rvar
mu</code></pre>
<pre><code>## rvar<2500,4>[100] mean ± sd:
## [1] 41.5 ± 0.25 3.6 ± 0.39 14.7 ± 0.29 61.6 ± 0.43 54.5 ± 0.36
## [6] 22.7 ± 0.24 34.2 ± 0.22 8.4 ± 0.34 49.6 ± 0.31 4.6 ± 0.38
## [11] 29.5 ± 0.22 11.5 ± 0.32 55.7 ± 0.37 38.5 ± 0.24 39.8 ± 0.24
## [16] 45.0 ± 0.28 52.3 ± 0.34 40.4 ± 0.25 58.8 ± 0.40 10.3 ± 0.33
## [21] 23.3 ± 0.24 41.0 ± 0.25 30.2 ± 0.22 14.1 ± 0.30 25.0 ± 0.23
## [26] 11.4 ± 0.32 21.9 ± 0.24 2.6 ± 0.40 21.9 ± 0.24 29.2 ± 0.22
## [31] 30.3 ± 0.22 10.9 ± 0.32 53.1 ± 0.34 23.7 ± 0.23 23.2 ± 0.24
## [36] 5.9 ± 0.37 23.2 ± 0.24 35.4 ± 0.23 22.3 ± 0.24 2.8 ± 0.40
## [41] 5.7 ± 0.37 9.2 ± 0.34 35.8 ± 0.23 3.2 ± 0.39 34.0 ± 0.22
## [46] 22.8 ± 0.24 15.9 ± 0.28 41.1 ± 0.25 56.6 ± 0.38 41.0 ± 0.25
## [51] 8.5 ± 0.34 17.6 ± 0.27 50.2 ± 0.32 10.3 ± 0.33 45.8 ± 0.28
## [56] 8.6 ± 0.34 38.7 ± 0.24 49.7 ± 0.31 30.0 ± 0.22 49.2 ± 0.31
## [61] 20.7 ± 0.25 47.5 ± 0.30 42.2 ± 0.26 36.1 ± 0.23 53.0 ± 0.34
## [66] 57.1 ± 0.38 4.3 ± 0.38 28.3 ± 0.22 51.6 ± 0.33 42.5 ± 0.26
## [71] 29.2 ± 0.22 8.9 ± 0.34 41.5 ± 0.25 32.9 ± 0.22 60.2 ± 0.41
## [76] 28.0 ± 0.22 32.5 ± 0.22 7.2 ± 0.36 48.6 ± 0.30 6.6 ± 0.36
## [81] 2.2 ± 0.40 11.5 ± 0.32 34.8 ± 0.22 60.8 ± 0.42 2.8 ± 0.40
## [86] 54.8 ± 0.36 8.9 ± 0.34 60.7 ± 0.42 3.1 ± 0.40 62.1 ± 0.43
## [91] 57.3 ± 0.38 53.7 ± 0.35 59.0 ± 0.40 22.2 ± 0.24 24.7 ± 0.23
## [96] 56.9 ± 0.38 10.0 ± 0.33 49.5 ± 0.31 28.2 ± 0.22 16.6 ± 0.28</code></pre>
<p>We can see that <em>mu</em> is an <code>rvar</code> vector of 100 objects, one for each individual <span class="math inline">\(i\)</span>. But, as we saw before, each of those objects is actually 10,000 data points - the distribution of <span class="math inline">\(\mu_i\)</span> for each individual. Again, let’s peek under the hood: here is the the distribution of <span class="math inline">\(\mu\)</span> for individual <span class="math inline">\(i=6\)</span>:</p>
<pre class="r"><code>data.table(draws_of(mu[6]))</code></pre>
<pre><code>## V1
## 1: 23.0
## 2: 22.4
## 3: 22.8
## 4: 22.7
## 5: 22.6
## ---
## 9996: 22.9
## 9997: 22.4
## 9998: 22.1
## 9999: 22.8
## 10000: 22.8</code></pre>
<p>Now we are ready to generate the distribution of predicted values for each individual - again using a single command <code>rvar_rng</code>, specifying that we want to generate data for each individual using the distribution of the <code>rvar</code> <em>mu</em> and the the <code>rvar</code> <em>sigma</em>. We get 10,000 predicted values (our estimated distribution) for each of the 100 individuals:</p>
<pre class="r"><code>pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)
str(pred)</code></pre>
<pre><code>## rvar<2500,4>[100] 41.4 ± 2.2 3.6 ± 2.2 14.7 ± 2.2 61.6 ± 2.3 ...</code></pre>
<p>Here, I randomly sample from the sample of 10,000 predicted values and plot this one instance of predicted values (in orange) along with the original data (in blue):</p>
<pre class="r"><code>newdd <- data.table(x = dd$x, y = draws_of(pred)[sample(10000, 1),])
head(newdd)</code></pre>
<pre><code>## x y
## 1: 6.554 44.17
## 2: 0.281 2.57
## 3: 2.115 16.13
## 4: 9.889 60.41
## 5: 8.715 56.27
## 6: 3.448 24.04</code></pre>
<pre class="r"><code>ggplot(data = dd, aes(x = x, y = y)) +
geom_point(color = "blue", size = 1) +
geom_point(color = "orange", size = 1, data = newdd) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/plotpred1-1.png" width="576" /></p>
<p>But we can actually visualize the <em>distribution</em> of predicted values for each individual and plot those distributions in relation to the actual data. If we want to look at the 80% interval for each individual (we could look at the 95% interval just as easily), we can estimate the interval bounds simply by applying the <code>quantile</code> function to the <code>rvar</code> <em>pred</em>:</p>
<pre class="r"><code>interval80 <- t(quantile(pred, c(0.10, 0.90)))
head(interval80)</code></pre>
<pre><code>## [,1] [,2]
## [1,] 38.58 44.21
## [2,] 0.79 6.44
## [3,] 11.79 17.52
## [4,] 58.75 64.45
## [5,] 51.55 57.44
## [6,] 19.89 25.55</code></pre>
<p>If the model is a good fit, we would expect the actual data to be scattered across those distributions without any obvious pattern, as is the case here. Not so surprising given the simulated data generation process:</p>
<pre class="r"><code>df.80 <- data.table(x = dd$x, y=dd$y, interval80)
df.80[, extreme := !(y >= V1 & y <= V2)]
ggplot(data = df.80, aes(x = x, y = y)) +
geom_segment(aes(y = V1, yend = V2, x = x, xend = x), color = "grey30", size = .1) +
geom_point(aes(color = extreme), size = 1) +
theme(panel.grid = element_blank(),
legend.position = "none") +
scale_color_manual(values = c("black", "red"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/plotintervals1-1.png" width="576" /></p>
</div>
<div id="bayesian-p-value" class="section level4">
<h4>Bayesian p-value</h4>
<p>I find the visual presentation pretty compelling, but if we want to quantify the model fit, one option is to estimate a <em>Bayesian p-value</em>, described in this <a href="https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1751-5823.2003.tb00203.x?casa_token=WKxfT4KbAc8AAAAA:L7bqFCk4bcCo0BQhb19ZX_unLctcydfFtrc-oflAyrxjQzPyYEf0WctDXlpDfJdc7wc3YZXsBvAFWw" target="_blank">Gelman paper</a> as</p>
<p><span class="math display">\[\text{p-value}(y) = P(T(y^{rep}) > T(y) \ |\ y)\]</span>
averaged over the parameters <span class="math inline">\(\theta\)</span> (and is a function of the observed data <span class="math inline">\(y\)</span>). <span class="math inline">\(y^{rep}\)</span> is the replicated or predicted data from the model (what we have saved in the variable <em>pred</em>). <span class="math inline">\(T()\)</span> is any function of the data that is reasonable in this context. The idea is that the p-value will not be extremely high or low (eg., not less than 0.05 or not greater than 0.95) if the model is a good approximation of the actual data generating process. Since my main goal here is to illustrate the usefulness of the <code>rvar</code> datatype, and not necessarily to come up with the ideal test statistic <span class="math inline">\(T\)</span>, I’ve created a pretty crude idea for <span class="math inline">\(T\)</span> in the the context of linear regression.</p>
<p>The first step is to split the data defined by different values of predictors <span class="math inline">\(x\)</span> into different bins (in this case I’ll use five) and calculate the proportion of observed <span class="math inline">\(y_i\)</span>’s that fall below the predicted mean <span class="math inline">\(\mu_i\)</span>:</p>
<p><span class="math display">\[p_{b} = \frac{1}{n_b}\sum_{i=1}^{n_b} I(y_i < \mu_i), \ b \in \{1,\dots,B\} \]</span></p>
<p>We also do the same to estimate <span class="math inline">\(p_b^{rep}\)</span> for each bin, using the replicated/predicted values of <span class="math inline">\(y\)</span>. We expect the variability <span class="math inline">\(p^{rep}\)</span> (i.e. <span class="math inline">\(p_1^{rep} \approx \dots \approx p_5^{rep}\)</span>): by definition, predictions are randomly scattered around the means in each bin, with half above and half below. If the model is a good fit of the observed data, we would expect the <span class="math inline">\(p_b\)</span>’s of based on observed data to all also be close to 0,5. However, if the model is a poor fit, there will likely be variability in proportions based on observed <span class="math inline">\(y\)</span>’s across bins, so that the <span class="math inline">\(P(\text{var}(p^{rep}) > \text{var}(p))\)</span> should be quite close to 0.</p>
<p>[As I write this, I’m noticing that this binned test statistic might bear some of the same motivations that underlie the <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1965.10480811" target="_blank">Goldfeld-Quandt test for heteroscedasticity</a>. OK, not quite, but perhaps it is very, very tangentially related? In any case, the more famous test was developed in part by my father; today would have been his 81st birthday, so I am very happy to make that (very subtle) connection.]</p>
</div>
<div id="estimating-the-p-value-from-the-data" class="section level4">
<h4>Estimating the p-value from the data</h4>
<p>One cool feature of <code>rvars</code> is that they can be included in <code>data.frames</code> (though not in <code>data.tables</code>). This allows us to do some cool summarization without a lot of manipulation.</p>
<pre class="r"><code>df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)
head(df)</code></pre>
<pre><code>## x y mu pred grp
## 1 6.554 43.13 41.47 ± 0.253 41.42 ± 2.21 4
## 2 0.281 3.32 3.61 ± 0.390 3.59 ± 2.22 1
## 3 2.115 16.72 14.67 ± 0.291 14.66 ± 2.23 2
## 4 9.889 59.35 61.59 ± 0.425 61.60 ± 2.25 5
## 5 8.715 53.21 54.51 ± 0.356 54.51 ± 2.27 5
## 6 3.448 24.41 22.72 ± 0.239 22.72 ± 2.24 2</code></pre>
<p>In this case, I want to calculate the proportion of values where the observed <span class="math inline">\(y\)</span> is less than <span class="math inline">\(\mu\)</span> in each bin; I can use <code>lapply</code> on the data frame <em>df</em> to calculate each of those proportions. However, I am actually calculating the proportion 10,000 times within each bin, once for each sample, so I have a distribution of proportions within each bin.</p>
<pre class="r"><code>bin_prop_y <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], I(y < mu))))
bin_prop_y</code></pre>
<pre><code>## [[1]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.57 ± 0.022
##
## [[2]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.48 ± 0.041
##
## [[3]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.53 ± 0.027
##
## [[4]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.4 ± 0.079
##
## [[5]]
## rvar<2500,4>[1] mean ± sd:
## [1] 0.63 ± 0.027</code></pre>
<p>A brief word about the function <code>rvar_mean</code> that I’ve used here (there is a more detailed description on the <code>posterior</code> <a href="https://mc-stan.org/posterior/articles/rvar.html" target="_blank">website</a>). If we have samples of multiple variables, we can apply a function across the variables within a sample (as opposed to across samples within a single variable) by using <code>rvar_func</code>. Within each bin, there are roughly 20 variables (one for each individual), and by using the function <code>rvar_mean</code>, I am averaging across individuals within each sample to get a distribution of proportions within each bin.</p>
<p>In the next steps, I need to do a little bit of manipulation to make things work. I was hoping to avoid this, but I haven’t been able to figure out any other way to get the data in the right format to estimate the probability. I am basically taking the data underlying the random variable (the 10,000 values for each bin), creating a single array, and then creating a new <code>rvar</code>.</p>
<pre class="r"><code>array_y <- abind(lapply(bin_prop_y, function(x) as_draws_array(draws_of(x))))
head(array_y)</code></pre>
<pre><code>## , , ...1
##
## 1
## 1 0.577
## 2 0.538
## 3 0.577
## 4 0.538
## 5 0.577
## 6 0.577
##
## , , ...1
##
## 1
## 1 0.500
## 2 0.444
## 3 0.500
## 4 0.500
## 5 0.444
## 6 0.500
##
## , , ...1
##
## 1
## 1 0.529
## 2 0.529
## 3 0.529
## 4 0.529
## 5 0.529
## 6 0.529
##
## , , ...1
##
## 1
## 1 0.579
## 2 0.368
## 3 0.579
## 4 0.579
## 5 0.368
## 6 0.474
##
## , , ...1
##
## 1
## 1 0.65
## 2 0.60
## 3 0.65
## 4 0.65
## 5 0.60
## 6 0.65</code></pre>
<pre class="r"><code>(rv_y <- rvar(array_y))</code></pre>
<pre><code>## rvar<10000>[1,5] mean ± sd:
## ...1 ...1 ...1 ...1 ...1
## 1 0.57 ± 0.022 0.48 ± 0.041 0.53 ± 0.027 0.40 ± 0.079 0.63 ± 0.027</code></pre>
<p>Here, I am repeating the steps on the predicted values (<span class="math inline">\(y^{rep}\)</span>). Even with the inelegant coding, it is still only three lines:</p>
<pre class="r"><code>bin_prop_pred <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (pred < mu))))
array_pred <- abind(lapply(bin_prop_pred, function(x) as_draws_array(draws_of(x))))
rv_pred <- rvar(array_pred)</code></pre>
<p>Finally, we are ready to calculate the p-value using the distribution of test statistics <span class="math inline">\(T\)</span>. Note that <code>rvar_var</code> is calculating the variance of the proportions across the bins within a single sample to give us a distribution of variances of the proportions based on the observed and predicted values. The overall p-value is the overage of the distribution.</p>
<pre class="r"><code>(T_y <- rvar_var(rv_y))</code></pre>
<pre><code>## rvar<10000>[1] mean ± sd:
## [1] 0.0092 ± 0.0033</code></pre>
<pre class="r"><code>(T_pred <- rvar_var(rv_pred))</code></pre>
<pre><code>## rvar<10000>[1] mean ± sd:
## [1] 0.013 ± 0.0088</code></pre>
<pre class="r"><code># p-value
mean(T_pred > T_y)</code></pre>
<pre><code>## [1] 0.585</code></pre>
<p>As expected, since the data generation process and the model are roughly equivalent, the p-value is neither extremely large or small, indicating good fit.</p>
</div>
</div>
<div id="straying-from-the-simple-model-assumptions" class="section level3">
<h3>Straying from the simple model assumptions</h3>
<p>If we tweak the data generation process slightly by including a quadratic term, things change a bit:</p>
<p><span class="math display">\[y \sim N(\mu = 2 + 6*x - 0.3x^2, \ \sigma^2 = 4)\]</span></p>
<p>Below, I give you the code and output without any commentary, except to say that both the visual display and the p-value strongly suggest that the simple linear regression model are <em>not</em> a good fit for these data generated with an added quadratic term.</p>
<pre class="r"><code>b_quad <- -0.3
dd <- genData(100, ddef)</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/plot2-1.png" width="576" /></p>
<pre class="r"><code>fit <- mod$sample(
data = list(N = nrow(dd), x = dd$x, y = dd$y),
seed = 72651,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.1 seconds.</code></pre>
<pre class="r"><code>post_rvars <- as_draws_rvars(fit$draws())
x_rvar <- as_rvar(dd$x)
mu <- post_rvars$alpha + post_rvars$beta * x_rvar
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)
df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90))))
df.80[, extreme := !(y >= V1 & y <= V2)]</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/index.en_files/figure-html/plotintervals2-1.png" width="576" /></p>
<pre class="r"><code>df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)
bin_prop_y <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (y < mu))))
array_y <- abind(lapply(bin_prop_y, function(x) as_draws_array(draws_of(x))))
rv_y <- rvar(array_y)
bin_prop_pred <- lapply(1:5, function(x) rvar_mean(with(df[df$grp == x,], (pred < mu))))
array_pred <- abind(lapply(bin_prop_pred, function(x) as_draws_array(draws_of(x))))
rv_pred <- rvar(array_pred)
(T_y <- rvar_var(rv_y))</code></pre>
<pre><code>## rvar<10000>[1] mean ± sd:
## [1] 0.11 ± 0.012</code></pre>
<pre class="r"><code>(T_pred <- rvar_var(rv_pred))</code></pre>
<pre><code>## rvar<10000>[1] mean ± sd:
## [1] 0.013 ± 0.0091</code></pre>
<pre class="r"><code>mean(T_pred > T_y)</code></pre>
<pre><code>## [1] 0</code></pre>
<p>I followed up this post with a quick update <a href="https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/">here</a>.</p>
<p>
<p><small><font color="darkkhaki"></p>
<p>References:</p>
<p>Gelman, Andrew. “A Bayesian formulation of exploratory data analysis and goodness‐of‐fit testing.” <em>International Statistical Review</em> 71, no. 2 (2003): 369-382.</p>
<p>Goldfeld, Stephen M., and Richard E. Quandt. “Some tests for homoscedasticity.” <em>Journal of the American statistical Association</em> 60, no. 310 (1965): 539-547.</p>
<p>Kerman, Jouni, and Andrew Gelman. “Manipulating and summarizing posterior simulations using random variable objects.” <em>Statistics and Computing</em> 17, no. 3 (2007): 235-244.</p>
</font></small>
</p>
</div>
Estimating a risk difference (and confidence intervals) using logistic regression
https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/
Tue, 15 Jun 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/
<script src="https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/index.en_files/header-attrs/header-attrs.js"></script>
<p>The <em>odds ratio</em> (OR) – the effect size parameter estimated in logistic regression – is notoriously difficult to interpret. It is a ratio of two quantities (odds, under different conditions) that are themselves ratios of probabilities. I think it is pretty clear that a very large or small OR implies a strong treatment effect, but translating that effect into a clinical context can be challenging, particularly since ORs cannot be mapped to unique probabilities.</p>
<p>One alternative measure of effect is the <em>risk difference</em>, which is certainly much more intuitive. Although a difference is very easy to calculate when measured non-parametrically (you just calculate the proportion for each arm and take the difference), things get a little less obvious when there are covariates that need adjusting. (There is a method developed by <a href="https://amstat.tandfonline.com/doi/full/10.1080/01621459.2016.1192546?casa_token=EspaMRhG3OIAAAAA%3AHCGnpIqZnUoAroQuWUCwKv5ANjH5mapba9vCUMrY-pkEmOMVmUuKZjDL-pZu2gC_9eKirj8j7CBk" target="_blank">Richardson, Robins, & Wang</a>, that allow analysts to model the risk difference, but I won’t get into that here.)</p>
<p>Currently, I’m working on an <a href="https://impactcollaboratory.org/" target="_blank">NIA IMPACT Collaboratory</a> study evaluating an intervention designed to increase COVID-19 vaccination rates for staff and long-term residents in nursing facilities. A collaborator suggested we report the difference in vaccination rates rather than the odds ratio, arguing in favor of the more intuitive measure. From my perspective, the only possible downside in using a risk difference instead of an OR is that risk difference estimates are <em>marginal</em>, whereas odds ratios are <em>conditional</em>. (I’ve written about this distinction <a href="https://www.rdatagen.net/post/marginal-v-conditional/" target="_blank">before</a>.) The marginal risk difference estimate is a function of the distribution of patient characteristics in the study that influence the outcome, so the reported estimate might not be generalizable to other populations. The odds ratio, on the other hand, is not dependent on the covariates. The ultimate consensus on our research team is that the benefits of improved communication outweigh the potential loss of generalizability.</p>
<p>My goal here is to demonstrate the relative simplicity of estimating the marginal risk difference described in these papers by <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1475-6773.2008.00900.x" target="_blank">Kleinman & Norton</a> and <a href="https://www.sciencedirect.com/science/article/pii/S0895435608003168" target="_blank">Peter Austin</a>. I won’t be using real data from the study that motivated this, but will generate simulated data so that I can illustrate the contrast between marginal and conditional estimates.</p>
<div id="quickly-defining-the-parameters-of-interest" class="section level3">
<h3>Quickly defining the parameters of interest</h3>
<p>In the study that motivated this, we had two study arms - an intervention arm which involved extensive outreach and vaccination promotion and the other a control arm where nothing special was done. So, there are two probabilities that we are interested in: <span class="math inline">\(p_1 \equiv P(\text{vaccinated} | \text{intervention})\)</span> and <span class="math inline">\(p_0 \equiv P(\text{vaccinated} | \text{control}).\)</span></p>
<p>The risk difference comparing the two groups is simply</p>
<p><span class="math display">\[\text{RD} = p_1 - p_0,\]</span>
the odds <span class="math inline">\(w_a\)</span> for each treatment group is</p>
<p><span class="math display">\[w_a = \frac{p_a}{1-p_a}, \ \ a \in \{0,1\},\]</span>
and the odds ratio comparing the intervention arm to the control arm is</p>
<p><span class="math display">\[\text{OR} = \frac{w_1}{w_0}.\]</span></p>
<p>The logistic regression model models the log odds as a linear function of the intervention status and any other covariates that are being adjusted. In the examples below, there is one continuous covariate <span class="math inline">\(x\)</span> that ranges from -0.5 to 0.5:</p>
<p><span class="math display">\[\text{log}(w_A) = \alpha + \beta A + \gamma X.\]</span>
<span class="math inline">\(\beta\)</span> represents the log(OR) conditional on a particular value of <span class="math inline">\(X\)</span>:</p>
<p><span class="math display">\[\text{log}(w_1) = \alpha + \beta + \gamma X \\
\text{log}(w_0) = \alpha + \gamma X,
\]</span>
and</p>
<p><span class="math display">\[\text{log(OR)} = \text{log}\left(\frac{w_1}{w_0}\right) =\text{log}(w_1) - \text{log}(w_0) = \beta\]</span></p>
<p>More importantly, we can move between odds and probability relatively easily:</p>
<span class="math display">\[\begin{aligned}
\frac{p_a}{1-p_a} &= w_a \\
p_a &= w_a(1- p_a) \\
p_a + w_ap_a &= w_a \\
p_a &= \frac{w_a}{1 + w_a} \\
p_a &= \frac{1}{1 + w_a^{-1}}
\end{aligned}\]</span>
</div>
<div id="estimating-the-marginal-probability-using-model-estimates" class="section level3">
<h3>Estimating the marginal probability using model estimates</h3>
<p>After fitting the model, we have estimates <span class="math inline">\(\hat{\alpha}\)</span>, <span class="math inline">\(\hat{\beta}\)</span>, and <span class="math inline">\(\hat{\gamma}\)</span>. We can generate a pair of odds for each individual <span class="math inline">\(i\)</span> (<span class="math inline">\(w_{i1}\)</span> and <span class="math inline">\(w_{i0}\)</span>) using their observed <span class="math inline">\(x_i\)</span> and the estimated parameters. All we need to do is set <span class="math inline">\(a=1\)</span> and <span class="math inline">\(a=0\)</span> to generate a predicted <span class="math inline">\(\hat{w}_{i1}\)</span> and <span class="math inline">\(\hat{w}_{i0}\)</span>, respectively, for each individual. Note we do not pay attention to the actual treatment arm that the individual was randomized to:</p>
<p><span class="math display">\[ \text{log}(\hat{w}_{i1}) = \hat{\alpha} + \hat{\beta} + \hat{\gamma}x_i, \]</span></p>
<p>or</p>
<p><span class="math display">\[ \hat{w}_{i1} = \text{exp}(\hat{\alpha} + \hat{\beta} + \hat{\gamma}x_i). \]</span></p>
<p>Likewise,</p>
<p><span class="math display">\[ \hat{w}_{i0} = \text{exp}(\hat{\alpha} + \hat{\gamma}x_i). \]</span>
We get <span class="math inline">\(\hat{p}_{ia}\)</span> for <span class="math inline">\(a \in \{0,1\}\)</span> as</p>
<p><span class="math display">\[ \hat{p}_{ia} = \frac{1}{1 + \hat{w}_{ia}^{-1}}\]</span></p>
<p>Finally, the marginal risk difference <span class="math inline">\(\widehat{\text{RD}}\)</span> can be estimated as</p>
<p><span class="math display">\[ \widehat{\text{RD}} = \frac{1}{n}\sum_{i=1}^n \hat{p}_{i1} - \frac{1}{n}\sum_{i=1}^n \hat{p}_{i0} \]</span></p>
<p>from all <span class="math inline">\(n\)</span> study participants regardless of actual treatment assignment.</p>
<p>Fortunately, in <code>R</code> we don’t need to do any of these calculations as predictions on the probability scale can be extracted from the model fit. Standard errors of this risk difference can be estimated using bootstrap methods.</p>
</div>
<div id="simulated-data-set" class="section level3">
<h3>Simulated data set</h3>
<p>Before getting into the simulations, here are the packages needed to run the code shown here:</p>
<pre class="r"><code>set.seed(287362)
library(simstudy)
library(data.table)
library(ggplot2)
library(ggthemes)
library(parallel)</code></pre>
<p>I am generating a binary outcome <span class="math inline">\(y\)</span> that is a function of a continuous covariate <span class="math inline">\(x\)</span> that ranges from -0.5 to 0.5. I use the <em>beta</em> distribution to generate <span class="math inline">\(x1\)</span> which is transformed into <span class="math inline">\(x\)</span>. The advantage of this distribution is the flexibility we have in defining the shape. The OR used to generate the outcome is 2.5:</p>
<pre class="r"><code>def <- defDataAdd(varname = "x1", formula = "..mu_x", variance = 8, dist = "beta")
def <- defDataAdd(def, varname = "x", formula = "x1 - 0.5", dist = "nonrandom")
def <- defDataAdd(def, varname = "y",
formula = "-2 + log(2.5) * rx + 1.5 * x",
dist = "binary", link="logit")</code></pre>
<p>In the first scenario of 500 observations, the distribution of <span class="math inline">\(x\)</span> will be right-skewed. This is established by setting the mean of <span class="math inline">\(x1\)</span> close to 0:</p>
<pre class="r"><code>mu_x = 0.2
dd_2 <- genData(500)
dd_2 <- trtAssign(dd_2, grpName = "rx")
dd_2 <- addColumns(def, dd_2)</code></pre>
<pre class="r"><code>ggplot(data = dd_2, aes(x = x)) +
geom_histogram(fill="#9ec785", binwidth = 0.05, boundary = 0) +
scale_x_continuous(limits = c(-.55, .55), breaks = seq(-.5, .5, by = .25)) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/index.en_files/figure-html/unnamed-chunk-5-1.png" width="576" /></p>
<p>The first step in estimating the risk difference is to fit a logistic regression model:</p>
<pre class="r"><code>glmfit <- glm(y ~ rx + x, data = dd_2, family = "binomial")</code></pre>
<p>Next, we need to predict the probability for each individual based on the model fit under each treatment condition. This will give us <span class="math inline">\(\hat{p}_{i1}\)</span> and <span class="math inline">\(\hat{p}_{i0}\)</span>:</p>
<pre class="r"><code>newdata <- dd_2[, .(rx=1, x)]
p1 <- mean(predict(glmfit, newdata, type = "response"))
newdata <- dd_2[, .(rx=0, x)]
p0 <- mean(predict(glmfit, newdata, type = "response"))
c(p1, p0)</code></pre>
<pre><code>## [1] 0.152 0.068</code></pre>
<p>A simple calculation gives us the point estimate for the risk difference (and note that the estimated OR is close to 2.5, the value used to generate the data):</p>
<pre class="r"><code>risk_diff <- p1 - p0
odds_ratio <- exp(coef(glmfit)["rx"])
c(rd = risk_diff, or = odds_ratio)</code></pre>
<pre><code>## rd or.rx
## 0.084 2.456</code></pre>
<p>We can use a bootstrap method to estimate a 95% confidence interval for risk difference. This involves sampling ids from each treatment group <em>with</em> replacement, fitting a new logistic regression model, predicting probabilities, and calculating a the risk difference. This is repeated 999 times to get a distribution of risk differences, from which we extract an estimated confidence interval:</p>
<pre class="r"><code>bootdif <- function(dd) {
db <- dd[, .(id = sample(id, replace = TRUE)), keyby = rx]
db <- merge(db[, id, rx], dd, by = c("id", "rx"))
glmfit <- glm(y ~ rx + x, data = db, family = "binomial")
newdata <- db[, .(rx=1, x)]
p1 <- mean(predict(glmfit, newdata, type = "response"))
newdata <- db[, .(rx=0, x)]
p0 <- mean(predict(glmfit, newdata, type = "response"))
return(p1 - p0)
}
bootest <- unlist(mclapply(1:999, function(x) bootdif(dd_2), mc.cores = 4))
quantile(bootest, c(0.025, 0.975))</code></pre>
<pre><code>## 2.5% 97.5%
## 0.031 0.137</code></pre>
</div>
<div id="change-in-distribution-changes-risk-difference" class="section level3">
<h3>Change in distribution changes risk difference</h3>
<p>To illustrate how a shift in the distribution of <span class="math inline">\(x\)</span> can influence the marginal risk difference without changing the odds ratio, I just need to specify the mean of <span class="math inline">\(x1\)</span> to be closer to 1. This creates a left-skewed distribution that will increase the risk difference:</p>
<pre class="r"><code>mu_x = 0.8</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/index.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>The risk difference appears to increase, but the OR seems to be pretty close to the true value of 2.5:</p>
<pre><code>## rd or.rx
## 0.18 2.59</code></pre>
<p>And for completeness, here is the estimated confidence interval:</p>
<pre><code>## 2.5% 97.5%
## 0.10 0.25</code></pre>
</div>
<div id="a-more-robust-comparison" class="section level3">
<h3>A more robust comparison</h3>
<p>It is hardly fair to evaluate this property using only two data sets. It is certainly possible that the estimated risk differences are inconsistent just by chance. I have written some functions (provided below in the <a href="#addendum">addendum</a>) that facilitate the replication of numerous data sets created under different distribution assumptions to a generate a distribution of estimated risk differences (as well as a distribution of estimated ORs). I have generated 5000 data sets of 500 observations each under four different assumptions of <code>mu_x</code> used in the data generation process defined above: {0.2, 0.4, 0.6, 0.8}.</p>
<p>It is pretty apparent the the risk difference increases as the distribution of <span class="math inline">\(x\)</span> shifts from right-skewed to left-skewed:</p>
<p><img src="https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/index.en_files/figure-html/unnamed-chunk-16-1.png" width="576" /></p>
<p>And it is equally apparent that shifting the distribution has no impact on the OR, which is consistent across different levels of <span class="math inline">\(x\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/index.en_files/figure-html/unnamed-chunk-17-1.png" width="576" /></p>
<p>
<p><small><font color="darkkhaki"></p>
<p>References:</p>
<p>Austin, Peter C. “Absolute risk reductions, relative risks, relative risk reductions, and numbers needed to treat can be obtained from a logistic regression model.” <em>Journal of clinical epidemiology</em> 63, no. 1 (2010): 2-6.</p>
<p>Kleinman, Lawrence C., and Edward C. Norton. “What’s the risk? A simple approach for estimating adjusted risk measures from nonlinear models including logistic regression.” <em>Health services research</em> 44, no. 1 (2009): 288-302.</p>
<p>Richardson, Thomas S., James M. Robins, and Linbo Wang. “On modeling and estimation for the relative risk and risk difference.” <em>Journal of the American Statistical Association</em> 112, no. 519 (2017): 1121-1130.</p>
<p>Support:</p>
<p>This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/" targt="_blank">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.</p>
</font></small>
</p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum-replication-code" class="section level3">
<h3>Addendum: replication code</h3>
<pre class="r"><code>s_define <- function() {
def <- defDataAdd(varname = "x1", formula = "..mu_x", variance = 8, dist = "beta")
def <- defDataAdd(def, varname = "x", formula = "x1 - 0.5", dist = "nonrandom")
def <- defDataAdd(def, varname = "y",
formula = "-2 + 1 * rx + 1.5 * x",
dist = "binary", link="logit")
return(list(def = def)) # list_of_defs is a list of simstudy data definitions
}
s_generate <- function(list_of_defs, argsvec) {
list2env(list_of_defs, envir = environment())
list2env(as.list(argsvec), envir = environment())
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(def, dx)
return(dx) # generated data is a data.table
}
s_model <- function(dx) {
glmfit <- glm(y ~ rx + x, data = dx, family = "binomial")
newdata <- dx[, .(rx=1, x)]
p1 <- mean(predict(glmfit, newdata, type = "response"))
newdata <- dx[, .(rx=0, x)]
p0 <- mean(predict(glmfit, newdata, type = "response"))
risk_diff <- p1 - p0
odds_ratio <- exp(coef(glmfit)["rx"])
model_results <- data.table(risk_diff, odds_ratio)
return(model_results) # model_results is a data.table
}
s_single_rep <- function(list_of_defs, argsvec) {
generated_data <- s_generate(list_of_defs, argsvec)
model_results <- s_model(generated_data)
return(model_results)
}
s_replicate <- function(argsvec, nsim) {
list_of_defs <- s_define()
model_results <- rbindlist(
parallel::mclapply(
X = 1 : nsim,
FUN = function(x) s_single_rep(list_of_defs, argsvec),
mc.cores = 4)
)
model_results <- cbind(t(argsvec), model_results)
return(model_results) # summary_stats is a data.table
}
### Scenarios
scenario_list <- function(...) {
argmat <- expand.grid(...)
return(asplit(argmat, MARGIN = 1))
}
#----
n <- 500
mu_x <- c(0.2, 0.4, 0.6, 0.8)
scenarios <- scenario_list(n = n, mu_x = mu_x)
summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 5000)))
ggplot(data = summary_stats, aes(x = risk_diff, group = mu_x)) +
geom_density(aes(fill = factor(mu_x)), alpha = .7) +
scale_fill_canva(palette = "Simple but bold", name = "mu_x") +
theme(panel.grid = element_blank()) +
xlab("estimated risk difference")
ggplot(data = summary_stats, aes(x = odds_ratio, group = mu_x)) +
geom_density(aes(fill = factor(mu_x)), alpha = .7) +
scale_fill_canva(palette = "Simple but bold", name = "mu_x") +
theme(panel.grid = element_blank()) +
xlab("estimated odds ratio")</code></pre>
</div>
Sample size determination in the context of Bayesian analysis
https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/
Tue, 01 Jun 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/
<script src="https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/index.en_files/header-attrs/header-attrs.js"></script>
<p>Given my recent involvement with the design of a somewhat complex <a href="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/" target="_blank">trial</a> centered around a Bayesian data analysis, I am appreciating more and more that Bayesian approaches are a very real option for clinical trial design. A key element of any study design is sample size. While some would argue that sample size considerations are not critical to the Bayesian design (since Bayesian inference is agnostic to any pre-specified sample size and is not really affected by how frequently you look at the data along the way), it might be a bit of a challenge to submit a grant without telling the potential funders how many subjects you plan on recruiting (since that could have a rather big effect on the level of resources - financial and time - required.)</p>
<p><a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">Earlier</a>, I touched a bit on these issues while discussing the frequentist properties of Bayesian models, but I didn’t really get directly into sample size considerations. I’ve been doing some more exploring and simulating, so I am sharing some of that here.</p>
<div id="bayesian-inference" class="section level3">
<h3>Bayesian inference</h3>
<p>In the Bayesian framework, all statistical inference is based on the estimated posterior probability distribution for the parameter(s) of interest (say <span class="math inline">\(\theta\)</span>) once we have observed the data: <span class="math inline">\(P(\theta | \text{data})\)</span>. In addition to extracting the mean or median of the distribution as a point estimate, we can get a measure of uncertainty by extracting quantiles from the distribution (a 95% interval comes to mind, though there is no reason to be limited by that convention).</p>
<p>Alternatively, we can make a probability statement about the parameter being above or below a threshold of effectiveness. For example if we are estimating a log-odds ratio for an intervention that prevents a bad outcome, we might be interested in <span class="math inline">\(P(log(OR) < 0).\)</span> We may even pre-specify that the trial will be considered a success if <span class="math inline">\(P(log(OR) < 0) > 0.95.\)</span></p>
<!-- ### Statistical power and sample size -->
<!-- This, of course, is in contrast with frequentist inference that is usually based on something like $P(\text{data} | \theta = 0)$. -->
<!-- The traditional notion of *statistical power* is the probability of rejecting the null hypothesis (i.e. $\theta=0$) given that the true underlying parameter is some $\theta = \myne{\theta_1}{0}$. This probability can be written as $P(D_1 | \theta = \theta_1)$, where $D_1$ is any set of observed data that satisfies $P(D_1 | \theta = 0) < \alpha$, typically with $\alpha = 0.05$. Sample size comes into play because $P(\text{data} | \theta = 0)$ is a function of the number of observations. -->
<!-- This concept of power doesn't translate very well to a Bayesian framework, because here the null hypothesis $\theta = 0$ is not particularly meaningful. However, there is a Bayesian analog of power -->
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(cmdstanr)
library(posterior)
library(bayesplot)</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>To investigate, I will use a simple binary outcome <span class="math inline">\(Y\)</span> that is changed by exposure or intervention <span class="math inline">\(A\)</span>. In this first case, I will randomly select a log-odds ratio from <span class="math inline">\(N(\mu = -1, \sigma = 0.5).\)</span></p>
<pre class="r"><code>defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * A",
dist = "binary", link="logit")
set.seed(21)
lor <- rnorm(1, -1, 0.5)
dT <- genData(200)
dT <- trtAssign(dT, grpName = "A")
dT <- addColumns(defB, dT)</code></pre>
</div>
<div id="model-fitting" class="section level3">
<h3>Model fitting</h3>
<p>I am primarily interested in recovering the log-odds ratio use to generate the data using a simple Bayesian model, written here in <code>Stan</code>. The parameter of interest in the <code>Stan</code> model is <span class="math inline">\(\beta\)</span>, log-odds ratio. The prior distribution is <span class="math inline">\(t_{student}(df=3, \mu=0, \sigma=5).\)</span></p>
<pre class="stan"><code>data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
vector[N] x;
real mu;
real s;
}
parameters {
real alpha;
real beta;
}
model {
beta ~ student_t(3, mu, s);
y ~ bernoulli_logit(alpha + beta * x);
}</code></pre>
<p>To estimate the posterior distribution, I am using the <code>R</code> package <code>cmdstanr</code>:</p>
<pre class="r"><code>mod <- cmdstan_model("code/bayes_logistic.stan")
fit <- mod$sample(
data = list(N=nrow(dT), y=dT$Y, x=dT$A, mu=0, s=5),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 1000,
iter_sampling = 4000,
step_size = 0.1,
show_messages = FALSE
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.2 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.2 seconds.
## Total execution time: 0.4 seconds.</code></pre>
<p>(If you’re impressed at how fast that model ran, it is because it is on my new MacBook Pro with the new Apple M1 chip - 4 or 5 times faster than my previous MacBook Pro with an Intel chip. It took me a while to get <code>R</code>, <code>RStudio</code>, and particularly, <code>cmdstan</code> up and running, but once I did, it has been totally worth it.)</p>
<p>First thing to check, of course, is whether the sampling from the posterior distribution was well-behaved. Here is a trace plot for the parameter <span class="math inline">\(\beta\)</span>:</p>
<pre class="r"><code>draws_array <- as_draws_array(fit$draws())
mcmc_trace(draws_array, pars = "beta")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/index.en_files/figure-html/unnamed-chunk-6-1.png" width="384" /></p>
<p>Here are the summary statistics of the posterior distribution. Based on these data, the median log-odds ratio is <span class="math inline">\(-0.61\)</span> and <span class="math inline">\(P(lor < 0) = 89\%\)</span>:</p>
<pre class="r"><code>res <- data.table(fit$summary(variables = "beta"))[,
.(median, sd, q95, len = q95-q5)]
betas <- data.table(beta = as.matrix(draws_array[,,"beta"]))
res$p0 <- mean(betas$beta.V1 < 0)
res</code></pre>
<pre><code>## median sd q95 len p0
## 1: -0.6050845 0.511862 0.2103548 1.673138 0.88875</code></pre>
<p>A plot of the posterior distribution is the best way to fully assess the state of knowledge about the parameter having observed this data set. The density plot includes a vertical dashed line at the median, and the dark shading indicates lowest <span class="math inline">\(95\%\)</span> of the density. The fact that the cutoff point <span class="math inline">\(0\)</span> lies within the bottom <span class="math inline">\(95\%\)</span> makes it clear that the threshold was not met.</p>
<pre class="r"><code>d <- density(draws_array[,,"beta"], n = 1024)
plot_points <- as.data.table(d[c("x", "y")])
median_xy <- plot_points[findInterval(res$median, plot_points$x)]
ggplot(data = plot_points, aes(x = x, y = y)) +
geom_area(aes(fill = (x < res$q95))) +
geom_segment(x = median_xy$x, xend=median_xy$x, y=0, yend = median_xy$y,
size = 0.2, color = "white", lty=3) +
scale_fill_manual(values = c("#adc3f2", "#5886e5")) +
theme(panel.grid = element_blank(),
legend.position = "none")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/index.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
<div id="bayesian-power" class="section level3">
<h3>Bayesian power</h3>
<p>If we want to assess what kind of sample sizes we might want to target in study based on this relatively simple design (binary outcome, two-armed trial), we can conduct a Bayesian power analysis that has a somewhat different flavor from the more typical frequentist Bayesian that I typically do with simulation. There are a few resources I’ve found very useful here: this book by <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/0470092602" target="_blank">Spiegelhalter et al</a> and these two papers, one by <a href="https://projecteuclid.org/journals/statistical-science/volume-17/issue-2/A-simulation-based-approach-to-Bayesian-sample-size-determination-for/10.1214/ss/1030550861.full" target="_blank">Wang & Gelfand</a> and another by <a href="https://www.mdpi.com/1660-4601/18/2/595" target="_blank">De Santis & Gubbiotti</a></p>
<p>When I conduct a power analysis within a frequentist framework, I usually assume set of <em>fixed/known</em> effect sizes, and the hypothesis tests are centered around the frequentist p-value at a specified level of <span class="math inline">\(\alpha\)</span>. The Bayesian power analysis differs with respect to these two key elements: a distribution of effect sizes replaces the single fixed effect size to accommodate uncertainty, and the posterior distribution probability threshold (or another criteria such as the variance of the posterior distribution or the length of the 95% credible interval) replaces the frequentist hypothesis test.</p>
<p>We have a prior distribution of effect sizes. De Santis and Gubbiotti suggest it is not necessary (and perhaps less desirable) to use the same prior used in the model fitting. That means you could use a skeptical (conservative) prior centered around 0, in the analysis, but use a prior for data generation that is consistent with a clinically meaningful effect size. In the example above the <em>analysis prior</em> was</p>
<p><span class="math display">\[ \beta \sim t_{student}(df = 3, \mu = 0, \sigma = 5) \]</span></p>
<p>and the <em>data generation prior</em> was</p>
<p><span class="math display">\[ \beta \sim N(\mu = -1, \sigma = 0.5).\]</span></p>
<p>To conduct the Bayesian power analysis, I replicated the simulation and model fitting shown above 1000 times for each of seven different sample sizes ranging from 100 to 400. (Even though my laptop is quite speedy, I used the NYU Langone Health high performance cluster Big Purple to do this, because I wanted to save a few hours.) I’m not showing the parallelized code in this post, but take a look <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">here</a> for an example similar to this. (I’m happy to share with anyone if you’d like to have the code. Updated 7/1/2021: code has been added in the Addendum below.)</p>
<p>The plots below show a sample of 20 posterior distributions taken from the 1000 generated for each of three sample sizes. As in the frequentist context, an increase in sample size appears to reduce the variance of the posterior distribution estimated in a Bayesian model. We can see visually that as the sample size increases, the distribution collapses towards the mean or median, which has a direct impact on how confident we are in drawing conclusions from the data; in this case, it is apparent that as sample size increases, the proportion of posterior distributions meet the 95% threshold increases.</p>
<p><img src="img/p95.png" /></p>
<p>Here is a curve that summarizes the probability of a posterior distribution meeting the 95% threshold at each sample size level. At a size of 400, 80% of the posterior distributions (which are themselves based on data generated from varying effect sizes specified by the <em>data generation prior</em> and the <em>analysis prior</em>) would lead us to conclude that the trial is success.</p>
<p><img src="img/power_curve.png" /></p>
<p>
<p><small><font color="darkkhaki"></p>
<p>References:</p>
<p>Wang, Fei, and Alan E. Gelfand. “A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models.” <em>Statistical Science</em> 17, no. 2 (2002): 193-208.</p>
<p>Spiegelhalter, David J., Keith R. Abrams, and Jonathan P. Myles. <em>Bayesian approaches to clinical trials and health-care evaluation</em>. Vol. 13. John Wiley & Sons, 2004.</p>
<p>De Santis, Fulvio, and Stefania Gubbiotti. “Sample Size Requirements for Calibrated Approximate Credible Intervals for Proportions in Clinical Trials.” <em>International Journal of Environmental Research and Public Health</em> 18, no. 2 (2021): 595.</p>
</font></small>
</p>
<p><br>
<br></p>
</div>
<div id="addendum" class="section level2">
<h2>Addendum</h2>
<p>Here is the full R code for the Bayesian power analysis using simulation. I am including the <code>slurmR</code> code that I used to execute on the HPC:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(bayesplot)
library(posterior)
library(cmdstanr)
library(slurmR)
library(collapse)
s_define <- function() {
defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * rx",
dist = "binary", link="logit")
return(list(defB = defB)) # list_of_defs is a list of simstudy data definitions
}
s_generate <- function(list_of_defs, argsvec) {
list2env(list_of_defs, envir = environment())
list2env(as.list(argsvec), envir = environment())
#--- add data generation code ---#
lor <- rnorm(1, mu.lor, sigma.lor)
dT <- genData(nobs)
dT <- trtAssign(dT, grpName = "rx")
dT <- addColumns(defB, dT)
return(dT[])
}
s_model <- function(generated_data, mod, argsvec) {
list2env(as.list(argsvec), envir = environment())
dt_to_list <- function(dx) {
N <- nrow(dx) ## number of observations
y <- dx$Y ## individual outcome
x <- dx$rx ## treatment arm for individual
s <- t_sigma
mu <- 0 # can be mu.lor
list(N=N, y=y, x=x, s=s, mu = mu)
}
fit <- mod$sample(
data = dt_to_list(generated_data),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 1000,
iter_sampling = 4000,
step_size = 0.1,
show_messages = FALSE
)
res <- data.table(fit$summary(variables = "beta"))[, .(median, sd, q95, len = q95-q5)]
draws_array <- as_draws_array(fit$draws())
betas <- data.table(beta = as.matrix(draws_array[,,"beta"]))
res$p0 <- mean(betas$beta.V1 < 0)
return(res) # model_results is a data.table
}
s_single_rep <- function(list_of_defs, argsvec, mod) {
set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0")
list_of_defs <- s_define()
generated_data <- s_generate(list_of_defs, argsvec)
model_results <- s_model(generated_data, mod, argsvec)
return(model_results)
}
s_replicate <- function(argsvec, nsim, mod) {
list_of_defs <- s_define()
model_results <-
lapply(
X = 1 : nsim,
FUN = function(x) s_single_rep(list_of_defs, argsvec, mod)
)
#--- add summary statistics code ---#
model_sums <- unlist2d(lapply(model_results, function(x) x),
idcols = "replicate", DT = TRUE)
summary_stats <- model_sums[ ,
.(p_95 = mean(p0 >= 0.95),
p_len = mean(len <= 2),
p_sd = mean(sd <= 0.5))
]
model_ests <- data.table(t(argsvec), summary_stats)
return(model_ests)
}
###
scenario_list <- function(...) {
argmat <- expand.grid(...)
return(asplit(argmat, MARGIN = 1))
}
mu.lor <- c(0, -0.5, -1.0, -1.5)
sigma.lor <- c(0.25)
nobs <- c(100, 150, 200, 250, 300, 350, 400)
t_sigma <- c(1, 5, 10)
scenarios <- scenario_list(mu.lor = mu.lor, sigma.lor = sigma.lor,
nobs = nobs, t_sigma = t_sigma)
set_cmdstan_path(path = ".../cmdstan/2.25.0")
mod <- cmdstan_model("present.stan")
job <- Slurm_lapply(
X = scenarios,
FUN = s_replicate,
mod = mod,
nsim = 1200,
njobs = min(length(scenarios), 90L),
mc.cores = 4L,
job_name = "i_bp",
tmp_path = "/gpfs/data/troxellab/ksg/scratch",
plan = "wait",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("s_single_rep", "s_define", "s_generate", "s_model"),
overwrite = TRUE
)
summary_stats <- Slurm_collect(job)
final_tab <- rbindlist(summary_stats)
save(final_tab, file = ".../bp.rda")</code></pre>
</div>
Generating random lists of names with errors to explore fuzzy word matching
https://www.rdatagen.net/post/2021-04-13-generating-random-lists-of-names-with-errors-to-explore-fuzzy-word-matching/
Tue, 13 Apr 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-04-13-generating-random-lists-of-names-with-errors-to-explore-fuzzy-word-matching/
<script src="https://www.rdatagen.net/post/2021-04-13-generating-random-lists-of-names-with-errors-to-explore-fuzzy-word-matching/index.en_files/header-attrs/header-attrs.js"></script>
<p>Health data systems are not always perfect, a point that was made quite obvious when a study I am involved with required a matched list of nursing home residents taken from one system with set results from PCR tests for COVID-19 drawn from another. Name spellings for the same person from the second list were not always consistent across different PCR tests, nor were they always consistent with the cohort we were interested in studying defined by the first list. My research associate, Yifan Xu, and I were asked to see what we could do to help out.</p>
<p>This is my first foray into fuzzy word-matching. We came up with simple solution to match the names on the two lists based on the <code>R</code> function <code>adist</code> that should allow the research team to finalize a matched list with minimal manual effort.</p>
<p>In order to test our proposed approach, we developed a way to generate random lists of names with errors. This post presents both the code for random list generation with errors as well as the simple matching algorithm.</p>
<div id="distance-between-strings" class="section level3">
<h3>Distance between strings</h3>
<p>Fuzzy word matching can be approached using the concept of <em>string distance</em>. Quite simply, this can be measured by counting the number of transformations required to move from the original to the target word. A transformation is one of three moves: (1) a substitution, (2) an insertion, or (3) a deletion. The figure below illustrates the 5 “moves” that are required to get from <strong>CAT</strong> to <strong>KITTEN</strong>: two substitutions and three insertions.</p>
<p><img src="img/cat_to_kitten.png" style="width:80.0%" /></p>
<p>The <code>adist</code> function can calculate this string distance, and if you set <code>counts = TRUE</code>, the function will provide the number of substitutions, insertions, and deletions. Here are the results for our example:</p>
<pre class="r"><code>adist("CAT", "KITTEN", counts = TRUE)</code></pre>
<pre><code>## [,1]
## [1,] 5
## attr(,"counts")
## , , ins
##
## [,1]
## [1,] 3
##
## , , del
##
## [,1]
## [1,] 0
##
## , , sub
##
## [,1]
## [1,] 2
##
## attr(,"trafos")
## [,1]
## [1,] "SSIMII"</code></pre>
<p>Assessing whether a distance is meaningful or no longer fuzzy certainly depends on the nature of the problem and the length of a strings. The distance from <strong>CAT</strong> to <strong>DOG</strong> is 3 (with 3 substitutions), and so is the distance from <strong>DERE, STEPHEN</strong> to <strong>DEERE, STEVEN</strong> (1 insertion, 1 deletion, and 1 substitution); we might be willing to declare the individual’s name a match while declining to pair the two different animals.</p>
</div>
<div id="simulating-lists-of-names-with-errors" class="section level3">
<h3>Simulating lists of names with errors</h3>
<p>To test out our fuzzy matching process, we need to be able to create a master list of names from which we can create two sub-lists: (1) the cohort list of nursing home residents with correct name spellings, and (2) the list of PCR records that includes multiple records (test results) per individual, with possible inconsistent name spellings across the different tests for a specific person.</p>
<div id="generating-names" class="section level4">
<h4>Generating names</h4>
<p>The master list can easily be generated using the <code>randomNames</code> function in the R package <a href="https://centerforassessment.github.io/randomNames/" target="_blank">randomNames</a>. A call to this function provides samples of names from a large scale database. (It provides gender and ethnic variation if you need it.)</p>
<pre class="r"><code>library(data.table)
library(randomNames)
set.seed(6251)
randomNames(4)</code></pre>
<pre><code>## [1] "Hale, James" "el-Qazi, Najeema" "Sourn, Raj" "Jensen, Tia"</code></pre>
</div>
<div id="generating-errors" class="section level4">
<h4>Generating errors</h4>
<p>To facilitate the generation of spelling errors, I’ve created a function that takes a string, a specified number of substitutions, a number of insertions (if negative then these are deletions), and an indicator that flips the order of the names (typically “Last Name, First Name”):</p>
<pre class="r"><code>mis_string <- function(name, subs = 1, ins = 0, flip = FALSE) {
names <- trimws(unlist(strsplit(name, split = ",")))
if (subs) {
for (i in 1 : subs) {
change <- sample(1:2, 1)
ii <- sample(nchar(names[change]), 1)
l <- substr(names[change], ii, ii)
s <- sample(letters[letters != l], 1)
names[change] <- sub(l, s, names[change])
}
}
if (ins > 0) {
for (i in 1 : ins) {
change <- sample(c(1, 2), 1)
ii <- sample(nchar(names[change]), 1)
stringi::stri_sub(names[change], ii, ii-1) <- sample(letters, 1)
}
}
if (ins < 0) {
for (i in 1 : -ins) {
change <- sample(c(1, 2), 1)
ii <- sample(nchar(names[change]), 1)
stringi::stri_sub(names[change], ii, ii) <- ""
}
}
paste(names[flip + 1], names[2 - flip], sep = ", ")
}</code></pre>
<p>Here are two applications of <code>mis_string</code> on the name “Vazquez Topete, Deyanira”:</p>
<pre class="r"><code>mis_string("Vazquez Topete, Deyanira", subs = 2, ins = 2)</code></pre>
<pre><code>## [1] "Vazhquiez Topete, Dmyanika"</code></pre>
<pre class="r"><code>mis_string("Vazquez Topete, Deyanira", subs = 1, ins = -2, flip = TRUE)</code></pre>
<pre><code>## [1] "Deynira, uazquez Topet"</code></pre>
</div>
<div id="master-list-definitions" class="section level4">
<h4>Master list definitions</h4>
<p>To generate the master list we define (using <code>simstudy</code>) a set of key parameters: an indicator <code>pcr</code> identifying whether the person has at least one test (70% will have a test), an indicator <code>resident</code> identifying whether the person is part of our resident cohort (20% of those on the master list will be residents), and a variable for the number of tests an individual has (conditional on having at least 1 test). There will be names on the master list that do not have any tests nor are they a resident; these patients are removed from the master list.</p>
<pre class="r"><code>library(simstudy)
def_n <- defDataAdd(varname = "pcr", formula = 0.7, dist="binary")
def_n <- defDataAdd(def_n, varname = "resident", formula = 0.2, dist="binary")
def_n <- defDataAdd(def_n, varname = "n_tests", formula = 3, dist="noZeroPoisson")</code></pre>
</div>
<div id="pcr-list-error-definitions" class="section level4">
<h4>PCR list error definitions</h4>
<p>Each person with a PCR test will have one or more records in the PCR list. The following set of definitions indicates the number of substitutions and insertions (both specified as categorical integer variables) as well as whether the first and last names should be flipped.</p>
<pre class="r"><code>def_p <- defDataAdd(varname = "subs", formula = ".8;.15;.04;.01", dist="categorical")
def_p <- defDataAdd(def_p, varname = "ins",
formula = "0.02;0.05;0.86;0.05;0.02", dist="categorical")
def_p <- defDataAdd(def_p, varname = "flip", formula = 0.10, dist="binary")</code></pre>
</div>
<div id="generating-the-data" class="section level4">
<h4>Generating the data</h4>
<p>In this simulation I am generating 50 possible names:</p>
<pre class="r"><code>set.seed(3695)
n <- 50
d_master <- data.table(id = 1:n, name = randomNames(n))
d_master <- addColumns(def_n, d_master)
d_master <- d_master[(pcr | resident)]
head(d_master)</code></pre>
<pre><code>## id name pcr resident n_tests
## 1: 1 Maas, Synneva 1 0 3
## 2: 2 Rock, Alyssa 1 0 3
## 3: 3 Lee, August 1 0 1
## 4: 4 Keefe, Dylan 1 0 1
## 5: 5 Yang, An 1 0 1
## 6: 6 Andrew, Crysta 1 0 3</code></pre>
<p>In this case, there are be 7 individuals in the resident cohort and 40 individuals have at least one PCR test. 5 residents were tested:</p>
<pre class="r"><code>d_master[, .(
num_res = sum(resident),
num_pcr = sum(pcr),
num_both = sum( (resident & pcr) )
)
]</code></pre>
<pre><code>## num_res num_pcr num_both
## 1: 7 40 5</code></pre>
<p>The PCR list will have 110 total tests for the 40 people with tests.</p>
<pre class="r"><code>d_pcr <- genCluster(d_master[pcr == 1], "id", "n_tests", "pcr_id")
d_pcr <- addColumns(def_p, d_pcr)
d_pcr[, subs := subs - 1]
d_pcr[, ins := ins - 3]
d_pcr[, obs_name := mapply(mis_string, name, subs, ins, flip)]
d_pcr[, .(pcr_id, id, name, obs_name, subs, ins, flip)]</code></pre>
<pre><code>## pcr_id id name obs_name subs ins flip
## 1: 1 1 Maas, Synneva Sycnfvq, Maas 3 0 1
## 2: 2 1 Maas, Synneva Synneva, Maas 0 0 1
## 3: 3 1 Maas, Synneva Maas, Sznneva 1 0 0
## 4: 4 2 Rock, Alyssa Rock, Alyssa 0 0 0
## 5: 5 2 Rock, Alyssa Ropk, Alyssa 1 0 0
## ---
## 106: 106 48 Wall, Sebastian Wall, Sebtastian 0 1 0
## 107: 107 48 Wall, Sebastian Wall, kebastian 1 0 0
## 108: 108 48 Wall, Sebastian wall, Sebastian 1 0 0
## 109: 109 49 Tafoya, April Tafoya, April 0 0 0
## 110: 110 49 Tafoya, April Tafoya, April 0 0 0</code></pre>
<p>We end up with two lists - one with just residents only and one with a list of PCR tests. This is mimicking the actual data we might get from our flawed health data systems.</p>
<pre class="r"><code>d_res <- d_master[resident == 1, .(id, name)]
d_pcr <- d_pcr[, .(pcr_id, id, name = obs_name, pcr)]</code></pre>
</div>
<div id="the-truth" class="section level4">
<h4>The truth</h4>
<p>Before proceeding to the matching, here is a PCR test records for the residents. This is the correct match that we hope to recover.</p>
<pre class="r"><code>d_pcr[ id %in% d_res$id]</code></pre>
<pre><code>## pcr_id id name pcr
## 1: 57 26 Diaper, nody 1
## 2: 58 26 Draper, Cody 1
## 3: 59 26 Cody, Drapeyr 1
## 4: 60 27 al-Naqvi, Qamraaa 1
## 5: 61 27 al-Naqvi, Qamraaa 1
## 6: 64 29 el-Hallal, Zahraaa 1
## 7: 65 29 Zahraaa, el-Hallal 1
## 8: 86 42 Allen, Jalyn 1
## 9: 87 42 llen, Jalyn 1
## 10: 88 42 Allen, Jalyn 1
## 11: 102 47 Sanandres, Bzandon 1
## 12: 103 47 Sananores, Brandon 1
## 13: 104 47 Sanandues, Brandon 1
## 14: 105 47 Sanandres, Brandon 1</code></pre>
</div>
</div>
<div id="fuzzy-matching-of-simulated-data" class="section level3">
<h3>Fuzzy matching of simulated data</h3>
<p>The fuzzy matching is quite simple (and I’ve simplified even more by ignoring the possibility that the data have been flipped). The first step is to merge each PCR row with each resident name, which in this case will result in <span class="math inline">\(7 \times 110 = 770\)</span> rows. The idea is that we will be comparing each of the names from the PCR tests with each of the resident names. In the merged data table <code>dd</code>, <code>x</code> is the resident name, and <code>name</code> is the PCR test list name.</p>
<pre class="r"><code>dd <- data.table(merge(d_res$name, d_pcr))
dd</code></pre>
<pre><code>## x pcr_id id name pcr
## 1: Korenek, Tara 1 1 Sycnfvq, Maas 1
## 2: Draper, Cody 1 1 Sycnfvq, Maas 1
## 3: al-Naqvi, Qamraaa 1 1 Sycnfvq, Maas 1
## 4: el-Hallal, Zahraaa 1 1 Sycnfvq, Maas 1
## 5: Slee Ackerson, Jeremiah 1 1 Sycnfvq, Maas 1
## ---
## 766: al-Naqvi, Qamraaa 110 49 Tafoya, April 1
## 767: el-Hallal, Zahraaa 110 49 Tafoya, April 1
## 768: Slee Ackerson, Jeremiah 110 49 Tafoya, April 1
## 769: Allen, Jalyn 110 49 Tafoya, April 1
## 770: Sanandres, Brandon 110 49 Tafoya, April 1</code></pre>
<p>Next, we calculate the string distance for each pair of strings in <code>dd</code>:</p>
<pre class="r"><code>dd[, pid := .I]
dd[, dist := adist(x, name), keyby = pid]
dd</code></pre>
<pre><code>## x pcr_id id name pcr pid dist
## 1: Korenek, Tara 1 1 Sycnfvq, Maas 1 1 10
## 2: Draper, Cody 1 1 Sycnfvq, Maas 1 2 11
## 3: al-Naqvi, Qamraaa 1 1 Sycnfvq, Maas 1 3 12
## 4: el-Hallal, Zahraaa 1 1 Sycnfvq, Maas 1 4 14
## 5: Slee Ackerson, Jeremiah 1 1 Sycnfvq, Maas 1 5 18
## ---
## 766: al-Naqvi, Qamraaa 110 49 Tafoya, April 1 766 13
## 767: el-Hallal, Zahraaa 110 49 Tafoya, April 1 767 14
## 768: Slee Ackerson, Jeremiah 110 49 Tafoya, April 1 768 19
## 769: Allen, Jalyn 110 49 Tafoya, April 1 769 11
## 770: Sanandres, Brandon 110 49 Tafoya, April 1 770 15</code></pre>
<p>We can refine the matched list of <span class="math inline">\(770\)</span> pairs to include only those that differ by 5 or fewer transformations, and can even create a score based on the distances where a score of 100 represents a perfect match. This refined list can then be reviewed manually to make a final determination in case there are any false matches.</p>
<pre class="r"><code>dd <- dd[dist <= 3,]
dd[, score := 100 - 5*dist]
dd[, .(staff_name = x, pcr_name = name, pcr_id, pcr, pid, score)]</code></pre>
<pre><code>## staff_name pcr_name pcr_id pcr pid score
## 1: Draper, Cody Diaper, nody 57 1 394 90
## 2: Draper, Cody Draper, Cody 58 1 401 100
## 3: al-Naqvi, Qamraaa al-Naqvi, Qamraaa 60 1 416 100
## 4: al-Naqvi, Qamraaa al-Naqvi, Qamraaa 61 1 423 100
## 5: el-Hallal, Zahraaa el-Hallal, Zahraaa 64 1 445 100
## 6: Allen, Jalyn Allen, Jalyn 86 1 601 100
## 7: Allen, Jalyn llen, Jalyn 87 1 608 95
## 8: Allen, Jalyn Allen, Jalyn 88 1 615 100
## 9: Sanandres, Brandon Sanandres, Bzandon 102 1 714 95
## 10: Sanandres, Brandon Sananores, Brandon 103 1 721 95
## 11: Sanandres, Brandon Sanandues, Brandon 104 1 728 95
## 12: Sanandres, Brandon Sanandres, Brandon 105 1 735 100</code></pre>
<p>We did pretty well, identifying 12 of the 14 resident records in the PCR data. The two we missed were the result of flipped names.</p>
<p>In practice, a relatively close distance is not necessarily a good match. For example <strong>SMITH, MARY</strong> and <strong>SMITH, JANE</strong> are only separated by three letter substitutions, but they are most likely not the same person. We could minimize this problem if we have additional fields to match on, such as date of birth. This would even allow us to <em>increase</em> the string distance we are willing to accept for a possible match without increasing the amount of manual inspection required.</p>
</div>
The case of three MAR mechanisms: when is multiple imputation mandatory?
https://www.rdatagen.net/post/2021-03-30-some-cases-where-imputing-missing-data-matters/
Tue, 30 Mar 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-03-30-some-cases-where-imputing-missing-data-matters/
<script src="https://www.rdatagen.net/post/2021-03-30-some-cases-where-imputing-missing-data-matters/index.en_files/header-attrs/header-attrs.js"></script>
<p>I thought I’d written about this before, but I searched through my posts and I couldn’t find what I was looking for. If I am repeating myself, my apologies. I <a href="https://www.rdatagen.net/post/musings-on-missing-data/" target="_blank">explored</a> missing data two years ago, using directed acyclic graphs (DAGs) to help understand the various missing data mechanisms (MAR, MCAR, and MNAR). The DAGs provide insight into when it is appropriate to use observed data to get unbiased estimates of population quantities even though some of the observations are missing information.</p>
<p>In that original post, I mentioned I might have more to say at some point in the future. Well, two years later I am again thinking about missing data, this time in the context of an ongoing randomized controlled trial. The research team has been discussing various ways to address potential biases that missing information might be introducing into the analysis. The group has decided that we need to use imputation to fill in the missing data, but I wanted to be clear why this added step is called for. After all, it is quite well known that imputation may not be necessary in light of missing data (see this <a href="https://statisticalhorizons.com/ml-is-better-than-mi" target="_blank">post</a>, for example.)</p>
<p>I’ve created three scenarios with data missing at random (MAR), where the probability of missingness is a function of observed data. In the first scenario, the treatment effect can surprisingly be estimated simply by comparing the means, no adjustment or imputation needed. In the second case, comparing the means directly is not appropriate, but adjustment for the predictor of missingness is sufficient; again, no imputation needed. And in the third case, neither a simple comparison nor a modeling adjustment do the trick; imputation is mandatory.</p>
<div id="a-little-background-for-context" class="section level3">
<h3>A little background for context</h3>
<p>The actual RCT is considerably more complicated than I am describing here, but this is the general idea. Individuals are randomized to one of two study arms <span class="math inline">\(A\)</span>, where <span class="math inline">\(A_i=1\)</span> if patient <span class="math inline">\(i\)</span> is in the treatment arm, and <span class="math inline">\(A_i = 0\)</span> if the the patient is in the control arm. We measure the outcome <span class="math inline">\(Y\)</span> at two time points, so we have <span class="math inline">\(Y_1\)</span> and <span class="math inline">\(Y_2\)</span>; our primary interest, however, is <span class="math inline">\(Y_2\)</span>. We measure a key covariate <span class="math inline">\(X\)</span> that influences both <span class="math inline">\(Y_1\)</span> and <span class="math inline">\(Y_2\)</span>. This is the true underlying DAG:</p>
<p><img src="img/study_DAG.png" style="width:40.0%" /></p>
<p>The challenge is that, for some patients, the second measurement <span class="math inline">\(Y_2\)</span> is missing, and we believe that <span class="math inline">\(Y_1\)</span> is a good predictor of the missingness pattern. But before getting into this (which is Case #3), I’ll start with a simpler scenario.</p>
</div>
<div id="case-1" class="section level3">
<h3>Case #1</h3>
<p>In the first scenario, there is only a single outcome measurement <span class="math inline">\(Y\)</span>, and we have measured <span class="math inline">\(X\)</span>. The simplified DAG looks like this:</p>
<p><img src="img/study_simple_DAG.png" style="width:30.0%" /></p>
<p>Unfortunately, we’ve only been able to collect the outcome measurement <span class="math inline">\(Y\)</span> for a subset of the sample, so that the observed <span class="math inline">\(Y^*\)</span> includes missing values for some subjects. The missing data mechanism is MAR, because the level of the observed baseline covariate <span class="math inline">\(X\)</span> determines the probability of observing <span class="math inline">\(Y\)</span>. The indicator <span class="math inline">\(R_y = 1\)</span> when we do observe <span class="math inline">\(Y\)</span> and <span class="math inline">\(R_y = 0\)</span> when we do not.</p>
<p><img src="img/MAR_1_DAG.png" style="width:35.0%" /></p>
<p>I’ll go ahead and simulate data based on this first DAG. In case you’d like to replicate, here are the libraries necessary for the simulations:</p>
<pre class="r"><code>library(simstudy)
library(ggplot2)
library(broom)
library(data.table)
library(mice)</code></pre>
<p>The data definitions establish the relationship between <span class="math inline">\(A\)</span>, <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> (the treatment effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(Y\)</span> is 2.5) as well as create a missingness mechanism for <span class="math inline">\(Y\)</span> that is a function <span class="math inline">\(X\)</span>; subjects with higher values of <span class="math inline">\(X\)</span> are more likely to have missing outcome measurements.</p>
<pre class="r"><code>def1 <- defData(varname = "x", formula=0.5, dist = "binary")
def2 <- defDataAdd(varname = "y", formula = "5 + 5*x + 2.5*a", variance = 2)
defm <- defMiss(varname = "y", formula = "-3.5 + 2.3*x", logit.link = TRUE)</code></pre>
<p>To generate the observed data with missing data, we first generate a complete data set (based on the data definitions), and then we generate a missing data matrix, which finally gives us the observed data set which includes <span class="math inline">\(\text{NA}\)</span>’s for about 13% of the <span class="math inline">\(Y\)</span>’s.</p>
<pre class="r"><code>set.seed(17236)
dd <- genData(500, def1)
dd <- trtAssign(dd, grpName = "a")
dd <- addColumns(def2, dd)
ddmiss <- genMiss(dd, defm, id = "id")
ddobs <- genObs(dd, ddmiss, id = "id")
ddobs</code></pre>
<pre><code>## id x a y
## 1: 1 0 0 6.1
## 2: 2 1 0 9.2
## 3: 3 1 1 11.6
## 4: 4 0 0 4.5
## 5: 5 1 1 NA
## ---
## 496: 496 0 0 5.8
## 497: 497 0 1 7.3
## 498: 498 0 1 6.9
## 499: 499 1 1 11.1
## 500: 500 1 0 10.0</code></pre>
<p>Using the full data set <code>dd</code> (without any missing data), we can get a point estimate of the treatment effect <span class="math inline">\(\delta\)</span> merely by calculating</p>
<p><span class="math display">\[\hat{\delta} = \bar{Y}_{a=1} - \bar{Y}_{a=0}\]</span></p>
<pre class="r"><code>dd[, .(avg = mean(y)), keyby = a][ , avg - shift(avg)][2]</code></pre>
<pre><code>## [1] 2.5</code></pre>
<p>There is no reason to believe that the observed data means are the same as the complete data set means. That is, it is not likely that <span class="math inline">\(\bar{Y^*}_{a=1}\)</span> = <span class="math inline">\(\bar{Y}_{a=1}\)</span> or <span class="math inline">\(\bar{Y^*}_{a=0}\)</span> = <span class="math inline">\(\bar{Y}_{a=0}\)</span>. Observations with higher values of <span class="math inline">\(X\)</span> (and thus higher values of <span class="math inline">\(Y\)</span>) are more likely to have missing <span class="math inline">\(Y\)</span>’s, so the average observed values in both treatment groups should be lower. This seems to be the case here:</p>
<pre class="r"><code>dd[, .(avg = mean(y)), keyby = a]</code></pre>
<pre><code>## a avg
## 1: 0 7.5
## 2: 1 10.0</code></pre>
<pre class="r"><code>ddobs[, (avg = mean(y, na.rm = TRUE)), keyby = a]</code></pre>
<pre><code>## a V1
## 1: 0 7.2
## 2: 1 9.7</code></pre>
<p>In the real world, we can only estimate the treatment effect <span class="math inline">\(\delta^*\)</span> with the data that we have:
<span class="math display">\[\hat{\delta}^* = \bar{Y}_{a=1}^* - \bar{Y}_{a=0}^*\]</span></p>
<p>It looks like, in this case at least, the bias in estimates of the means are in the same direction, so that the estimate of the treatment effect based on the <em>difference</em> of means in the observed data is unbiased:</p>
<pre class="r"><code>ddobs[!is.na(y), .(avg = mean(y)), keyby = a][ , avg - shift(avg)][2] </code></pre>
<pre><code>## [1] 2.5</code></pre>
<p>If this is the case more generally for data sets generated using this mechanism, we may not need to worry at all about the missing data mechanism; even though we know it is MAR, we might be able to treat it as MCAR, and just use the observed measurements only, without any adjustment or imputation.</p>
<p>Simulating 2500 data sets using steps outlined above provides insight into the nature of the bias. (I’ve provided generic code for generating repeated data sets in the <a href="#addendum">addendum</a>.) The estimates based on the complete data set are shown on the <span class="math inline">\(x\)</span> axis, and the observed data estimates are on the <span class="math inline">\(y\)</span> axis. The dotted lines show the average of the estimates for the complete and observed data sets, respectively.</p>
<p>For both treatment arms, the average estimate from the complete data sets is centered around the true value (used in the data generation process). As expected (since higher values of Y are likely to be missing), the average estimate for each arm is biased downwards when we do not take into consideration the missingness.</p>
<p><img src="img/MAR_1_y.png" style="width:70.0%" /></p>
<p>However, the bias is removed when we consider the treatment effect, which is our primary interest. In this (perhaps overly) simplistic scenario, there is no price to pay when ignoring the missing data. Both estimates are centered around 2.5, the true value.</p>
<p><img src="img/MAR_1_diff.png" style="width:40.0%" /></p>
</div>
<div id="case-2" class="section level3">
<h3>Case #2</h3>
<p>The second example differs from the first only in one respect: the size of the intervention effect depends on the baseline covariate <span class="math inline">\(X\)</span> (the line drawn from <span class="math inline">\(X\)</span> to the arrow connecting <span class="math inline">\(A\)</span> and <span class="math inline">\(Y\)</span> represents this effect modification).</p>
<p><img src="img/MAR_2_DAG.png" style="width:35.0%" /></p>
<p>In the example, <span class="math inline">\(\delta_0 = 1\)</span> for the sub-population with <span class="math inline">\(X = 0\)</span>, and <span class="math inline">\(\delta_1 = 4\)</span> for the sub-population with <span class="math inline">\(X = 1\)</span>. If the population were evenly distributed between <span class="math inline">\(X=0\)</span> and <span class="math inline">\(X=1\)</span>, then we would observe an overall effect <span class="math inline">\(\delta = 2.5\)</span>.</p>
<pre class="r"><code>d1 <- defData(varname = "x", formula=0.5, dist = "binary")
d2 <- defDataAdd(varname = "y", formula = "6 + 1*a + 2*x + 3*a*x", variance = 2)
dm <- defMiss(varname = "y", formula = "-3.5 + 2.3*x", logit.link = TRUE)</code></pre>
<p>But this time around, if we go ahead and naïvely estimate <span class="math inline">\(\delta^* = \bar{Y}_{a=1}^* - \bar{Y}_{a=0}^*\)</span>, the estimate will be biased.</p>
<p><img src="img/MAR_2_diff.png" style="width:40.0%" /></p>
<p>The reason for this bias is that the mix of <span class="math inline">\(X\)</span> in the observed sample is different from the complete sample (and population); since <span class="math inline">\(X\)</span> influences the effect size this change impacts the overall unadjusted estimate. In the complete data set <span class="math inline">\(P(X=1) = 0.50\)</span>, but in an observed data set with missing values <span class="math inline">\(P^*(X=1) = 0.44\)</span> (this can be confirmed using the assumptions from the data generation process, but I’ll let you do that as an exercise if you’d like.) The population average treatment effect is <span class="math inline">\(P(X=0) \times 1 + P(X=1) \times 4 = 2.5\)</span>. And in the data set with missing data <span class="math inline">\(P^*(X=0) \times 1 + P^*(X=1) \times 4 = 0.56 \times 1 + 0.44 \times 4 = 2.3\)</span>.</p>
<p>We can still estimate the treatment effect if we adjust for <span class="math inline">\(X\)</span> in a regression model, or just take the difference in means within each level of <span class="math inline">\(X\)</span>. These estimates are unbiased:</p>
<p><img src="img/MAR_2_adj.png" style="width:70.0%" /></p>
<p>If we want to recover the population average treatment effect, we can reweight the group-level treatment effects by the distribution of <span class="math inline">\(X\)</span> in complete sample (since <span class="math inline">\(X\)</span> is fully observed). No imputation is needed.</p>
</div>
<div id="case-3" class="section level3">
<h3>Case #3</h3>
<p>Now we are back to the original motivating scenario. The missing data mechanism is depicted in the next DAG. Those with higher scores in the first period are more likely to have missing values in the second time period, perhaps because they have improved sufficiently and no longer feel like participating in the study.</p>
<p><img src="img/MAR_3_DAG.png" style="width:40.0%" /></p>
<p>The DAG is implemented with these definitions:</p>
<pre class="r"><code>def1 <- defData(varname = "x", formula=0.5, dist = "binary")
def2 <- defDataAdd(varname = "y1", formula = "5 + a*2.5 + 5*x", variance = 2)
def2 <- defDataAdd(def2, "y2", formula = "1 + y1 + 5*x", variance = 2)
defm <- defMiss(varname = "y2", formula = "-4.5 + 0.3*y1", logit.link = TRUE)</code></pre>
<p>In this case, simply comparing the means in the data sets with missing data provides a biased estimate - we can see this on the left; the argument is similar to the one I made in the previous scenario. If we opt to control for <span class="math inline">\(Y_1\)</span>, we introduce all sorts of biases, as <span class="math inline">\(Y_1\)</span> is a mediator between <span class="math inline">\(A\)</span> and <span class="math inline">\(Y_2\)</span>, as well as a collider. (I’ve written about the dangers of controlling for post-intervention variables <a href="https://www.rdatagen.net/post/be-careful/" target="_blank">here</a> and the need to be careful with colliders <a href="https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/" target="_blank">here</a>.)</p>
<p><img src="img/MAR_3_trt.png" style="width:70.0%" /></p>
<p>Since neither a simple comparison of means nor an adjusted model will suffice here, our only option is to use multiple imputation, which in <code>R</code> can be can be implemented with the package <a href="https://amices.org/mice/" target="_blank">mice</a>. Below, I am showing code that generates 20 imputed data sets, fits models for each of them, and pools the results to provide a single estimate and measure of uncertainty.</p>
<pre class="r"><code>library(mice)
imp <- mice(ddobs[,-"id"], m=20, maxit=5, print=FALSE)
fit <- with(imp, lm(y2 ~ a))
results <- summary(pool(fit))</code></pre>
<p>Multiple imputation has been applied to the same 2500 data sets with missing data that are represented in the biased estimate plots. The plot below shows a pretty strong correlation with the estimates from the full data mode, and both are centered at the true population effect of 2.5.</p>
<p><img src="img/MAR_3_imp.png" style="width:40.0%" /></p>
<p>The takeaway from all this is that while multiple imputation is not always necessary, if you think there are potentially unmeasured confounders or post-intervention measures that are conceivably in the mix, a multiple imputation approach might be wiser than trying to adjust your way out of the problem.</p>
<p>I plan on implementing a Bayesian model that treats the missing data as parameters. If I can get that working, I will share it here, of course.</p>
<p><br /></p>
<p><a name="addendum"></a></p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>In case you’d like to play around with other scenarios, I’m including the code that will allow you to repeatedly sample data sets. Just provide you our data definitions.</p>
<pre class="r"><code>s_generate <- function(n) {
dd <- genData(n, d1)
dd <- trtAssign(dd, grpName = "a")
dd <- addColumns(d2, dd)
dmiss <- genMiss(dd, dm, id = "id")
dobs <- genObs(dd, dmiss, id = "id")
return(list(dd, dobs))
}
s_replicate <- function(n) {
dsets <- s_generate(n)
diff.complete <- dsets[[1]][, .(avg = mean(y2)), keyby = a][ , avg - shift(avg)][2]
diff.obs<- dsets[[2]][!is.na(y2), .(avg = mean(y2)), keyby = a][ , avg - shift(avg)][2]
est.complete <- coef(lm(y2 ~ a, data = dsets[[1]]))["a"]
est.obs <- coef(lm(y2 ~ y1 + a, data = dsets[[2]]))["a"]
imp <- mice(dsets[[2]][,-"id"], m=20, maxit=5, print=FALSE)
fit <- with(imp, lm(y2 ~ a))
pooled.ests <- summary(pool(fit))
est.impute <- pooled.ests$estimate[2]
return(data.table(diff.complete, est.complete, diff.obs, est.obs, est.impute))
}
results <- rbindlist(mclapply(1:2500, function(x) s_replicate(300), mc.cores = 4))</code></pre>
</div>
Framework for power analysis using simulation
https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/
Tue, 16 Mar 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/
<script src="https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/index.en_files/header-attrs/header-attrs.js"></script>
<p>The <a href="https://kgoldfeld.github.io/simstudy/index.html" target="_blank">simstudy</a> package started as a collection of functions I developed as I found myself repeating many of the same types of simulations for different projects. It was a way of organizing my work that I decided to share with others in case they wanted a routine way to generate data as well. <code>simstudy</code> has expanded a bit from that, but replicability is still a key motivation.</p>
<p>What I have here is another attempt to document and organize a process that I find myself doing quite often - repeated data generation and model fitting. Whether I am conducting a power analysis using simulation or exploring operating characteristics of different models, I take a pretty similar approach. I refer to this structure when I am starting a new project, so I thought it would be nice to have it easily accessible online - and that way others might be able to refer to it as well.</p>
<div id="the-framework" class="section level3">
<h3>The framework</h3>
<p>I will provide a simple application below, but first I’ll show the general structure. The basic idea is that we want to generate data under a variety of assumptions - for example, a power analysis will assume different sample sizes, effects, and/or levels of variation - and for <em>each set of assumptions</em>, we want to generate a large number of replications to mimic repeated sampling from a population. The key elements of the process include (1) <em>defining</em> the data, (2) <em>generating</em> a data set, (3) <em>fitting a model</em> to the data, and (4) <em>providing summary statistics</em>.</p>
<p>If you have familiarity with <code>simstudy</code>, I’d say the code is pretty self-explanatory. In the function <code>s_generate</code>, there is a call to base R function <code>list2env</code>, which makes all elements of a list available as variables in the function’s environment. The replication process is managed by the <code>mclapply</code> function from the <code>parallel</code> package. (Alternative approaches include using function <code>lapply</code> in base R or using a <em>for</em> loop.)</p>
<pre class="r"><code>s_define <- function() {
#--- add data definition code ---#
return(list_of_defs) # list_of_defs is a list of simstudy data definitions
}
s_generate <- function(list_of_defs, argsvec) {
list2env(list_of_defs, envir = environment())
list2env(as.list(argsvec), envir = environment())
#--- add data generation code ---#
return(generated_data) # generated_data is a data.table
}
s_model <- function(generated_data) {
#--- add model code ---#
return(model_results) # model_results is a data.table
}
s_single_rep <- function(list_of_defs, argsvec) {
generated_data <- s_generate(list_of_defs, argsvec)
model_results <- s_model(generated_data)
return(model_results)
}
s_replicate <- function(argsvec, nsim) {
list_of_defs <- s_define()
model_results <- rbindlist(
parallel::mclapply(
X = 1 : nsim,
FUN = function(x) s_single_rep(list_of_defs, argsvec),
mc.cores = 4)
)
#--- add summary statistics code ---#
return(summary_stats) # summary_stats is a data.table
}</code></pre>
<div id="specifying-scenarios" class="section level4">
<h4>Specifying scenarios</h4>
<p>The possible values of each data generating parameter are specified as a vector. The function <code>scenario_list</code> creates all possible combinations of the values of the various parameters, so that there will be <span class="math inline">\(n_1 \times n_2 \times n_3 \times ...\)</span> scenarios, where <span class="math inline">\(n_i\)</span> is the number of possible values for parameter <span class="math inline">\(i\)</span>. Examples of parameters might be sample size, effect size, variance, etc, really any value that can be used in the data generation process.</p>
<p>The process of data generation and model fitting is executed for each combination of <span class="math inline">\(n_1 \times n_2 \times n_3 \times ...\)</span> scenarios. This can be done locally using function <code>lapply</code> or using a high performance computing environment using something like <code>Slurm_lapply</code> in the <code>slurmR</code> package. (I won’t provide an example of that here - let me know if you’d like to see that.)</p>
<pre class="r"><code>#---- specify varying power-related parameters ---#
scenario_list <- function(...) {
argmat <- expand.grid(...)
return(asplit(argmat, MARGIN = 1))
}
param_1 <- c(...)
param_2 <- c(...)
param_3 <- c(...)
.
.
.
scenarios <- scenario_list(param1 = param_1, param_2 = param_2, param_3 = param_3, ...)
#--- run locally ---#
summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 1000)))</code></pre>
</div>
</div>
<div id="example-power-analysis-of-a-crt" class="section level2">
<h2>Example: power analysis of a CRT</h2>
<p>To carry out a power analysis of a cluster randomized trial, I’ll fill in the skeletal framework. In this case I am interested in understanding how estimates of power vary based on changes in effect size, between cluster/site variation, and the number of patients per site. The data definitions use <a href="https://kgoldfeld.github.io/simstudy/articles/double_dot_extension.html" target="_blank">double dot</a> notation to allow the definitions to change dynamically as we switch from one scenario to the next. We estimate a mixed effect model for each data set and keep track of the proportion of p-value estimates less than 0.05 for each scenario.</p>
<pre class="r"><code>s_define <- function() {
#--- data definition code ---#
def1 <- defData(varname = "site_eff",
formula = 0, variance = "..svar", dist = "normal", id = "site")
def1 <- defData(def1, "npat", formula = "..npat", dist = "poisson")
def2 <- defDataAdd(varname = "Y", formula = "5 + site_eff + ..delta * rx",
variance = 3, dist = "normal")
return(list(def1 = def1, def2 = def2))
}
s_generate <- function(list_of_defs, argsvec) {
list2env(list_of_defs, envir = environment())
list2env(as.list(argsvec), envir = environment())
#--- data generation code ---#
ds <- genData(40, def1)
ds <- trtAssign(ds, grpName = "rx")
dd <- genCluster(ds, "site", "npat", "id")
dd <- addColumns(def2, dd)
return(dd)
}
s_model <- function(generated_data) {
#--- model code ---#
require(lme4)
require(lmerTest)
lmefit <- lmer(Y ~ rx + (1|site), data = generated_data)
est <- summary(lmefit)$coef[2, "Estimate"]
pval <- summary(lmefit)$coef[2, "Pr(>|t|)"]
return(data.table(est, pval)) # model_results is a data.table
}
s_single_rep <- function(list_of_defs, argsvec) {
generated_data <- s_generate(list_of_defs, argsvec)
model_results <- s_model(generated_data)
return(model_results)
}
s_replicate <- function(argsvec, nsim) {
list_of_defs <- s_define()
model_results <- rbindlist(
parallel::mclapply(
X = 1 : nsim,
FUN = function(x) s_single_rep(list_of_defs, argsvec),
mc.cores = 4)
)
#--- summary statistics ---#
power <- model_results[, mean(pval <= 0.05)]
summary_stats <- data.table(t(argsvec), power)
return(summary_stats) # summary_stats is a data.table
}</code></pre>
<pre class="r"><code>scenario_list <- function(...) {
argmat <- expand.grid(...)
return(asplit(argmat, MARGIN = 1))
}
delta <- c(0.50, 0.75, 1.00)
svar <- c(0.25, 0.50)
npat <- c(8, 16)
scenarios <- scenario_list(delta = delta, svar = svar, npat = npat)
#--- run locally ---#
summary_stats <- rbindlist(lapply(scenarios, function(a) s_replicate(a, nsim = 250)))</code></pre>
<p>The overall results (in this case, the power estimate) can be reported for each scenario.</p>
<pre class="r"><code>summary_stats</code></pre>
<pre><code>## delta svar npat power
## 1: 0.50 0.25 8 0.480
## 2: 0.75 0.25 8 0.844
## 3: 1.00 0.25 8 0.960
## 4: 0.50 0.50 8 0.368
## 5: 0.75 0.50 8 0.684
## 6: 1.00 0.50 8 0.904
## 7: 0.50 0.25 16 0.660
## 8: 0.75 0.25 16 0.940
## 9: 1.00 0.25 16 1.000
## 10: 0.50 0.50 16 0.464
## 11: 0.75 0.50 16 0.792
## 12: 1.00 0.50 16 0.956</code></pre>
<p>We can also plot the results easily to get a clearer picture. Higher between-site variation clearly reduces power, as do smaller effect sizes and smaller sizes. None of this is surprising, but is always nice to see things working out as expected:</p>
<p><img src="https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/index.en_files/figure-html/powerplot-1.png" width="768" /></p>
</div>
Randomization tests make fewer assumptions and seem pretty intuitive
https://www.rdatagen.net/post/2021-03-02-randomization-tests/
Tue, 02 Mar 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-03-02-randomization-tests/
<script src="https://www.rdatagen.net/post/2021-03-02-randomization-tests/index.en_files/header-attrs/header-attrs.js"></script>
<p>I’m preparing a lecture on simulation for a statistical modeling class, and I plan on describing a couple of cases where simulation is intrinsic to the analytic method rather than as a tool for exploration and planning. MCMC methods used for Bayesian estimation, bootstrapping, and randomization tests all come to mind.</p>
<p>Randomization tests are particularly interesting as an approach to conducting hypothesis tests, because they allow us to avoid making unrealistic assumptions. I’ve written about this <a href="https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/" target="_blank">before</a> under the rubric of a permutation test. The example I use here is a little a different; truth be told, the real reason I’m sharing is that I came up with a nice little animation to illustrate a simple randomization process. So, even if I decide not to include it in the lecture, at least you’ve seen it.</p>
<div id="the-goal-of-a-randomization-test" class="section level3">
<h3>The goal of a randomization test</h3>
<p>In the context of an RCT with treatment and control arms and we want to compare average responses (measured by some outcome <span class="math inline">\(Y\)</span>), the standard hypothesis test is framed around a null hypothesis <span class="math inline">\(H_0: \mu_\text{t} = \mu_\text{c}\)</span>. In this framework, <span class="math inline">\(\mu_\text{t}\)</span> and <span class="math inline">\(\mu_\text{c}\)</span> are the average responses in the <em>population</em> under treatment and control, respectively. For this to be valid, we need to assume that the study sample is representative of the population of interest, that has been randomly selected; for most RCTs, this is a fairly untenable. Participants of studies are typically <em>not randomly drawn</em> from the population, but are more likely to have shown up in a particular setting, been identified for recruitment, and ultimately decided for themselves about participating.</p>
<p>The randomization test makes no assumption about whether the sample is representative. Rather, it asks a question that is limited to the sample at hand, regardless of how it was collected or created. The null hypothesis in the randomization test is that the average response <em>in the sample</em> is unrelated to treatment assignment. If we reject the null hypothesis and conclude treatment assignment mattered in this sample, we can decide to extend this conclusion to the population based on our understanding of the intervention and how the sample relates to this broader population. In the standard hypothesis testing framework, the leap from the sample to the population is baked in; in the context of a randomization test, the generalization is not so obvious.</p>
<p>It is also important to underscore that while the standard hypothesis test makes assumptions about normality (or at least about the validity of the Central Limit Theorem), the randomization test makes no assumptions about the underlying distribution of the outcome data.</p>
</div>
<div id="simulated-data" class="section level3">
<h3>Simulated data</h3>
<p>To make things a little more interesting, and to to test the importance of the normality assumption, I’m using a data generation process that mixes two populations - Group 1, where the outcome <span class="math inline">\(Y_1 \sim N(\mu=0, \sigma^2=1)\)</span> and Group 2, with larger mean and variance: <span class="math inline">\(Y_2 \sim N(5,4)\)</span>. The treatment effect also differs across the groups. The population (or actually the sample) will be comprised of 70% from Group 1 and 30% from Group 2.</p>
<p>For the purposes of creating the animation, I am generating 1000 observations in total, randomizing 500 to each arm:</p>
<pre class="r"><code>library(simstudy)
d1 <- defDataAdd(varname = "Y_1", formula = "0 + 2 * rx",
variance = 1, dist = "normal")
d1 <- defDataAdd(d1, varname = "Y_2", formula = "5 + 1 * rx",
variance = 4, dist = "normal")
d1 <- defDataAdd(d1, varname = "Y",
formula = "Y_1 | 0.7 + Y_2 | 0.3", dist = "mixture")
set.seed(11778)
dd <- genData(1000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)</code></pre>
<p>The histogram of these data makes it quite clear that the data are <em>not</em> normally distributed:</p>
<p><img src="https://www.rdatagen.net/post/2021-03-02-randomization-tests/index.en_files/figure-html/distY-1.png" width="480" /></p>
</div>
<div id="randomization" class="section level3">
<h3>Randomization</h3>
<p>In the simple case of two-arm trial, the randomization test is quite simple: we repeatedly assign randomly generated alternate treatment arm labels to each of the observations, and calculate the test statistic following each iteration. In this case the test statistic in the difference in means <span class="math inline">\(\Delta_\text{obs} = \bar{Y}_t - \bar{Y}_c\)</span>.</p>
<p>In a more complex situation, where the data have an underlying structure, such as clustering or block randomization, we have to make sure that the re-randomization does not violate that structure. For example, in the case of a cluster randomized trial where all individuals within the cluster are in the same intervention arm, the null hypothesis is that cluster-level treatment has no impact, so we would re-randomize the cluster as a whole, not the individuals.</p>
<p>Here is the animation that depicts a single iteration of the re-randomization process, starting with original data, permuting the data, and calculating <span class="math inline">\(\Delta_i^*\)</span>, the test statistic for iteration <span class="math inline">\(i\)</span> of the procedure. In the data just generated <span class="math inline">\(\Delta_\text{obs} = 1.9\)</span> and the re-randomized <span class="math inline">\(\Delta^* = 0.1\)</span>. (The code for the animation is in the <a href="#addendum">addendum</a>.)</p>
<pre class="r"><code>dd[, rx_s := sample(rx, replace = FALSE)]</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-03-02-randomization-tests/index.en_files/figure-html/anim1-1.gif" /><!-- --></p>
</div>
<div id="estimating-a-p-value" class="section level3">
<h3>Estimating a p-value</h3>
<p>The animation is kind of a cool way to depict single iteration, but to estimate a distribution for <span class="math inline">\(\Delta^*\)</span> and ultimately the p-value, we need to do this repeatedly. Using 1000 observations, the p-value will be vanishingly small, so I’m creating a much smaller data set of 60 observations with an observed effect size of 1.8.</p>
<pre class="r"><code>dd <- genData(60)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)
Delta_obs <- dd[rx == 1, mean(Y)] - dd[rx == 0, mean(Y)]
Delta_obs</code></pre>
<pre><code>## [1] 1.809</code></pre>
<p>The iteration process consists of repeatedly calling a function that randomly assigns labels and returns the group differences based on these new labels. It is generally recommended to run between 500 and 1500 iterations (including the observed data set, which can be viewed as just another iteration under the null hypothesis); here I am using 1500.</p>
<pre class="r"><code>randomize <- function(dx) {
rx_s <- sample(dx$rx, replace = FALSE)
dn <- data.table(Y = dx$Y, rx = rx_s)
Delta_star <- dn[rx == 1, mean(Y)] - dn[rx == 0, mean(Y)]
Delta_star
}
Delta_stars <- sapply(1:1499, function(x) randomize(dd))</code></pre>
<p>The distribution of the <span class="math inline">\(\Delta^*\)</span>’s appears to be normally distributed even though the underlying data are not; the red line indicates the observed value, <span class="math inline">\(\Delta_\text{obs}\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2021-03-02-randomization-tests/index.en_files/figure-html/plotDelta-1.png" width="480" /></p>
<p>The p-value is estimated by comparing <span class="math inline">\(\Delta_\text{obs}\)</span> with a combined data set that includes the <span class="math inline">\(\Delta^*\)</span>’s and <span class="math inline">\(\Delta_\text{obs}\)</span>. Using an <span class="math inline">\(\alpha = 0.05\)</span>, we would reject that null hypothesis and conclude that within this sample, treatment had an effect.</p>
<pre class="r"><code>1 - mean(abs(Delta_obs) >= abs(c(Delta_obs, Delta_stars)))</code></pre>
<pre><code>## [1] 0.004667</code></pre>
</div>
<div id="operating-characteristics-of-the-randomization-test" class="section level3">
<h3>Operating characteristics of the randomization test</h3>
<p>I was very interested to see what the Type I error rate would be for this example, so I repeatedly generated data sets under the assumption of no treatment effect . For each data set, I estimated both a traditional as well as a randomization test p-value. I used a large number of iterations - 48,000 to be exact - to make sure my Type I error estimate converged as close to as possible to the truth.</p>
<p>This would have taken days, probably close to a week, on my laptop; to get around this I used a high performance computer which I have <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">described</a> in the context of Bayesian modeling. Rather than days, it took about 2 hours. If you’d like the code for this, I’m happy to share. Using datasets with 20 observations, the standard Type I error rate was 4.5% and the error rate using the randomization test was exactly 5.0%. With 40 observations, the error rates were 4.8% and 4.9%, respectively. So it appears that, in these scenarios at least, the randomization test does a slightly better job of realizing the targeted 5% Type I error rate.</p>
<p><br /></p>
<p><a name="addendum"></a></p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The animation is created using the <code>gganimate</code> package. This is completely new to me, so I am still exploring; if you want to learn more, I’d recommend checking out the <a href="https://gganimate.com/" target="_blank">website</a>. The key element is to define a sequence of plots that represent states; <code>gganimate</code> magically creates the necessary transitions, and you can control observation times and smoothness of the transitions. The output is a <em>gif</em> file.</p>
<pre class="r"><code>library(ggplot2)
library(gganimate)
dif_in_means_orig <- round(dd[rx == 1, mean(Y)] - dd[rx == 0, mean(Y)], 1)
dif_in_means_perm <- round(dd[rx_s == 1, mean(Y)] - dd[rx_s == 0, mean(Y)], 1)
dd1 <- dd[, .(iter = 1, id=id, rx = rx, rcolor = rx, Y=Y, perm = FALSE)]
dd2 <- dd[, .(iter = 2, id=id, rx = 0.5, rcolor = 3, Y=Y, perm = FALSE)]
dd3 <- dd[, .(iter = 3, id=id, rx = 0.5, rcolor = rx_s, Y=Y, perm = TRUE)]
dd4 <- dd[, .(iter = 4, id=id, rx = rx_s, rcolor = rx_s, Y=Y, perm = TRUE)]
ddx <- rbind(dd1, dd2, dd3, dd4)
ddx[, iter := factor(iter,
labels = c(
paste0("Original data with effect size = ", dif_in_means_orig, " ..."),
"permutation ...",
"permutation ....",
paste0("... after permutation, the mean difference = ", dif_in_means_perm)))]
a <- ggplot(data = ddx, aes(x = rx, y = Y, group = id)) +
geom_point(position = position_jitter(seed = 42),
aes(color = factor(rcolor), shape = perm)) +
geom_vline(xintercept = 0.5, color = "white") +
scale_color_manual(values = c("#bbb66c", "#6c71bb", "grey80")) +
scale_shape_manual(values = c(19, 4)) +
scale_x_continuous(limits = c(-.5, 1.5), breaks = c(0, 1),
labels = c("control", "treatment")) +
theme(legend.position = "none",
panel.grid = element_blank(),
axis.title.x = element_blank(),
axis.text = element_text(size = 12),
axis.title.y = element_text(size = 14)) +
transition_states(iter, state_length = 2, transition_length = 1) +
labs(title = "{closest_state}", y="Outcome")
animate(a, duration = 15, fps = 10, height = 450, width = 350)</code></pre>
</div>
Visualizing the treatment effect with an ordinal outcome
https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/
Tue, 16 Feb 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/
<script src="https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/index.en_files/header-attrs/header-attrs.js"></script>
<p>If it’s true that many readers of a journal article focus on the abstract, figures and tables while skimming the rest, it is particularly important tell your story with a well conceived graphic or two. Along with a group of collaborators, I am trying to figure out the best way to represent an ordered categorical outcome from an RCT. In this case, there are a lot of categories, so the images can get confusing. I’m sharing a few of the possibilities that I’ve tried so far, including the code.</p>
<p>The motivation for this work is a data set we don’t have yet. The ongoing <a href="https://bit.ly/3b7aZCr" target="_blank">CONTAIN</a> trial is a large multi-site trial evaluating the effectiveness of using COVID-19 convalescent plasma (CP) to treat newly hospitalized patients with COVID-19. The primary outcome is a World Health Organization (WHO) score that ranges from 0 to 10, where 0 is healthy and virus-free and 10 is death:</p>
<p><img src="img/WHO-11.png" /></p>
<p>The primary goal of the study is to show whether patients receiving CP are more likely to have better outcomes (i.e. score lower on the WHO 11-point scale) 14 days after randomization compared to those who receive a saline solution placebo.</p>
<p>The analysis will use a cumulative proportional odds model, which I have written about previously (see <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/" target="_blank">here</a>, for example). We plan on doing a Bayesian version of the model, but I won’t get into that either (I wrote about the Bayes model in the context of a related project <a href="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/" target="_blank">here</a>). Rather, I will show four options for presenting the data in a way that highlights the treatment effect (or lack thereof). Often, when I’m blogging, I’ll skip over the <code>ggplot</code> code, but that is the point here so you’ll see everything.</p>
<div id="the-data" class="section level3">
<h3>The data</h3>
<p>Since I don’t have access to the actual data, simulated data will have to suffice. The data generation process I’m using is quite simple:</p>
<ol style="list-style-type: decimal">
<li>generate 1000 individuals</li>
<li>randomize to treatment and control using a 1:1 ratio</li>
<li>specify baseline probabilities for the distribution of outcomes in the control arm</li>
<li>outcome probabilities in the treatment arm are based on an assumed log cumulative odds ratio of -0.6 (which translates to a cumulative odds ratio of 0.55).</li>
</ol>
<p>The odds we are talking about are</p>
<p><span class="math display">\[
Odds = \frac{P(WHO \ge y)}{P(WHO < y)}, \ \ y \in \{2,\dots,10\},
\]</span>
so it should be clear that lower odds is preferred.</p>
<p>Here is the data generation code:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
def <- defDataAdd(varname = "z", formula = "rx * -0.6", dist = "nonrandom")
set.seed(39821)
dd <- genData(1000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(def, dd)
dd <- genOrdCat(
dd,
adjVar = "z",
baseprobs = c(0.06, 0.06, 0.10, 0.10, 0.10, 0.13, 0.13, 0.10, 0.10, 0.06, 0.06),
catVar = "WHO"
)</code></pre>
</div>
<div id="summarizing-the-data" class="section level3">
<h3>Summarizing the data</h3>
<p>In the plots that follow, I’ll be using summary data: proportions and cumulative proportions of patients that fall into each category:</p>
<pre class="r"><code>dsum <- dd[, .(N = sum(.N)), keyby = .(rx, WHO)]
dsum[, rx := factor(rx, labels = c("control", "treatment"))]
dsum</code></pre>
<pre><code>## rx WHO N
## 1: control 1 27
## 2: control 2 28
## 3: control 3 48
## 4: control 4 54
## 5: control 5 52
## 6: control 6 62
## 7: control 7 63
## 8: control 8 63
## 9: control 9 40
## 10: control 10 42
## 11: control 11 21
## 12: treatment 1 50
## 13: treatment 2 46
## 14: treatment 3 78
## 15: treatment 4 63
## 16: treatment 5 64
## 17: treatment 6 62
## 18: treatment 7 53
## 19: treatment 8 27
## 20: treatment 9 33
## 21: treatment 10 13
## 22: treatment 11 11
## rx WHO N</code></pre>
<p>I’ll also need the total number of patients in each arm so that I can provide informative labels:</p>
<pre class="r"><code>dnames <- dsum[, sum(N), keyby = rx]
dnames[, legend_label := paste0(rx, " (n = ", V1, ")")]
dnames[, axis_label := paste0(rx, "\n(n = ", V1, ")")]
dnames</code></pre>
<pre><code>## rx V1 legend_label axis_label
## 1: control 500 control (n = 500) control\n(n = 500)
## 2: treatment 500 treatment (n = 500) treatment\n(n = 500)</code></pre>
</div>
<div id="proportions-by-arm" class="section level3">
<h3>Proportions by arm</h3>
<p>The first plot is quite straightforward, showing the proportion of each arm that falls in each category. This plot allows us to see right away that the treatment arm has more patients in the lower categories. While not particularly elegant, the plot makes it quite easy to gauge the relative proportions:</p>
<pre class="r"><code>dsum[, prop := N/sum(N), keyby = rx]
dsum[, legend_label := factor(rx, labels = dnames[, legend_label])]
ggplot(data = dsum, aes(x = WHO, y = prop)) +
geom_line(aes(group = legend_label, color = legend_label), size = 1) +
geom_point(aes(color = legend_label), size = 2) +
ylim(0, 0.2) +
theme(panel.grid = element_blank(),
legend.title = element_blank(),
legend.position = c(.8, .88),
legend.background = element_rect(fill = "grey92"),
legend.key = element_rect(fill = "grey92")) +
scale_color_manual(values = c("#798E87", "#C27D38"),
guide = guide_legend(revers = TRUE)) +
scale_x_discrete(labels = c("virus-free", 1:9, "died")) +
ylab("proportion") +
xlab("WHO score")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/index.en_files/figure-html/plot1-1.png" width="480" /></p>
</div>
<div id="cumulative-proportion-by-arm" class="section level3">
<h3>Cumulative proportion by arm</h3>
<p>A slightly nicer version of the proportional line plot might be the same idea but with cumulative probabilities or proportions. We again can easily see that the treatment is having the desired effect, as the cumulative proportion is higher at the low end of the scale. I actually like this one, but it may be harder for folks to interpret, particularly if they haven’t worked with ordinal data extensively. Perhaps the first plot and this one in tandem would work well to give a complete picture.</p>
<pre class="r"><code>dsum[, cumprop := cumsum(prop), keyby = rx]
ggplot(data = dsum, aes(x = WHO, y = cumprop)) +
geom_line(aes(group = legend_label, color = legend_label), size = 1) +
geom_point(aes(color = legend_label), size = 2) +
scale_x_discrete(labels = c("virus-free", 1:9, "died")) +
ylim(0, 1) +
theme(panel.grid = element_blank(),
legend.title = element_blank(),
legend.position = c(.75, .2),
legend.background = element_rect(fill = "grey92"),
legend.key = element_rect(fill = "grey92")) +
scale_color_manual(values = c("#798E87", "#C27D38"),
guide = guide_legend(revers = TRUE)) +
ylab("cumulative proportion") +
xlab("WHO score")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/index.en_files/figure-html/plot2-1.png" width="480" /></p>
</div>
<div id="distribution-of-outcome-using-stacked-bars" class="section level3">
<h3>Distribution of outcome using stacked bars</h3>
<p>The next one was inspired by a recent <a href="https://jamanetwork.com/journals/jama/fullarticle/2772922" target="_blank">paper</a> describing the results of an RCT assessing the effect of Hydroxychloroquine on COVID-19 patients. The plot is packed with information, but is still simple enough to understand. With a large number of categories, the stacked bars might not make it completely obvious that treatment appears effective. By using more colors, we might be able to overcome this. But I’ve reserved that for the final plot.</p>
<pre class="r"><code>cc <- scales::seq_gradient_pal("#267efa", "white")(seq(0,1,length.out=11))
dsum[, linept := (rx == "treatment") * (2 - 0.3) + (rx == "control") * (1 + 0.3)]
dsum[, axis_label := factor(rx, labels = dnames[, axis_label])]
ggplot(data = dsum, aes( fill = WHO, y = N, x = axis_label)) +
geom_bar(position=position_fill(reverse=TRUE),
stat = "identity", width = 0.6) +
geom_line(aes(x = linept, y = N),
position = position_fill(reverse = TRUE),
lty = 3, size = .2, color = "black") +
geom_text(aes(label = scales::percent(prop, accuracy = 1L), y = prop),
position = position_stack(vjust = 0.5, reverse= TRUE),
size = 2.5, color = "grey20") +
theme(axis.title.x = element_blank(),
panel.grid = element_blank(),
legend.title = element_text(size = 8, face = "bold")) +
scale_fill_manual(
values = cc,
name = "WHO score",
labels = c("virus-free", 1:9, "died"),
guide = guide_legend(reverse = TRUE)) +
ylab("proportion")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/index.en_files/figure-html/plot3-1.png" width="480" /></p>
</div>
<div id="distribution-of-outcome-using-divergent-bars" class="section level3">
<h3>Distribution of outcome using divergent bars</h3>
<p>In this last version, the stacked bars are rotated and shifted so that they diverge from the middle of the WHO scale. This emphasizes that the treatment arm does appear to have a higher proportion of patients who are doing relatively well. This divergent plot is a bit trickier to pull off manually using <code>ggplot</code>, and requires some manipulation of the factors to make it work out the way we want.</p>
<pre class="r"><code>cc_low <- scales::seq_gradient_pal("#faa226", "white")(seq(0.2, 0.8, length.out=4))
cc_high <- scales::seq_gradient_pal("white", "#267efa")(seq(0.2, 0.8, length.out=7))
cc <- c(cc_low, cc_high)
dsum[, w_alt := factor(WHO, levels = 1:11, labels=letters[1:11])]
dsum_low <- dsum[as.numeric(w_alt) %in% c(1:4)]
dsum_high <- dsum[as.numeric(w_alt) %in% c(5:11)]
dsum_high[, w_alt := factor(w_alt, levels = letters[11:5])]
ggplot() +
geom_bar(
data = dsum_low,
aes(x = axis_label, y = -prop, fill = w_alt),
width = .6, stat="identity") +
geom_bar(
data = dsum_high,
aes(x = axis_label, y = prop, fill = w_alt),
width = .6, stat="identity") +
scale_fill_manual(
values = cc,
name = "WHO score",
labels = c("virus-free", 1:9, "died")) +
ylab("proportion") +
theme(panel.grid = element_blank(),
axis.title.y = element_blank(),
legend.title = element_text(size = 8, face = "bold")) +
scale_y_continuous(limits = c(-.50,.75),
breaks = c(-.50, -.25, 0, 0.25, .5, .75),
labels = c("50%", "25%","0%", "25%", "50%","75%")) +
geom_hline(yintercept = 0, color = "grey96") +
coord_flip() </code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/index.en_files/figure-html/plot4-1.png" width="768" /></p>
<p>In the end, it is difficult to say which approach will be preferable - I think it will depend on the actual data and the differences between the groups. Of course, not all journals permit color, so that would certainly influence the design. But this is a good start: it gives us plenty to play around with once the data do finally arrive.</p>
</div>
How useful is it to show uncertainty in a plot comparing proportions?
https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/
Tue, 02 Feb 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/
<script src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/header-attrs/header-attrs.js"></script>
<p>I recently created a simple plot for a paper describing a pilot study of an intervention targeting depression. This small study was largely conducted to assess the feasibility and acceptability of implementing an existing intervention in a new population. The primary outcome measure that was collected was the proportion of patients in each study arm who remained depressed following the intervention. The plot of the study results that we included in the paper looked something like this:</p>
<p><img src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/figure-html/origplot-1.png" width="672" /></p>
<p>The motivation for showing the data in this form was simply to provide a general sense of the outcome patterns we observed, even though I would argue (and I <a href="https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/" target="_blank">have</a> argued) that one shouldn’t try to use a small pilot to draw strong conclusions about a treatment effect, or maybe any conclusions at all. The data are simply too noisy. However, it does seem useful to show data that <em>suggest</em> an intervention <em>might</em> move things in the right direction (or least not in the wrong direction). I would have been fine showing this plot along with a description of the feasibility outcomes and plans for a future bigger trial that is designed to actually measure the treatment effect and allow us to draw stronger conclusions.</p>
<p>Of course, some journals have different priorities and might want to make stronger statements about the research they publish. Perhaps with this in mind, a reviewer suggested that we include 95% confidence intervals around the point estimates to give a more complete picture. In that case the figure would have looked something like this:</p>
<p><img src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/figure-html/ciplot-1.png" width="672" /></p>
<p>When I shared this plot with my collaborators, it generated a bit of confusion. They had done a test comparing two proportions at period 2 and found a “significant” difference between the two arms. The p-value was 0.04, and the 95% confidence interval for the difference in proportions was [0.03, 0.57], which excludes 0:</p>
<pre class="r"><code>prop.test(x = c(21, 12), n=c(30, 30), correct = TRUE)</code></pre>
<pre><code>##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(21, 12) out of c(30, 30)
## X-squared = 4, df = 1, p-value = 0.04
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.027 0.573
## sample estimates:
## prop 1 prop 2
## 0.7 0.4</code></pre>
<p>Does it make sense that the 95% CIs for the individual proportions overlap while at the same time there does appear to be a real difference between the two groups (at least in this crude way, without making adjustments for multiple testing or considering the possibility that there might be differences in the two groups)? Well - there’s actually no real reason to think that this is a paradox. The two different types of confidence intervals are measuring very different quantities - one set is looking at individual proportions, and the other is looking at the <em>difference</em> in proportions.</p>
<p>I thought a simple way to show this non-paradox would be to generate <em>all</em> the possible confidence intervals and p-values for a case where we have 30 patients per arm, and create a plot to show how overlapping individual confidence intervals for the proportions relate to p-values based on a comparison of those proportions.</p>
<p>I’ve created a data set that is a grid of events, where I am only interested in cases where the number of “events” (e.g. individuals with depression) in the intervention arm is less than the number in the control arm.</p>
<pre class="r"><code>N <- 30
e0 <- c(5:(N-5))
e1 <- c(5:(N-5))
dd <- data.table(expand.grid(e0 = e0, e1 = e1))
dd <- dd[e1 < e0]
dd[, id := 1:.N]
dd</code></pre>
<pre><code>## e0 e1 id
## 1: 6 5 1
## 2: 7 5 2
## 3: 8 5 3
## 4: 9 5 4
## 5: 10 5 5
## ---
## 206: 24 22 206
## 207: 25 22 207
## 208: 24 23 208
## 209: 25 23 209
## 210: 25 24 210</code></pre>
<p>For each pair of possible outcomes, I am estimating the confidence interval for each proportion. If the upper limit of intervention arm 95% CI is greater than the lower limit of the control arm 95% CI, the two arms overlap. (Look back at the confidence interval plot to make sure this makes sense.)</p>
<pre class="r"><code>ci <- function(x, n) {
data.table(t(prop.test(x = x, n = n, correct = T)$conf.int))
}
de0 <- dd[, ci(e0, N), keyby = id]
de0 <- de0[, .(L_e0 = V1, U_e0 = V2)]
de1 <- dd[, ci(e1, N), keyby = id]
de1 <- de1[, .(L_e1 = V1, U_e1 = V2)]
dd <- cbind(dd, de0, de1)
dd[, overlap := factor(U_e1 >= L_e0, labels = c("no overlap", "overlap"))]</code></pre>
<p>In the next and last step, I am getting the p-value for a comparison of the proportions in each pair. Any p-value less than the cutoff of 5% is considered <em>significant</em>.</p>
<pre class="r"><code>cidif <- function(x, n) {
prop.test(x = x, n = n, correct = T)$p.value
}
dd[, pval := cidif(x = c(e1, e0), n = c(N, N)), keyb = id]
dd[, sig := factor(pval < 0.05, labels = c("not significant","significant"))]</code></pre>
<p>The plot shows each pair color coded as to whether there is overlap and the difference is statistically significant.</p>
<pre class="r"><code>library(paletteer)
ggplot(data=dd, aes(x = e0, y = e1)) +
geom_point(aes(color = interaction(overlap, sig)), size = 3, shape =15) +
theme(panel.grid = element_blank(),
legend.title = element_blank()) +
scale_color_paletteer_d(
palette = "wesanderson::Moonrise2",
breaks=c("overlap.not significant", "overlap.significant", "no overlap.significant")
) +
scale_x_continuous(limits=c(5, 25), name = "number of events in control arm") +
scale_y_continuous(limits=c(5, 25), name = "number of events in treatment arm")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/figure-html/gridplot-1.png" width="576" /></p>
<p>The blue points in the center represent outcomes that are relatively close; there is overlap in the individual 95% CIs and the results are <em>not</em> significant. The rust points in the lower right-hand corner represent outcomes where differences are quite large; there is no overlap and the results <em>are</em> significant. (It will always be the case that if there is no overlap in the individual 95% CIs, the differences will be significant, at least before making adjustments for multiplicity, etc.) The region of gold points is where the ambiguity lies, outcomes where there <em>is</em> overlap between the individual 95% CIs but the differences are indeed statistically significant.</p>
<p>The following plot focuses on a single row from the grid plot above. Fixing the number of events in the treatment arm to 10, the transition from (a) substantial overlap and non-significance to (b) less overlap and significance to (c) complete separation and significance is made explicit.</p>
<pre class="r"><code>d10 <- dd[e1==10]
d10 <- melt(
data = d10,
id.vars = c("e0","e1", "sig", "overlap"),
measure.vars = list(c("L_e0", "L_e1"), c("U_e0", "U_e1")),
value.name = c("L","U")
)
ggplot(data = d10, aes(x = factor(e0), ymin = L, ymax = U)) +
geom_errorbar(aes(group = "variable",
color=interaction(overlap, sig)
),
width = .4,
size = 1,
position = position_dodge2()) +
theme(panel.grid = element_blank(),
legend.title = element_blank(),
plot.title = element_text(size = 10, face = "bold")) +
scale_color_paletteer_d(
palette = "wesanderson::Moonrise2",
breaks=c("overlap.not significant", "overlap.significant", "no overlap.significant")
) +
scale_y_continuous(limits = c(0, 1), name = "proportion with events") +
xlab("number of events in control arm") +
ggtitle("Comparison of 95% CIs and significance - treatment arm fixed at 10 events")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/figure-html/overlap-1.png" width="768" /></p>
<p>Where does this leave us? I think including the 95% CIs for the individual proportions is not really all that helpful, because there is that area of ambiguity. (Not to mention the fact that I think we should be de-emphasizing the p-values while reporting the results of a pilot.)</p>
<p>In this case, I am fine with the original plot, but, it is possible to provide an alternative measure of uncertainty by including error bars defined by the sample standard deviation. While doing this is typically more interesting in the context of continuous outcomes, it does give a sense of the sampling variability, which in the case of proportions is largely driven by the sample size. If you do decide to go this route, make sure to label the plot clearly to indicate what the error bars represent (so readers don’t think they are something they are not, such as 95% CIs).</p>
<p><img src="https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/index.en_files/figure-html/sdplot-1.png" width="672" /></p>
Finding answers faster for COVID-19: an application of Bayesian predictive probabilities
https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/
Tue, 19 Jan 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/
<script src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/header-attrs/header-attrs.js"></script>
<p>As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of <a href="https://bit.ly/3qhY2f5" target="_blank">COMPILE</a>, a prospective individual patient data meta-analysis, recognizes this. They are regularly monitoring the data to determine if there is a sufficiently strong signal to indicate effectiveness of convalescent plasma (CP) for hospitalized patients not on ventilation.</p>
<p>How much data is enough to draw a conclusion? We know that at some point in the next few months, many if not all of the studies included in the meta-analysis will reach their target enrollment, and will stop recruiting new patients; at that point, the meta-analysis data set will be complete. Before that end-point, an interim DSMB analysis might indicate there is a high probability that CP is effective although it does not meet the pre-established threshold of 95%. If we know the specific number of patients that will ultimately be included in the final data set, we can predict the probability that the findings will put us over that threshold, and possibly enable a recommendation. If this probability is not too low, the DSMB may decide it is worth waiting for the complete results before drawing any conclusions.</p>
<p>Predicting the probability of success (or futility) is done using the most recent information collected from the study, which includes observed data, the parameter estimates, and the uncertainty surrounding these estimates (which is reflected in the posterior probability distribution).</p>
<p>This post provides an example using a simulated data set to show how this prediction can be made.</p>
<div id="determining-success" class="section level2">
<h2>Determining success</h2>
<p>In this example, the outcome is the WHO 11-point ordinal scale for clinical status at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. As in COMPILE, I’ll use a Bayesian proportional odds model to assess the effectiveness of CP. The measure of effectiveness is an odds ratio (OR) that compares the cumulative odds of having a worse outcome for the treated group compared to the cumulative odds for the control group:</p>
<p><span class="math display">\[
\text{Cumulative odds for level } k \text{ in treatment arm } j =\frac{P(Y_{ij} \ge k)}{P(Y_{ij} \lt k)}, \ k \in \{1,\dots, 10\}
\]</span></p>
<p>The goal is to reduce the odds of having a bad outcome, so a successful therapy is one where <span class="math inline">\(OR \lt 1\)</span>. In a Bayesian context, we estimate the posterior probability distribution of the <span class="math inline">\(OR\)</span> (based on prior assumptions before we have collected any data). We will recommend the therapy in the case that most of the probability density lies to the left of 1; in particular we will claim success only when <span class="math inline">\(P(OR \lt 1) > 0.95\)</span>. For example, in the figure the posterior distribution on top would lead us to consider the therapy successful since 95% of the density falls below 1, whereas the distribution on the bottom would not:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
</div>
<div id="data-set" class="section level2">
<h2>Data set</h2>
<p>This data set here is considerably simpler than the COMPILE data that has motivated all of this. Rather than structuring this example as a multi-study data set, I am assuming a rather simple two-arm design without any sort of clustering. I am including two binary covariates related to sex and age. The treatment in this case reduces the odds of worse outcomes (or increases the odds of better outcomes). For more detailed discussion of generating ordinal outcomes, see this earlier <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/" target="_blank">post</a> (but note that I have flipped direction of cumulative probability in the odds formula).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def1 <- defDataAdd(varname="male", formula="0.7", dist = "binary")
def1 <- defDataAdd(def1, varname="over69", formula="0.6", dist = "binary")
def1 <- defDataAdd(def1,
varname="z", formula="0.2*male + 0.3*over69 - 0.3*rx", dist = "nonrandom")
baseprobs <- c(0.10, 0.15, 0.08, 0.07, 0.08, 0.08, 0.11, 0.10, 0.09, 0.08, 0.06)
RNGkind("L'Ecuyer-CMRG")
set.seed(9121173)
dd <- genData(450)
dd <- trtAssign(dd, nTrt = 2, grpName = "rx")
dd <- addColumns(def1, dd)
dd <- genOrdCat(dd, adjVar = "z", baseprobs = baseprobs, catVar = "y")</code></pre>
<p>Here is a plot of the cumulative proportions by treatment arm for the first 450 patients in the (simulated) trial. The treatment arm has more patients with lower WHO-11 scores, so for the most part lies above the control arm line. (This may be a little counter-intuitive, so it may be worthwhile to think about it for a moment.)</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="estimate-a-bayes-ordinal-cumulative-model" class="section level2">
<h2>Estimate a Bayes ordinal cumulative model</h2>
<p>With the data from 450 patients in hand, the first step is to estimate a Bayesian proportional odds model, which I am doing in <code>Stan</code>. I use the package <code>cmdstanr</code> to interface between <code>R</code> and <code>Stan</code>.</p>
<p>Here is the model:</p>
<p><span class="math display">\[
\text{logit}\left(P(y_{i}) \ge k \right) = \tau_k + \beta_1 I(\text{male}) + \beta_2 I(\text{over69}) + \delta T_i, \ \ \ k \in \{ 1,\dots,10 \}
\]</span></p>
<p>where <span class="math inline">\(T_i\)</span> is the treatment indicator for patient <span class="math inline">\(i\)</span>, and <span class="math inline">\(T_i = 1\)</span> when patient <span class="math inline">\(i\)</span> receives CP. <span class="math inline">\(\delta\)</span> represents the log odds ratio, so <span class="math inline">\(OR = e^{\delta}\)</span>. I’ve included the Stan code for the model in the the first <a href="#addendumA">addendum</a>.</p>
<pre class="r"><code>library(cmdstanr)
dt_to_list <- function(dx) {
N <- nrow(dx) ## number of observations
L <- dx[, length(unique(y))] ## number of levels of outcome
y <- as.numeric(dx$y) ## individual outcome
rx <- dx$rx ## treatment arm for individual
x <- model.matrix(y ~ factor(male) + factor(over69), data = dx)[, -1]
D <- ncol(x)
list(N=N, L=L, y=y, rx=rx, x=x, D=D)
}
mod <- cmdstan_model("pprob.stan")
fit <- mod$sample(
data = dt_to_list(dd),
seed = 271263,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 4 finished in 56.9 seconds.
## Chain 1 finished in 57.8 seconds.
## Chain 3 finished in 58.0 seconds.
## Chain 2 finished in 60.7 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 58.3 seconds.
## Total execution time: 61.0 seconds.</code></pre>
</div>
<div id="extract-posterior-distribution" class="section level2">
<h2>Extract posterior distribution</h2>
<p>Once the model is fit, our primary interest is whether we can make a recommendation about the therapy. A quick check to verify if <span class="math inline">\(P(OR < 1) > 0.95\)</span> confirms that we are not there yet.</p>
<pre class="r"><code>library(posterior)
draws_df <- as_draws_df(fit$draws())
draws_dt <- data.table(draws_df[-grep("^yhat", colnames(draws_df))])
mean(draws_dt[, OR < 1])</code></pre>
<pre><code>## [1] 0.89</code></pre>
<p>A plot that shows the bottom 95% portion of the density in blue makes it clear that the threshold has not been met:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="the-elements-of-the-predictive-probability" class="section level2">
<h2>The elements of the predictive probability</h2>
<p>We have collected complete data from 450 patients out of an expect 500, though we are not yet ready declare success. An interesting question to ask at this point is “<em>given what we have observed up until this point, what is the probability that we will declare success after 50 additional patients are included in the analysis?</em>” If the probability is sufficiently high, we may decide to delay releasing the inconclusive results pending the updated data set. (On the other hand, if the probability is quite low, there may be no point in delaying.)</p>
<p>The prediction incorporates three potential sources of uncertainty. First, there is the uncertainty regarding the parameters, which is described by the posterior distribution. Second, even if we knew the parameters with certainty, the outcome remains stochastic (i.e. not pre-determined conditional on the parameters). Finally, we don’t necessarily know the characteristics of the remaining patients (though we may have some or all of that information if recruitment has been finished but complete follow-up has not).</p>
<p>In the algorithm that follows - the steps follow from these three elements of uncertainty:</p>
<ol style="list-style-type: decimal">
<li>Generate 50 new patients by bootstrap sampling (with replacement) from the observed patients. The distribution of covariates of the new 50 patients will be based on the original 450 patients.</li>
<li>Make a single draw from the posterior distribution of our estimates to generate a set of parameters.</li>
<li>Using the combination of new patients and parameters, generate ordinal outcomes for each of the 50 patients.</li>
<li>Combine this new data set with the 450 existing patients to create a single analytic file.</li>
<li>Fit a new Bayesian model with the 500-patient data set.</li>
<li>Record <span class="math inline">\(P(OR < 1)\)</span> based on this new posterior distribution. If <span class="math inline">\(P(OR < 1)\)</span> is <span class="math inline">\(\gt 95\%\)</span>, we consider the result to be a “success”, otherwise it is “not a success”.</li>
</ol>
<p>We repeat this cycle, say 1000 times. <em>The proportion of cycles that are counted as a success represents the predictive probability of success</em>.</p>
<div id="step-1-new-patients" class="section level4">
<h4>Step 1: new patients</h4>
<pre class="r"><code>library(glue)
dd_new <- dd[, .(id = sample(id, 25, replace = TRUE)), keyby = rx]
dd_new <- merge(dd[, .(id, male, over69)], dd_new, by = "id")
dd_new[, id:= (nrow(dd) + 1):(nrow(dd) +.N)]</code></pre>
</div>
<div id="step-2-draw-set-of-parameters" class="section level4">
<h4>Step 2: draw set of parameters</h4>
<pre class="r"><code>draw <- as.data.frame(draws_dt[sample(.N, 1)])</code></pre>
<p>The coefficients <span class="math inline">\(\hat{\beta}_1\)</span> (male), <span class="math inline">\(\hat{\beta}_2\)</span> (over69), and <span class="math inline">\(\hat{\delta}\)</span> (treatment effect) are extracted from the draw from the posterior:</p>
<pre class="r"><code>D <- dt_to_list(dd)$D
beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
delta <- draw$delta
coefs <- as.matrix(c(beta, delta))
coefs</code></pre>
<pre><code>## [,1]
## [1,] 0.22
## [2,] 0.60
## [3,] -0.19</code></pre>
<p>Using the draws of the <span class="math inline">\(\tau_k\)</span>’s, I’ve calculated the corresponding probabilities that can be used to generate the ordinal outcome for the new observations:</p>
<pre class="r"><code>tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
tau <- c(tau, Inf)
cprop <- plogis(tau)
xprop <- diff(cprop)
baseline <- c(cprop[1], xprop)
baseline</code></pre>
<pre><code>## [1] 0.117 0.136 0.123 0.076 0.089 0.101 0.114 0.102 0.054 0.040 0.048</code></pre>
</div>
<div id="step-3-generate-outcome-using-coefficients-and-baseline-probabilities" class="section level4">
<h4>Step 3: generate outcome using coefficients and baseline probabilities</h4>
<pre class="r"><code>zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
dd_new$z <- zmat %*% coefs
setkey(dd_new, id)
dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")</code></pre>
</div>
<div id="step-4-combine-new-with-existing" class="section level4">
<h4>Step 4: combine new with existing</h4>
<pre class="r"><code>dx <- rbind(dd, dd_new)</code></pre>
</div>
<div id="step-5-fit-model" class="section level4">
<h4>Step 5: fit model</h4>
<pre class="r"><code>fit_pp <- mod$sample(
data = dt_to_list(dx),
seed = 737163,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)</code></pre>
<pre><code>## Running MCMC with 4 parallel chains...
##
## Chain 2 finished in 79.4 seconds.
## Chain 4 finished in 79.7 seconds.
## Chain 3 finished in 80.1 seconds.
## Chain 1 finished in 80.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 79.9 seconds.
## Total execution time: 80.6 seconds.</code></pre>
</div>
<div id="step-6-assess-success-for-single-iteration" class="section level4">
<h4>Step 6: assess success for single iteration</h4>
<pre class="r"><code>draws_pp <- data.table(as_draws_df(fit_pp$draws()))
draws_pp[, mean(OR < 1)]</code></pre>
<pre><code>## [1] 0.79</code></pre>
</div>
</div>
<div id="estimating-the-predictive-probability" class="section level2">
<h2>Estimating the predictive probability</h2>
<p>The next step is to pull all these elements together in a single function that we can call repeatedly to estimate the predictive probability of success. This probability is estimated by calculating the proportion of iterations that result in a success.</p>
<p>Computing resources required for this estimation might be quite substantial. If we iterate 1000 times, we need to fit that many Bayesian models. And 1000 Bayesian model estimates could be prohibitive - a high performance computing cluster (HPC) may be necessary. (I touched on this <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">earlier</a> when I describe exploring the characteristic properties of Bayesian models.) I have provided the code below in the second <a href="#addendumB">addendum</a> in case any readers are interested in trying to implement on an HPC.</p>
<p>I’ll conclude with a figure that shows how predictive probabilities can vary depending on the observed sample size and <span class="math inline">\(P(OR < 1)\)</span> for the interim data set. Based on the data generating process I’ve used here, if we observe a <span class="math inline">\(P(OR < 1) = 90\%\)</span> at an interim look after 250 patients, it is considerably more probable that we will end up over 95% than if we observe that same probability at an interim look after 450 patients. This makes sense, of course, since the estimate at 450 patients will have less uncertainty, and adding 50 patients will not likely change to results dramatically. The converse is true after 250 patients.</p>
<p><img src="https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/index.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>Ultimately, the interpretation of the predictive probability will depend on the urgency of making a recommendation, the costs of waiting, the costs of deciding to soon, and other factors specific to the trial and those making the decisions.</p>
<p><a name="addendumA"></a></p>
<p> </p>
</div>
<div id="addendum-a-stan-code" class="section level2">
<h2>Addendum A: stan code</h2>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=2> L; // number of WHO categories
int<lower=1,upper=L> y[N]; // vector of categorical outcomes
int<lower=0,upper=1> rx[N]; // treatment or control
int<lower=1> D; // number of covariates
row_vector[D] x[N]; // matrix of covariates N x D matrix
}
parameters {
vector[D] beta; // covariate estimates
real delta; // overall control effect
ordered[L-1] tau; // cut-points for cumulative odds model ([L-1] vector)
}
transformed parameters{
vector[N] yhat;
for (i in 1:N){
yhat[i] = x[i] * beta + rx[i] * delta;
}
}
model {
// priors
beta ~ student_t(3, 0, 10);
delta ~ student_t(3, 0, 2);
tau ~ student_t(3, 0, 8);
// outcome model
for (i in 1:N)
y[i] ~ ordered_logistic(yhat[i], tau);
}
generated quantities {
real OR = exp(delta);
}</code></pre>
<p><a name="addendumB"></a></p>
<p> </p>
</div>
<div id="addendum-b-hpc-code" class="section level2">
<h2>Addendum B: HPC code</h2>
<pre class="r"><code>library(slurmR)
est_from_draw <- function(n_draw, Draws, dd_obs, D, s_model) {
set_cmdstan_path(path = "/.../cmdstan/2.25.0")
dd_new <- dd_obs[, .(id = sample(id, 125, replace = TRUE)), keyby = rx]
dd_new <- merge(dd_obs[, .(id, male, over69)], dd_new, by = "id")
dd_new[, id:= (nrow(dd_obs) + 1):(nrow(dd_obs) +.N)]
draw <- as.data.frame(Draws[sample(.N, 1)])
beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric")
delta <- draw$delta
coefs <- as.matrix(c(beta, delta))
tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric")
tau <- c(tau, Inf)
cprop <- plogis(tau)
xprop <- diff(cprop)
baseline <- c(cprop[1], xprop)
zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1]
dd_new$z <- zmat %*% coefs
setkey(dd_new, id)
dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")
dx <- rbind(dd_obs, dd_new)
fit_pp <- s_model$sample(
data = dt_to_list(dx),
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 2000,
iter_sampling = 2500,
step_size = 0.1
)
draws_pp <- data.table(as_draws_df(fit_pp$draws()))
return(data.table(n_draw, prop_success = draws_pp[, mean(OR < 1)]))
}
job <- Slurm_lapply(
X = 1L:1080L,
FUN = est_from_draw,
Draws = draws_dt,
dd_obs = dd,
D = D,
s_model = mod,
njobs = 90L,
mc.cores = 4L,
job_name = "i_pp",
tmp_path = "/.../scratch",
plan = "wait",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("dt_to_list"),
overwrite = TRUE
)
job
res <- Slurm_collect(job)
rbindlist(res)[, mean(prop_success >= 0.95)]</code></pre>
</div>
Coming soon: effortlessly generate ordinal data without assuming proportional odds
https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/
Tue, 05 Jan 2021 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I’m starting off 2021 with my 99th post ever to introduce a new feature that will be incorporated into <code>simstudy</code> soon to make it a bit easier to generate ordinal data without requiring an assumption of proportional odds. I should wait until this feature has been incorporated into the development version, but I want to put it out there in case any one has any further suggestions. In any case, having this out in plain view will motivate me to get back to work on the package.</p>
<p>In the <a href="https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/" target="_blank">past</a>, I have generated ordinal without the non-proportional odds assumption, but it was a bit cumbersome. I’ve wanted to simplify the approach by incorporating the functionality directly into <code>genOrdCat</code>. While this last step is not complete, it is very close. For now, there is a temporary function <code>genOrdCatNP</code>.</p>
<div id="starting-with-the-proportional-odds-assumption" class="section level2">
<h2>Starting with the proportional odds assumption</h2>
<p>First, I am generating a data set using <code>genOrdCat</code> and an assumption of proportionality. This will be a large data set to eliminate any sampling issues when summarizing the odds ratios. (Please take a look <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/" target="_blank">here</a> for a little more detail on how how the odds and cumulative odds are defined.) The data will include a treatment or exposure indicator and an ordinal categorical outcome with four possible responses. In this case, the log odds-ratio comparing the odds for the treatment group with the control group will be -1.3, which translates to an odds ratio (OR) of 0.27.</p>
<pre class="r"><code>d1 <- defDataAdd(varname = "z", formula = "rx * 1.3", dist = "nonrandom")
set.seed(382763)
dd <- genData(100000)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(d1, dd)
baseprobs <- c(.4, .3, .2, .1)
dd_p <- genOrdCat(dtName = dd, adjVar = "z", baseprobs = baseprobs)</code></pre>
<p>Here is a calculation of the cumulative proportions and odds, as well as the odds ratios, which are all close to -1.3:</p>
<pre class="r"><code>(prop <- prop.table(dd_p[, table(rx, cat)], 1))</code></pre>
<pre><code>## cat
## rx 1 2 3 4
## 0 0.40 0.30 0.20 0.10
## 1 0.15 0.24 0.32 0.29</code></pre>
<pre class="r"><code>(cumprop <- data.table(apply(prop, 1, cumsum)))</code></pre>
<pre><code>## 0 1
## 1: 0.4 0.15
## 2: 0.7 0.39
## 3: 0.9 0.71
## 4: 1.0 1.00</code></pre>
<pre class="r"><code>(cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))])</code></pre>
<pre><code>## odds0 odds1
## 1: 0.67 0.18
## 2: 2.32 0.63
## 3: 8.84 2.45
## 4: Inf Inf</code></pre>
<pre class="r"><code>cumodds[1:3, odds1/odds0]</code></pre>
<pre><code>## [1] 0.27 0.27 0.28</code></pre>
<p>To visualize cumulative proportional odds, here’s a plot using the really nice <code>likert</code> package:</p>
<pre class="r"><code>library(likert)
item <- data.frame(dd_p[, "cat"])
names(item) <- "response"
bin.grp <- factor(dd_p[, rx])
likert.bin <- likert(item, grouping = bin.grp)
plot(likert.bin) + ggtitle("Proportional odds")</code></pre>
<p><img src="https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/index.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="its-now-simple-to-relax-the-proportionality-assumption" class="section level2">
<h2>It’s now simple to relax the proportionality assumption</h2>
<p>Two arguments are needed to relax the assumption: <code>npVar</code> and <code>npAdj</code>. The user specifies the first argument to indicate what variable is driving the lack of proportionality. In this case, it will be the treatment itself, since that is the only covariate in the data set. The second argument is a vector that represents the violation of the assumption at each level of the ordinal measure. The length of this vector is currently the same length as the number of ordinal categories, though the last value will always be 0.</p>
<p>In this example, the cumulative log-odds ratio for the first category will be 0.3 points more than the proportional assumption of -1.3, so should be quite close to -1.0 (or 0.37 on the OR scale). The cumulative log-odds ratio (and OR) for the second category will be the same as in the case of proportionality. And, the cumulative log-odds ratio for the third category will be 1.0 less than -1.3, or around -2.3 (which translates to 0.10 on the OR scale)</p>
<pre class="r"><code>npAdj <- c(-0.3, 0.0, 1.0, 0)
dd_np <- genOrdCatNP(dtName = dd, adjVar = "z", baseprobs = baseprobs,
npVar = "rx", npAdj = npAdj)
cumprop <- data.table(apply(prop.table(dd_np[, table(rx, cat)], 1), 1, cumsum))
cumodds <- cumprop[, .(odds0 = `0`/(1 - `0`), odds1=`1`/(1 - `1`))]
cumodds[1:3, odds1/odds0]</code></pre>
<pre><code>## [1] 0.37 0.27 0.10</code></pre>
<p>And here is what non-proportional cumulative odds looks like:</p>
<p><img src="https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/index.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
Constrained randomization to evaulate the vaccine rollout in nursing homes
https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/
Tue, 22 Dec 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>On an incredibly heartening note, two COVID-19 vaccines have been approved for use in the US and other countries around the world. More are possibly on the way. The big challenge, at least here in the United States, is to convince people that these vaccines are safe and effective; we need people to get vaccinated as soon as they are able to slow the spread of this disease. I for one will not hesitate for a moment to get a shot when I have the opportunity, though I don’t think biostatisticians are too high on the priority list.</p>
<p>Those who <em>are</em> at the top of the priority list are staff and residents of nursing homes. Unfortunately, within these communities, there are pockets of ambivalence or skepticism about vaccines in general, and the COVID vaccine in particular. In the past, influenza vaccine rates for residents at some facilities could be as low as 50%. I am part of an effort organized by researchers affiliated with the <a href="https://impactcollaboratory.org/" target="_blank">IMPACT Collaboratory</a> to figure out a way to increase these numbers.</p>
<p>This effort involves a cluster randomized trial (CRT) to evaluate the effectiveness of a wide-ranging outreach program designed to encourage nursing home residents to get their shots. The focus is on facilities that have relatively high proportions of African-American and Latinx residents, because these facilities have been among the least successful in the past in convincing residents to get vaccinated. The outcome measure of the trial, which will be measured at the individual level, will be the probability of vaccination within 5 weeks of being available at the nursing home.</p>
<p>The nursing homes in the study are members of one of four national nursing home networks or corporations. In this CRT, the randomization will be stratified by these networks and by the proportion of African-American and Latinx residents. We are defining the race/ethnicity strata using cutoffs of proportions: <25%, 25% to 40%, and >40%. We want our randomization to provide balance with respect to racial/ethnic composition in the intervention and control arms within each individual stratum. However, we are concerned that the strata with fewer nursing homes have a high risk of imbalance just by chance. Constrained randomization is one possible way to mitigate this risk, which is the focus here in this post.</p>
<div id="constrained-randomization" class="section level3">
<h3>Constrained randomization</h3>
<p>The basic idea is pretty simple. We generate a large number of possible randomization lists based on the requirements of the study design. For each randomization, we evaluate whether the balancing criteria have been met; by selecting only the subset of randomizations that pass this test, we create a sample of eligible randomizations. With this list of possible randomizations is in hand, we randomly select one, which becomes the actual randomization. Because we have limited the final selection only to possible randomizations that have been vetted for balance (or whatever criteria we require), we are guaranteed to satisfy the pre-specified criteria.</p>
</div>
<div id="simulated-data" class="section level3">
<h3>Simulated data</h3>
<p>I am using a single simulated data set to illustrate the constrained randomization process. Using the <code>simstudy</code> package, creating this data set is a two-step process of <em>defining</em> the data and then <em>generating</em> the data.</p>
<div id="defining-the-data" class="section level4">
<h4>Defining the data</h4>
<p>There will be a total of 200 nursing homes in 3 (rather than 4) networks. Just as in the real data, racial/ethnic composition will differ by network (because they are based in different parts of the country). And the networks are different sizes. The proportions of African-American/Latinx residents are generated using the <code>beta</code> distribution, which ranges from 0 to 1. In <code>simstudy</code>, the beta distribution is parameterized with a mean (specified in the <code>formula</code> argument) and dispersion (specified in the <code>variance</code> argument. See <a href="https://kgoldfeld.github.io/simstudy/articles/simstudy.html" target="_blank">this</a> for more details on the <em>beta</em> distribution.</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def <- defData(varname = "network", formula = "0.3;0.5;0.2",
dist = "categorical", id = "site")
defC <- defCondition(condition = "network == 1",
formula = "0.25", variance = 10, dist = "beta")
defC <- defCondition(defC, condition = "network == 2",
formula = "0.3", variance = 7.5, dist = "beta")
defC <- defCondition(defC, condition = "network == 3",
formula = "0.35", variance = 5, dist = "beta")</code></pre>
</div>
<div id="generating-the-data" class="section level4">
<h4>Generating the data</h4>
<pre class="r"><code>set.seed(2323761)
dd <- genData(200, def, id = "site")
dd <- addCondition(defC, dd, newvar = "prop")
dd[, stratum := cut(prop, breaks = c(0, .25, .4, 1),
include.lowest = TRUE, labels = c(1, 2, 3))]
dd</code></pre>
<pre><code>## site prop network stratum
## 1: 1 0.340 2 2
## 2: 2 0.181 2 1
## 3: 3 0.163 2 1
## 4: 4 0.099 3 1
## 5: 5 0.178 2 1
## ---
## 196: 196 0.500 2 3
## 197: 197 0.080 3 1
## 198: 198 0.479 3 3
## 199: 199 0.071 2 1
## 200: 200 0.428 2 3</code></pre>
</div>
</div>
<div id="randomization" class="section level3">
<h3>Randomization</h3>
<p>Now that we have a data set in hand, we can go ahead an randomize. I am using the <code>simstudy</code> function <code>trtAssign</code>, which allows us to specify the strata as well as the the ratio of controls to intervention facilities. In this case, we have a limit in the number of sites at which we can implement the intervention. In this simulation, I assume that we’ll randomize 150 sites to control, and 50 to the intervention, a 3:1 ratio.</p>
<pre class="r"><code>dr <- trtAssign(dd, nTrt = 2, balanced = TRUE, strata = c("network", "stratum"),
ratio = c(3, 1), grpName = "rx")</code></pre>
<p>We want to inspect an average proportion of African-American/Latinx residents within each strata (without adjusting for nursing home size, which is ignored here). First, we create a data table that includes the difference in average proportions between the facilities randomized to the intervention and those randomized to control:</p>
<pre class="r"><code>dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
dc <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
dc[, dif := abs(`1` - `0`)]</code></pre>
<p>Looking at both the table and the figure, Stratum 3 (>40%) in Network 1 jumps out as having the largest discrepancy, about 15 percentage points:</p>
<pre><code>## network stratum 0 1 dif
## 1: 1 1 0.16 0.192 0.03062
## 2: 1 2 0.31 0.305 0.00641
## 3: 1 3 0.45 0.600 0.14568
## 4: 2 1 0.15 0.143 0.00394
## 5: 2 2 0.32 0.321 0.00310
## 6: 2 3 0.49 0.547 0.05738
## 7: 3 1 0.13 0.085 0.04948
## 8: 3 2 0.33 0.331 0.00029
## 9: 3 3 0.56 0.607 0.04284</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/index.en_files/figure-html/unnamed-chunk-7-1.png" width="576" /></p>
</div>
<div id="constraining-the-randomization" class="section level3">
<h3>Constraining the randomization</h3>
<p>We want to do better and ensure that the maximum difference within a stratum falls below some specified threshold, say a 3 percentage point difference. All we need to do is repeatedly randomize and then check balance. I’ve written a function <code>randomize</code> that will be called repeatedly. Here I generate 1000 randomization lists, but in some cases I might need to generate many more, particularly if it is difficult to achieve targeted balance in any particular randomization.</p>
<pre class="r"><code>randomize <- function(dd) {
dr <- trtAssign(dd, nTrt = 2, strata = c("network", "stratum"), balanced = TRUE,
ratio = c(3, 1), grpName = "rx")
dx <- dr[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]
dx <- dcast(dx, network + stratum ~ rx, value.var = "mu_prop")
dx[, dif := abs(`1` - `0`)]
list(is_candidate = all(dx$dif < 0.03), randomization = dr[,.(site, rx)],
balance = dx)
}
rand_list <- lapply(1:1000, function(x) randomize(dd))</code></pre>
<p>Here is one randomization that fails to meet the criteria as 5 of the 9 strata exceed the 3 percentage point threshold:</p>
<pre><code>## [[1]]
## [[1]]$is_candidate
## [1] FALSE
##
## [[1]]$randomization
## site rx
## 1: 1 0
## 2: 2 1
## 3: 3 0
## 4: 4 1
## 5: 5 0
## ---
## 196: 196 0
## 197: 197 1
## 198: 198 0
## 199: 199 0
## 200: 200 0
##
## [[1]]$balance
## network stratum 0 1 dif
## 1: 1 1 0.16 0.207 0.0503
## 2: 1 2 0.30 0.334 0.0330
## 3: 1 3 0.45 0.600 0.1457
## 4: 2 1 0.15 0.138 0.0107
## 5: 2 2 0.32 0.330 0.0078
## 6: 2 3 0.50 0.514 0.0142
## 7: 3 1 0.13 0.085 0.0493
## 8: 3 2 0.34 0.311 0.0239
## 9: 3 3 0.55 0.647 0.0950</code></pre>
<p>Here is another that passes, as all differences are below the 3 percentage point threshold:</p>
<pre><code>## [[1]]
## [[1]]$is_candidate
## [1] TRUE
##
## [[1]]$randomization
## site rx
## 1: 1 1
## 2: 2 0
## 3: 3 1
## 4: 4 0
## 5: 5 1
## ---
## 196: 196 1
## 197: 197 0
## 198: 198 1
## 199: 199 1
## 200: 200 0
##
## [[1]]$balance
## network stratum 0 1 dif
## 1: 1 1 0.16 0.18 0.0168
## 2: 1 2 0.31 0.31 0.0039
## 3: 1 3 0.49 0.49 0.0041
## 4: 2 1 0.15 0.14 0.0144
## 5: 2 2 0.32 0.33 0.0064
## 6: 2 3 0.50 0.52 0.0196
## 7: 3 1 0.12 0.12 0.0095
## 8: 3 2 0.34 0.31 0.0239
## 9: 3 3 0.57 0.58 0.0134</code></pre>
<p>All that remains is to identify all the randomization sets that met the criteria (in this case there are only 6, suggesting we should probably generate at least 100,000 randomizations to ensure we have enough to pick from).</p>
<pre class="r"><code>candidate_indices <- sapply(rand_list, function(x) x[["is_candidate"]])
candidates <- rand_list[candidate_indices]
(n_candidates <- length(candidates))</code></pre>
<pre><code>## [1] 6</code></pre>
<pre class="r"><code>selected <- sample(x = n_candidates, size = 1)
ds <- candidates[[selected]][["randomization"]]
ds <- merge(dd, ds, by = "site")
dx <- ds[, .(mu_prop = mean(prop)), keyby = c("network", "stratum", "rx")]</code></pre>
<p>And looking at the plot confirms that we have a randomization scheme that is balanced based on our target:</p>
<p><img src="https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/index.en_files/figure-html/unnamed-chunk-12-1.png" width="576" /></p>
<p>Of course, the selection criteria could be based on any combination of factors. We may have multiple means that we want to balance, or we might want the two arms to be similar with respect to the standard deviation of a measure. These additional criteria may require more randomization schemes to be generated just because balance is that much more difficult to achieve, but all that really costs is computing time, not programming effort.</p>
<p>
<p><small><font color="darkkhaki">
Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
A Bayesian implementation of a latent threshold model
https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/
Tue, 08 Dec 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In the <a href="https://www.rdatagen.net/post/a-latent-threshold-model/" target="_blank">previous post</a>, I described a latent threshold model that might be helpful if we want to dichotomize a continuous predictor but we don’t know the appropriate cut-off point. This was motivated by a need to identify a threshold of antibody levels present in convalescent plasma that is currently being tested as a therapy for hospitalized patients with COVID in a number of RCTs, including those that are particpating in the ongoing <a href="https://bit.ly/3lTTc4Q" target="_blank">COMPILE meta-analysis</a>.</p>
<p>Barring any specific scientific rationale, we could pick an arbitrary threshold and continue with our analysis. Unfortunately, our estimates would not reflect the uncertainty around the selection of that threshold point; an approach that incorporates this uncertainty would be more appropriate. Last time, I described a relatively simple scenario with a single continuous predictor, a latent threshold, and a continuous outcome; the estimates were generated using the <code>R</code> package <code>chngpt</code>. Because I want to be able to build more flexible models in the future that could accommodate multiple continuous predictors (and latent thresholds), I decided to implement a Bayesian version of the model.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>Before laying out the model (described in much more detail in the <a href="https://bit.ly/3fYbd0M" target="_blank">Stan User’s Guide</a>), I should highlight two key features. First, we assume that the distribution of the outcome differs on either side of the threshold. In this example, we expect that the outcome data for antibody levels below the threshold are distributed as <span class="math inline">\(N(\alpha, \sigma)\)</span>, and that data above the threshold are <span class="math inline">\(N(\beta, \sigma)\)</span>. Second, since we do not know the threshold value, the likelihood is specified as a mixture across the range of all possible thresholds; the posterior distribution of the parameters <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> reflect the uncertainty where the threshold lies.</p>
<p>The observed data include the continuous outcome <span class="math inline">\(\textbf{y}\)</span> and a continuous antibody measure <span class="math inline">\(\textbf{x}\)</span>. There are <span class="math inline">\(M\)</span> possible pre-specified thresholds that are reflected in the vector <span class="math inline">\(\mathbf{c}\)</span>. Each candidate threshold is treated as a discrete quantity and a probability <span class="math inline">\(\lambda_m\)</span> is attached to each. Here is the model for the outcome conditional on the distribution parameters as well as the probability of the thresholds:</p>
<p><span class="math display">\[p(\textbf{y}|\alpha, \beta, \sigma, \mathbf{\lambda}) = \sum_{m=1}^M \lambda_m \left(\prod_{i: \; x_i < c[m]} \text{normal}(y_i | \alpha, \sigma) \prod_{i: \; x_i \ge c[m]} \text{normal}(y_i | \beta, \sigma)\right)\]</span></p>
</div>
<div id="implmentation-in-stan" class="section level3">
<h3>Implmentation in Stan</h3>
<p>Here is a translation of the model into <code>Stan</code>. The data for the model include the antibody level <span class="math inline">\(x\)</span>, the outcome <span class="math inline">\(y\)</span>, and the candidate thresholds included in the vector <span class="math inline">\(\mathbf{c}\)</span> which has length <span class="math inline">\(M\)</span>. In this example, the candidate vector is based on the <em>range</em> of observed antibody levels.</p>
<pre class="stan"><code>data {
int<lower=1> N; // number of observations
real x[N]; // antibody measures
real y[N]; // outcomes
int<lower=1> M; // number of candidate thresholds
real c[M]; // candidate thresholds
}</code></pre>
<p>At the outset, equal probability will be assigned to each of the <span class="math inline">\(M\)</span> candidate thresholds, which is <span class="math inline">\(1/M\)</span>. Since Stan operates in log-probabilities, this is translated to <span class="math inline">\(\text{log}(1/M) = \text{-log}(M)\)</span>:</p>
<pre class="stan"><code>transformed data {
real lambda;
lambda = -log(M);
}</code></pre>
<p>The three parameters that define the two distributions (above and below the threshold) are <span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span>:</p>
<pre class="stan"><code>parameters {
real alpha;
real beta;
real<lower=0> sigma;
}</code></pre>
<p>This next block is really the implementation of the threshold model. <span class="math inline">\(\mathbf{lp}\)</span> is a vector of log probabilities, where each element represents the log of each summand in the model specified above.</p>
<pre class="stan"><code>transformed parameters {
vector[M] lp;
lp = rep_vector(lambda, M);
for (m in 1:M)
for (n in 1:N)
lp[m] = lp[m] + normal_lpdf(y[n] | x[n] < c[m] ? alpha : beta, sigma);
}</code></pre>
<p>The notation <code>y[n] | x[n] < c[m] ? alpha : beta, sigma</code> is Stan’s shorthand for an if-then-else statement (<strong>this is note Stan code!</strong>):</p>
<pre><code>if x[n] < c[m] then
y ~ N(alpha, sigma)
else if x[n] >= c[m] then
y ~ N(beta, sigma)</code></pre>
<p>And finally, here is the specification of the priors and the full likelihood, which is the sum of the log-likelihoods across the candidate thresholds. The function <code>log_sum_exp</code> executes the summation across the <span class="math inline">\(M\)</span> candidate thresholds specified in the model above.</p>
<pre class="stan"><code>model {
alpha ~ student_t(3, 0, 2.5);
beta ~ student_t(3, 0, 2.5);
sigma ~ exponential(1);
target += log_sum_exp(lp);
}</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generated to explore this model is based on the same data definitions I used in the <a href="https://www.rdatagen.net/post/a-latent-threshold-model/" target="_blank">last post</a> to explore the MLE model.</p>
<pre class="r"><code>library(simstudy)
set.seed(87654)
d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status",
variance = 1, dist = "normal")
dd <- genData(500, d1)</code></pre>
<p>The threshold is quite apparent here. In the right hand plot, the latent classes are revealed.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/p3.png" style="width:90.0%" /></p>
</div>
<div id="model-fitting" class="section level3">
<h3>Model fitting</h3>
<p>We use the <code>rstan</code> package to access Stan, passing along the observed antibody data, outcome data, as well as the candidate thresholds:</p>
<pre class="r"><code>library(rstan)
rt <- stanc("/.../threshold.stan");
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
N <- nrow(dd3)
y <- dd3[, y]
x <- dd3[, antibody]
c <- seq(round(min(x), 1), round(max(x), 1), by = .1)
M <- length(c)
studydata3 <- list(N=N, x=x, y=y, M=M, c=c)
fit3 <- sampling(sm, data = studydata3, iter = 3000, warmup = 500,
cores = 4L, chains = 4, control = list(adapt_delta = 0.8))</code></pre>
<p>The first order of business is to make sure that the MCMC algorithm sampled the parameter space in a well-behave manner. Everything looks good here:</p>
<pre class="r"><code>library(bayesplot)
posterior <- as.array(fit3)
lp <- log_posterior(fit3)
np <- nuts_params(fit3)
color_scheme_set("mix-brightblue-gray")
mcmc_trace(posterior, pars = c("alpha","beta", "sigma"),
facet_args = list(nrow = 3), np = np) +
xlab("Post-warmup iteration")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/trace3.png" style="width:80.0%" /></p>
<p>The posterior distributions of the three parameters of interest (<span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span>) are quite close to the values used in the data generation process:</p>
<pre class="r"><code>mcmc_intervals(posterior, pars = c("alpha","beta", "sigma"))</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/estimates3.png" style="width:70.0%" /></p>
</div>
<div id="the-posterior-probability-of-the-threshold" class="section level3">
<h3>The posterior probability of the threshold</h3>
<p>Even though the distributions of <span class="math inline">\(\alpha\)</span>, <span class="math inline">\(\beta\)</span>, and <span class="math inline">\(\sigma\)</span> are marginal with respect to the candidate thresholds, we may still be interested in the posterior distribution of the thresholds. An approach to estimating this is described in the <a href="https://mc-stan.org/docs/2_25/stan-users-guide/change-point-section.html#posterior-distribution-of-the-discrete-change-point" target="_blank">User’s Guide</a>. I provide a little more detail and code for generating the plot in the <a href="#addendum">addendum</a>.</p>
<p>The plot shows the log-probability for each of the candidate thresholds considered, with a red dashed line drawn at <span class="math inline">\(-0.7\)</span>, the true threshold used in the data generation process. In this case, the probability (and log-probability) peaks at this point. In fact, there is a pretty steep drop-off on either side, indicating that we can have a lot of confidence that the threshold is indeed <span class="math inline">\(-0.7\)</span>.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/threshold3.png" style="width:80.0%" /></p>
</div>
<div id="when-there-is-a-single-distribution" class="section level3">
<h3>When there is a single distribution</h3>
<p>If we update the data definitions to generate a single distribution (<em>i.e.</em> the outcome is independent of the antibody measure), the threshold model with a struggles to identify a threshold, and the parameter estimates have more uncertainty.</p>
<pre class="r"><code>d1 <- updateDef(d1, changevar = "y", newformula = "0")
dd <- genData(500, d1)</code></pre>
<p>Here is a plot of the data based on the updated assumption:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/p0.png" style="width:90.0%" /></p>
<p>And here are the posterior probabilities for the parameters - now with much wider credible intervals:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/estimates0.png" style="width:70.0%" /></p>
<p>Here is the posterior distribution of thresholds, intentionally plotted to highlight the lack of distinction across the candidate thresholds:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesthreshold/threshold0.png" style="width:80.0%" /></p>
<p id="addendum">
</p>
</div>
<div id="addendum---posterior-probabilties-of-the-threshold" class="section level3">
<h3>Addendum - posterior probabilties of the threshold</h3>
<p>Here’s a little more background on how the posterior probabilities for the threshold were calculated. As a reminder, <span class="math inline">\(\textbf{c}\)</span> is a vector of candidate thresholds of length <span class="math inline">\(M\)</span>. We define a quantity <span class="math inline">\(q(c_m | data)\)</span> as</p>
<p><span class="math display">\[
q(c_m | data) = \frac{1}{R}\sum_{r=1}^R \text{exp}(lp_{rc_m})
\]</span>
where <span class="math inline">\(lp_{cr_m}\)</span> is the value of <span class="math inline">\(lp\)</span> from the <em>r</em>’th draw for threshold candidate <span class="math inline">\(c_m\)</span>. We are actually interested in <span class="math inline">\(p(c_m|data\)</span>), which is related to <span class="math inline">\(q\)</span>:</p>
<p><span class="math display">\[
p(c_m | data) = \frac{q(c_m | data)}{\sum_{m'=1}^M q(c_{m'}|data)}
\]</span></p>
<p>The <code>R</code> code is a little bit involved, because the log-probabilities are so small that exponentiating them to recover the probabilities runs into floating point limitations. In the examples I have been using here, the log probabilities ranged from <span class="math inline">\(-4400\)</span> to <span class="math inline">\(-700\)</span>. On my device the smallest value I can meaningfully exponentiate is <span class="math inline">\(-745\)</span>; anything smaller results in a value of 0, rendering it impossible to estimate <span class="math inline">\(q\)</span>.</p>
<p>To get around this problem, I used the <code>mpfr</code> function in the <code>Rmfpr</code> package. Here is a simple example to show how exponentiate a hihgly negative variable <span class="math inline">\(b\)</span>. A helper variable <span class="math inline">\(a\)</span> is specified to set the precision, which can then be used to derive the desired result, which is <span class="math inline">\(\text{exp}(b)\)</span>.</p>
<p>Everything is fine if <span class="math inline">\(b \ge -745\)</span>:</p>
<pre class="r"><code>library(Rmpfr)
b <- -745
exp(b)</code></pre>
<pre><code>## [1] 4.94e-324</code></pre>
<p>For <span class="math inline">\(b<-745\)</span>, we have floating point issues:</p>
<pre class="r"><code>b <- -746
exp(b)</code></pre>
<pre><code>## [1] 0</code></pre>
<p>So, we turn to <code>mpfr</code> to get the desired result. First, specify <span class="math inline">\(a\)</span> with the proper precision:</p>
<pre class="r"><code>(a <- mpfr(exp(-100), precBits=64))</code></pre>
<pre><code>## 1 'mpfr' number of precision 64 bits
## [1] 3.72007597602083612001e-44</code></pre>
<p>And now we can calculate <span class="math inline">\(\text{exp}(b)\)</span>:</p>
<pre class="r"><code>a^(-b/100)</code></pre>
<pre><code>## 1 'mpfr' number of precision 64 bits
## [1] 1.03828480951583225515e-324</code></pre>
<p>The code to calculate <span class="math inline">\(\text{log}(p_{c_m})\)</span> extracts the draws of <span class="math inline">\(lp\)</span> from the sample, exponentiates, and sums to get the desired result.</p>
<pre class="r"><code>library(glue)
a <- mpfr(exp(-100), precBits=64)
qc <- NULL
for(m in 1:M) {
lp.i <- glue("lp[{m}]")
le <- rstan::extract(fit3, pars = lp.i)[[1]]
q <- a^(-le/100)
qc[m] <- sum(q)
}
qcs <- mpfr2array(qc, dim = M)
lps <- log(qcs/sum(qcs))
dps <- data.table(c, y=as.numeric(lps))
ggplot(data = dps, aes(x = c, y = y)) +
geom_vline(xintercept = -0.7, color = "red", lty = 3) +
geom_line(color = "grey60") +
geom_point(size = 1) +
theme(panel.grid = element_blank()) +
ylab("log(probability)") +
xlab("threshold from low to not low") +
scale_y_continuous(limits = c(-800, 0))</code></pre>
</div>
A latent threshold model to dichotomize a continuous predictor
https://www.rdatagen.net/post/a-latent-threshold-model/
Tue, 24 Nov 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-latent-threshold-model/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>This is the context. In the convalescent plasma pooled individual patient level meta-analysis we are conducting as part of the <a href="https://bit.ly/3nBxPXd" target="_blank">COMPILE</a> study, there is great interest in understanding the impact of antibody levels on outcomes. (I’ve described various aspects of the analysis in previous posts, most recently <a href="https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/" target="_blank">here</a>). In other words, not all convalescent plasma is equal.</p>
<p>If we had a clear measure of antibodies, we could model the relationship of these levels with the outcome of interest, such as health status as captured by the WHO 11-point scale or mortality, and call it a day. Unfortunately, at the moment, there is no single measure across the RCTs included in the meta-analysis (though that may change). Until now, the RCTs have used a range of measurement “platforms” (or technologies), which may measure different components of the convalescent plasma using different scales. Given these inconsistencies, it is challenging to build a straightforward model that simply estimates the relationship between antibody levels and clinical outcomes.</p>
<p>The study team is coalescing around the idea of comparing the outcomes of patients who received <em>low</em> levels of antibodies with patients who received <em>not low</em> levels (as well as with patients who received no antibodies). One thought (well, really my thought) is to use a model that can jointly estimate the latent threshold and, given that threshold, estimate a treatment effect. Importantly, this model would need to accommodate multiple antibody measures and their respective thresholds.</p>
<p>To tackle this problem, I have turned to a class of models called change point or threshold models. My ultimate goal is to fit a Bayesian model that can estimate threshold and effect-size parameters for any number of RCTs using any number of antibody measures. At this point we are a few steps removed from that, so in this post I’ll start with a simple case of a single RCT and a single antibody measure, and use a maximum likelihood estimation method implemented in the <code>R</code> package <a href="https://cran.r-project.org/web/packages/chngpt/index.html" target="_blank">chngpt</a> to estimate parameters from a simulated data set. In a subsequent post, I’ll implement a Bayesian version of this simple model, and perhaps in a third post, I’ll get to the larger model that incorporates more complexity.</p>
<div id="visualizing-simple-scenarios" class="section level3">
<h3>Visualizing simple scenarios</h3>
<p>Change point models appear to be most commonly used in the context of time series data where the focus is on understanding if a trend or average has shifted at a certain point in a sequence of measurements over time. In the case of COMPILE, the target would be a threshold for a continuous antibody measure across multiple patients; we are interested in measuring the average outcome for patients on either side of the threshold.</p>
<p>The following plots show three scenarios. On the left, there is no threshold; the distribution of continuous outcomes is the same across all values of the the antibody measure. In the middle, there is a threshold at <span class="math inline">\(-0.7\)</span>; patients with antibody levels below <span class="math inline">\(-0.7\)</span> have a lower average outcome than patients with antibodies above <span class="math inline">\(-0.7\)</span>. On the right, the threshold is shifted to <span class="math inline">\(0.5\)</span>.</p>
<p>The key here is that the outcome is solely a function of the latent categorical status - not the actual value of the antibody level. This may be a little simplistic, because we might expect the antibody level itself to be related to the outcome based on some sort of linear or non-linear relationship rather than the dichotomous relationship we are positing here. However, if we set our sights on detecting a difference in average clinical outcomes for patients categorized as having been exposed to <em>low</em> and <em>not low</em> antibody levels rather than on understanding the full nature of their relationship, this simplification may be reasonable.</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-1-1.png" width="864" /></p>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I think if you see the data generation process, the model and assumptions might make more sense. We start with an antibody level that, for simplicity’s sake, has a standard normal distribution. In this simulation, the latent group status (i.e. <em>low</em> vs. <em>not low</em>) is not determined completely by the threshold (though it certainly could); here, the probability that latent status is <em>not low</em> is about <span class="math inline">\(5\%\)</span> for patients with antibody levels that fall below <span class="math inline">\(-0.7\)</span>, but is <span class="math inline">\(95\%\)</span> for patients that exceed threshold.</p>
<pre class="r"><code>library(simstudy)
set.seed(87654)
d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",
dist = "binary", link = "logit")
d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status",
variance = 1, dist = "normal")
dd <- genData(500, d1)
dd</code></pre>
<pre><code>## id antibody latent_status y
## 1: 1 -1.7790 0 0.5184
## 2: 2 0.2423 1 3.2174
## 3: 3 -0.4412 1 1.8948
## 4: 4 -1.2505 0 0.9816
## 5: 5 -0.0552 1 2.9251
## ---
## 496: 496 -0.4634 1 2.7298
## 497: 497 0.6862 0 -0.0507
## 498: 498 -1.0899 0 0.9680
## 499: 499 2.3395 1 1.9540
## 500: 500 -0.4874 1 3.5238</code></pre>
</div>
<div id="simple-model-estimation" class="section level3">
<h3>Simple model estimation</h3>
<p>The <code>chngptm</code> function in the <code>chngpt</code> package provides an estimate of the threshold as well as the treatment effect of antibody lying above this latent threshold. The parameters in this simple case are recovered quite well. The fairly narrow <span class="math inline">\(95\%\)</span> confidence interval (2.2, 2.8) just misses the true value. The very narrow <span class="math inline">\(95\%\)</span> CI for the threshold is (-0.73, -0.69) just does include the true value.</p>
<pre class="r"><code>library(chngpt)
fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) 0.296 0.130 0.0547 0.563 2.26e-02
## antibody>chngpt 2.520 0.139 2.2416 2.787 1.99e-73
##
## Threshold:
## est Std. Error (lower upper)
## -0.70261 0.00924 -0.72712 -0.69092</code></pre>
</div>
<div id="alternative-scenarios" class="section level3">
<h3>Alternative scenarios</h3>
<p>When there is more ambiguity in the relationship between the antibody threshold and the classification into the two latent classes of <em>low</em> and <em>not low</em>, there is more uncertainty in both the effect and threshold estimates. Furthermore, the effect size estimate is attenuated, since the prediction of the latent class is less successful.</p>
<p>In the next simulation, the true threshold remains at <span class="math inline">\(-0.7\)</span>, but the probability that a patient below the threshold actually does not have <em>low</em> levels of antibodies increases to about <span class="math inline">\(21\%\)</span>, while the probability of a patient above the threshold does not have <em>low</em> levels of antibodies decreases to <span class="math inline">\(79\%\)</span>. There is more uncertainty regarding the the threshold, as the <span class="math inline">\(95\%\)</span> CI is (-1.09, -0.62). And the estimated effect is <span class="math inline">\(1.5 \; (1.3, 2.0)\)</span> is attenuated with more uncertainty. Given the added uncertainty in the data generation process, these estimates are what we would expect.</p>
<pre class="r"><code>d1 <- updateDef(d1, changevar = "latent_status",
newformula = "-1.3 + 2.6 * (antibody > -0.7)")
dd <- genData(500, d1)
fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) 0.881 0.159 0.50 1.12 3.05e-08
## antibody>chngpt 1.439 0.173 1.17 1.85 1.09e-16
##
## Threshold:
## est Std. Error (lower upper)
## -0.6298 0.0579 -0.8083 -0.5814</code></pre>
<p>The effect size has an impact on the estimation of a threshold. At the extreme case where there is no effect, the concept of a threshold is not meaningful; we would expect there to be great uncertainty with the estimate for the threshold. As the true effect size grows, we would expect the precision of the threshold estimate to increase as well (subject to the latent class membership probabilities just described). The subsequent plot shows the point estimates and <span class="math inline">\(95\%\)</span> CIs for thresholds at different effect sizes. The true threshold is <span class="math inline">\(0.5\)</span> and effect sizes range from 0 to 2:</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>This last figure shows that the uncertainty around the effect size estimate is higher at lower levels of true effectiveness. This higher level of uncertainty in the estimated effect is driven by the higher level of uncertainty in the estimate of the threshold at lower effect sizes (as just pointed out above).</p>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="with-a-fundamentally-different-data-generating-process" class="section level3">
<h3>With a fundamentally different data generating process</h3>
<p>What happens when the underlying data process is quite different from the one we have been imagining? Is the threshold model useful? I would say “maybe not” in the case of a single antibody measurement. I alluded to this a bit earlier in the post, justifying the idea by arguing it might make more sense with multiple types of antibody measurements. We will hopefully find that out if I get to that point. Here, I briefly investigate the estimates we get from a threshold model when the outcome is linearly related to the antibody measurement, and there is in fact no threshold, as in this data set:</p>
<pre class="r"><code>d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "y", formula = "antibody", variance = 1, dist = "normal")
dd <- genData(500, d1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-11-24-a-latent-threshold-model.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>The estimated threshold is near the center of the antibody data (which in this case is close to <span class="math inline">\(0\)</span>), with a fairly narrow <span class="math inline">\(95\%\)</span> confidence interval. The effect size is essentially a comparison of the means for patients with measurements below <span class="math inline">\(0\)</span> compared to patients above <span class="math inline">\(0\)</span>. If this were the actual data generation process, it might be preferable to model the relationship directly using simple linear regression without estimating a threshold.</p>
<pre class="r"><code>fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,
data = dd, type="step", family="gaussian")
summary(fit)</code></pre>
<pre><code>## Change point model threshold.type: step
##
## Coefficients:
## est Std. Error* (lower upper) p.value*
## (Intercept) -0.972 0.162 -1.24 -0.607 2.19e-09
## antibody>chngpt 1.739 0.109 1.58 2.006 1.15e-57
##
## Threshold:
## est Std. Error (lower upper)
## -0.0713 0.2296 -0.3832 0.5170</code></pre>
</div>
Exploring the properties of a Bayesian model using high performance computing
https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/
Tue, 10 Nov 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>An obvious downside to estimating Bayesian models is that it can take a considerable amount of time merely to fit a model. And if you need to estimate the same model repeatedly, that considerable amount becomes a prohibitive amount. In this post, which is part of a series (last one <a href="https://bit.ly/31kCCDV" target="_blank">here</a>) where I’ve been describing various aspects of the Bayesian analyses we plan to conduct for the <a href="https://bit.ly/31hDwB0" target="_blank">COMPILE</a> meta-analysis of convalescent plasma RCTs, I’ll present a somewhat elaborate model to illustrate how we have addressed these computing challenges to explore the properties of these models.</p>
<p>While concept of statistical power may not be part of the Bayesian analytic framework, there are many statisticians who would like to assess this property regardless of the modeling approach. These assessments require us to generate multiple data sets and estimate a model for each. In this case, we’ve found that each run through the MCMC-algorithm required to sample from the posterior probability of the Bayesian model can take anywhere from 7 to 15 minutes on our laptops or desktops. If we want to analyze 1,000 data sets using these methods, my laptop would need to run continuously for at least a week. And if we want to explore models under different assumptions about the data generating process or prior distributions - well, that would be impossible.</p>
<p>Fortunately, we have access to the <a href="https://bit.ly/37wJdPs" target="_blank">High Performance Computing Core at NYU Langone Health</a> (HPC), which enables us to analyze 1,000 data sets in about 90 minutes. Still pretty intensive, but clearly a huge improvement. This post describes how we adapted our simulation and modeling process to take advantage of the power and speed of the the HPC.</p>
<div id="compile-study-design" class="section level3">
<h3>COMPILE Study design</h3>
<p>There are numerous randomized control trials being conducted around the world to evaluate the efficacy of antibodies in convalescent blood plasma in improving outcomes for patients who have been hospitalized with COVID. Because each trial lacks adequate sample size to allow us to draw any definitive conclusions, we have undertaken a project to pool individual level data from these various studies into a single analysis. (I described the general approach and some conceptual issues <a href="https://bit.ly/3o8o8Rr" target="_blank">here</a> and <a href="https://bit.ly/2HcXCpr" target="_blank">here</a>). The outcome is an 11-point categorical score developed by the <a href="https://bit.ly/3m7h4CI" target="_blank">WHO</a>, with 0 indicating no COVID infection and 10 indicating death. As the intensity of support increases, the scores increase. For the primary outcome, the score will be based on patient status 14 days after randomization.</p>
<p>This study is complicated by the fact that each RCT is using one of three different control conditions: (1) usual care (unblinded), (2) non-convalescent plasma, and (3) saline solution. The model conceptualizes the three control conditions as three treatments to be compared against the reference condition of convalescent plasma. The overall treatment effect will represent a quantity centered around the three control effects.</p>
<p>Because we have access to individual-level data, we will be able to adjust for important baseline characteristics that might be associated with the prognosis at day 14; these are baseline WHO-score, age, sex, and symptom duration prior to randomization. The primary analysis will adjust for these factors, and secondary analyses will go further to investigate if any of these factors modify the treatment effect. In the example in this post, I will present the model based on a secondary analysis that considers only a single baseline factor symptom duration; we are allowing for the possibility that the treatment may be more or less effective depending on the symptom duration. (For example, patients who have been sicker longer may not respond to the treatment, while those who are treated earlier may.)</p>
</div>
<div id="model" class="section level3">
<h3>Model</h3>
<p>Here is the model that I am using (as I mentioned, the planned COMPILE analysis will be adjusting for additional baseline characteristics):</p>
<p><span class="math display">\[
\text{logit} \left(P\left(Y_{kis} \ge y\right)\right) = \tau_{yk} + \beta_s + I_{ki} \left( \gamma_{kcs} + \delta_{kc} \right), \; \; y \in \{1, \dots L-1\} \text{ with } L \text{ response levels}
\]</span>
And here are the assumptions for the <strong>prior distributions</strong>:</p>
<p><span class="math display">\[
\begin{aligned}
\tau_{yk} &\sim t_{\text{student}} \left( \text{df=} 3, \mu=0, \sigma = 5 \right), \; \; \text{monotone within } \text{site } k \\
\beta_s &\sim \text{Normal} \left( \mu=0, \sigma = 5 \right), \qquad \; \;s \in \{1,\dots, S \} \text{ for symptom duration strata}, \beta_1 = 0 \\
\gamma_{kcs} &\sim \text{Normal}\left( \gamma_{cs}, 1 \right), \qquad \qquad \;\;\;\;\;\;c \in \{0, 1, 2\} \text{ for control conditions }, \gamma_{kc1} = 0 \text{ for all } k \\
\gamma_{cs} &\sim \text{Normal}\left( \Gamma_s, 0.25 \right), \qquad \qquad \; \; \gamma_{c1} = 0 \text{ for all } c \\
\Gamma_s &\sim t_{\text{student}} \left( 3, 0, 1 \right), \qquad \qquad \qquad \Gamma_{1} = 0 \\
\delta_{kc} &\sim \text{Normal}\left( \delta_c, \eta \right)\\
\delta_c &\sim \text{Normal}\left( -\Delta, 0.5 \right) \\
\eta &\sim t_{\text{student}}\left(3, 0, 0.25 \right) \\
-\Delta &\sim t_{\text{student}} \left( 3, 0, 2.5 \right)
\end{aligned}
\]</span>
There are <span class="math inline">\(K\)</span> RCTs. The outcome for the <span class="math inline">\(i\)</span>th patient from the <span class="math inline">\(k\)</span>th trial on the <span class="math inline">\(L\)</span>-point scale at day 14 is <span class="math inline">\(Y_{ki}=y\)</span>, <span class="math inline">\(y=0,\dots,L-1\)</span> (although the COMPILE study will have <span class="math inline">\(L = 11\)</span> levels, I will be using <span class="math inline">\(L=5\)</span> to speed up estimation times a bit). <span class="math inline">\(I_{ki}\)</span> indicates the treatment assignment for subject <span class="math inline">\(i\)</span> in the <span class="math inline">\(k\)</span>th RCT, <span class="math inline">\(I_{ki} = 0\)</span> if patient <span class="math inline">\(i\)</span> received CP and <span class="math inline">\(I_{ki} = 1\)</span> if the patient was in <em>any</em> control arm. There are three control conditions <span class="math inline">\(C\)</span>: standard of care, <span class="math inline">\(C=0\)</span>; non-convalescent plasma, <span class="math inline">\(C=1\)</span>; saline/LR with coloring, <span class="math inline">\(C=2\)</span>; each RCT <span class="math inline">\(k\)</span> is attached to a specific control condition. There are also <span class="math inline">\(S=3\)</span> symptom duration strata: short duration, <span class="math inline">\(s=0\)</span>; moderate duration, <span class="math inline">\(s=1\)</span>; and long duration, <span class="math inline">\(s=2\)</span>. (COMPILE will use five symptom duration strata - again I am simplifying.)</p>
<p><span class="math inline">\(\tau_{yk}\)</span> corresponds to the <span class="math inline">\(k\)</span>th RCT’s intercept associated with level <span class="math inline">\(y\)</span>; the <span class="math inline">\(\tau\)</span>’s represent the cumulative log odds for patients with in symptom duration group <span class="math inline">\(s=0\)</span> and receiving CP treatment. Within a particular RCT, all <span class="math inline">\(\tau_{yk}\)</span>’s, satisfy the monotonicity requirements for the intercepts of the proportional odds model. <span class="math inline">\(\beta_s\)</span>, <span class="math inline">\(s \in {2, 3}\)</span>, is the main effect of symptom duration (<span class="math inline">\(\beta_1 = 0\)</span>, where <span class="math inline">\(s=1\)</span> is the reference category). <span class="math inline">\(\gamma_{kcs}\)</span> is the moderating effect of strata <span class="math inline">\(s\)</span> in RCT <span class="math inline">\(k\)</span>; <span class="math inline">\(\gamma_{kc1} = 0\)</span>, since <span class="math inline">\(s=1\)</span> is the reference category. <span class="math inline">\(\delta_{kc}\)</span> is the RCT-specific control effect, where RCT <span class="math inline">\(k\)</span> is using control condition <span class="math inline">\(c\)</span>.</p>
<p>Each <span class="math inline">\(\gamma_{kcs}\)</span> is normally distributed around a control type/symptom duration mean <span class="math inline">\(\gamma_{cs}\)</span>. And each <span class="math inline">\(\gamma_{cs}\)</span> is centered around a pooled mean <span class="math inline">\(\Gamma_s\)</span>. The <span class="math inline">\(\delta_{kc}\)</span>’s are assumed to be normally distributed around a control-type specific effect <span class="math inline">\(\delta_c\)</span>, with variance <span class="math inline">\(\eta\)</span> that will be estimated; the <span class="math inline">\(\delta_c\)</span>’s are normally distributed around <span class="math inline">\(-\Delta\)</span> (we take <span class="math inline">\(-\Delta\)</span> as the mean of the distribution to which <span class="math inline">\(\delta_c\)</span> belongs so that <span class="math inline">\(\exp(\Delta)\)</span> will correspond to the cumulative log-odds ratio for CP relative to control, rather than for control relative to CP.). (For an earlier take on these types of models, see <a href="https://bit.ly/34ila4Q" target="_blank">here</a>.)</p>
</div>
<div id="go-or-no-go" class="section level3">
<h3>Go or No-go</h3>
<p>The focus of a Bayesian analysis is the estimated posterior probability distribution of a parameter or parameters of interest, for example the log-odds ratio from a cumulative proportional odds model. At the end of an analysis, we have credibility intervals, means, medians, quantiles - all concepts associated a probability distribution.</p>
<p>A “Go/No-go” decision process like a hypothesis test is not necessarily baked into the Bayesian method. At some point, however, even if we are using a Bayesian model to inform our thinking, we might want to or have to make a decision. In this case, we might want recommend (or not) the use of CP for patients hospitalized with COVID-19. Rather than use a hypothesis test to reject or fail to reject a null hypothesis of no effect, we can use the posterior probability to create a decision rule. In fact, this is what we have done.</p>
<p>In the proposed design of COMPILE, the CP therapy will be deemed a success if both of these criteria are met:</p>
<p><span class="math display">\[ P \left( \exp\left(-\Delta\right) < 1 \right) = P \left( OR < 1 \right) > 95\%\]</span></p>
<p><span class="math display">\[P \left( OR < 0.80 \right) > 50\%\]</span>
The first statement ensures that the posterior probability of a good outcome is very high. If we want to be conservative, we can obviously increase the percentage threshold above <span class="math inline">\(95\%\)</span>. The second statement says that the there is decent probability that the treatment effect is clinically meaningful. Again, we can modify the target OR and/or the percentage threshold based on our desired outcome.</p>
</div>
<div id="goals-of-the-simulation" class="section level3">
<h3>Goals of the simulation</h3>
<p>Since there are no Type I or Type II errors in the Bayesian framework, the concept of power (which is the probability of rejecting the null hypothesis when it is indeed not true) does not logically flow from a Bayesian analysis. However, if we substitute our decision rules for a hypothesis test, we can estimate the probability (call it Bayesian power, though I imagine some Bayesians would object) that we will make a “Go” decision given a specified treatment effect. (To be truly Bayesian, we should impose some uncertainty on what that specific treatment effect is, and calculate a probability distribution of Bayesian power. But I am keeping things simpler here.)</p>
<p>Hopefully, I have provided sufficient motivation for the need to simulate data and fit multiple Bayesian models. So, let’s do that now.</p>
</div>
<div id="the-simulation" class="section level3">
<h3>The simulation</h3>
<p>I am creating four functions that will form the backbone of this simulation process: <code>s_define</code>, <code>s_generate</code>, <code>s_estimate</code>, and <code>s_extract</code>. Repeated calls to each of these functions will provide us with the data that we need to get an estimate of Bayesian power under our (static) data generating assumptions.</p>
<div id="data-definitions" class="section level4">
<h4>Data definitions</h4>
<p>The first definition table, <code>defC1</code>, sets up the RCTs. Each RCT has specific symptom duration interaction effect <span class="math inline">\(a\)</span> and control treatment effect <span class="math inline">\(b\)</span>. To introduce a little variability in sample size, 1/3 of the studies will be larger (150 patients), and 2/3 will be smaller (75 patients).</p>
<p>The remaining tables, <code>defC2</code>, <code>defS</code>, and <code>defC3</code>, define patient-level data. <code>defC2</code> adds the control group indicator (0 = CP, 1 = standard care, 2 = non-convalescent plasma, 3 = saline) and the symptom duration stratum. <code>defS</code> defines the interaction effect conditional on the stratum. <code>defC3</code> defines the ordinal categorical outcome.</p>
<pre class="r"><code>s_define <- function() {
defC1 <- defDataAdd(varname = "a",formula = 0, variance = .005, dist = "normal")
defC1 <- defDataAdd(defC1,varname = "b",formula = 0, variance= .01, dist = "normal")
defC1 <- defDataAdd(defC1,varname = "size",formula = "75+75*large", dist = "nonrandom")
defC2 <- defDataAdd(varname="C_rv", formula="C * control", dist = "nonrandom")
defC2 <- defDataAdd(defC2, varname = "ss", formula = "1/3;1/3;1/3",
dist = "categorical")
defS <- defCondition(
condition = "ss==1",
formula = 0,
dist = "nonrandom")
defS <- defCondition(defS,
condition = "ss==2",
formula = "(0.09 + a) * (C_rv==1) + (0.10 + a) * (C_rv==2) + (0.11 + a) * (C_rv==3)",
dist = "nonrandom")
defS <- defCondition(defS,
condition = "ss==3",
formula = "(0.19 + a) * (C_rv==1) + (0.20 + a) * (C_rv==2) + (0.21 + a) * (C_rv==3)",
dist = "nonrandom")
defC3 <- defDataAdd(
varname = "z",
formula = "0.1*(ss-1)+z_ss+(0.6+b)*(C_rv==1)+(0.7+b)*(C_rv==2)+(0.8+b)*(C_rv==3)",
dist = "nonrandom")
list(defC1 = defC1, defC2 = defC2, defS = defS, defC3 = defC3)
}</code></pre>
</div>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>The data generation process draws on the definition tables to create an instance of an RCT data base. This process includes a function <code>genBaseProbs</code> that I described <a href="https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/" target="_blank">previously</a>.</p>
<pre class="r"><code>s_generate <- function(deflist, nsites) {
genBaseProbs <- function(n, base, similarity, digits = 2) {
n_levels <- length(base)
x <- gtools::rdirichlet(n, similarity * base)
x <- round(floor(x*1e8)/1e8, digits)
xpart <- x[, 1:(n_levels-1)]
partsum <- apply(xpart, 1, sum)
x[, n_levels] <- 1 - partsum
return(x)
}
basestudy <- genBaseProbs(n = nsites,
base = c(.10, .35, .25, .20, .10),
similarity = 100)
dstudy <- genData(nsites, id = "study")
dstudy <- trtAssign(dstudy, nTrt = 3, grpName = "C")
dstudy <- trtAssign(dstudy, nTrt = 2, strata = "C", grpName = "large", ratio = c(2,1))
dstudy <- addColumns(deflist[['defC1']], dstudy)
dind <- genCluster(dstudy, "study", numIndsVar = "size", "id")
dind <- trtAssign(dind, strata="study", grpName = "control")
dind <- addColumns(deflist[['defC2']], dind)
dind <- addCondition(deflist[["defS"]], dind, newvar = "z_ss")
dind <- addColumns(deflist[['defC3']], dind)
dl <- lapply(1:nsites, function(i) {
b <- basestudy[i,]
dx <- dind[study == i]
genOrdCat(dx, adjVar = "z", b, catVar = "ordY")
})
rbindlist(dl)[]
}</code></pre>
</div>
<div id="model-estimation" class="section level4">
<h4>Model estimation</h4>
<p>The estimation involves creating a data set for <code>Stan</code> and sampling from the Bayesian model. The <code>Stan</code> model is included in the addendum.</p>
<pre class="r"><code>s_estimate <- function(dd, s_model) {
N <- nrow(dd) ## number of observations
L <- dd[, length(unique(ordY))] ## number of levels of outcome
K <- dd[, length(unique(study))] ## number of studies
y <- as.numeric(dd$ordY) ## individual outcome
kk <- dd$study ## study for individual
ctrl <- dd$control ## treatment arm for individual
cc <- dd[, .N, keyby = .(study, C)]$C ## specific control arm for study
ss <- dd$ss
x <- model.matrix(ordY ~ factor(ss), data = dd)[, -1]
studydata <- list(N=N, L= L, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc, ss=ss, x=x)
fit <- sampling(s_model, data=studydata, iter = 4000, warmup = 500,
cores = 4L, chains = 4, control = list(adapt_delta = 0.8))
fit
}</code></pre>
</div>
<div id="estimate-extraction" class="section level4">
<h4>Estimate extraction</h4>
<p>The last step is the extraction of summary data from the posterior probability distributions. I am collecting quantiles of the key parameters, including <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(OR = \exp(-\Delta)\)</span>. For the Bayesian power analysis, I am estimating the probability of falling below the two thresholds for each data set. And finally, I want to get a sense of the quality of each estimation process by recovering the number of divergent chains that resulted from the MCMC algorithm (more on that <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">here</a>).</p>
<pre class="r"><code>s_extract <- function(iternum, mcmc_res) {
posterior <- as.array(mcmc_res)
x <- summary(
mcmc_res,
pars = c("Delta", "delta", "Gamma", "beta", "alpha", "OR"),
probs = c(0.025, 0.5, 0.975)
)
dpars <- data.table(iternum = iternum, par = rownames(x$summary), x$summary)
p.eff <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 1)
p.clinic <- mean(rstan::extract(mcmc_res, pars = "OR")[[1]] < 0.8)
dp <- data.table(iternum = iternum, p.eff = p.eff, p.clinic = p.clinic)
sparams <- get_sampler_params(mcmc_res, inc_warmup=FALSE)
n_divergent <- sum(sapply(sparams, function(x) sum(x[, 'divergent__'])))
ddiv <- data.table(iternum, n_divergent)
list(ddiv = ddiv, dpars = dpars, dp = dp)
}</code></pre>
</div>
<div id="replication" class="section level4">
<h4>Replication</h4>
<p>Now we want to put all these pieces together and repeatedly execute those four functions and save the results from each. I’ve <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/" target="_blank">described</a> using <code>lapply</code> to calculate power in a much more traditional setting. We’re going to take the same approach here, except on steroids, replacing <code>lapply</code> not with <code>mclapply</code>, the parallel version, but with <code>Slurm_lapply</code>, which is a function in the <code>slurmR</code> package.</p>
<p><a href="https://slurm.schedmd.com/documentation.html" target="_blank">Slurm</a> (Simple Linux Utility for Resource Management) is a HPC cluster job scheduler. <a href="https://uscbiostats.github.io/slurmR/" target="_blank">slurmR</a> is a wrapper that mimics many of the R <code>parallel</code> package functions, but in a Slurm environment. The strategy here is to define a meta-function (<code>iteration</code>) that itself calls the four functions already described, and then call that function repeatedly. <code>Slurm_lapply</code> does that, and rather than allocating the iterations to different <em>cores</em> on a computer like <code>mclapply</code> does, it allocates the iterations to different <em>nodes</em> on the HPC, using what is technically called a <em>job array</em>. Each node is essentially its own computer. In addition to that, each node has multiple cores, so we can run the different MCMC chains in parallel within a node; we have parallel processes within a parallel process. I have access to 100 nodes at any one time, though I find I don’t get much performance improvement if I go over 90, so that is what I do here. Within each node, I am using 4 cores. I am running 1,980 iterations, so that is 22 iterations per node. As I mentioned earlier, all of this runs in about an hour and a half.</p>
<p>The following code includes the “meta-function” <code>iteration</code>, the compilation of the <code>Stan</code> model (which only needs to be done once, thankfully), the <code>Slurm_lapply</code> call, and the <strong>Slurm</strong> batch code that I need to execute to get the whole process started on the HPC, which is called Big Purple here at NYU. (All of the R code goes into a single <code>.R</code> file, the batch code is in a <code>.slurm</code> file, and the Stan code is in its own <code>.stan</code> file.)</p>
<pre class="r"><code>iteration <- function(iternum, s_model, nsites) {
s_defs <- s_define()
s_dd <- s_generate(s_defs, nsites = nsites)
s_est <- s_estimate(s_dd, s_model)
s_res <- s_extract(iternum, s_est)
return(s_res)
}</code></pre>
<pre class="r"><code>library(simstudy)
library(rstan)
library(data.table)
library(slurmR)
rt <- stanc("/.../r/freq_bayes.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
job <- Slurm_lapply(
X = 1:1980,
iteration,
s_model = sm,
nsites = 9,
njobs = 90,
mc.cores = 4,
tmp_path = "/.../scratch",
overwrite = TRUE,
job_name = "i_fb",
sbatch_opt = list(time = "03:00:00", partition = "cpu_short"),
export = c("s_define", "s_generate", "s_estimate", "s_extract"),
plan = "wait")
job
res <- Slurm_collect(job)
diverge <- rbindlist(lapply(res, function(l) l[["ddiv"]]))
ests <- rbindlist(lapply(res, function(l) l[["dpars"]]))
probs <- rbindlist(lapply(res, function(l) l[["dp"]]))
save(diverge, ests, probs, file = "/.../data/freq_bayes.rda")</code></pre>
<pre><code>#!/bin/bash
#SBATCH --job-name=fb_parent
#SBATCH --mail-type=END,FAIL # send email if the job end or fail
#SBATCH --mail-user=keith.goldfeld@nyulangone.org
#SBATCH --partition=cpu_short
#SBATCH --time=3:00:00 # Time limit hrs:min:sec
#SBATCH --output=fb.out # Standard output and error log
module load r/3.6.3
cd /.../r
Rscript --vanilla fb.R</code></pre>
</div>
<div id="results" class="section level4">
<h4>Results</h4>
<p>Each of the three extracted data tables are combined across simulations and the results are saved to an <code>.rda</code> file, which can be loaded locally in R and summarized. In this case, we are particularly interested in the Bayesian power estimate, which is the proportion of data sets that would results in a “go” decision (a recommendation to strongly consider using the intervention).</p>
<p>However, before we consider that, we should first get a rough idea about how many replications had <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">divergence issues</a>, which we extracted into the <code>diverge</code> data table. For each replication, we used four chains of length 3,500 each (after the 500 warm-up samples), accounting for a total of 14,000 chains. Here are the proportion of replications with at least one divergent chain:</p>
<pre class="r"><code>load("DataBayesCOMPILE/freq_bayes.rda")
diverge[, mean(n_divergent > 0)]</code></pre>
<pre><code>## [1] 0.102</code></pre>
<p>While 10% of replications with at least 1 divergent chain might seem a little high, we can get a little more comfort from the fact that it appears that almost all replications had fewer than 35 (0.25%) divergent chains:</p>
<pre class="r"><code>diverge[, mean(n_divergent < 35)]</code></pre>
<pre><code>## [1] 0.985</code></pre>
<p>To get a general sense of how well our model is working, we can plot the distribution of posterior medians. In particular, this will allow us to assess how well the model is recovering the values used in the data generating process. In this case, I am excluding the 29 replications with 35 or more divergent chains:</p>
<p><img src="https://www.rdatagen.net/post/2020-11-10-a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models.en_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Finally, we are ready to report the estimated Bayesian power (again, using the replications with limited number of divergent chains) and show the distribution of probabilities.</p>
<pre class="r"><code>probs_d <- merge(probs, diverge, by = "iternum")[n_divergent < 35]
probs_d[, mean(p.eff > 0.95 & p.clinic > 0.50)]</code></pre>
<pre><code>## [1] 0.726</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-11-10-a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>So, given an actual effect <span class="math inline">\(OR=\exp(-0.70) = 0.50\)</span>, we would conclude with a decision to go ahead with the therapy with 73% probability. However, a single estimate of power based on one effect size is a bit incomplete; it would be preferable to assess power under numerous scenarios of effect sizes and perhaps prior distribution assumptions to get a more complete picture. And if you have access to a HPC, this may actually be something you can do in a realistic period of time.</p>
</div>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The <code>stan</code> model that implements the model described at the outset actually looks a little different than that model in two key ways. First, there is a parameter <span class="math inline">\(\alpha\)</span> that appears in the outcome model, which represents an overall intercept across all studies. Ideally, we wouldn’t need to include this parameter since we want to fix it at zero, but the model behaves very poorly without it. We do include it, but with a highly restrictive prior that will constrain it to be very close to zero. The second difference is that standard normal priors appear in the model - this is to alleviate issues related to divergent chains, which I described in a <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">previous post</a>.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=2> L; // number of WHO categories
int<lower=1> K; // number of studies
int<lower=1,upper=L> y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=3> cc[K]; // specific control for site
int<lower=1,upper=3> ss[N]; // strata
row_vector[2] x[N]; // strata indicators N x 2 matrix
}
parameters {
real Delta; // overall control effect
vector[2] Gamma; // overall strata effect
real alpha; // overall intercept for treatment
ordered[L-1] tau[K]; // cut-points for cumulative odds model (K X [L-1] matrix)
real<lower=0> eta_0; // sd of delta_k (around delta)
// non-central parameterization
vector[K] z_ran_rx; // site-specific effect
vector[2] z_phi[K]; // K X 2 matrix
vector[3] z_delta;
vector[2] z_beta;
vector[2] z_gamma[3]; // 3 X 2 matrix
}
transformed parameters{
vector[3] delta; // control-specific effect
vector[K] delta_k; // site specific treatment effect
vector[2] gamma[3]; // control-specific duration strata effect (3 X 2 matrix)
vector[2] beta; // covariate estimates of ss
vector[2] gamma_k[K]; // site-specific duration strata effect (K X 2 matrix)
vector[N] yhat;
delta = 0.5 * z_delta + Delta; // was 0.1
beta = 5 * z_beta;
for (c in 1:3)
gamma[c] = 0.25 * z_gamma[c] + Gamma;
for (k in 1:K){
delta_k[k] = eta_0 * z_ran_rx[k] + delta[cc[k]];
}
for (k in 1:K)
gamma_k[k] = 1 * z_phi[k] + gamma[cc[k]];
for (i in 1:N)
yhat[i] = alpha + x[i] * beta + ctrl[i] * (delta_k[kk[i]] + x[i]*gamma_k[kk[i]]);
}
model {
// priors
z_ran_rx ~ std_normal();
z_delta ~ std_normal();
z_beta ~ std_normal();
alpha ~ normal(0, 0.25);
eta_0 ~ student_t(3, 0, 0.25);
Delta ~ student_t(3, 0, 2.5);
Gamma ~ student_t(3, 0, 1);
for (c in 1:3)
z_gamma[c] ~ std_normal();
for (k in 1:K)
z_phi[k] ~ std_normal();
for (k in 1:K)
tau[k] ~ student_t(3, 0, 5);
// outcome model
for (i in 1:N)
y[i] ~ ordered_logistic(yhat[i], tau[kk[i]]);
}
generated quantities {
real OR;
OR = exp(-Delta);
}</code></pre>
</div>
A refined brute force method to inform simulation of ordinal response data
https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/
Tue, 27 Oct 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Francisco, a researcher from Spain, reached out to me with a challenge. He is interested in exploring various models that estimate correlation across multiple responses to survey questions. This is the context:</p>
<ul>
<li>He doesn’t have access to actual data, so to explore analytic methods he needs to simulate responses.</li>
<li>It would be ideal if the simulated data reflect the properties of real-world responses, some of which can be gleaned from the literature.</li>
<li>The studies he’s found report only means and standard deviations of the ordinal data, along with the correlation matrices, <em>but not probability distributions of the responses</em>.</li>
<li>He’s considering <code>simstudy</code> for his simulations, but the function <code>genOrdCat</code> requires a set of probabilities for each response measure; it doesn’t seem like simstudy will be helpful here.</li>
</ul>
<p>Ultimately, we needed to figure out if we can we use the empirical means and standard deviations to derive probabilities that will yield those same means and standard deviations when the data are simulated. I thought about this for a bit, and came up with a bit of a work-around; the approach seems to work decently and doesn’t require any outrageous assumptions.</p>
<p>I might have kept this between the two of us, but in the process of looking more closely at my solution, I generated a plot that was so beautiful and interesting that I needed to post it. And since I am posting the image, I thought I might as well go ahead and describe the solution in case any one else might find it useful. But first, the plot:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<div id="a-little-more-detail" class="section level3">
<h3>A little more detail</h3>
<p>In the simplest scenario, we want to simulate responses from a single survey question with responses ranging from 1 to 7, where 1 might signify <em>totally disagree</em> and 7 would mean <em>totally agree</em>, with gradations in between. Responses collected from a population will be distributed across the seven categories, and the proportion of responses that fall within each category represents the probability of a response.</p>
<p>To inform the simulation, we have a journal article that reports only a mean and standard deviation from responses to that same question collected in an earlier study. The idea is to find the probabilities for the possible responses that correspond to those observed means and standard deviations. That is, how do we go from the mean and standard deviation to a set of probabilities?</p>
<p>The reverse - going from known probabilities to a mean response and standard deviation - is much easier: we just calculate the weighted mean and weighted standard deviations, where the weights are the probabilities.</p>
<p>For example, say the probability distribution of the seven categorical responses is 21%, 20%, 18%, 15%, 13%, 9%, and 4% responding 1, 2, … , and 7, respectively, and represented by this histogram:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Under this distribution the weighted mean and standard deviation are 3.2 and 1.8, respectively:</p>
<pre class="r"><code>weighted.mean(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))</code></pre>
<pre><code>## [1] 3.2</code></pre>
<pre class="r"><code>weighted.sd(x = 1:7, w = c(.21, .20, .18, .15, .13, .09, .04))</code></pre>
<pre><code>## [1] 1.8</code></pre>
</div>
<div id="the-brute-force-approach" class="section level3">
<h3>The brute force approach</h3>
<p>My first thought about how use <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> was simple, if a bit crude. Generate a slew of probabilities (like a million or so) and calculate the weighted mean and standard deviation for each distribution. I would look for the probabilities that yielded values that were close to my target (i.e. those that had been reported in the literature).</p>
<p>There are a couple of drawbacks to this approach. First, it is not particularly systematic, since we generating the probabilities randomly, and even though we have large numbers, we are not guaranteed to generate combinations that reflect our targets. Second, there is no reason to think that the generated randomly generated distributions will look like the true distribution. And third, there is no reason to think that, even if we do find a match, the distribution is unique.</p>
<p>I actually went ahead and implemented this approach and found two distributions that also yield <span class="math inline">\(\mu\)</span> = 3.2 and and <span class="math inline">\(\sigma\)</span> = 1.8 (truth be told, I did this part first and then found the distribution above using the method I will describe in a second):</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Here are the target <span class="math inline">\(\mu\)</span>’s and <span class="math inline">\(\sigma\)</span>’s for the distributions on the right and left:</p>
<pre class="r"><code>p_left <- c(0.218, 0.174, 0.170, 0.206, 0.134, 0.022, 0.077)
c(weighted.mean(1:7, p_left), weighted.sd(1:7, p_left))</code></pre>
<pre><code>## [1] 3.2 1.8</code></pre>
<pre class="r"><code>p_right <- c(0.185, 0.185, 0.247, 0.217, 0.011, 0.062, 0.092)
c(weighted.mean(1:7, p_right), weighted.sd(1:7, p_right))</code></pre>
<pre><code>## [1] 3.2 1.8</code></pre>
</div>
<div id="drawing-on-the-beta-distribution" class="section level3">
<h3>Drawing on the <em>beta</em> distribution</h3>
<p>Thinking about probabilities always draws me to the <em>beta</em> family distribution, a continuous distribution from 0 to 1. Theses distributions are parameterized with two shape values, often referred to as <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. Here are a few probability density functions (pdf’s) for <span class="math inline">\((a,b)\)</span> pairs of (1, 1.6) in yellow, (2, 4) in red, and (2, 2) in blue:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>I had an idea that generating different pdf’s based on different values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> might provide a more systematic way of generating probabilities. If we carve the pdf into <span class="math inline">\(K\)</span> sections (where <span class="math inline">\(K\)</span> is the number of responses, in our case 7), then the area under the pdf in the <span class="math inline">\(k\)</span>th slice could provide the probability for the <span class="math inline">\(k\)</span>th response. Since each pdf is unique (determined by specific values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>), this would ensure different (i.e. unique) sets of probabilities to search through.</p>
<p>Using the example from above where <span class="math inline">\(a\)</span> = 1 and <span class="math inline">\(b\)</span> = 1.6, here is how the slices look based on the seven categories:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>The cumulative probability at each slice <span class="math inline">\(x \in \{1, ..., 7\}\)</span> is (<span class="math inline">\(P(X < x/7)\)</span>), and can be calculated in <code>R</code> with the function <code>pbeta</code>:</p>
<pre class="r"><code>z <- pbeta((1:7)/7, 2, 4)
z</code></pre>
<pre><code>## [1] 0.15 0.44 0.71 0.89 0.97 1.00 1.00</code></pre>
<p>The probability of for each category is <span class="math inline">\(P(X = x) = P(X < x) - P(X < (x-1))\)</span>, and is calculated easily:</p>
<pre class="r"><code>p <- z - c(0, z[-7])
p</code></pre>
<pre><code>## [1] 0.1518 0.2904 0.2684 0.1786 0.0851 0.0239 0.0018</code></pre>
<p>This is the transformed probability distribution from the continuous <em>beta</em> scale to the discrete categorical scale:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
<p>And finally here are <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> associated with these values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>:</p>
<pre class="r"><code>c(weighted.mean(1:7, p), weighted.sd(1:7, p))</code></pre>
<pre><code>## [1] 2.8 1.3</code></pre>
</div>
<div id="brute-force-refined" class="section level3">
<h3>Brute force, refined</h3>
<p>If we create a grid of <span class="math inline">\((a, b)\)</span> values, there will be an associated, and unique, set of probabilities for each pair derived from slicing the pdf into <span class="math inline">\(K\)</span> sections And for each of these sets of probabilities, we can calculate the means and standard deviations. We then find the <span class="math inline">\((\mu, \sigma)\)</span> pair that is closest to our target. While this idea is not that much better than the brute force approach suggested above, at least it is now systematic. If we do it in two steps, first by searching for the general region and then zooming in to find a specific set of probabilities, we can really speed things up and use less memory.</p>
<p>Is limiting the search to <em>beta</em>-based distributions justifiable? It might depend on the nature of responses in a particular case, but it does seem reasonable; most importantly, it assures fairly well-behaved distributions that could plausibly reflect a wide range of response patterns. Barring any additional information about the distributions, then, I would have no qualms using this approach. (If it turns out that this is a common enough problem, I would even consider implementing the algorithm as a <code>simstudy</code> function.)</p>
<p>Now, it is time to reveal the secret of the plot (if you haven’t figured it out already). Each point is just the <span class="math inline">\((\mu, \sigma)\)</span> pair generated by a specific <span class="math inline">\((a, b)\)</span> pair. Here is the code, implementing the described algorithm:</p>
<pre class="r"><code>get_abp <- function(a, b, size) {
x <- 1:size
z <- pbeta((1:size)/size, a, b)
p <- z - c(0, z[-size])
sigma <- weighted.sd(x, p)
mu <- weighted.mean(x, p)
data.table(a, b, mu, sigma, t(p))
}
get_p <- function(a, b, n) {
ab <- asplit(expand.grid(a = a, b = b), 1)
rbindlist(lapply(ab, function(x) get_abp(x[1], x[2], n)))
}
a <- seq(.1, 25, .1)
b <- seq(.1, 25, .1)
ab_res <- get_p(a, b, 7)</code></pre>
<p>We can fill in the plot with more points by increasing the range of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> that we search, but creating such a huge table of look-up values is time consuming and starts to eat up memory. In any case, there is no need, because we will refine the search by zooming in on the area closest to our target.</p>
<p>Here is the plot again, based on <span class="math inline">\(a\)</span>’s and <span class="math inline">\(b\)</span>’s ranging from 0.1 to 25, with the superimposed target pair <span class="math inline">\((\mu =3.2, \sigma = 1.8)\)</span></p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>To zoom in, we first find the point in the grid that is closest to our target (based on Euclidean distance). We then define a finer grid around this point in the grid, and re-search for the closest point. We do have to be careful that we do not search for invalid values of <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> (i.e. <span class="math inline">\(a \le 0\)</span> and <span class="math inline">\(b \le 0\)</span>). Once we find our point, we have the associated probabilities.:</p>
<pre class="r"><code>t_mu = 3.2
t_s = 1.8
ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
close_point <- ab_res[distance == min(distance), .(a, b, distance)]
a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
a_zoom <- a_zoom[a_zoom > 0]
b_zoom <- b_zoom[b_zoom > 0]
res_zoom <- get_p(a_zoom, b_zoom, 7)</code></pre>
<p>Here is the new search region:</p>
<p><img src="https://www.rdatagen.net/post/2020-10-27-can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>And the selection of the point:</p>
<pre class="r"><code>res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
res_zoom[distance == min(distance)]</code></pre>
<pre><code>## a b mu sigma V1 V2 V3 V4 V5 V6 V7 distance
## 1: 0.97 1.6 3.2 1.8 0.22 0.2 0.17 0.15 0.12 0.09 0.046 0.0021</code></pre>
</div>
<div id="applying-the-beta-search-to-a-bigger-problem" class="section level3">
<h3>Applying the <em>beta</em>-search to a bigger problem</h3>
<p>To conclude, I’ll finish with Francisco’s more ambitious goal of simulating correlated responses to multiple questions. In this case, we will assume four questions, all with responses ranging from 1 to 7. The target (<span class="math inline">\(\mu, \sigma\)</span>) pairs taken from the (hypothetical) journal article are:</p>
<pre class="r"><code>targets <- list(c(2.4, 0.8), c(4.1, 1.2), c(3.4, 1.5), c(5.8, 0.8))</code></pre>
<p>The correlation matrix taken from this same article is:</p>
<pre class="r"><code>corMat <- matrix(c(
1.00, 0.09, 0.11, 0.05,
0.09, 1.00, 0.35, 0.16,
0.11, 0.35, 1.00, 0.13,
0.05, 0.16, 0.13, 1.00), nrow=4,ncol=4)</code></pre>
<p>The <code>get_target_prob</code> function implements the search algorithm described above:</p>
<pre class="r"><code>get_target_prob <- function(t_mu, t_s, ab_res) {
ab_res[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
close_point <- ab_res[distance == min(distance), .(a, b, distance)]
a_zoom<- with(close_point, seq(a - .25, a + .25, length = 75))
b_zoom<- with(close_point, seq(b - .25, b + .25, length = 75))
a_zoom <- a_zoom[a_zoom > 0]
b_zoom <- b_zoom[b_zoom > 0]
res_zoom <- get_p(a_zoom, b_zoom, 7)
res_zoom[, distance := sqrt((mu - t_mu)^2 + (sigma - t_s)^2)]
baseprobs <- as.vector(res_zoom[distance == min(distance), paste0("V", 1:7)], "double")
baseprobs
}</code></pre>
<p>Calling the function conducts the search and provides probabilities for each question:</p>
<pre class="r"><code>probs <- lapply(targets, function(x) get_target_prob(x[1], x[2], ab_res))
(probs <- do.call(rbind, probs))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.1e-01 4.8e-01 0.3334 0.077 0.0059 8.8e-05 3.7e-08
## [2,] 7.9e-03 8.5e-02 0.2232 0.306 0.2537 1.1e-01 1.3e-02
## [3,] 1.0e-01 2.1e-01 0.2345 0.209 0.1502 7.9e-02 1.7e-02
## [4,] 2.5e-08 5.4e-05 0.0036 0.051 0.2638 5.0e-01 1.8e-01</code></pre>
<p>At least in theory, Francisco can now conduct his simulation study. In this case, I am generating a huge sample size to minimize sampling variation with the hope that we can recover the means, standard deviations and correlations, which, of course, we do:</p>
<pre class="r"><code>d_ind <- genData(100000)
dx <- genOrdCat(d_ind, adjVar = NULL, baseprobs = probs, catVar = "y",
corMatrix = corMat, asFactor = FALSE)</code></pre>
<pre class="r"><code>apply(as.matrix(dx[, -1]), 2, mean)</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## 2.4 4.1 3.4 5.8</code></pre>
<pre class="r"><code>apply(as.matrix(dx[, -1]), 2, sd)</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## 0.8 1.2 1.5 0.8</code></pre>
<pre class="r"><code>cor(as.matrix(dx[, -1]))</code></pre>
<pre><code>## grp1 grp2 grp3 grp4
## grp1 1.000 0.08 0.099 0.043
## grp2 0.080 1.00 0.331 0.142
## grp3 0.099 0.33 1.000 0.110
## grp4 0.043 0.14 0.110 1.000</code></pre>
<p>In the end, Francisco seemed to be satisfied with the solution - at least satisfied enough to go to the trouble to have a bottle of wine sent to me in New York City, which was definitely above and beyond. While my wife and I will certainly enjoy the wine - and look forward to being able to travel again so maybe we can enjoy a glass in person - seeing that image emerge from a <em>beta</em>-distribution was really all I needed. Salud.</p>
</div>
simstudy just got a little more dynamic: version 0.2.1
https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/
Tue, 13 Oct 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p><code>simstudy</code> version 0.2.1 has just been submitted to <a href="https://cran.rstudio.com/web/packages/simstudy/" target="_blank">CRAN</a>. Along with this release, the big news is that I’ve been joined by Jacob Wujciak-Jens as a co-author of the package. He initially reached out to me from Germany with some suggestions for improvements, we had a little back and forth, and now here we are. He has substantially reworked the underbelly of <code>simstudy</code>, making the package much easier to maintain, and positioning it for much easier extension. And he implemented an entire system of formalized tests using <a href="https://testthat.r-lib.org/" target="_blank">testthat</a> and <a href="https://cran.r-project.org/web/packages/hedgehog/vignettes/hedgehog.html" target="_blank">hedgehog</a>; that was always my intention, but I never had the wherewithal to pull it off, and Jacob has done that. But, most importantly, it is much more fun to collaborate on this project than to toil away on my own.</p>
<p>You readers, though, are probably more interested in the changes that, as a user, you will notice. There are a number of bug fixes (hopefully you never encountered those, but I know some of you have, because you have pointed them out to me) and improved documentation, including some new vignettes. There is even a nice new <a href="https://kgoldfeld.github.io/simstudy/index.html" target="_blank">website</a> that is created with the help of <a href="https://pkgdown.r-lib.org/" target="_blank">pkgdown</a>.</p>
<p>The most exciting extension of this new version is the ability to modify data definitions on the fly using externally defined variables. Often, we’d like to explore data generation and modeling under different scenarios. For example, we might want to understand the operating characteristics of a model given different variance or other parametric assumptions. There was already some functionality built into <code>simstudy</code> to facilitate this type of dynamic exploration, with <code>updateDef</code> and <code>updateDefAdd</code>, that allows users to edit lines of existing data definition tables. Now, there is an additional and, I think, more powerful mechanism - called <em>double-dot</em> reference - to access variables that do not already exist in a defined data set or data definition.</p>
<div id="double-dot-external-variable-reference" class="section level3">
<h3>Double-dot external variable reference</h3>
<p>It may be useful to think of an external reference variable as a type of hyperparameter of the data generation process. The reference is made directly in the formula itself, using a double-dot (“..”) notation before the variable name.</p>
<p>Here is a simple example:</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "x", formula = 0,
variance = 5, dist = "normal")
def <- defData(def, varname = "y", formula = "..B0 + ..B1 * x",
variance = "..sigma2", dist = "normal")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: x 0 5 normal identity
## 2: y ..B0 + ..B1 * x ..sigma2 normal identity</code></pre>
<p><code>B0</code>, <code>B1</code>, and <code>sigma2</code> are not part of the simstudy data definition, but will be set external to that process, either in the global environment or within the context of a function.</p>
<pre class="r"><code>B0 <- 4;
B1 <- 2;
sigma2 <- 9
set.seed(716251)
dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))
coef(fit)</code></pre>
<pre><code>## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4 0.28 14 2.6e-25
## x 2 0.13 15 5.9e-28</code></pre>
<pre class="r"><code>fit$sigma</code></pre>
<pre><code>## [1] 2.8</code></pre>
<p>It is easy to create a new data set on the fly with different slope and variance assumptions without having to go to the trouble of updating the data definitions.</p>
<pre class="r"><code>B1 <- 3
sigma2 <- 16
dd <- genData(100, def)
fit <- summary(lm(y ~ x, data = dd))
coef(fit)</code></pre>
<pre><code>## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4 0.43 10 4.6e-17
## x 3.1 0.22 14 8.6e-26</code></pre>
<pre class="r"><code>fit$sigma</code></pre>
<pre><code>## [1] 4.2</code></pre>
</div>
<div id="using-with-apply-functions" class="section level3">
<h3>Using with <em>apply</em> functions</h3>
<p>Double-dot references can be flexibly applied using <code>lapply</code> (or the parallel version <code>mclapply</code>) to create a range of data sets under different assumptions:</p>
<pre class="r"><code>gen_data <- function(sigma2, d) {
dd <- genData(400, d)
dd[, sigma2 := sigma2]
dd
}
sigma2s <- c(1:9)^2
dd_m <- lapply(sigma2s, function(s) gen_data(s, def))
dd_m <- rbindlist(dd_m)
ggplot(data = dd_m, aes(x = x, y = y)) +
geom_point(size = .5, color = "#DDAA33") +
facet_wrap(sigma2 ~ .) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-10-13-simstudy-just-got-a-little-more-dynamic-version-0-2-0.en_files/figure-html/unnamed-chunk-5-1.png" width="768" /></p>
</div>
<div id="using-with-vectors" class="section level3">
<h3>Using with vectors</h3>
<p>Double-dot referencing is also vector friendly. For example, if we want to create a mixture distribution from a vector of values (which we can also do using a <em>categorical</em> distribution), we can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4:</p>
<pre class="r"><code>defblk <- defData(varname = "blksize",
formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")
defblk</code></pre>
<pre><code>## varname formula variance dist link
## 1: blksize ..sizes[1] | .5 + ..sizes[2] | .5 0 mixture identity</code></pre>
<pre class="r"><code>sizes <- c(2, 4)
genData(1000, defblk)</code></pre>
<pre><code>## id blksize
## 1: 1 2
## 2: 2 4
## 3: 3 2
## 4: 4 4
## 5: 5 4
## ---
## 996: 996 4
## 997: 997 2
## 998: 998 4
## 999: 999 4
## 1000: 1000 4</code></pre>
<p>There are a few other changes to the package that are described <a href="https://kgoldfeld.github.io/simstudy/news/index.html" target="_blank">here</a> (but look for version <code>0.2.0</code> - we found a pretty major bug right away and fixed it, hence <code>0.2.1</code>). Moving forward, we have some more things in the works, of course. And if you have suggestions of your own, you know where to find us.</p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>Here’s a more detailed example to show how double-dot references simplify things considerably in a case where I originally used the <code>updateDef</code> function. In a post where I described <a href="https://www.rdatagen.net/post/regression-to-the-mean" target="_blank">regression to the mean</a>, there is an <a href="https://www.rdatagen.net/post/regression-to-the-mean/#addendum" target="_blank">addendum</a> that I adapt here using double-dot references. I’m not going into the motivation for the code here - check out the <a href="https://www.rdatagen.net/post/regression-to-the-mean" target="_blank">post</a> if you’d like to see more.)</p>
<p>Here’s the data original code (both examples require the <code>parallel</code> package):</p>
<pre class="r"><code>d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "x2", formula = "0 + 2*U", variance = 1)
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")
rtomean <- function(n, d) {
dd <- genData(n, d)
data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}
repl <- function(xvar, nrep, ucoef, d) {
d <- updateDef(d, "x1", newvariance = xvar)
d <- updateDef(d, "x2", newvariance = xvar)
dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d)))
mudif <- unlist(lapply(dif, mean))
data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
}
dres <- list()
i <- 0
for (ucoef in c(0, 1, 2, 3)) {
i <- i + 1
uform <- genFormula( c(0, ucoef), "U")
d <- updateDef(d, "x1", newformula = uform)
d <- updateDef(d, "x2", newformula = uform)
dr <- mclapply(seq(1, 4, by = 1), function(x) repl(x, 1000, ucoef, d))
dres[[i]] <- rbindlist(dr)
}
dres <- rbindlist(dres)</code></pre>
<p>And here is the updated code:</p>
<pre class="r"><code>d <- defData(varname = "U", formula = "-1;1", dist = "uniform")
d <- defData(d, varname = "x1", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "x2", formula = "0 + ..ucoef*U", variance = "..xvar")
d <- defData(d, varname = "h1", formula = "x1 > quantile(x1, .80) ",
dist = "nonrandom")
rtomean <- function(n, d, ucoef, xvar) {
dd <- genData(n, d)
data.table(x1 = dd[x1 >= h1, mean(x1)] , x2 = dd[x1 >= h1, mean(x2)])
}
repl <- function(nrep, d, ucoef, xvar) {
dif <- rbindlist(mclapply(1:nrep, function(x) rtomean(200, d, ucoef, xvar)))
mudif <- unlist(lapply(dif, mean))
data.table(ucoef, xvar, x1 = mudif[1], x2 = mudif[2])
}
ucoef <- c(0:3)
xvar <- c(1:4)
params <- asplit(expand.grid(ucoef = ucoef, xvar = xvar), 1)
dres <- rbindlist(mclapply(params, function(x) repl(1000, d, x["ucoef"], x["xvar"])))</code></pre>
<p>The code is much cleaner and the data generating process doesn’t really lose any clarity. Importantly, this change has allowed me to take advantage of the <code>apply</code> approach (rather than using a loop). I’d conclude that double-dot references have the potential to simplify things quite a bit.</p>
</div>
Permuted block randomization using simstudy
https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/
Tue, 29 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Along with preparing power analyses and statistical analysis plans (SAPs), generating study randomization lists is something a practicing biostatistician is occasionally asked to do. While not a particularly interesting activity, it offers the opportunity to tackle a small programming challenge. The title is a little misleading because you should probably skip all this and just use the <code>blockrand</code> package if you want to generate randomization schemes; don’t try to reinvent the wheel. But, I can’t resist. Since I was recently asked to generate such a list, I’ve been wondering how hard it would be to accomplish this using <code>simstudy</code>. There are already built-in functions for simulating stratified randomization schemes, so maybe it could be a good solution. The key element that is missing from simstudy, of course, is the permuted block setup.</p>
<div id="why-permuted-block-randomization" class="section level3">
<h3>Why permuted block randomization?</h3>
<p>What is <em>permuted block</em> randomization and why even bother? The <em>block</em> part indicates that patients will be randomized in sub-groups. If we have blocks of size two, patients will be paired sequentially, with each patient randomized to different arms. This helps with balance over time and when recruitment is less than anticipated. If we were to randomize 100 patients to treatment and control (50 in each arm), it is possible that treatment assignments will cluster close to the start of the trial, just by chance. If the time period is related to the outcome in some way, this would be undesirable. Furthermore, if trial recruitment lagged and had to stop early, there would be an actual lack of balance across the arms.</p>
<p>The argument for block randomization seems strong enough. But if coordinators in the field know that we are using this approach, there is a risk of influencing patient recruitment. If it is known that patients are randomized in blocks of size four, and the first two patients are randomized to drug therapy, the coordinator will know that the next two patients will be randomized to control. This could influence who the coordinator recruits into the study, particularly if they believe drug therapy is superior. They may actively or unconsciously recruit healthier patients when it is known that they are going to get the control. (This, of course, is much less of an issue when recruiters/coordinators are completely blinded to group assignment.) By changing the block sizes in an unpredictable manner, by <em>permuting</em> the sizes, this problem is solved. Hence, <em>permuted block randomization</em>.</p>
</div>
<div id="simstudy-code" class="section level3">
<h3>simstudy code</h3>
<p>I want to walk through the code that will generate permuted block randomization. In this scenario we are conducting a trial to compare a <em>drug</em> therapy with <em>placebo</em> in at least 120 patients. We would like to randomize within blocks of size two or four, and the order of the blocks will themselves be randomized. We assume that each block size will have equal probability of being selected, though balance across different block sizes is not guaranteed. The preliminary code shown here implements these specifications:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
set.seed(1234)
n <- 120
levels <- c("D", "P") # Drug/Placebo
blk_sizes <- c(2, 4)
n_arms <- length(levels)
p_blk <- 1/length(blk_sizes)</code></pre>
<p>The first step is to generate a sequence of blocks with varying block sizes. We take advantage of the <code>mixture</code> distribution option in <code>simstudy</code> to generate blocks. This distribution is specified using a string with the format “<span class="math inline">\(x_1|p_1 + \dots + x_k|p_k\)</span>”. In this case, there are <span class="math inline">\(k=2\)</span> block sizes; <span class="math inline">\(x_1 = 2\)</span>, <span class="math inline">\(p_1 = 0.5\)</span>, <span class="math inline">\(x_2 = 4\)</span>, and <span class="math inline">\(p_2 = 0.5\)</span>. We construct the mixture formula using the predefined block sizes, and use that formula to define the data that we need to generate:</p>
<pre class="r"><code>v <- sapply(blk_sizes, function(x) paste(x, p_blk, sep = "|"))
mixformula <- paste(v, collapse = "+")
def <- defData(varname = "blk_size", formula = mixformula,
dist = "mixture", id = "blk_id")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: blk_size 2|0.5+4|0.5 0 mixture identity</code></pre>
<p>Now, we need generate enough blocks to support the target number of patients to be randomized; that is, the sum of the block sizes should at least as large as the target. If all block sizes were the minimum block size (in this case <span class="math inline">\(2\)</span>), then we would need at least <span class="math inline">\(n/2\)</span> blocks. Clearly, we will need fewer, but we will start with <span class="math inline">\(n/2\)</span> here and remove the excess:</p>
<pre class="r"><code>maxblocks <- ceiling(n / min(blk_sizes))
dd <- genData(maxblocks, def)
#--- removing the excess
nblocks <- dd[, threshold := (cumsum(blk_size) >= n) * .I][threshold > 0]
dd <- dd[1:nblocks[1, threshold]]
tail(dd)</code></pre>
<pre><code>## blk_id blk_size threshold
## 1: 36 4 0
## 2: 37 2 0
## 3: 38 2 0
## 4: 39 4 0
## 5: 40 4 0
## 6: 41 4 41</code></pre>
<p>In the final step, we use cluster data generation to create the individual patients, defining each block as a cluster. Treatment assignment is stratified by each block:</p>
<pre class="r"><code>di <- genCluster(dd, cLevelVar = "blk_id", numIndsVar = "blk_size", level1ID = "id")
dtrt <- trtAssign(di, nTrt = n_arms, strata = "blk_id", grpName = "arm")
dtrt <- dtrt[, .(id, blk_id, blk_size, arm = factor(arm, labels = levels))]</code></pre>
<p>Here are examples of the block randomization results for four blocks:</p>
<pre class="r"><code>dtrt[blk_id == 5]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 15 5 4 P
## 2: 16 5 4 D
## 3: 17 5 4 P
## 4: 18 5 4 D</code></pre>
<pre class="r"><code>dtrt[blk_id == 8]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 25 8 2 D
## 2: 26 8 2 P</code></pre>
<pre class="r"><code>dtrt[blk_id == 19]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 59 19 2 P
## 2: 60 19 2 D</code></pre>
<pre class="r"><code>dtrt[blk_id == 26]</code></pre>
<pre><code>## id blk_id blk_size arm
## 1: 73 26 4 D
## 2: 74 26 4 P
## 3: 75 26 4 P
## 4: 76 26 4 D</code></pre>
</div>
<div id="a-real-world-application" class="section level3">
<h3>A real-world application</h3>
<p>I’ve created a function <code>blkRandom</code> based on this code so that I can illustrate this functionality in a more realistic setting. In a current multi-site study that I’m working on (already did the power analysis, finalizing the SAP), we need to provide a randomization list so that subject recruitment can begin. Randomization will be stratified by each of the sites (1 through 7), by sex (M and F), and by location of recruitment (A or B); in total, there will be <span class="math inline">\(7\times 2 \times 2 = 28\)</span> strata. For each of the 28 strata we want to randomize 50 potential subjects using permuted block randomization; for particular strata, this is certainly too large a number, but it doesn’t hurt to overestimate as long as the clinical trial software system can handle it.</p>
<p>Here is how the function would work for a single strata (just showing the first and last blocks):</p>
<pre class="r"><code>blkRandom(n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))[c(1:4, 47:50)]</code></pre>
<pre><code>## blk_id blk_size threshold id arm
## 1: 1 4 0 1 A
## 2: 1 4 0 2 B
## 3: 1 4 0 3 A
## 4: 1 4 0 4 B
## 5: 13 4 13 47 B
## 6: 13 4 13 48 A
## 7: 13 4 13 49 A
## 8: 13 4 13 50 B</code></pre>
<p>Here is a wrapper function for <code>blkRandom</code> that incorporates a specific strata <span class="math inline">\(s\)</span>. This will enable us to do permuted block randomization within different subgroups of the population, such as males and females, or sites:</p>
<pre class="r"><code>sBlkRandom <- function(s, n, levels, blk_sizes) {
dB <- blkRandom(n, levels, blk_sizes)
dB[, .(id = paste0(id, s), stratum = s, arm)]
}
sBlkRandom(s = "M1A", n = 30, levels = c("A", "B"), blk_sizes = c(2, 4))[1:5]</code></pre>
<pre><code>## id stratum arm
## 1: 1M1A M1A B
## 2: 2M1A M1A A
## 3: 3M1A M1A A
## 4: 4M1A M1A B
## 5: 5M1A M1A B</code></pre>
<p>All the pieces are now in place.</p>
<p>We need to create a list of strata, each of which requires its own permuted block randomization list:</p>
<pre class="r"><code>library(tidyr)
#--- specify all strata variables
site <- c(1 : 7)
sex <- c("M", "F")
location <- c("A", "B")
#--- create strata
strata <- expand.grid(sex = sex,site = site, location = location)
strata <- unite(strata, "stratum", sep = "")$stratum
head(strata)</code></pre>
<pre><code>## [1] "M1A" "F1A" "M2A" "F2A" "M3A" "F3A"</code></pre>
<p>With the list of strata in hand - we create the randomization lists using <code>lapply</code> to repeatedly call <code>sBlkRandom</code>:</p>
<pre class="r"><code>rbindlist(lapply(
strata,
function(s) sBlkRandom(s, n = 50, levels = c("A", "B"), blk_sizes = c(2, 4))
)
)</code></pre>
<pre><code>## id stratum arm
## 1: 1M1A M1A B
## 2: 2M1A M1A A
## 3: 3M1A M1A A
## 4: 4M1A M1A B
## 5: 5M1A M1A B
## ---
## 1420: 46F7B F7B B
## 1421: 47F7B F7B B
## 1422: 48F7B F7B A
## 1423: 49F7B F7B B
## 1424: 50F7B F7B A</code></pre>
</div>
<div id="going-with-the-tried-and-true" class="section level3">
<h3>Going with the tried and true</h3>
<p>This has been fun and I think successful, but as I mentioned, you might want to stick with the established <a href="https://cran.r-project.org/web/packages/blockrand/index.html" target="_blank">blockrand</a> package that is designed around this very specific goal. I have written a simple wrapper function that rectifies one slightly minor shortcoming (block sizes are specified as blk_size<span class="math inline">\(/ 2\)</span>) and allows us to use <code>lapply</code> to make repeated calls across the strata:</p>
<pre class="r"><code>library(blockrand)
rand_stratum <- function(stratum, n, levels, blk_sizes) {
blk_sizes <- blk_sizes / 2
dB <- data.table(blockrand(
n = n,
num.levels = length(levels),
levels = levels,
id.prefix = stratum,
block.prefix = stratum,
stratum = stratum,
block.sizes = blk_sizes)
)
dB[, .(id, stratum, treatment)]
}
rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))</code></pre>
<pre><code>## id stratum treatment
## 1: M1A01 M1A B
## 2: M1A02 M1A A
## 3: M1A03 M1A B
## 4: M1A04 M1A A
## 5: M1A05 M1A B
## ---
## 1420: F7B46 F7B A
## 1421: F7B47 F7B A
## 1422: F7B48 F7B B
## 1423: F7B49 F7B A
## 1424: F7B50 F7B B</code></pre>
<p>In case the elegance and simplicity (not to mention all the other features that it provides, but I didn’t show you) are not compelling enough, the speed comparison isn’t even close: <code>blockrand</code> is 7 times faster than my <code>simstudy</code> solution (albeit on the millisecond scale - so you might not actually notice it).</p>
<pre class="r"><code>library(microbenchmark)
microbenchmark(
rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A", "B"), c(2, 4)))),
rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))
)</code></pre>
<pre><code>## Unit: milliseconds
## expr
## rbindlist(lapply(strata, function(s) sBlkRandom(s, 50, c("A", "B"), c(2, 4))))
## rbindlist(lapply(strata, function(s) rand_stratum(s, 50, c("A", "B"), c(2, 4))))
## min lq mean median uq max neval cld
## 276 290 331 304 344 698 100 b
## 38 40 48 44 49 120 100 a</code></pre>
</div>
Generating probabilities for ordinal categorical data
https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/
Tue, 15 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently <a href="https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/" target="_blank">here</a>. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.</p>
<div id="motivation" class="section level3">
<h3>Motivation</h3>
<p>The outcome for the analysis that we will be conducting is the WHO 11-point ordinal scale for clinical improvement at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. We plan to use a Bayesian proportional odds model to assess the effectiveness of the therapy. Since this is a meta-analysis, we will be including these data from a collection of studies being conducted around the world.</p>
<p>Typically, in a proportional odds model one has to make an <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/" target="{"_blank"">assumption about proportionality</a>. In this case, while we are willing to make that assumption <em>within</em> specific studies, we are not willing to make that assumption <em>across</em> the various studies. This means we need to generate a separate set of intercepts for each study that we simulate.</p>
<p>In the proportional odds model, we are modeling the log-cumulative odds at a particular level. The simplest model with a single exposure/treatment covariate for a specific study or cluster <span class="math inline">\(k\)</span> is</p>
<p><span class="math display">\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{xk} + \beta A,\]</span>
where <span class="math inline">\(x\)</span> ranges from 1 to 10, all the levels of the WHO score excluding the lowest level <span class="math inline">\(x=0\)</span>. <span class="math inline">\(A\)</span> is the treatment indicator, and is <span class="math inline">\(A=1\)</span> for patients who receive the treatment. <span class="math inline">\(\alpha_{xk}\)</span> is the intercept for each study/cluster <span class="math inline">\(k\)</span>. <span class="math inline">\(\beta\)</span> is interpreted as the log-odds ratio comparing the odds of the treated with the non-treated within each study. The proportionality assumption kicks in here when we note that <span class="math inline">\(\beta\)</span> is constant for all levels of <span class="math inline">\(x\)</span>. In addition, in this particular model, we are assuming that the log-odds ratio is constant across studies (not something we will assume in a more complete model). We make no assumptions about how the study intercepts relate to each other.</p>
<p>To make clear what it would mean to make a stronger assumption about the odds <em>across</em> studies consider this model:</p>
<p><span class="math display">\[log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{x} + b_k + \beta A,\]</span></p>
<p>where the intercepts for each study are related, since they are defined as <span class="math inline">\(\alpha_{x} + b_k\)</span>, and share <span class="math inline">\(\alpha_x\)</span> in common. If we compare the log-odds of the treated in one study <span class="math inline">\(k\)</span> with the log-odds of treated in another study <span class="math inline">\(j\)</span> (so <span class="math inline">\(A=1\)</span> in both cases), the log-odds ratio is <span class="math inline">\(b_j - b_k\)</span>. The ratio is independent of <span class="math inline">\(x\)</span>, which implies a strong proportional odds assumption across studies. In contrast, the same comparison across studies based on the first model is <span class="math inline">\(\alpha_{xj} - \alpha_{xk}\)</span>, which is <em>not</em> necessarily constant across different levels of <span class="math inline">\(x\)</span>.</p>
<p>This is a long way of explaining why we need to generate different sets of intercepts for each study. In short, we would like to make the more relaxed assumption that odds are not proportional across studies or clusters.</p>
</div>
<div id="the-dirichlet-distribution" class="section level3">
<h3>The Dirichlet distribution</h3>
<p>In order to generate ordinal categorical data I use the <a href="https://kgoldfeld.github.io/simstudy/reference/genOrdCat.html" target="_blank">genOrdCat</a> function in the <code>simstudy</code> package. This function requires a set of baseline probabilities that sum to one; these probabilities map onto level-specific intercepts. There will be a distinct set of baseline probabilities for each study and I will create a data set for each study. The challenge is to be able to generate unique baseline probabilities as if I were sampling from a population of studies.</p>
<p>If I want to generate a single probability (i.e. a number between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>), a good solution is to draw a value from a <em>beta</em> distribution, which has two shape parameters <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>.</p>
<p>Here is a single draw from <span class="math inline">\(beta(3, 3)\)</span>:</p>
<pre class="r"><code>set.seed(872837)
rbeta(1, shape1 = 3, shape2 = 3)</code></pre>
<pre><code>## [1] 0.568</code></pre>
<p>The mean of the <em>beta</em> distribution is <span class="math inline">\(\alpha/(\alpha + \beta)\)</span> and the variance is <span class="math inline">\(\alpha\beta/(\alpha+\beta)^2(\alpha + \beta + 1)\)</span>. We can reduce the variance and maintain the same mean by increasing <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> by a constant factor (see <a href="#addendum">addendum</a> for a pretty picture):</p>
<pre class="r"><code>library(data.table)
d1 <- data.table(s = 1, value = rbeta(1000, shape1 = 1, shape2 = 2))
d2 <- data.table(s = 2, value = rbeta(1000, shape1 = 5, shape2 = 10))
d3 <- data.table(s = 3, value = rbeta(1000, shape1 = 100, shape2 = 200))
dd <- rbind(d1, d2, d3)
dd[, .(mean(value), sd(value)), keyby = s]</code></pre>
<pre><code>## s V1 V2
## 1: 1 0.338 0.2307
## 2: 2 0.336 0.1195
## 3: 3 0.333 0.0283</code></pre>
<p>The <em>Dirichlet</em> distribution is a multivariate version of the <em>beta</em> distribution where <span class="math inline">\(K\)</span> values between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span> are generated, with the caveat that they sum to <span class="math inline">\(1\)</span>. Instead of <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>, the Dirichlet is parameterized by a vector of length <span class="math inline">\(K\)</span></p>
<p><span class="math display">\[\boldsymbol{\alpha} = \left(\alpha_1,\dots, \alpha_K\right)^T,\]</span></p>
<p>where there are <span class="math inline">\(K\)</span> levels of the ordinal outcome. A draw from this distribution returns a vector <span class="math inline">\(\boldsymbol{p} = ( p_1, \dots, p_K)^T\)</span> where <span class="math inline">\(\sum_{i=1}^K p_i = 1\)</span> and</p>
<p><span class="math display">\[E(p_k)=\frac{\alpha_k}{\sum_{i=1}^K \alpha_i}.\]</span>
A draw from a Dirichlet distribution with <span class="math inline">\(K=2\)</span> is actually equivalent to a draw from a <em>beta</em> distribution where <span class="math inline">\(\boldsymbol{\alpha} = (\alpha, \beta)^T\)</span>. Before, I generated data from a <span class="math inline">\(beta(1, 2)\)</span>, and now here is a draw from <span class="math inline">\(Dirichlet\left(\boldsymbol\alpha = (1,2)\right)\)</span> using <code>rdirichlet</code> from the <code>gtools</code> package:</p>
<pre class="r"><code>library(gtools)
dir <- rdirichlet(1000, alpha = c(1,2))
head(dir)</code></pre>
<pre><code>## [,1] [,2]
## [1,] 0.3606 0.639
## [2,] 0.4675 0.533
## [3,] 0.2640 0.736
## [4,] 0.0711 0.929
## [5,] 0.5643 0.436
## [6,] 0.0188 0.981</code></pre>
<p>The first column has the same distribution as the <span class="math inline">\(beta\)</span> distribution from before; the mean and standard deviation are close to the values estimated above:</p>
<pre class="r"><code>c(mean(dir[,1]), sd(dir[,1]))</code></pre>
<pre><code>## [1] 0.332 0.236</code></pre>
<p>To ramp things up a bit, say we have <span class="math inline">\(K = 5\)</span>, and the target mean values for each level are <span class="math inline">\(\boldsymbol{p} = \left(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9} \right)\)</span>, one way to specify this is:</p>
<pre class="r"><code>dir_1 <- rdirichlet(1000, alpha = c(1, 2, 3, 2, 1))
head(dir_1)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1710 0.6637 0.0676 0.0633 0.0343
## [2,] 0.1130 0.1150 0.2803 0.4229 0.0689
## [3,] 0.1434 0.0678 0.3316 0.1721 0.2851
## [4,] 0.0250 0.1707 0.3841 0.2490 0.1712
## [5,] 0.0633 0.3465 0.4056 0.0853 0.0993
## [6,] 0.1291 0.1510 0.3993 0.2612 0.0593</code></pre>
<p>Here are the observed means for each <span class="math inline">\(p_k\)</span>, pretty close to the target:</p>
<pre class="r"><code>apply(dir_1, 2, mean)</code></pre>
<pre><code>## [1] 0.111 0.221 0.328 0.229 0.112</code></pre>
<p>Of course, we could generate data with a similar target <span class="math inline">\(\boldsymbol{p}\)</span> by multiplying <span class="math inline">\(\boldsymbol\alpha\)</span> by a constant <span class="math inline">\(c\)</span>. In this case, we use <span class="math inline">\(c=10\)</span> and see that the average values for each <span class="math inline">\(p_k\)</span> are also close to the target:</p>
<pre class="r"><code>dir_2 <- rdirichlet(1000, alpha = c(10, 20, 30, 20, 10))
apply(dir_2, 2, mean)</code></pre>
<pre><code>## [1] 0.113 0.222 0.334 0.220 0.111</code></pre>
<p>There is a key difference between specifying <span class="math inline">\(\boldsymbol{\alpha}\)</span> and <span class="math inline">\(c\boldsymbol{\alpha}\)</span>. Just as in the <em>beta</em> distribution, as <span class="math inline">\(c\)</span> grows larger, the variation within each <span class="math inline">\(p_k\)</span> decreases. This will be useful when generating the study specific probabilities if we want explore different levels of variation.</p>
<p>Here’s the standard deviations from the two data sets just generated:</p>
<pre class="r"><code>apply(dir_1, 2, sd)</code></pre>
<pre><code>## [1] 0.102 0.131 0.144 0.134 0.098</code></pre>
<pre class="r"><code>apply(dir_2, 2, sd)</code></pre>
<pre><code>## [1] 0.0333 0.0425 0.0508 0.0421 0.0333</code></pre>
</div>
<div id="generating-the-baseline-probabilities" class="section level3">
<h3>Generating the baseline probabilities</h3>
<p>A simple function that includes two key arguments - the base probabilities (which are really <span class="math inline">\(\boldsymbol{\alpha}\)</span>) and a <em>similarity</em> index (which is really just the constant <span class="math inline">\(c\)</span>) - implements these ideas to generate study-specific probabilities for each outcome level. As the <em>similarity</em> index increases, the variation across studies or sites decreases. The function includes an additional adjustment to ensure that the row totals sum exactly to <span class="math inline">\(1\)</span> and not to some value infinitesimally greater than <span class="math inline">\(1\)</span> as a result of rounding. Such a rounding error could cause problems for the function <code>genOrdCat</code>.</p>
<pre class="r"><code>genBaseProbs <- function(n, base, similarity, digits = 8) {
n_levels <- length(base)
x <- rdirichlet(n, similarity * base)
#--- ensure that each vector of probabilities sums exactly to 1
x <- round(floor(x*1e8)/1e8, digits) # round the generated probabilities
xpart <- x[, 1:(n_levels-1)] # delete the base prob of the final level
partsum <- apply(xpart, 1, sum) # add the values of levels 1 to K-1
x[, n_levels] <- 1 - partsum # the base prob of the level K = 1 - sum(1:[K-1])
return(x)
}</code></pre>
<p>In this first example, I am generating 11 values (representing base probabilities) for each of 9 studies using a relatively low similarity index, showing you the first six studies:</p>
<pre class="r"><code>basestudy <- genBaseProbs(
n = 9,
base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
similarity = 15,
)
round(head(basestudy), 3)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 0.094 0.022 0.121 0.100 0.061 0.102 0.053 0.309 0.059 0.078 0.000
## [2,] 0.025 0.079 0.043 0.197 0.083 0.044 0.099 0.148 0.025 0.150 0.107
## [3,] 0.007 0.042 0.084 0.066 0.049 0.145 0.191 0.323 0.078 0.012 0.003
## [4,] 0.061 0.021 0.063 0.104 0.092 0.292 0.112 0.110 0.113 0.026 0.008
## [5,] 0.067 0.023 0.021 0.042 0.063 0.473 0.108 0.127 0.016 0.013 0.046
## [6,] 0.001 0.018 0.054 0.225 0.150 0.301 0.043 0.081 0.100 0.008 0.020</code></pre>
<p>A great way to see the variability is a cumulative probability plot for each individual study. With a relatively low <em>similarity</em> index, you can generate quite a bit of variability across the studies. In order to create the plot, I need to first calculate the cumulative probabilities:</p>
<pre class="r"><code>library(ggplot2)
library(viridis)
cumprobs <- data.table(t(apply(basestudy, 1, cumsum)))
n_levels <- ncol(cumprobs)
cumprobs[, id := .I]
dm <- melt(cumprobs, id.vars = "id", variable.factor = TRUE)
dm[, level := factor(variable, labels = c(0:10))]
ggplot(data = dm, aes(x=level, y = value)) +
geom_line(aes(group = id, color = id)) +
scale_color_viridis( option = "D") +
theme(panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.position = "none")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>Here is a plot of data generated using a <em>similarity</em> index of 150. Variation is reduced pretty dramatically:</p>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
</div>
<div id="using-base-probabilities-to-generate-ordinal-data" class="section level3">
<h3>Using base probabilities to generate ordinal data</h3>
<p>Now that we have these base probabilities, the last step is to use them to generate ordinal outcomes. I am generating the simplest of data sets: 9 “studies” each with 500 subjects, without any covariates or even treatment assignment. Since the <code>genOrdCat</code> requires an adjustment variable, I am adjusting everyone by 0. (This is something I need to fix - there should be no such requirement.)</p>
<pre class="r"><code>library(simstudy)
d_study <- genData(9, id = "study")
d_ind <- genCluster(d_study, "study", numIndsVar = 500, "id")
d_ind[, z := 0]
d_ind</code></pre>
<pre><code>## study id z
## 1: 1 1 0
## 2: 1 2 0
## 3: 1 3 0
## 4: 1 4 0
## 5: 1 5 0
## ---
## 4496: 9 4496 0
## 4497: 9 4497 0
## 4498: 9 4498 0
## 4499: 9 4499 0
## 4500: 9 4500 0</code></pre>
<p>To generate the ordinal categorical outcome, we have to treat each study separately since they have unique baseline probabilities. This can be accomplished using <code>lapply</code> in the following way:</p>
<pre class="r"><code>basestudy <- genBaseProbs(
n = 9,
base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05),
similarity = 50
)
list_ind <- lapply(
X = 1:9,
function(i) {
b <- basestudy[i,]
d_x <- d_ind[study == i]
genOrdCat(d_x, adjVar = "z", b, catVar = "ordY")
}
)</code></pre>
<p>The output <code>list_ind</code> is a list of data tables, one for each study. For example, here is the 5th data table in the list:</p>
<pre class="r"><code>list_ind[[5]]</code></pre>
<pre><code>## study id z ordY
## 1: 5 2001 0 7
## 2: 5 2002 0 9
## 3: 5 2003 0 5
## 4: 5 2004 0 9
## 5: 5 2005 0 9
## ---
## 496: 5 2496 0 9
## 497: 5 2497 0 4
## 498: 5 2498 0 7
## 499: 5 2499 0 5
## 500: 5 2500 0 11</code></pre>
<p>And here is a table of proportions for each study that we can compare with the base probabilities:</p>
<pre class="r"><code>t(sapply(list_ind, function(x) x[, prop.table(table(ordY))]))</code></pre>
<pre><code>## 1 2 3 4 5 6 7 8 9 10 11
## [1,] 0.106 0.048 0.086 0.158 0.058 0.162 0.092 0.156 0.084 0.028 0.022
## [2,] 0.080 0.024 0.092 0.134 0.040 0.314 0.058 0.110 0.028 0.110 0.010
## [3,] 0.078 0.050 0.028 0.054 0.148 0.172 0.162 0.134 0.058 0.082 0.034
## [4,] 0.010 0.056 0.116 0.160 0.054 0.184 0.102 0.084 0.156 0.056 0.022
## [5,] 0.010 0.026 0.036 0.152 0.150 0.234 0.136 0.084 0.120 0.026 0.026
## [6,] 0.040 0.078 0.100 0.092 0.170 0.168 0.196 0.050 0.038 0.034 0.034
## [7,] 0.006 0.064 0.058 0.064 0.120 0.318 0.114 0.068 0.082 0.046 0.060
## [8,] 0.022 0.070 0.038 0.160 0.182 0.190 0.074 0.068 0.070 0.036 0.090
## [9,] 0.054 0.046 0.052 0.128 0.100 0.290 0.102 0.092 0.080 0.030 0.026</code></pre>
<p>Of course, the best way to compare is to plot the data for each study. Here is another cumulative probability plot, this time including the observed (generated) probabilities in black over the baseline probabilities used in the data generation in red:</p>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-18-1.png" width="672" /></p>
<p>Sometime soon, I plan to incorporate something like the function <code>genBaseProbs</code> into <code>simstudy</code> to make it easier to incorporate non-proportionality assumptions into simulation studies that use ordinal categorical outcomes.</p>
<p id="addendum">
</p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>The variance of the <em>beta</em> distribution (and similarly the <em>Dirichlet</em> distribution) decreases as <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span> both increase proportionally (keeping the mean constant). I’ve plotted the variance of the <em>beta</em> distribution for <span class="math inline">\(\alpha = 1\)</span> and different levels of <span class="math inline">\(\beta\)</span> and <span class="math inline">\(C\)</span>. It is clear that at any level of <span class="math inline">\(\beta\)</span> (I’ve drawn a line at <span class="math inline">\(\beta = 1\)</span>), the variance decreases as <span class="math inline">\(C\)</span> increases. It is also clear that, holding <span class="math inline">\(\alpha\)</span> constant, the relationship of <span class="math inline">\(\beta\)</span> to variance is not strictly monotonic:</p>
<pre class="r"><code>var_beta <- function(params) {
a <- params[1]
b <- params[2]
(a * b) / ( (a + b)^2 * (a + b + 1))
}
loop_b <- function(C, b) {
V <- sapply(C, function(x) var_beta(x*c(1, b)))
data.table(b, V, C)
}
b <- seq(.1, 25, .1)
C <- c(0.01, 0.1, 0.25, 0.5, 1, 2, 4, 10, 100)
d_var <- rbindlist(lapply(b, function(x) loop_b(C, x)))
ggplot(data = d_var, aes(x = b, y = V, group = C)) +
geom_vline(xintercept = 1, size = .5, color = "grey80") +
geom_line(aes(color = factor(C))) +
scale_y_continuous(name = expression("Var beta"~(alpha==1~","~beta))) +
scale_x_continuous(name = expression(beta)) +
scale_color_viridis(discrete = TRUE, option = "B", name = "C") +
theme(panel.grid = element_blank(),
legend.title.align=0.15)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-09-15-generating-probabilities-for-ordinal-categorical-data.en_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
</div>
Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis
https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/
Tue, 01 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>The federal government recently granted emergency approval for the use of antibody rich blood plasma when treating hospitalized COVID-19 patients. This announcement is <a href="https://www.statnews.com/2020/08/24/trump-opened-floodgates-convalescent-plasma-too-soon/" target="_blank">unfortunate</a>, because we really don’t know if this promising treatment works. The best way to determine this, of course, is to conduct an experiment, though this approval makes this more challenging to do; with the general availability of convalescent plasma (CP), there may be resistance from patients and providers against participating in a randomized trial. The emergency approval sends the incorrect message that the treatment is definitively effective. Why would a patient take the risk of receiving a placebo when they have almost guaranteed access to the therapy?</p>
<p>This doesn’t obviate the fact that we still need to figure out if CP is effective. Last month, I described an <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="_blank">approach</a> to pooling data across multiple different, but similar, studies as way to leverage information that is being generated around the country and world from ongoing trials. I was particularly cryptic about the nature of the research, because the paper describing the details of the proposed research had not yet been <a href="https://jamanetwork.com/journals/jama/fullarticle/2768851" target="_blank">published</a>. Now, the project has a name (<strong>COMPILE</strong>), a <a href="https://med.nyu.edu/departments-institutes/population-health/divisions-sections-centers/biostatistics/research/continuous-monitoring-pooled-international-trials-convalescent-plasma-covid19-hospitalized-patients" target="_blank">website</a>, and most importantly, participating studies committed to sharing data.</p>
<p>In preparation for the analyses, we have been developing a statistical plan, which is based on a pooled Bayesian model similar to what I <a href="https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/" target="_blank">described</a>) earlier. The Bayesian approach offers much needed flexibility in this context when we must make a principled decision as quickly as possible. Indeed, now that the <a href="https://www.fda.gov/news-events/press-announcements/fda-issues-emergency-use-authorization-convalescent-plasma-potential-promising-covid-19-treatment" target="_blank">emergency approval</a> has been granted, there is even more urgency. The study’s Data Safety and Monitoring Board will be evaluating the data frequently, which a Bayesian approach accommodates quite well. (I imagine I will have much to write about over the next few months as we try to better understand the implications and challenges of taking this path.)</p>
<p>In this post, I am describing a nitty-gritty issue related to Markov chain Monte Carlo (MCMC) estimation: stability. It may not sound super-exciting, but stable estimation is key to drawing correct inferences from the estimated posterior distribution. As a model becomes more complex, the MCMC estimation in <code>stan</code> can be plagued by degenerate sampling caused by divergent transitions. MCMC works well when the algorithm takes the sampler across the full posterior distribution without getting stuck, but all bets are off when the process breaks down.</p>
<p>Using a slightly simplified version of the data and model we are proposing for COMPILE, I want to show how to see if things have gotten stuck at some point, and then present a possible solution to getting things unstuck. (I highly encourage you to look <a href="http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html" target="_blank">here</a> and <a href="https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html" target="_blank">here</a> for many more details if this is indeed exciting to you.)</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>The studies included in this meta-analysis will be similar in that they are all providing 1- or 2-units of plasma to the patients randomized to therapy. However, there will be differences with respect to the control arm: studies will use a saline solution, non-convalescent plasma, or usual care (in the last case, the study is not blinded). We need to account for the possibility that the treatment effect will vary slightly depending on the type of control. In this case, I am assuming a binary outcome (though in the actual study we are using an ordinal outcome and a proportional odds model). Here is the logistic model:
<span class="math display">\[\text{logodds}(P(y_{ik}=1)) = \alpha_0 + b_k + \delta_k x_{ik}\]</span>
where the observed data are <span class="math inline">\(y_{ik}\)</span>, the indicator for the outcome (say death) of patient <span class="math inline">\(i\)</span> in study <span class="math inline">\(k\)</span>, and <span class="math inline">\(x_{ik}\)</span>, an indicator set to <span class="math inline">\(1\)</span> if the patient is in the control arm, 0 otherwise. The parameters are <span class="math inline">\(\alpha_0\)</span>, the global logodds of death (note here the intercept represents the treatment condition, not the control condition as is usually the case); <span class="math inline">\(b_k\)</span>, the study-specific logodds of death given treatment; and <span class="math inline">\(\delta_k\)</span>, the “control” effect (a logodds ratio) specific to study <span class="math inline">\(k\)</span>.</p>
<p>The really interesting aspect of this model is <span class="math inline">\(\delta_k\)</span>. This effect is a function of three components - the study site, the control-type, and the the overall/global treatment effect. We are assuming that there is a tendency for studies to vary around a control-type effect average. So, with three controls <span class="math inline">\(c\in {1,2,3}\)</span>:
<span class="math display">\[\delta_k \sim N(\delta_c, \eta_0),\]</span></p>
<p>determined by the control-type of study <span class="math inline">\(k\)</span>. Furthermore, we assume that the control-type effects, the <span class="math inline">\(\delta_c\)</span>’s, vary around a global effect <span class="math inline">\(\Delta\)</span>:
<span class="math display">\[\delta_c \sim N(\Delta, \eta).\]</span>
We assume that <span class="math inline">\(\eta\)</span> is quite small; that is, the control effects will likely be very similar. We are not actually interested in <span class="math inline">\(\eta\)</span> so we do not attempt to estimate it. However <span class="math inline">\(\eta_0\)</span>, the variability across studies, is important, so we <em>will</em> estimate that.</p>
</div>
<div id="generating-data" class="section level3">
<h3>Generating data</h3>
<p>To start, here are the <code>R</code> packages I am using to generate the data.</p>
<pre class="r"><code>library(simstudy)
library(data.table)</code></pre>
<p>Next, I define the study level parameters: the study-specific intercept <span class="math inline">\(b_k\)</span> and the study-specific “control” effect <span class="math inline">\(\delta_k\)</span>, which is a function of the control-type. Note I do not specify a study-level “control” effect, there will just be natural variation across studies. The individual-level outcome <span class="math inline">\(y_{ik}\)</span> is defined as a function of study parameters. The overall treatment effect is <span class="math inline">\(\Delta = 0.5\)</span> on the logodds ratio scale.</p>
<pre class="r"><code>def_s <- defDataAdd(varname = "b_k", formula = 0, variance = 0.025)
def_s <- defDataAdd(
def_s, varname = "delta_k",
formula = "(c_type==1) * 0.4 + (c_type==2) * 0.5 + (c_type==3) * 0.6",
dist = "nonrandom"
)
def_i <- defDataAdd(
varname = "y", formula = "-1 + b_k + rx * delta_k",
dist = "binary", link = "logit")</code></pre>
<p>I am generating 7 studies with under each control type, for a total of 21 studies. Each study has 50 patients, 25 in each arm, for a total of 1050 patients.</p>
<pre class="r"><code>dc <- genData(3, id = "c_type")
ds <- genCluster(dc, "c_type", numIndsVar = 7, level1ID = "site")
ds <- addColumns(def_s, ds)
di <- genCluster(ds, "site", 50, "id")
di <- trtAssign(di, 2, strata = "site", grp = "rx")
di <- addColumns(def_i, di)</code></pre>
</div>
<div id="estimation-using-stan" class="section level3">
<h3>Estimation using stan</h3>
<p>I am using <code>rstan</code> directly to sample from the posterior distribution, as opposed to using a more user-friendly package like <code>brms</code> or <code>rstanarm</code>. I’ve actually been warned against taking this approach by folks at stan, because it can be more time consuming and could lead to problems of the sort that I am showing you how to fix. However, I find building the model using stan code very satisfying and illuminating; this process has really given me a much better appreciation of the Bayesian modeling. And besides, this model is odd enough that trying to shoehorn it into a standard <code>brms</code> model might be more trouble than it is worth.</p>
<pre class="r"><code>library(rstan)
library(ggplot)
library(bayesplot)</code></pre>
<p>The stan code, which can reside in its own file, contains a number of <em>blocks</em> that define the model. The <code>data</code> block specifies the data will be provided to the model; this can include summary data as well as raw data. The <code>parameters</code> block is where you specify the parameters of the model that need to be estimated. The <code>transformed parameters</code> block includes another set of parameters that will be estimated, but are a function of parameters defined in the previous block. And in this case, the last block is the <code>model</code> where prior distributions are specified as well as the likelihood (outcome) model. Rather than walk you through the details here, I will let you study a bit and see how this relates to the model I specified above.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=0> C; // number of control types
int<lower=1> K; // number of studies
int y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=C> cc[K]; // specific control for site
}
parameters {
real alpha; // overall intercept for treatment
vector[K] beta_k; // site specific intercept
real<lower=0> sigma_b; // sd of site intercepts
vector[K] delta_k; // site specific treatment effect
real<lower=0> eta_0; // sd of delta_k (around delta_c)
vector[C] delta_c; // control-specific effect
real Delta; // overall control effect
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = alpha + beta_k[kk[i]] + (ctrl[i] * (delta_k[kk[i]]));
}
model {
// priors
alpha ~ student_t(3, 0, 2.5);
beta_k ~ normal(0, sigma_b);
sigma_b ~ cauchy(0, 1);
eta_0 ~ cauchy(0, 1);
for (k in 1:K)
delta_k[k] ~ normal(delta_c[cc[k]], eta_0);
delta_c ~ normal(Delta, 0.5);
Delta ~ normal(0, 10);
// likelihood/outcome model
y ~ bernoulli_logit(yhat);
}</code></pre>
<p>We need to compile the stan code so that it can be called from the <code>R</code> script:</p>
<pre class="r"><code>rt_c <- stanc("binary_outcome.stan")
sm_c <- stan_model(stanc_ret = rt_c, verbose=FALSE)</code></pre>
<p>And here is the <code>R</code> code that prepares the data for the stan program and then samples from the posterior distribution. In this case, I will be using 4 different Monte Carlo chains of 2500 draws each (after allowing for 500 warm-up draws), so we will have a actual sample size of 10,000.</p>
<pre class="r"><code>N <- nrow(di) ;
C <- di[, length(unique(c_type))]
K <- di[, length(unique(site))]
y <- as.numeric(di$y)
kk <- di$site
ctrl <- di$rx
cc <- di[, .N, keyby = .(site, c_type)]$c_type
sampdat <- list(N=N, C=C, K=K, y=y, kk=kk, ctrl=ctrl, cc=cc)
fit_c <- sampling(
sm_c, data = sampdat, iter = 3000, warmup = 500,
show_messages = FALSE, cores = 4, refresh = 0,
control = list(adapt_delta = 0.8)
)</code></pre>
</div>
<div id="inspecting-the-posterior-distribution" class="section level3">
<h3>Inspecting the posterior distribution</h3>
<p>Assuming that everything has run smoothly, the first thing to do is to make sure that the MCMC algorithm adequately explored the full posterior distribution in the sampling process. We are typically interested in understanding the properties of the distribution, like the mean or median, or 95% credible intervals. To have confidence that these properties reflect the true posterior probabilities, we need to be sure that the sample they are drawn from is a truly representative one.</p>
<p>A quick and effective way to assess the “representativeness” of the sample is to take a look at the trace plot for a particular parameter, which literally tracks the path of the MCMC algorithm through the posterior distribution. Below, I’ve included two trace plots, one for <span class="math inline">\(\Delta\)</span>, the overall effect, and the other for <span class="math inline">\(\eta_0\)</span>, the variability of a study’s control effect <span class="math inline">\(\delta_k\)</span> around <span class="math inline">\(\delta_c\)</span>. On the left, the plots appear as they should, with lines jumping up and down. However, these particular plots include red indicators where the algorithm got stuck, where there were <em>divergent transitions</em>. We really don’t want to see any of this indicators, because that is a sign that our sample is not representative of the posterior distribution.</p>
<pre class="r"><code>posterior_c <- as.array(fit_c)
lp_c <- log_posterior(fit_c)
np_c <- nuts_params(fit_c)
color_scheme_set("mix-brightblue-gray")
mcmc_trace(posterior_c, pars = "Delta", np = np_c) +
xlab("Post-warmup iteration")
mcmc_trace(
posterior_c, pars = "Delta", np = np_c, window = c(1500, 1700)
) +
xlab("Post-warmup iteration")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/trace_c.png" /></p>
<p>On the right in the figure above, I’ve zoomed in on steps 700 to 900 to see if we can see any patterns. And sure enough, we can. In the <span class="math inline">\(\Delta\)</span> plot, straight lines appear in the middle, evidence that the sampling for some of the chains did indeed get stuck. Likewise, the plot for <span class="math inline">\(\eta_0\)</span> shows flat lines near <span class="math inline">\(0\)</span>.</p>
<p>There’s an additional plot that shows the same thing but in a slightly more dramatic and comprehensive way. This plot (shown below) has a line for each step connecting the parameter estimates of that step. The red lines represent divergent transitions. The important thing to note here is that in all cases with divergent transitions, <span class="math inline">\(\eta_0\)</span> found itself close to <span class="math inline">\(0\)</span>. In other words, the sampling was getting stuck at this point, and this is the likely culprit for the sampling issues.</p>
<pre class="r"><code>color_scheme_set("darkgray")
parcoord_c <-mcmc_parcoord(
posterior_c, np = np_c,
pars = c("eta_0", "sigma_b", "alpha",
"delta_c[1]", "delta_c[2]","delta_c[3]", "Delta")
)
parcoord_c +
scale_x_discrete(expand = c(0.01, 0.01)) +
theme(panel.background = element_rect(fill = "grey90")) +
ylim(-3, 3) +
ggtitle("Original model specification")</code></pre>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/parc_c.png" id="id" class="class" style="width:75.0%;height:75.0%" /></p>
</div>
<div id="a-remedy-for-divergent-transitions" class="section level3">
<h3>A remedy for divergent transitions</h3>
<p>In a moment, I will provide a brief illustration of the perils of divergent transitions, but before that I want to describe “non-centered parameterization,” an approach that can be used to mitigate divergence. The idea is that since sampling from a standard Gaussian or normal distribution is less likely to lead to problematic transitions, we should try to do this as much as possible. “In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation.” (See <a href="https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html" target="_blank">here</a> for the source of this quote and much more.)</p>
<p>For example, in the original model specification, we parameterized <span class="math inline">\(\delta_k\)</span> and <span class="math inline">\(\delta_c\)</span> as</p>
<p><span class="math display">\[
\delta_k \sim N(\delta_c, \eta_0) \\
\delta_c \sim N(\Delta, 0.5)
\]</span></p>
<p><span class="math display">\[
\eta_0 \sim Cauchy(0, 1) \\
\Delta \sim N(0, 10),
\]</span></p>
<p>whereas using “non-centered” parameterization, we would incorporate two latent standard normal variables <span class="math inline">\(\theta_{rx}\)</span> and <span class="math inline">\(\theta_c\)</span> into the model. <span class="math inline">\(\delta_k\)</span> and <span class="math inline">\(\delta_c\)</span> have the same prior distribution as the original model, but we are now sampling from standard normal prior distribution:
<span class="math display">\[
\delta_k = \delta_c + \eta_0 \theta_{rx} \\
\theta_{rx} \sim N(0, 1)
\]</span></p>
<p><span class="math display">\[
\delta_c = \Delta + 0.5\theta_c\\
\theta_c \sim N(0, 1)
\]</span></p>
<p><span class="math display">\[
\eta_0 \sim Cauchy(0, 1) \\
\Delta \sim N(0, 10).
\]</span></p>
<p>This transformation makes the path through the posterior distribution much smoother and, at least in this case, eliminates the divergent transitions.</p>
<p>Here is the stan code using non-centered parameterization (again, feel free to linger and study):</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=0> C; // number of control types
int<lower=1> K; // number of studies
int y[N]; // vector of categorical outcomes
int<lower=1,upper=K> kk[N]; // site for individual
int<lower=0,upper=1> ctrl[N]; // treatment or control
int<lower=1,upper=C> cc[K]; // specific control for site
}
parameters {
real alpha; // overall intercept for treatment
real<lower=0> sigma_b;
real<lower=0> eta_0; // sd of delta_k (around delta)
real Delta; // overall control effect
// non-centered parameterization
vector[K] z_ran_rx; // site level random effects (by period)
vector[K] z_ran_int; // individual level random effects
vector[C] z_ran_c; // individual level random effects
}
transformed parameters{
vector[N] yhat;
vector[K] beta_k;
vector[K] delta_k; // site specific treatment effect
vector[C] delta_c;
beta_k = sigma_b * z_ran_int + alpha;
for (i in 1:C)
delta_c[i] = 0.5 * z_ran_c[i] + Delta;
for (i in 1:K)
delta_k[i] = eta_0 * z_ran_rx[i] + delta_c[cc[i]];
for (i in 1:N)
yhat[i] = beta_k[kk[i]] + ctrl[i] * delta_k[kk[i]];
}
model {
// priors
alpha ~ student_t(3, 0, 2.5);
z_ran_c ~ std_normal();
z_ran_int ~ std_normal();
z_ran_rx ~ std_normal();
sigma_b ~ cauchy(0, 1);
eta_0 ~ cauchy(0, 1);
Delta ~ normal(0, 10);
// outcome model
y ~ bernoulli_logit(yhat);
}</code></pre>
<p>Looking at the trace plots from the non-centered model makes it clear that divergent transitions are no longer a problem. There are no red indicators, and the patterns of straight lines have been eliminated:</p>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/trace_nc.png" /></p>
</div>
<div id="proceed-with-caution-if-you-ignore-the-divergence-warnings" class="section level3">
<h3>Proceed with caution if you ignore the divergence warnings</h3>
<p>If we don’t heed the warnings, how bad can things be? Well, it will probably depend on the situation, but after exploring with multiple data sets, I have convinced myself that it is probably a good idea to reduce the number of divergent transitions as close to 0 as possible.</p>
<p>I conducted an experiment by generating 100 data sets and fitting a model using both the original and non-centered parameterizations. I collected the posterior distribution for each data set and estimation method, and plotted the density curves. (In case you are interested, I used a parallel process running on a high-performance computing core to do this; running on my laptop, this would have taken about 5 hours, but on the HPC it ran in under 15 minutes.) The purpose of this was to explore the shapes of the densities across the different data sets. I know it is a little odd to use this frequentist notion of repeatedly sampling datasets to evaluate the performance of these two approaches, but I find it to be illuminating. (If you’re interested in the code for that, let me know.)</p>
<p>Below on the right, the plot of the posteriors from the non-centered parameterization shows variability in location, but is otherwise remarkably consistent in shape and scale. On the left, posterior densities show much more variation; some are quite peaked and others are even bi-modal. (While I am not showing this here, the densities from samples with more divergent transitions tend to diverge the most from the well-behaved densities on the right.)</p>
<p>Although the mean or median estimates from a divergent sample may not be too far off from its non-divergent counterpart, the more general description of the distribution may be quite far off the mark, making it likely that inferences too will be off the mark.</p>
<p><img src="https://www.rdatagen.net/img/post-bayesdiag/post_plot.png" /></p>
</div>
Generating data from a truncated distribution
https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/
Tue, 18 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>A researcher reached out to me the other day to see if the <code>simstudy</code> package provides a quick and easy way to generate data from a truncated distribution. Other than the <code>noZeroPoisson</code> distribution option (which is a <em>very</em> specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy. I’ve actually had it in the back of my mind to add something like this to <code>simstudy</code>, but have hesitated because it might mean changing (or at least adding to) the <code>defData</code> table structure.</p>
<p>However, it may be time to go for it. The process and coding are actually relatively straightforward, so there is no real reason not to. I was developing a simple prototype for several probability distributions (though the concept can easily be applied to any distribution where the cumulative distribution function, or CDF, is readily accessible), and am sharing here in case you need to do this before it is available in the package, or if you just want to implement yourself.</p>
<div id="what-is-a-truncated-distribution" class="section level3">
<h3>What is a truncated distribution?</h3>
<p>A truncated probability distribution is one derived from limiting the domain of an existing distribution. A picture is worth a thousand words. On the left, we have a histogram for 10,000 observations drawn from a full (non-truncated) <strong>Gaussian</strong> or <strong>normal</strong> distribution with mean 0 and standard deviation 3. In the middle, the histogram represents data drawn from the positive portion of the same distribution (i.e. is truncated at the left by 0). And on the far right, the truncation is defined by the boundaries <span class="math inline">\((-3, 3.5)\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-2-1.png" width="768" /></p>
</div>
<div id="leveraging-the-uniform-distribution-and-a-cdf" class="section level3">
<h3>Leveraging the uniform distribution and a CDF</h3>
<p>A while back, I <a href="https://www.rdatagen.net/post/correlated-data-copula/" target="blank">described</a> a <em>copula</em> approach to generating correlated data from different distributions (ultimately implemented in functions <code>genCorGen</code> and <code>addCorGen</code>). I wrote about combining a draw from a uniform distribution with the CDF of any target distribution to facilitate random number generation from the target generation. This is an approach that works well for truncated distributions also, where the truncated distribution is the target.</p>
<p>Again - visuals help to explain how this works. To start, here are several CDFs of normal distributions with different means and variances:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>The CDF of a distribution (usually written as <span class="math inline">\(F(x)\)</span>) effectively defines that distribution: <span class="math inline">\(F(x) = P(X \le x)\)</span>. Since probabilities by definition range from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>, we know that <span class="math inline">\(F(x)\)</span> also ranges from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>. It is also the case, that <span class="math inline">\(F(x)\)</span> is monotonically increasing (or at least non-decreasing) from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>.</p>
<p>Let’s say we want to generate a draw from <span class="math inline">\(N(\mu =0, \sigma = 3)\)</span> using the the CDF. We can first generate a draw from <span class="math inline">\(u = Uniform(0,1)\)</span>. We then treat <span class="math inline">\(u\)</span> as a value of the CDF, and map it back <span class="math inline">\(x\)</span> to get our draw from the target distribution. So, <span class="math inline">\(x = F^{-1}(u)\)</span>. In <code>R</code>, the CDF for the normal distribution can be determined using the <code>qnorm</code> function, where the first argument is a probability value between <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>. This would be the <code>R</code> code to generate a single draw from <span class="math inline">\(N(0, 3)\)</span> using a random draw from <span class="math inline">\(Uniform(0, 1)\)</span>:</p>
<pre class="r"><code>(u <- runif(1))</code></pre>
<pre><code>## [1] 0.9</code></pre>
<pre class="r"><code>qnorm(u, mean = 0, sd = 3)</code></pre>
<pre><code>## [1] 3.9</code></pre>
<p>This is how <span class="math inline">\(u = 0.9\)</span> relates to the draw of <span class="math inline">\(x=3.9\)</span>:</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>To generate a random sample of 10,000 draws from <span class="math inline">\(N(0, 3)\)</span>, this process is replicated 10,000 times:</p>
<pre class="r"><code>library(ggplot2)
u <- runif(10000)
x <- qnorm(u, mean = 0, sd = 3)
ggplot(data = data.frame(x), aes(x = x)) +
geom_histogram(fill = "#CCC591", alpha = 1, binwidth = .2, boundary = 0) +
theme(panel.grid = element_blank(),
axis.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="extending-the-inverse-process-to-generate-truncation" class="section level3">
<h3>Extending the inverse process to generate truncation</h3>
<p>Let’s say we are only interested in generating data from the middle portion of the <span class="math inline">\(N(0,3)\)</span> distribution, between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>. The trick is to use the corresponding CDF values, <span class="math inline">\(F(a)\)</span> and <span class="math inline">\(F(b)\)</span> as the basis of the randomization.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>To generate data within the constraints <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>, all we would need to do is generate a value from the uniform distribution with minimum equal to <span class="math inline">\(F(a)\)</span> and maximum <span class="math inline">\(F(b)\)</span>. We then conduct the mapping as we did before when drawing from the full distribution. By constraining <span class="math inline">\(u\)</span> to be between <span class="math inline">\(F(a)\)</span> and <span class="math inline">\(F(b)\)</span>, we force the values of the target distribution to lie between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>.</p>
<p>Now, we are ready to create a simple function <code>rnormt</code> that implements this: The <code>pnorm</code> function provides the CDF at a particular value:</p>
<pre class="r"><code>rnormt <- function(n, range, mu, s = 1) {
# range is a vector of two values
F.a <- pnorm(min(range), mean = mu, sd = s)
F.b <- pnorm(max(range), mean = mu, sd = s)
u <- runif(n, min = F.a, max = F.b)
qnorm(u, mean = mu, sd = s)
}</code></pre>
<p>Here, I am generating the data plotted above, showing the code this time around.</p>
<pre class="r"><code>library(data.table)
library(simstudy)
library(paletteer)
defC <- defCondition(condition= "tt == 1",
formula = "rnormt(10000, c(-Inf, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 2",
formula = "rnormt(10000, c(0, Inf), mu = 0, s = 3)")
defC <- defCondition(defC, "tt == 3",
formula = "rnormt(10000, c(-3, 3.5), mu = 0, s = 3)")
dd <- genData(30000)
dd <- trtAssign(dd, nTrt = 3, grpName = "tt")
dd <- addCondition(defC, dd, "x")
dd[, tt := factor(tt,
labels = c("No truncation", "Left truncation at 0", "Left and right truncation"))]
ggplot(data = dd, aes(x = x, group = tt)) +
geom_histogram(aes(fill = tt), alpha = 1, binwidth = .2, boundary = 0) +
facet_grid(~tt) +
theme(panel.grid = element_blank(),
axis.title = element_blank(),
legend.position = "none") +
scale_fill_paletteer_d("wesanderson::Moonrise2")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-9-1.png" width="768" /></p>
</div>
<div id="going-beyond-the-normal-distribution" class="section level3">
<h3>Going beyond the normal distribution</h3>
<p>With this simple approach, it is possible to generate a truncated distribution using any distribution available in <code>R</code>. Here is another example that allows us to generate truncated data from a <strong>gamma</strong> distribution:</p>
<pre class="r"><code>rgammat <- function(n, range, shape, scale = 1) {
F.a <- pgamma(min(range), shape = shape, scale = scale)
F.b <- pgamma(max(range), shape = shape, scale = scale)
u <- runif(n, min = F.a, max = F.b)
qgamma(u, shape = shape, scale = scale)
}</code></pre>
<p>To conclude, here is a plot of gamma-based distributions using <code>rgammat</code>. And I’ve added similar plots for <strong>beta</strong> and <strong>Poisson</strong> distributions - I’ll leave it to you to write the functions. But, if you don’t want to do that, <code>simstudy</code> will be updated at some point soon to help you out.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-18-generating-data-from-a-truncated-distribution.en_files/figure-html/unnamed-chunk-11-1.png" width="768" /></p>
</div>
A hurdle model for COVID-19 infections in nursing homes
https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/
Tue, 04 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Late last <a href="https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/" target="blank">year</a>, I added a <em>mixture</em> distribution to the <code>simstudy</code> package, largely motivated to accommodate <em>zero-inflated</em> Poisson or negative binomial distributions. (I really thought I had added this two years ago - but time is moving so slowly these days.) These distributions are useful when modeling count data, but we anticipate observing more than the expected frequency of zeros that would arise from a non-inflated (i.e. “regular”) Poisson or negative binomial distribution.</p>
<p>There is a related, though subtly different model - the <em>hurdle</em> model - that I want to talk about here, because it has come up in the context of planning a new study to address COVID-19 infection rates in nursing homes using much more aggressive testing strategies.</p>
<p>A hurdle model is a modified count model that also assumes a relatively high frequency of zeros, but is set up as a two-stage data generation process rather than as a mixture distribution. In the first stage, the development of any (i.e. at least 1) new infections during the observation period is described by binomial probability distribution. If there is at least one infection, the process has transitioned over the “hurdle” into the second stage; here, the number of new infections is described by a non-zero count distribution, such as a truncated Poisson or truncated negative binomial distribution.</p>
<div id="model-specification" class="section level3">
<h3>Model specification</h3>
<p>If <span class="math inline">\(N_i\)</span> is the number of new infections in nursing home <span class="math inline">\(i\)</span> over the observation period and we have two intervention arms <span class="math inline">\(T_i \in \{0,1\}\)</span>, the intervention effect at each stage of the process can be modeled simply as:</p>
<p><span class="math display">\[
\text{logodds}\left[ P(N_i > 0) \right] = \beta_0 + \beta_1 T_i + \boldsymbol{X_i \beta_2}\]</span></p>
<p><span class="math display">\[\text{log}(N_i | N_i \ge 1) = \alpha_0 + \alpha_1 T_i + \boldsymbol{X_i \alpha_2} + \text{log}(D_i)\]</span></p>
<p>The intervention effect for the binomial stage is <span class="math inline">\(\beta_1\)</span> (on the logodds scale) and the intervention effect for the hurdle (count) stage is <span class="math inline">\(\alpha_1\)</span> (on the log scale). <span class="math inline">\(\boldsymbol{X_i}\)</span> are any covariates that are used for stratified randomization.</p>
<p><span class="math inline">\(D_i\)</span> is the number of resident-days observed during the follow-up period, and <span class="math inline">\(\text{log}(D_i)\)</span> is the “offset”; we are effectively modeling a rate of infections <span class="math inline">\(\text{log}(N_i/D_i)\)</span>. This will take into account the fact that residents will be observed for different lengths of time - some moving into the nursing home after the study has started, and others leaving or dying before the study is complete.</p>
</div>
<div id="simulating-a-hurdle-model" class="section level3">
<h3>Simulating a hurdle model</h3>
<p>Simulating data from this model is relatively straightforward, complicated only by the need to generate varying observation periods. Essentially, we must generate two outcomes - a binary outcome and a non-zero count outcome (in this case it will be from a non-zero Poisson distribution), and the observed outcome is 0 if the binary outcome is actually 0, and the value of the count outcome if the binary outcome is 1.</p>
<p>To get things going, here are the packages I will use. The <code>pscl</code> package provides a function <code>hurdle</code> to estimate the model parameters from our simulated data, and <code>stargazer</code> package outputs the model in a nice, readable format.</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(pscl)
library(stargazer)</code></pre>
<div id="data-generation" class="section level4">
<h4>Data generation</h4>
<p>In this simulation the average observation time is 80 days (out of 90 maximum), and on average, each nursing home will have 100 residents. In the control arm, 95% of the nursing homes will have at least one infection, and 80% of the intervention arm will have at least one. The corresponding odds ratio is <span class="math inline">\((0.80/0.20)/(.95/0.05) = 0.21\)</span>.</p>
<p>The infection rate per 1000 resident-days for the control arm will be <span class="math inline">\(\sim (20/8000)*1000 = 2.5\)</span>; for the intervention arm, the rate will be <span class="math inline">\(\sim (20/8000) *0.8 * 1000 = 2.0\)</span>.</p>
<p>Here is the data definition table <code>defHurdle</code> created by the function <code>defDataAdd</code> that encodes these assumptions:</p>
<pre><code>## varname formula variance dist link
## 1: nRes 100 0 poisson identity
## 2: aDays 80 0 poisson identity
## 3: nDays pmin(90, aDays) 0 nonrandom identity
## 4: pDays nRes * nDays 0 nonrandom identity
## 5: xBin 0.95 - 0.15 * rx 0 binary identity
## 6: xCnt log(20/8000)+log(0.8)*rx+log(pDays) 0 noZeroPoisson log
## 7: y xBin * xCnt 0 nonrandom identity</code></pre>
<p>The data generation is only at the nursing home level. In this example, we are assuming 500 nursing homes:</p>
<pre class="r"><code>set.seed(29211)
dx <- genData(500)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(defHurdle, dx)
dx</code></pre>
<pre><code>## id rx nRes aDays nDays pDays xBin xCnt y
## 1: 1 1 113 86 86 9718 1 16 16
## 2: 2 0 89 66 66 5874 1 16 16
## 3: 3 1 83 82 82 6806 1 13 13
## 4: 4 0 91 95 90 8190 1 27 27
## 5: 5 1 97 70 70 6790 0 17 0
## ---
## 496: 496 1 116 85 85 9860 0 17 0
## 497: 497 1 89 94 90 8010 1 14 14
## 498: 498 0 112 92 90 10080 1 20 20
## 499: 499 1 97 71 71 6887 1 21 21
## 500: 500 0 92 68 68 6256 1 13 13</code></pre>
</div>
<div id="data-visualization" class="section level4">
<h4>Data visualization</h4>
<p>A plot of the data shows the effect at each stage of the hurdle process:</p>
<pre class="r"><code>dx[, rate1000 := (y/pDays)*1000]
dx[ , rx := factor(rx, labels = c("No intervention", "Intervention"))]
dm <- dx[rate1000 != 0, .(mu = mean(rate1000)), keyby = rx]
ggplot(data = dx, aes(x = rate1000)) +
geom_vline(aes(xintercept = mu), data = dm, lty = 3, color = "grey50") +
geom_histogram(binwidth = .1,
aes(y = (..count..)/sum(..count..), fill = rx)) +
facet_grid(rx ~ .) +
theme(panel.grid = element_blank(),
legend.position = "none") +
scale_y_continuous(labels = scales::percent,
name = "proportion",
expand = c(0, 0),
breaks = c(c(.05, .10)),
limits = c(0, .11)) +
scale_x_continuous(name = "# infections per 1000 resident-days") +
scale_fill_manual(values = c("#EDCB64", "#B62A3D")) </code></pre>
<p><img src="https://www.rdatagen.net/post/2020-08-04-a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations.en_files/figure-html/unnamed-chunk-5-1.png" width="480" /></p>
</div>
<div id="parameter-estimation" class="section level4">
<h4>Parameter estimation</h4>
<p>I fit two models here. The first includes a possible intervention effect, and the second assumes no intervention effect. The purpose in fitting the second model is to provide a basis of comparison.</p>
<pre class="r"><code>hfit1 <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx)
hfit1.0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)</code></pre>
<p>The hurdle model returns two sets of estimates. The first component of the model shown here is binomial model. The estimated intervention effect (odds ratio) is <span class="math inline">\(exp(-1.570) = 0.21\)</span>, as expected. Note that the log-likelihood reported here is for the composite hurdle model (both stages).</p>
<pre class="r"><code>stargazer(hfit1, hfit1.0, type = "text", zero.component = TRUE,
notes =" ", notes.append = FALSE, notes.label="",
dep.var.labels.include = FALSE, dep.var.caption = "",
omit.stat = "n", object.names = TRUE, model.numbers = FALSE)</code></pre>
<pre><code>##
## ====================================
## hfit1 hfit1.0
## ------------------------------------
## rxIntervention -1.570***
## (0.325)
##
## Constant 2.900*** 1.900***
## (0.285) (0.133)
##
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
## </code></pre>
<p>The second component is the count model. The estimated intervention effect is <span class="math inline">\(exp(-0.279) = 0.76\)</span>, which is close to the true value of <span class="math inline">\(0.80\)</span>. (The reported log-likelihoods are the same as in the binomial model.)</p>
<pre class="r"><code>stargazer(hfit1, hfit1.0, type = "text", zero.component = FALSE,
notes =" ", notes.append = FALSE, notes.label="",
dep.var.labels.include = FALSE, dep.var.caption = "",
omit.stat = "n", object.names = TRUE, model.numbers = FALSE)</code></pre>
<pre><code>##
## ====================================
## hfit1 hfit1.0
## ------------------------------------
## rxIntervention -0.279***
## (0.023)
##
## Constant -5.980*** -6.090***
## (0.014) (0.011)
##
## ------------------------------------
## Log Likelihood -1,424.000 -1,511.000
## ====================================
## </code></pre>
<p>In this particular case, the intervention alters both the binomial probability and the county distribution, but that will not necessarily always be the case. A log-likelihood ratio test (LRT) is a global test that compares the model that explicitly excludes an intervention effect (<code>hfit1.0</code>) with the model that includes an intervention effect. If the likelihoods under each are close enough, then the model that excludes the intervention effect is considered sufficient, and there is no reason to conclude that the intervention is effective. We can use the p-value based on the LRT as a measure of whether or not the intervention is generally effective, either because it changes the binomial probability, the count distribution, or both.</p>
<p>In this case, the p-value is quite low:</p>
<pre class="r"><code>lrt1 <- -2*(logLik(hfit1.0) - logLik(hfit1))
1 - pchisq(lrt1, 2)</code></pre>
<pre><code>## 'log Lik.' 0 (df=2)</code></pre>
</div>
</div>
<div id="alternative-scenarios" class="section level3">
<h3>Alternative scenarios</h3>
<p>Here are three additional scenarios that provide examples of ways the intervention can affect the outcome. In Scenario 2, the intervention no longer has an effect on the probability of having at least one infection, but still has an effect on the count. In Scenario 3, the intervention <em>only</em> effects the probability of having at least one infection, and not the count distribution. And in Scenario 4, the intervention has no effect at all at either stage.</p>
<pre class="r"><code>defHurdle.V2 <- updateDef(defHurdle, "xBin", "0.95")
defHurdle.V3 <- updateDef(defHurdle, "xCnt", "log(20/8000) + log(pDays)")
defHurdle.V4 <- updateDef(defHurdle.V3, "xBin", "0.95")</code></pre>
<p>The plots bear out the underlying parameters. We can see the probability of a zero is the same across treatment arms in Scenario 2, just as the distributions of the count variable in Scenario 3 appear equivalent. In Scenario 4, it is hard to distinguish between the two distributions across interventions.</p>
<p><img src="https://www.rdatagen.net/post/2020-08-04-a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations.en_files/figure-html/unnamed-chunk-14-1.png" width="960" /></p>
<p>Here are the model fits - the results are consistent with the plots:</p>
<pre><code>##
## ================================================================================
## hfit2 hfit2.0 hfit3 hfit3.0 hfit4 hfit4.0
## --------------------------------------------------------------------------------
## rxIntervention -0.099 -0.871*** 0.058
## (0.446) (0.287) (0.342)
##
## Constant 3.180*** 3.130*** 2.440*** 1.940*** 2.500*** 2.530***
## (0.323) (0.223) (0.233) (0.135) (0.239) (0.171)
##
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
## </code></pre>
<pre><code>##
## ================================================================================
## hfit2 hfit2.0 hfit3 hfit3.0 hfit4 hfit4.0
## --------------------------------------------------------------------------------
## rxIntervention -0.210*** 0.010 -0.029
## (0.022) (0.022) (0.021)
##
## Constant -6.010*** -6.110*** -6.000*** -6.000*** -5.980*** -6.000***
## (0.015) (0.011) (0.015) (0.011) (0.015) (0.010)
##
## --------------------------------------------------------------------------------
## Log Likelihood -1,443.000 -1,489.000 -1,458.000 -1,463.000 -1,463.000 -1,464.000
## ================================================================================
## </code></pre>
<p>And finally, the p-values from the LRTs of the models under each of the three scenarios are consistent with the underlying data generating processes. It is only in the last scenario where there is no reason to believe that the intervention has some sort of effect.</p>
<pre class="r"><code>round(c(lrt2 = 1 - pchisq(lrt2, 2),
lrt3=1 - pchisq(lrt3, 2),
lrt4=1 - pchisq(lrt4, 2)), 4)</code></pre>
<pre><code>## lrt2 lrt3 lrt4
## 0.0000 0.0067 0.3839</code></pre>
<p>
<small><font color="darkkhaki">
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/" target="blank">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
<div id="addendum---estimating-power" class="section level3">
<h3>Addendum - estimating power</h3>
<p>If you’ve visited my blog before, you might have <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/" target="blank">picked up</a> on the fact that I like to use simulation to estimate sample size or power when planning a randomized trial. This allows me to be sure everyone understands the assumptions.</p>
<p>To estimate power, I generate multiple data sets under a specific set of assumptions and estimate intervention effects for each data set. The power of the study under this set of assumptions is the proportion of times we would conclude that the intervention is effective. In the context of a hurdle model, I use the p-value from the LRT as the arbiter of effectiveness; the proportion of p-values less than 0.05 is the power.</p>
<pre class="r"><code>gData <- function(n, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(defHurdle, dx)
dx[]
}
estModel <- function(dx) {
hfit <- hurdle(y ~ rx | rx, offset = log(pDays), data = dx, )
hfit0 <- hurdle(y ~ 1 | 1, offset = log(pDays), data = dx)
lrt <- -2*(logLik(hfit0) - logLik(hfit))
data.table(p.zero = coef(summary(hfit))$zero["rx", "Pr(>|z|)"],
p.count = coef(summary(hfit))$count["rx", "Pr(>|z|)"],
X2 = 1 - pchisq(lrt, 2))
}
iter <- function(n, defHurdle, i) {
dx <- gData(n, def)
hfit <- estModel(dx)
return(data.table(i = i, hfit))
}
diter <- rbindlist(lapply(1:1000, function(i) iter(50, defHurdle, i)))</code></pre>
<p>Here are the results from the individual replications Scenario 1 effect assumptions and 50 nursing homes:</p>
<pre class="r"><code>diter</code></pre>
<pre><code>## i p.zero p.count X2
## 1: 1 0.9975 4.06e-04 0.000437
## 2: 2 0.0449 1.05e-03 0.000216
## 3: 3 0.0713 5.92e-03 0.002246
## 4: 4 0.0449 5.85e-04 0.000128
## 5: 5 0.1891 3.20e-02 0.034025
## ---
## 996: 996 0.3198 7.04e-03 0.014600
## 997: 997 0.1891 1.13e-02 0.013579
## 998: 998 0.3198 8.16e-04 0.001973
## 999: 999 1.0000 4.45e-06 0.000023
## 1000: 1000 0.5590 2.34e-03 0.007866</code></pre>
<p>And here is the estimate of power - in this case there is about 90% power that we will conclude that there is an effect of some type given the assumptions under Scenario 1:</p>
<pre class="r"><code>diter[, mean(X2 <= 0.05)]</code></pre>
<pre><code>## [1] 0.898</code></pre>
<p>In conclusion, here is a power plot for a range of effect size assumptions, sample size assumptions, and control arm assumptions. In all of these cases, I assumed that the binomial probability under the control condition would be 70%, (If anyone wants to see the code for generating all of this data and the plot, I can post on github. However, it is really just an extension of what is shown here.)</p>
<p><img src="https://www.rdatagen.net/img/post-hurdle/power.png" id="id" class="class" style="width:95.0%;height:95.0%" /></p>
</div>
A Bayesian model for a simulated meta-analysis
https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/
Tue, 21 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>This is essentially an addendum to the previous <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="blank">post</a> where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the <code>nlme</code> package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package <code>rstan</code>.</p>
<div id="create-the-data-set" class="section level3">
<h3>Create the data set</h3>
<p>We’ll use the exact same data generating process as <a href="https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/" target="blank">described</a> in some detail in the previous post.</p>
<pre class="r"><code>library(simstudy)
library(rstan)
library(data.table)</code></pre>
<pre class="r"><code>defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study",
formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
dist = "poisson")
defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)
ds <- genData(12, defS)
dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)
d.obs <- dc[, .(study, id, x, y)]</code></pre>
</div>
<div id="build-the-stan-model" class="section level3">
<h3>Build the Stan model</h3>
<p>There are multiple ways to estimate a <code>Stan</code> model in <code>R</code>, but I choose to build the Stan code directly rather than using the <code>brms</code> or <code>rstanarm</code> packages. In the Stan code, we need to define the data structure, specify the parameters, specify any transformed parameters (which are just a function of the parameters), and then build the model - which includes laying out the prior distributions as well as the likelihood.</p>
<p>In this case, the model is slightly different from what was presented in the context of a mixed effects model. This is the mixed effects model:</p>
<p><span class="math display">\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\
\\
\delta_k = \delta_0 + v_k \\
e_{ik} \sim N(0, \sigma_k^2), v_k \sim N(0,\tau^2)
\]</span>
In this Bayesian model, things are pretty much the same:
<span class="math display">\[ y_{ik} \sim N(\alpha_k + \delta_k x_{ik}, \sigma_k^2) \\
\\
\delta_k \sim N(\Delta, \tau^2)
\]</span></p>
<p>The key difference is that there are prior distributions on <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\tau\)</span>, introducing an additional level of uncertainty into the estimate. I would expect that the estimate of the overall treatment effect <span class="math inline">\(\Delta\)</span> will have a wider 95% CI (credible interval in this context) than the 95% CI (confidence interval) for <span class="math inline">\(\delta_0\)</span> in the mixed effects model. This added measure of uncertainty is a strength of the Bayesian approach.</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of observations
int<lower=1> K; // number of studies
real y[N]; // vector of continuous outcomes
int<lower=1,upper=K> kk[N]; // study for individual
int<lower=0,upper=1> x[N]; // treatment arm for individual
}
parameters {
vector[K] beta; // study-specific intercept
vector[K] delta; // study effects
real<lower=0> sigma[K]; // sd of outcome dist - study specific
real Delta; // average treatment effect
real <lower=0> tau; // variation of treatment effects
}
transformed parameters{
vector[N] yhat;
for (i in 1:N)
yhat[i] = beta[kk[i]] + x[i] * delta[kk[i]];
}
model {
// priors
sigma ~ normal(0, 2.5);
beta ~ normal(0, 10);
tau ~ normal(0, 2.5);
Delta ~ normal(0, 10);
delta ~ normal(Delta, tau);
// outcome model
for (i in 1:N)
y[i] ~ normal(yhat[i], sigma[kk[i]]);
}</code></pre>
</div>
<div id="generate-the-posterior-distributions" class="section level3">
<h3>Generate the posterior distributions</h3>
<p>With the model in place, we transform the data into a <code>list</code> so that Stan can make sense of it:</p>
<pre class="r"><code>N <- nrow(d.obs) ## number of observations
K <- dc[, length(unique(study))] ## number of studies
y <- d.obs$y ## vector of continuous outcomes
kk <- d.obs$study ## study for individual
x <- d.obs$x ## treatment arm for individual
ddata <- list(N = N, K = K, y = y, kk = kk, x = x)</code></pre>
<p>And then we compile the Stan code:</p>
<pre class="r"><code>rt <- stanc("model.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)</code></pre>
<p>Finally, we can sample data from the posterior distribution:</p>
<pre class="r"><code>fit <- sampling(sm, data=ddata, seed = 3327, iter = 10000, warmup = 2500,
control=list(adapt_delta=0.9))</code></pre>
</div>
<div id="check-the-diagonstic-plots" class="section level3">
<h3>Check the diagonstic plots</h3>
<p>Before looking at any of the output, it is imperative to convince ourselves that the MCMC process was a stable one. The <em>trace</em> plot is the most basic way to assess this. Here, I am only showing these plots for <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\tau\)</span>, but the plots for the other parameters looked similar, which is to say everything looks good:</p>
<pre class="r"><code>pname <- c("Delta", "tau")
stan_trace(object = fit, pars = pname)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
<div id="look-at-the-results" class="section level3">
<h3>Look at the results</h3>
<p>It is possible to look inspect the distribution of any or all parameters. In this case, I am particularly interested in the treatment effects at the study level, and overall. That is, the focus here is on <span class="math inline">\(\Delta\)</span>, <span class="math inline">\(\delta_k\)</span>, and <span class="math inline">\(\tau\)</span>.</p>
<pre class="r"><code>pname <- c("delta", "Delta","tau")
print(fit, pars=pname, probs = c(0.05, 0.5, 0.95))</code></pre>
<pre><code>## Inference for Stan model: model.
## 4 chains, each with iter=10000; warmup=2500; thin=1;
## post-warmup draws per chain=7500, total post-warmup draws=30000.
##
## mean se_mean sd 5% 50% 95% n_eff Rhat
## delta[1] 6.39 0.01 1.13 4.51 6.41 8.22 29562 1
## delta[2] -0.78 0.01 1.62 -3.45 -0.78 1.85 28188 1
## delta[3] -0.14 0.01 1.39 -2.37 -0.16 2.18 28909 1
## delta[4] 3.08 0.00 0.59 2.09 3.08 4.05 34277 1
## delta[5] -0.16 0.01 1.01 -1.77 -0.18 1.52 27491 1
## delta[6] 3.87 0.00 0.86 2.47 3.87 5.27 35079 1
## delta[7] 4.04 0.01 1.11 2.21 4.03 5.87 32913 1
## delta[8] 5.23 0.01 1.29 3.12 5.23 7.36 33503 1
## delta[9] 1.79 0.01 1.25 -0.27 1.78 3.82 30709 1
## delta[10] 1.38 0.01 1.12 -0.46 1.38 3.21 30522 1
## delta[11] 4.47 0.01 1.25 2.43 4.47 6.54 34573 1
## delta[12] 0.79 0.01 1.45 -1.60 0.80 3.16 33422 1
## Delta 2.48 0.00 0.89 1.01 2.50 3.89 31970 1
## tau 2.72 0.00 0.71 1.72 2.64 4.01 24118 1
##
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 15:47:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).</code></pre>
<p>The forest plot is quite similar to the one based on the mixed effects model, though as predicted, the 95% CI is considerably wider:</p>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
<p>As a comparison, here is the plot from the mixed effects model estimated using the <code>nlme</code> package in the previous post. The bootstrapped estimates of uncertainty at the study level are quite close to the Bayesian measure of uncertainty; the difference really lies in the uncertainty around the global estimate.</p>
<p><img src="https://www.rdatagen.net/post/2020-07-21-a-bayesian-model-for-a-simulated-meta-analysis.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
</div>
Simulating multiple RCTs to simulate a meta-analysis
https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/
Tue, 07 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I am currently involved with an RCT that is struggling to recruit eligible patients (by no means an unusual problem), increasing the risk that findings might be inconclusive. A possible solution to this conundrum is to find similar, ongoing trials with the aim of pooling data in a single analysis, to conduct a <em>meta-analysis</em> of sorts.</p>
<p>In an ideal world, this theoretical collection of sites would have joined forces to develop a single study protocol, but often there is no structure or funding mechanism to make that happen. However, this group of studies may be similar enough - based on the target patient population, study inclusion and exclusion criteria, therapy protocols, comparison or control condition, randomization scheme, and outcome measurement - that it might be reasonable to estimate a single treatment effect and some measure of uncertainty.</p>
<p>This pooling approach would effectively be a prospective meta-analysis using <em>individual participant data</em>. The goal is to estimate a single treatment effect for this intervention or therapy that has been evaluated by different groups under varying research conditions, with possibly different treatment effects in each study.</p>
<p>To explore how all of this works, I generated some data and fit some models. As usual I thought the code would be more useful sitting on this blog rather than hidden away on some secure server.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>In this simulation, I am using a generic continuous outcome <span class="math inline">\(y_{ik}\)</span>, for individual <span class="math inline">\(i\)</span> who is participating in study <span class="math inline">\(k\)</span>. The individual outcome is a function of the study itself and whether that individual received the experimental therapy (<span class="math inline">\(x_{ik} = 1\)</span> for patients in the experimental arm):</p>
<p><span class="math display">\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\
\\
\delta_k = \delta_0 + v_k
\]</span>
<span class="math inline">\(\alpha_k\)</span> is the intercept for study <span class="math inline">\(k\)</span>, or the average outcome for patients in study <span class="math inline">\(k\)</span> in the control arm. <span class="math inline">\(\delta_k\)</span> is the treatment effect in study <span class="math inline">\(k\)</span> and can be decomposed into a common treatment effect across all studies <span class="math inline">\(\delta_0\)</span> and a study-specific effect <span class="math inline">\(v_k\)</span>. <span class="math inline">\(v_k\)</span> is often assumed to be normally distributed, <span class="math inline">\(v_k \sim N(0, \tau^2)\)</span>. An individual effect, <span class="math inline">\(e_{ik}\)</span> is also assumed to be normally distributed, <span class="math inline">\(e_{ik} \sim N(0, \sigma_k^2)\)</span>. Note that the variance <span class="math inline">\(\sigma_k^2\)</span> of individual effects might differ across studies; that is, in some studies patients may be more similar to each other than in other studies.</p>
</div>
<div id="the-simulation-assumptions" class="section level3">
<h3>The simulation assumptions</h3>
<p>Before starting - here are the necessary libraries in case you want to follow along:</p>
<pre class="r"><code>library(simstudy)
library(parallel)
library(nlme)
library(data.table)</code></pre>
<p>In these simulations, there are 12 studies, each enrolling a different number of patients. There are a set of smaller studies, moderately sized studies, and larger studies. We are not really interested in the variability of the intercepts (<span class="math inline">\(\alpha_k\)</span>’s), but we generate based on a normal distribution <span class="math inline">\(N(3, 2)\)</span>. The overall treatment effect is set at <span class="math inline">\(3\)</span>, and the study-specific effects are distributed as <span class="math inline">\(N(0, 6)\)</span>. We use a gamma distribution to create the study-specific within study variation <span class="math inline">\(\sigma^2_k\)</span>: the average within-study variance is <span class="math inline">\(16\)</span>, and will range between <span class="math inline">\(1\)</span> and <span class="math inline">\(64\)</span> (the variance of the variances is <span class="math inline">\(mean^2 \times dispersion = 16^2 \times 0.2 = 51.2\)</span>). The study-specific data are generated using these assumptions:</p>
<pre class="r"><code>defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")
defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")
defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")
defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")
defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")
defS <- defData(defS, varname = "n.study",
formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",
dist = "poisson")</code></pre>
<p>The individual outcomes are generated based on the model specified above:</p>
<pre class="r"><code>defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>First, we generate the study level data:</p>
<pre class="r"><code>RNGkind(kind = "L'Ecuyer-CMRG")
set.seed(12764)
ds <- genData(12, defS)
ds</code></pre>
<pre><code>## study a.k d.0 v.k s2.k size.study n.study
## 1: 1 2.51 3 2.7437 5.25 2 30
## 2: 2 1.51 3 -4.8894 30.48 2 37
## 3: 3 1.62 3 -4.1762 15.06 1 22
## 4: 4 3.34 3 0.2494 3.26 2 44
## 5: 5 2.34 3 -2.9078 5.59 1 15
## 6: 6 1.70 3 1.3498 7.42 2 44
## 7: 7 4.17 3 -0.4135 14.58 2 45
## 8: 8 2.14 3 0.7826 25.78 2 44
## 9: 9 2.54 3 -1.1197 15.72 1 28
## 10: 10 3.10 3 -2.1275 10.00 1 24
## 11: 11 2.62 3 -0.0812 32.76 2 40
## 12: 12 1.17 3 -0.5745 30.94 2 49</code></pre>
<p>And then we generate individuals within each study, assign treatment, and add the outcome:</p>
<pre class="r"><code>dc <- genCluster(ds, "study", "n.study", "id", )
dc <- trtAssign(dc, strata = "study", grpName = "x")
dc <- addColumns(defI, dc)</code></pre>
<p>The observed data set obviously does not include any underlying study data parameters. The figure based on this data set shows the individual-level outcomes by treatment arm for each of the 12 studies. The study-specific treatment effects and differences in within-study variation are readily apparent.</p>
<pre class="r"><code>d.obs <- dc[, .(study, id, x, y)]</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-8-1.png" width="528" /></p>
</div>
<div id="initial-estimates" class="section level3">
<h3>Initial estimates</h3>
<p>If each study went ahead and analyzed its own data set separately, the emerging picture would be a bit confusing. We would have 12 different estimates, some concluding that the treatment is effective, and others not able to draw that conclusion. A plot of the 12 model estimates along with the 95% confidence intervals highlights the muddled picture. For additional reference, I’ve added points that represent the true (and unknown) study effects in blue, including a blue line at the value of the overall treatment effect.</p>
<pre class="r"><code>lm.ind <- function(z, dx) {
fit <- lm(y~x, data = dx)
data.table(z, coef(fit)["x"], confint(fit, "x"))
}
res <- lapply(1:d.obs[, length(unique(study))], function(z) lm.ind(z, d.obs[study == z]))</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
</div>
<div id="the-meta-analysis" class="section level3">
<h3>The meta-analysis</h3>
<p>The meta-analysis is an attempt to pool the findings from all the studies to try to draw an overall conclusion. Traditionally, meta-analysis has been conducted using only the summary information from each study - effect size estimates, standard errors, and sample sizes. More recently, researchers have started to use individual-level data to estimate an overall effect. There are advantages to this added level of detail, particularly in enhancing the ability to model patient-level and study-level characteristics that might influence the effect size; these adjustments could help reduce the variance of the effect size estimates.</p>
<p>There are packages in <code>R</code> specifically designed to conduct meta-analysis, but I am doing it “manually” through the use of the <code>nlme</code> package, which estimates mixed-effects model that mimics the underlying data process. (In a subsequent post, I will do the same thing using a Bayesian model implement using <code>rstan</code>.) I opted for <code>nlme</code> over the <code>lme4</code> package, because the former can accommodate the possibility of different within-study variation.</p>
<p>The model fit here includes a study specific (fixed) intercept, an overall treatment effect, and a study-specific treatment effect. And, as I just mentioned, the within-study variation is accommodated:</p>
<pre class="r"><code>lmefit <- lme(y ~ factor(study) + x - 1,
random = ~ x - 1 | study,
weights = varIdent(form = ~ 1 | study),
data = d.obs, method = 'REML'
)</code></pre>
<p>The model estimate for the overall treatment effect is 2.5, just under but close to the true value of 3.0:</p>
<pre class="r"><code>round(coef(summary(lmefit))["x",], 3)</code></pre>
<pre><code>## Warning in pt(-abs(tVal), fDF): NaNs produced</code></pre>
<pre><code>## Value Std.Error DF t-value p-value
## 2.481 0.851 410.000 2.915 0.004</code></pre>
</div>
<div id="bootstrapping-uncertainty" class="section level3">
<h3>Bootstrapping uncertainty</h3>
<p>Every meta-analysis I’ve seen includes a forest plot that shows the individual study estimates along with the global estimate of primary interest. In my version of this plot, I wanted to show the estimated study-level effects from the model (<span class="math inline">\(\delta_0 + v_k\)</span>) along with 95% confidence intervals. The model fit does not provide a variance estimate for each study-level treatment effect, so I have estimated the standard error using bootstrap methods. I repeatedly sample from the observed data (sampling stratified by study and treatment arm) and estimate the same fixed effects model. For each iteration, I keep the estimated study-specific treatment effect as well as the estimated pooled effect:</p>
<pre class="r"><code>bootest <- function() {
bootid <- d.obs[, .(id = sample(id, .N, replace = TRUE)), keyby = .(study, x)][, .(id)]
dboot <- merge(bootid, d.obs, by = "id")
bootfit <- tryCatch(
{ lme(y ~ factor(study) + x - 1,
random = ~ x - 1 | study,
weights = varIdent(form = ~ 1 | study),
data = dboot, method = 'REML')
},
error = function(e) {
return("error")
},
warn = function(w) {
return("warning")
}
)
if (class(bootfit) == "lme") {
return(data.table(t(random.effects(bootfit) + fixed.effects(bootfit)["x"]),
pooled = fixed.effects(bootfit)["x"]))
}
}
res <- mclapply(1:3000, function(x) bootest(), mc.cores = 4)
res <- rbindlist(res)</code></pre>
<p>The next plot shows the individual study estimates based on the pooled analysis along with the overall estimate in red, allowing us to bring a little clarity to what was an admittedly confusing picture. We might conclude from these findings that the intervention appears to be effective.</p>
<p><img src="https://www.rdatagen.net/post/2020-07-07-simulating-mutliple-studies-to-simulate-a-meta-analysis.en_files/figure-html/unnamed-chunk-14-1.png" width="576" /></p>
<p>As an aside, it is interesting to compare the two forest plot figures in the post, because it is apparent that the point estimates for the individual studies in the second plot are “pulled” closer to the overall average. This is the direct result of the mixed effects model that imposes a structure in the variation of effect sizes across the 12 studies. In contrast, the initial plot shows individual effect sizes that were independently estimated without any such constraint or structure. Pooling across groups or clusters generally has an attenuating effect on estimates.</p>
</div>
Consider a permutation test for a small pilot study
https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/
Tue, 23 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Recently I <a href="https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/">wrote</a> about the challenges of trying to learn too much from a small pilot study, even if it is a randomized controlled trial. There are limitations on how much you can learn about a treatment effect given the small sample size and relatively high variability of the estimate. However, the temptation for researchers is usually just too great; it is only natural to want to see if there is any kind of signal of an intervention effect, even though the pilot study is focused on questions of feasibility and acceptability.</p>
<p>Through my work with the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>, I have been involved with planning a research initiative to test the feasibility of studying a highly innovative strategy in nursing homes to reduce the risk of Covid-19 infections among both residents and staffs. Given that the strategy is so novel, there are big questions about whether it can even be implemented, and how the outcome measures can be collected. So, it may be premature to figure out if the approach will reduce infection. But still, it is hard not to try to gain a little insight into the potential effect of the intervention.</p>
<p>One of the lead investigators suggested a permutation test, because we know the sample is going to be small and we might not want to be forced to make parametric assumptions about the outcome. In the context of a pilot study, the permutation test might give a crude indication about the potential impact of an intervention. Would the full-blown follow-up study be conducted if there is no observed effect in the pilot? That is a bigger question. But, the suggestion of some sort of signal might provide additional motivation if feasibility was no longer a question; we would still need to be <a href="https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/">careful</a> about how we incorporate these findings into planning for the bigger study.</p>
<div id="permutation-test-explained-briefly" class="section level3">
<h3>Permutation test explained, briefly</h3>
<p>Typically, if we are comparing outcomes for two treatment arms, we calculate a statistic that quantifies the comparison. For example, this could be a difference in group means, a risk ratio, or a log-odds ratio. For whatever statistic we use, there would be an underlying sampling distribution of that statistic under the assumption that there is no difference between the two groups. Typically, the sampling distribution would be estimated analytically using additional assumptions about the underlying distributions of the observed data, such as normal or Poisson. We then use the sampling distribution to calculate a <em>p-value</em> for the observed value of the statistic.</p>
<p>The permutation approach is an alternative way to generate the sampling distribution of the statistic under an assumption of no group differences <em>without</em> making any assumptions about the distributions of the data. If group membership does not influence the outcome, it wouldn’t matter if we re-arranged all the treatment assignments in the data. We could do that and estimate the statistic. In fact, we could do that for all the possible arrangements, and that would give us a distribution of the statistic for that sample under the assumption of no group effect. (If the number of possible arrangements is excessive, we could just take a large sample of those possible arrangements, which is what I do below.) To get a <em>p-value</em>, we compare the observed statistic to this manufactured distribution.</p>
<p>Now to the simulations.</p>
</div>
<div id="the-data-generation-process" class="section level3">
<h3>The data generation process</h3>
<p>In this proposed study, we are interested in measuring the rate of Covid-19 infection in 8 nursing homes. Given the nature of the spread of disease and the inter-relationship of the infections between residents, the nursing home is the logical unit of analysis. So, we will only have 8 observations - hardly anything to hang your hat on. But, can the permutation provide any useful information?</p>
<p>The data generation process starts with generating a pool of residents at each site, about 15 per home. The study will run for followed for 4 months (or 120 days), and residents will come and go during that period. We are going to assume that the average residence time is 50 days at each home, but there will be some variability. Based on the number of patients and average length of stay, we can calculate the number of patient-days per site. The number of infected patients <span class="math inline">\(y\)</span> at a site is a function of the intervention and the time of exposure (patient-days). We will be comparing the average rates (<span class="math inline">\(y/patDays\)</span>) for the two groups.</p>
<p>In the first simulation, I am assuming no treatment effect, because I want to assess the Type 1 error (the probability of concluding there is an effect given we know there is no effect).</p>
<p>Here is a function to generate the data definitions and a second function to go through the simple data generation process:</p>
<pre class="r"><code>library(simstudy)
library(parallel)</code></pre>
<pre class="r"><code>defs <- function() {
def <- defDataAdd(varname = "nRes", formula = 15, dist = "poisson")
def <- defDataAdd(def, varname = "nDays", formula = 50, dist = "poisson")
def <- defDataAdd(def, varname = "patDays",
formula = "nRes * pmin(120, nDays)",
dist = "nonrandom")
def <- defDataAdd(def, varname = "y",
formula = "-4 - 0.0 * rx + log(patDays)",
variance = 1,
dist = "negBinomial", link = "log")
def <- defDataAdd(def, varname = "rate",
formula = "y/patDays",
dist = "nonrandom")
return(def[])
}</code></pre>
<pre class="r"><code>gData <- function(n, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addColumns(def, dx)
dx[]
}</code></pre>
<p>And here we actually generate a single data set:</p>
<pre class="r"><code>RNGkind(kind = "L'Ecuyer-CMRG")
#set.seed(72456)
set.seed(82456)
def <- defs()
dx <- gData(8, def)
dx</code></pre>
<pre><code>## id rx nRes nDays patDays y rate
## 1: 1 1 16 40 640 0 0.000000000
## 2: 2 0 9 59 531 8 0.015065913
## 3: 3 1 20 57 1140 58 0.050877193
## 4: 4 1 14 51 714 14 0.019607843
## 5: 5 1 7 59 413 4 0.009685230
## 6: 6 0 14 38 532 4 0.007518797
## 7: 7 0 10 40 400 19 0.047500000
## 8: 8 0 11 56 616 22 0.035714286</code></pre>
<p>The observed difference in rates is quite close to 0:</p>
<pre class="r"><code>dx[, mean(rate), keyby = rx]</code></pre>
<pre><code>## rx V1
## 1: 0 0.02644975
## 2: 1 0.02004257</code></pre>
<pre class="r"><code>obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]
obs.diff</code></pre>
<pre><code>## [1] -0.006407182</code></pre>
</div>
<div id="the-permutation-test" class="section level3">
<h3>The permutation test</h3>
<p>With 8 sites, there are <span class="math inline">\(8!\)</span> possible permutations, or a lot ways to scramble the treatment assignments.</p>
<pre class="r"><code>factorial(8)</code></pre>
<pre><code>## [1] 40320</code></pre>
<p>I decided to implement this myself in a pretty rudimentary way, though there are R packages out there that can certainly do this better. Since I am comparing averages, I am creating a vector that represents the contrast.</p>
<pre class="r"><code>rx <- dx$rx/sum(dx$rx)
rx[rx==0] <- -1/(length(dx$rx) - sum(dx$rx))
rx</code></pre>
<pre><code>## [1] 0.25 -0.25 0.25 0.25 0.25 -0.25 -0.25 -0.25</code></pre>
<p>I’m taking a random sample of 5000 permutations of the contrast vector, storing the results in a matrix:</p>
<pre class="r"><code>perm <- t(sapply(1:5000, function(x) sample(rx, 8, replace = FALSE)))
head(perm)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] -0.25 -0.25 0.25 -0.25 0.25 0.25 -0.25 0.25
## [2,] -0.25 -0.25 0.25 -0.25 0.25 -0.25 0.25 0.25
## [3,] 0.25 -0.25 -0.25 0.25 -0.25 -0.25 0.25 0.25
## [4,] 0.25 -0.25 0.25 0.25 -0.25 0.25 -0.25 -0.25
## [5,] -0.25 -0.25 -0.25 0.25 0.25 0.25 0.25 -0.25
## [6,] 0.25 -0.25 0.25 -0.25 0.25 -0.25 0.25 -0.25</code></pre>
<p>Using a simple operation of matrix multiplication, I’m calculating a rate difference for each of the sampled permutations:</p>
<pre class="r"><code>perm.diffs <- perm %*% dx$rate
head(perm.diffs)</code></pre>
<pre><code>## [,1]
## [1,] 0.005405437
## [2,] 0.025396039
## [3,] 0.004918749
## [4,] -0.007490399
## [5,] -0.004336380
## [6,] 0.007538896</code></pre>
<p>Here is an estimate of the 2-sided <em>p-value</em>:</p>
<pre class="r"><code>mean(abs(perm.diffs) > abs(obs.diff))</code></pre>
<pre><code>## [1] 0.7166</code></pre>
<p>And finally, here is a histogram of the permuted rate differences, with the observed rate different overlaid as a red line. The observed value lies pretty much right in the middle of the distribution, which is what the <em>p-value</em> suggests:</p>
<pre class="r"><code>ggplot(data = data.frame(perm.diffs), aes(x = perm.diffs)) +
geom_histogram(binwidth = .005, color = "white") +
geom_vline(xintercept = obs.diff, color = "red") +
scale_y_continuous(expand = c(0,0), breaks = seq(2500, 10000, 2500)) +
theme(panel.grid = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-23-permutation-test-for-a-covid-19-pilot-nursing-home-study.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="operating-characteristics" class="section level3">
<h3>Operating characteristics</h3>
<p>As in a power analysis by simulation, we can estimate the Type 1 error rate by generating many data sets, and for each one calculate a <em>p-value</em> using the permutation test. The proportion of <em>p-values</em> less than 0.05 would represent the Type 1 error rate, which should be close to 0.05.</p>
<pre class="r"><code>iter <- function(n) {
dx<- gData(n, defs())
obs.diff <- dx[, mean(rate), keyby = rx][, diff(V1)]
rx <- dx$rx/sum(dx$rx)
rx[rx==0] <- -1/(n - sum(dx$rx))
perm <- t(sapply(1:20000, function(x) sample(rx, n, replace = FALSE)))
perm.diffs <- perm %*% dx$rate
mean(abs(perm.diffs) > abs(obs.diff))
}</code></pre>
<p>Here we use 5000 data sets to estimate the Type 1 error rate under the data generating process we’ve been using all along, and for each of those data sets we use 5000 permutations to estimate the p-value.</p>
<pre class="r"><code>res <- unlist(mclapply(1:5000, function(x) iter(8), mc.cores = 4))
mean(res < .05)</code></pre>
<pre><code>## [1] 0.0542</code></pre>
</div>
<div id="the-risks-of-using-a-model-with-assumptions" class="section level3">
<h3>The risks of using a model (with assumptions)</h3>
<p>If we go ahead and try to find a signal using a parametric model, there’s a chance we’ll be led astray. These data are count data, so it would not be strange to consider Poisson regression model to estimate the treatment effect (in this case, the effect would be a rate <em>ratio</em> rather than a rate <em>difference</em>). Given that the data are quite limited, we may not really be in a position to verify whether the Poisson distribution is appropriate; as a result, it might be hard to actually select the right model. (In reality, I know that this model <em>will</em> lead us astray, because we used a negative binomial distribution, a distribution with more variance than the Poisson, to generate the count data.)</p>
<p>Just as before, we generate 5000 data sets. For each one we fit a generalized linear model with a Poisson distribution and a log link, and store the effect estimate along with the <em>p-value</em>.</p>
<pre class="r"><code>chkglm <- function(n) {
dx <- gData(n, defs())
glmfit <- glm( y ~ rx + offset(log(patDays)), family = poisson, data = dx)
data.table(t(coef(summary(glmfit))["rx",]))
}
glm.res <- rbindlist(mclapply(1:5000, function(x) chkglm(8)))</code></pre>
<p>The estimated Type 1 error is far greater than 0.05; there would be a pretty good chance that we will be over-enthusiastic about the potential success of our new nursing home strategy if it was not actually effective.</p>
<pre class="r"><code>glm.res[, .(mean(`Pr(>|z|)` < 0.05))]</code></pre>
<pre><code>## V1
## 1: 0.6152</code></pre>
</div>
<div id="when-there-is-a-treatment-effect" class="section level3">
<h3>When there is a treatment effect</h3>
<p>In the case where there is actually a treatment effect, the observed effect size is more likely to fall closer to one of the distribution’s tails, depending on the direction of the effect. If the treatment reduces the number infections, we would expect the rate difference to be <span class="math inline">\(< 0\)</span>, as it is in this particular case:</p>
<pre class="r"><code>def <- updateDefAdd(def, changevar = "y",
newformula = "-4 - 1.2 * rx + log(patDays)" )</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-23-permutation-test-for-a-covid-19-pilot-nursing-home-study.en_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
<p>At the end of the day, if you feel like you must estimate the treatment effect in a pilot study before moving on to the larger trial, one option is to use a non-parametric approach like a permutation test that requires fewer assumptions to lead you astray.</p>
<p>In the end, though, we opted for a different model. If we do get the go ahead to conduct this study, we will fit a Bayesian model instead. We hope this will be flexible enough to accommodate a range of assumptions and give us a potentially more informative posterior probability of a treatment effect. If we actually get the opportunity to do this, I’ll consider describing that model here.</p>
<p>
<p><small><font color="darkkhaki">
Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
When proportional odds is a poor assumption, collapsing categories is probably not going to save you
https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/
Tue, 09 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Continuing the discussion on cumulative odds models I started <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">last time</a>, I want to investigate a solution I always assumed would help mitigate a failure to meet the proportional odds assumption. I’ve believed if there is a large number of categories and the relative cumulative odds between two groups don’t appear proportional across all categorical levels, then a reasonable approach is to reduce the number of categories. In other words, fewer categories translates to proportional odds. I’m not sure what led me to this conclusion, but in this post I’ve created some simulations that seem to throw cold water on that idea.</p>
<div id="when-the-odds-are-proportional" class="section level3">
<h3>When the odds <em>are</em> proportional</h3>
<p>I think it is illustrative to go through a base case where the odds are actually proportional. This will allow me to introduce the data generation and visualization that I’m using to explore this issue. I am showing a lot of code here, because I think it is useful to see how it is possible to visualize cumulative odds data and the model estimates.</p>
<p>The first function <code>genDT</code> generates a data set with two treatment arms and an ordinal outcome. <code>genOrdCat</code> uses a base set of probabilities for the control arm, and the experimental arm probabilities are generated under an assumption of proportional cumulative odds (see the <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">previous post</a> for more details on what cumulative odds are and what the model is).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
genDT <- function(nobs, baseprobs, defA) {
dT <- genData(nobs)
dT <- trtAssign(dT, grpName = "rx")
dT <- addColumns(defA, dT)
dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
dT[]
}</code></pre>
<p>In this case, I’ve set the base probabilities for an ordinal outcome of 8 categories. The log of the cumulative odds ratio comparing experimental arm to control is 1.0 (and is parameterized as -1.0). In this case, the proportional odds ratio should be about 2.7.</p>
<pre class="r"><code>baseprobs <- c(0.05, 0.10, 0.15, 0.25, .20, 0.15, .05, .05)
defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")
set.seed(29672) # 19779
dT.prop <- genDT(200, baseprobs, defA)</code></pre>
<p>Calculation of the observed cumulative odds ratio at each response level doesn’t provide an entirely clear picture about proportionality, but the sample size is relatively small given the number of categories.</p>
<pre class="r"><code>codds <- function(cat, dx) {
dcodds <- dx[, .(codds = mean(as.numeric(r) <= cat)/mean(as.numeric(r) > cat)),
keyby = rx]
round(dcodds[rx == 1, codds] / dcodds[rx==0, codds], 2)
}
sapply(1:7, function(x) codds(x, dT.prop))</code></pre>
<pre><code>## [1] 1.48 3.81 3.12 1.83 2.05 3.59 2.02</code></pre>
<div id="a-visual-assessment" class="section level4">
<h4>A visual assessment</h4>
<p>An excellent way to assess proportionality is to do a visual comparison of the <em>observed</em> cumulative probabilities with the <em>estimated</em> cumulative probabilities from the cumulative odds model that makes the assumption of proportional odds.</p>
<p>I’ve written three functions that help facilitate this comparison. <code>getCumProbs</code> converts the parameter estimates of cumulative odds from the model to estimates of cumulative probabilities.</p>
<pre class="r"><code>getCumProbs <- function(coefs) {
cumprob0 <- data.table(
cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")])), 1),
r = factor(1 : nrow(coefs)),
rx = 0
)
cumprob1 <- data.table(
cumprob = c(1/(1 + exp(-coefs[which(rownames(coefs) != "rx")] +
coefs["rx", 1])), 1),
r = factor(1 : nrow(coefs)),
rx = 1
)
rbind(cumprob0, cumprob1)[]
}</code></pre>
<p>The function <code>bootCumProb</code> provides a single bootstrap from the data so that we can visualize the uncertainty of the estimated cumulative probabilities. In this procedure, a random sample is drawn (with replacement) from the data set, a <code>clm</code> model is fit, and the cumulative odds are converted to cumulative probabilities.</p>
<pre class="r"><code>library(ordinal)
bootCumProb <- function(bootid, dx) {
sampid <- dx[, .(srow = sample(.I, replace = TRUE)), keyby = rx][, srow]
dtBoot <- dx[sampid,]
bootFit <- clm(r ~ rx, data = dtBoot)
bcoefs <- coef(summary(bootFit))
bcumProbs <- getCumProbs(bcoefs)
bcumProbs[, bootid := bootid]
bcumProbs[]
}</code></pre>
<p>The third function <code>fitPlot</code> fits a <code>clm</code> model to the original data set, collects the bootstrapped estimates, calculates the observed cumulative probabilities, converts the estimated odds to estimated probabilities, and generates a plot of the observed data, the model fit, and the bootstrap estimates.</p>
<pre class="r"><code>library(ggplot2)
library(paletteer)
fitPlot <- function(dx) {
clmFit <- clm(r ~ rx, data = dx)
coefs <- coef(summary(clmFit))
bootProbs <- rbindlist(lapply(1:500, function(x) bootCumProb(x, dx)))
cumObsProbs <- dx[, .N, keyby = .(rx, r)]
cumObsProbs[, cumprob := cumsum(N)/sum(N) , keyby = rx]
cumModProbs <- getCumProbs(coefs)
ggplot(data = cumObsProbs, aes(x = r, y = cumprob , color = factor(rx))) +
geom_line(data = cumModProbs, alpha = 1, aes(group=rx)) +
geom_line(data = bootProbs, alpha = .01,
aes(group = interaction(rx, bootid))) +
geom_point() +
ylab("cumulative probability") +
xlab("ordinal category") +
theme(panel.grid = element_blank(),
legend.position = "none") +
scale_color_paletteer_d("jcolors::pal6")
}</code></pre>
<p>Here is a plot based on the original data set of 200 observations. The observed values are quite close to the modeled estimates, and well within the range of the bootstrap estimates.</p>
<pre class="r"><code>fitPlot(dT.prop)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-7-1.png" width="528" /></p>
</div>
<div id="collapsing-the-categories" class="section level4">
<h4>Collapsing the categories</h4>
<p>Continuing with the same data set, let’s see what happens when we collapse categories together. I’ve written a function <code>collapseCat</code> that takes a list of vectors of categories that are to be combined and returns a new, modified data set.</p>
<pre class="r"><code>collapseCat <- function(dold, collapse) {
dx <- copy(dold)
for (i in 1:length(collapse)) {
dx[r %in% collapse[[i]], r:= min(collapse[[i]])]
}
dx[, r := factor(r)]
dx[, r := factor(r, labels = c(1:length(levels(r))))]
dx[]
}</code></pre>
<p>Here is the distribution of the original data set:</p>
<pre class="r"><code>dT.prop[, table(rx, r)]</code></pre>
<pre><code>## r
## rx 1 2 3 4 5 6 7 8
## 0 7 4 14 31 19 15 8 2
## 1 10 22 19 19 16 11 2 1</code></pre>
<p>And if we combine categories 1, 2, and 3 together, as well as 7 and 8, here is the resulting distribution based on the remaining five categories. Here’s a quick check to see that the categories were properly combined:</p>
<pre class="r"><code>collapse <- list(c(1,2,3), c(7,8))
collapseCat(dT.prop, collapse)[, table(rx, r)]</code></pre>
<pre><code>## r
## rx 1 2 3 4 5
## 0 25 31 19 15 10
## 1 51 19 16 11 3</code></pre>
<p>If we create four modified data sets based on different combinations of groups, we can fit models and plot the cumulative probabilities for all for of them. In all cases the proportional odds assumption still seems pretty reasonable.</p>
<pre class="r"><code>collapse <- list(list(c(3, 4), c(6, 7)),
list(c(1,2,3), c(7,8)),
list(c(1,2,3), c(4, 5), c(7,8)),
list(c(1,2), c(3, 4, 5), c(6, 7, 8))
)
dC.prop <- lapply(collapse, function(x) collapseCat(dT.prop, x))
cplots <- lapply(dC.prop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
</div>
<div id="non-proportional-cumulative-odds" class="section level3">
<h3>Non-proportional cumulative odds</h3>
<p>That was all just a set-up to explore what happens in the case of non-proportional odds. To do that, there’s just one more function to add - we need to generate data that does not assume proportional cumulative odds. I use the <code>rdirichlet</code> in the <code>gtools</code> package to generate values between 0 and 1, which sum to 1. The key here is that there is no pattern in the data - so that the ratio of the cumulative odds will not be constant.</p>
<pre class="r"><code>genDTnon <- function(nobs, ncat) {
ps <- gtools::rdirichlet(2, rep(2, ncat))
p0 <- paste(ps[1, -ncat], collapse = ";")
p1 <- paste(ps[2, -ncat], collapse = ";")
defc <- defCondition(condition = "rx == 0", formula = p0,
dist = "categorical")
defc <- defCondition(defc, condition = "rx == 1", formula = p1,
dist = "categorical")
dx <- genData(nobs)
dx <- trtAssign(dx, grpName = "rx")
dx <- addCondition(defc, dx, "r")
dx[, r := factor(r)]
dx[]
}</code></pre>
<p>Again, we generate a data set with 200 observations and an ordinal categorical outcome with 8 levels. The plot of the observed and estimated cumulative probabilities suggests that the proportional odds assumption is not a good one here. Some of the observed probabilities are quite far from the fitted lines, particularly at the low end of the ordinal scale. It may not be a disaster to to use a <code>clm</code> model here, but it is probably not a great idea.</p>
<pre class="r"><code>dT.nonprop <- genDTnon(200, 8)
fitPlot(dT.nonprop)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-13-1.png" width="528" /></p>
<p>The question remains - if we reduce the number of categories does the assumption of proportional odds come into focus? The four scenarios shown here do not suggest much improvement. The observed data still fall outside or at the edge of the bootstrap bands for some levels in each case.</p>
<pre class="r"><code>dC.nonprop <- lapply(collapse, function(x) collapseCat(dT.nonprop, x))
cplots <- lapply(dC.nonprop, function(x) fitPlot(x))
gridExtra::grid.arrange(grobs = cplots, nrow = 2)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-06-09-more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>What should we do in this case? That is a tough question. The proportional odds model for the original data set with eight categories is probably just as reasonable as estimating a model using any of the combined data sets; there is no reason to think that any one of the alternatives with fewer categories will be an improvement. And, as we learned <a href="https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/">last time</a>, we may actually lose power by collapsing some of the categories. So, it is probably best to analyze the data set using its original structure, and find the best model for that data set. Ultimately, that best model may need to relax the proportionality assumption; a post on that will need to be written another time.</p>
</div>
Considering the number of categories in an ordinal outcome
https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/
Tue, 26 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In two Covid-19-related trials I’m involved with, the primary or key secondary outcome is the status of a patient at 14 days based on a World Health Organization ordered rating scale. In this particular ordinal scale, there are 11 categories ranging from 0 (uninfected) to 10 (death). In between, a patient can be infected but well enough to remain at home, hospitalized with milder symptoms, or hospitalized with severe disease. If the patient is hospitalized with severe disease, there are different stages of oxygen support the patient can be receiving, such as high flow oxygen or mechanical ventilation.</p>
<p>It is common to analyze ordinal categorical outcomes like the WHO status measure using a cumulative proportional odds model. (I’ve described these models in a number of posts, starting <a href="https://www.rdatagen.net/post/ordinal-regression/">here</a> and <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a>.) We’ve be been wrestling with the question of whether to use the full 11-point scale or to collapse categories to create a simpler outcome of four or five groups. One issue that comes up is whether this reduction would increase or decrease our ability to detect a treatment effect, assuming of course that there is a treatment effect. To explore the issue, I turned to simulation.</p>
<div id="a-very-quick-recap-of-the-model" class="section level3">
<h3>A very quick recap of the model</h3>
<p>In the cumulative proportional odds model, we are comparing a series of cumulative odds across two groups, and we make an assumption that the ratio of these cumulative odds for the two groups is consistent throughout, the proportional odds assumption.</p>
<p>The cumulative odds for the control group that the status is <span class="math inline">\(x\)</span> is</p>
<p><span class="math display">\[
\text{cOdds}_{c}(x) = \frac{P(Status \le x | rx = Control)}{P(Status \gt x |rx = Control)}
\]</span></p>
<p>And the cumulative odds <strong><em>ratio</em></strong> comparing <em>Control</em> to <em>Treated</em> is</p>
<p><span class="math display">\[
\text{COR}_{ct}(x) = \frac{\text{cOdds}_c(x)}{\text{cOdds}_t(x)}
\]</span></p>
<p>In the proportional odds model, with a measure that has <span class="math inline">\(K\)</span> levels we make the assumption that</p>
<p><span class="math display">\[
\text{COR}_{ct}(1) = \text{COR}_{ct}(2) = \dots =\text{COR}_{ct}(K)
\]</span></p>
<p>The model that we estimate is</p>
<p><span class="math display">\[
\text{logit}(P(Status \le x)) = \alpha_x - \beta * I(rx = Treat)
\]</span>
where <span class="math inline">\(\alpha_x\)</span> is the log cumulative odds for a particular levels <span class="math inline">\(x\)</span>, and <span class="math inline">\(-\beta = \text{COR}_{ct}(k)\)</span>, the (proportional) log odds ratio across all <span class="math inline">\(k\)</span> status levels.</p>
</div>
<div id="conceputalizing-the-categories" class="section level3">
<h3>Conceputalizing the categories</h3>
<p>I am comparing estimates of models for outcome scales that use a range of categories, from 2 to 16. (I expanded beyond 11 to get a better sense of the results when the gradations become quite fine.) The figure shows the 16-group structure collapsing into 2 groups. The first row depicts the distribution of the control group across 16 categories. The second row combines the 2 rightmost purple categories of the first row into a single category, resulting in 15 total categories. Moving downwards, a pair of adjacent categories are combined at each step, until only 2 categories remain at the bottom.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-26-the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome.en_files/figure-html/unnamed-chunk-3-1.png" width="768" /></p>
<p>And here are the actual probabilities for the bottom seven rows, from 8 categories down to 2:</p>
<pre class="r"><code>baseprobs[7:1]</code></pre>
<pre><code>## [[1]]
## [1] 0.075 0.075 0.075 0.075 0.175 0.175 0.175 0.175
##
## [[2]]
## [1] 0.075 0.075 0.150 0.175 0.175 0.175 0.175
##
## [[3]]
## [1] 0.075 0.075 0.150 0.350 0.175 0.175
##
## [[4]]
## [1] 0.150 0.150 0.350 0.175 0.175
##
## [[5]]
## [1] 0.15 0.15 0.35 0.35
##
## [[6]]
## [1] 0.30 0.35 0.35
##
## [[7]]
## [1] 0.3 0.7</code></pre>
</div>
<div id="generating-the-data" class="section level3">
<h3>Generating the data</h3>
<p>To simulate the data, I use the function <code>genOrdCat</code> in <code>simstudy</code> that uses the base probabilities and the log-odds ratio transforming variable, which in this case is <span class="math inline">\(z\)</span>. (I introduced this function a while <a href="https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/">back</a>.) In this case the log odds ratio <span class="math inline">\((-\beta)\)</span> is 1, which translates to a cumulative odds ratio of <span class="math inline">\(exp(1) = 2.72\)</span>.</p>
<pre class="r"><code>library(simstudy)
defA <- defDataAdd(varname = "z", formula = "-1.0 * rx", dist = "nonrandom")
genDT <- function(nobs, baseprobs, defA) {
dT <- genData(nobs)
dT <- trtAssign(dT, grpName = "rx")
dT <- addColumns(defA, dT)
dT <- genOrdCat(dT, adjVar = "z", baseprobs, catVar = "r")
dT[]
}</code></pre>
<p>A single data set of 5000 observations with 6 categories looks like this:</p>
<pre class="r"><code>set.seed(7891237)
(dx <- genDT(5000, baseprobs[[5]], defA ))</code></pre>
<pre><code>## id rx z r
## 1: 1 0 0 1
## 2: 2 1 -1 3
## 3: 3 0 0 5
## 4: 4 0 0 4
## 5: 5 1 -1 1
## ---
## 4996: 4996 0 0 3
## 4997: 4997 1 -1 4
## 4998: 4998 0 0 3
## 4999: 4999 1 -1 3
## 5000: 5000 1 -1 4</code></pre>
<p>Here are the distributions by treatment arm:</p>
<pre class="r"><code>prop.table(dx[, table(rx, r)], margin = 1)</code></pre>
<pre><code>## r
## rx 1 2 3 4 5 6
## 0 0.0644 0.0772 0.1524 0.3544 0.1772 0.1744
## 1 0.1792 0.1408 0.2204 0.2880 0.1012 0.0704</code></pre>
<p>Here are the cumulative odds and the odds ratio for a response being 2 or less:</p>
<pre class="r"><code>(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 2)/mean(as.numeric(r) > 2)), keyby = rx])</code></pre>
<pre><code>## rx codds
## 1: 0 0.165
## 2: 1 0.471</code></pre>
<pre class="r"><code>dcodds[rx == 1, codds] / dcodds[rx==0, codds]</code></pre>
<pre><code>## [1] 2.85</code></pre>
<p>And here are the cumulative odds and COR for a response being 4 or less.</p>
<pre class="r"><code>(dcodds <- dx[, .(codds = mean(as.numeric(r) <= 4)/mean(as.numeric(r) > 4)), keyby = rx])</code></pre>
<pre><code>## rx codds
## 1: 0 1.84
## 2: 1 4.83</code></pre>
<pre class="r"><code>dcodds[rx == 1, codds] / dcodds[rx==0, codds]</code></pre>
<pre><code>## [1] 2.62</code></pre>
<p>The CORs are both close to the true COR of 2.72.</p>
</div>
<div id="running-the-experiment" class="section level3">
<h3>Running the experiment</h3>
<p>I was particularly interested in understanding the impact of increasing the number of categories <span class="math inline">\(K\)</span> on the probability of observing a treatment effect (i.e. the power). This required generating many (in this case 10,000) data sets under each scenario defined by the number of categories ranging from 2 to 16, and then estimating a cumulative odds model for each data set. I used the <code>clm</code> function in the <code>ordinal</code> package.</p>
<p>Two functions implement this iteration. <code>analyzeDT</code> generates a data set and returns a model fit. <code>iterate</code> repeatedly calls <code>analyzeDT</code> and estimates power for a particular scenario by calculating the proportion of p-values that are less than 0.05:</p>
<pre class="r"><code>library(ordinal)
analyzeDT <- function(nobs, baseprobs, defA) {
dT <- genDT(nobs, baseprobs, defA)
clmFit <- clm(r ~ rx, data = dT)
coef(summary(clmFit))
}
iterate <- function(niter, nobs, baseprobs, defA) {
res <- lapply(1:niter, function(x) analyzeDT(nobs, baseprobs, defA))
mean( sapply(res, function(x) x["rx", "Pr(>|z|)"]) < 0.05)
}</code></pre>
<p><code>lapply</code> is used here to cycle through each scenario (for enhanced speed <code>mclapply</code> in the <code>parallel</code> package could be used):</p>
<pre class="r"><code>set.seed(1295)
power <- lapply(baseprobs, function(x) iterate(niter = 10000,
nobs = 100, x, defA))</code></pre>
</div>
<div id="effect-of-k-on-power" class="section level3">
<h3>Effect of K on power</h3>
<p>A plot of the estimates suggests a strong relationship between the number of categories and power:</p>
<p><img src="https://www.rdatagen.net/post/2020-05-26-the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome.en_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>In this particular case, it seems apparent there are benefits to increasing from 2 to 6 categories. However, there are slight gains to be had by increasing the number of categories much beyond this; indeed, extending to the full 16 categories may not be worth the trouble, as the gains in power are minimal.</p>
<p>These minimal gains need to be weighed against the potential difficulty in acquiring the finely grained categorical outcomes. In cases where the defined categories are completely objective and are naturally collected as part of an operating environment - as in the WHO scale that might be gathered from an electronic health record - there is no real added burden to maximizing the number of categories. However, if the outcome scores are based on patient responses to a survey, the quality of data collection may suffer. Adding additional categories may confuse the patient and make the data collection process more burdensome, resulting in unreliable responses or even worse, missing data. In this case, the potential gains in power may be offset by poor data quality.</p>
</div>
To stratify or not? It might not actually matter...
https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/
Tue, 12 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Continuing with the theme of <em>exploring small issues that come up in trial design</em>, I recently used simulation to assess the impact of stratifying (or not) in the context of a multi-site Covid-19 trial with a binary outcome. The investigators are concerned that baseline health status will affect the probability of an outcome event, and are interested in randomizing by health status. The goal is to ensure balance across the two treatment arms with respect to this important variable. This randomization would be paired with an estimation model that adjusts for health status.</p>
<p>An alternative strategy is to ignore health status in the randomization, but to pre-specify an outcome model that explicitly adjusts for health status, just as in the stratification scenario. The question is, how do the operating characteristics (e.g. <em>power</em>, <em>variance</em>, and <em>bias</em>) of each approach compare. Are the (albeit minimal) logistics necessary for stratification worth the effort?</p>
<div id="simulation" class="section level3">
<h3>Simulation</h3>
<p>Simulations under a variety of scenarios suggest that stratification might not be necessary. (See this <a href="https://www.sciencedirect.com/science/article/pii/S0895435698001383">paper</a> for a much deeper, richer discussion of these issues.)</p>
<div id="define-the-data" class="section level4">
<h4>Define the data</h4>
<p>In these simulations, I assume that there are a small number of clusters. The proportion of high risk cases in each cluster varies (specified by <code>p</code>), as do the event rates (specified by <code>a</code>). The simulations vary the log odds of an outcome (<code>baseLO</code>), effect sizes/log-odds ratio (<code>effLOR</code>), and the effect of poor health status <code>xLOR</code>):</p>
<pre class="r"><code>library(simstudy)
library(parallel)
setDefs <- function(pX, precX, varRE, baseLO, effLOR, xLOR) {
defc <- defData(varname = "p", formula = pX, variance = precX,
dist = "beta", id = "site")
defc <- defData(defc, varname = "a", formula = 0, variance = varRE)
form <- genFormula(c(baseLO, effLOR, xLOR, 1), vars = c("rx", "x", "a"))
defi1 <- defDataAdd(varname = "x", formula = "p", dist = "binary")
defi2 <- defDataAdd(varname = "y", formula = form, dist = "binary", link = "logit")
return(list(defc = defc, defi1 = defi1, defi2 = defi2))
}</code></pre>
</div>
<div id="generate-the-data-and-estimates" class="section level4">
<h4>Generate the data and estimates</h4>
<p>Under each scenario, the data definitions are established by a call to <code>setDefs</code> and treatment is randomized, stratified by <em>site</em> only, or by <em>site</em> <strong>and</strong> <em>health status</em> <code>x</code>. (There is a slight bug in the <code>trtAssign</code> function that will generate an error if there is only a single observation in a site and particular strata - which explains my use of the <code>try</code> function to prevent the simulations from grinding to a halt. This should be fixed soon.)</p>
<p>For each generated data set under each scenario, we estimate a <em>generalized linear model</em>:</p>
<p><span class="math display">\[
logit(y_{ij}) = \beta_0 + \gamma_j + \beta_1r_i + \beta_2x_i \ ,
\]</span>
where <span class="math inline">\(y_{ij}\)</span> is the outcome for patient <span class="math inline">\(i\)</span> at site <span class="math inline">\(j\)</span>, <span class="math inline">\(r_i\)</span> is the treatment indicator, and <span class="math inline">\(x_i\)</span> is the health status indicator. <span class="math inline">\(\gamma_j\)</span> is a fixed site-level effect. The function returns parameter estimate for the log-odds ratio (the treatment effect <span class="math inline">\(\beta_1\)</span>), as well as its standard error estimate and p-value.</p>
<pre class="r"><code>genEsts <- function(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR) {
defs <- setDefs(pX, precX, varRE, baseLO, effLOR, xLOR)
dc <- genData(nclust, defs$defc)
dx <- genCluster(dc, "site", clustsize , "id")
dx <- addColumns(defs$defi1, dx)
dx <- try(trtAssign(dx, strata = strata, grpName = "rx"), silent = TRUE)
if ( (class(dx)[1]) == "try-error") {
return(NULL)
}
dx <- addColumns(defs$defi2, dx)
glmfit <- glm(y~factor(site) + rx + x, data = dx, family = "binomial")
estrx <- t(coef(summary(glmfit))["rx", ])
return(data.table(estrx))
}</code></pre>
</div>
<div id="iterate-through-multiple-scenarios" class="section level4">
<h4>Iterate through multiple scenarios</h4>
<p>We will “iterate” through different scenarios using the <code>mclapply</code> function the <code>parallel</code> package. For each scenario, we generate 2500 data sets and parameter estimates. For each of these scenarios, we calculate the</p>
<pre class="r"><code>forFunction <- function(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR) {
res <- rbindlist(mclapply(1:2500, function(x)
genEsts(strata, nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))
data.table(strata = length(strata), nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR,
est = res[, mean(Estimate)],
se.obs = res[, sd(Estimate)],
se.est = res[, mean(`Std. Error`)],
pval = res[, mean(`Pr(>|z|)` < 0.05)]
)
}</code></pre>
</div>
<div id="specify-the-scenarios" class="section level4">
<h4>Specify the scenarios</h4>
<p>We specify all the scenarios by creating a data table of parameters. Each row of this table represents a specific scenario, for which 2500 data sets will be generated and parameters estimated. For these simulations that I am reporting here, I varied the strata for randomization, the cluster size, the baseline event rate, and the effect size, for a total of 336 scenarios (<span class="math inline">\(2 \times 6 \times 4 \times 7\)</span>).</p>
<pre class="r"><code>strata <- list("site", c("site", "x"))
nclust <- 8
clustsize <- c(30, 40, 50, 60, 70, 80)
pX <- 0.35
precX <- 30
varRE <- .5
baseLO <- c(-1.5, -1.25, -1.0, -0.5)
effLOR <- seq(0.5, 0.8, by = .05)
xLOR <- c(.75)
dparam <- data.table(expand.grid(strata, nclust, clustsize, pX, precX,
varRE, baseLO, effLOR, xLOR))
setnames(dparam, c("strata","nclust", "clustsize", "pX", "precX",
"varRE", "baseLO", "effLOR", "xLOR"))
dparam</code></pre>
<pre><code>## strata nclust clustsize pX precX varRE baseLO effLOR xLOR
## 1: site 8 30 0.35 30 0.5 -1.5 0.5 0.75
## 2: site,x 8 30 0.35 30 0.5 -1.5 0.5 0.75
## 3: site 8 40 0.35 30 0.5 -1.5 0.5 0.75
## 4: site,x 8 40 0.35 30 0.5 -1.5 0.5 0.75
## 5: site 8 50 0.35 30 0.5 -1.5 0.5 0.75
## ---
## 332: site,x 8 60 0.35 30 0.5 -0.5 0.8 0.75
## 333: site 8 70 0.35 30 0.5 -0.5 0.8 0.75
## 334: site,x 8 70 0.35 30 0.5 -0.5 0.8 0.75
## 335: site 8 80 0.35 30 0.5 -0.5 0.8 0.75
## 336: site,x 8 80 0.35 30 0.5 -0.5 0.8 0.75</code></pre>
</div>
<div id="run-the-simulation" class="section level4">
<h4>Run the simulation</h4>
<p>Everything is now set up. We go through each row of the scenario table <code>dparam</code> to generate the summaries for each scenario by repeated calls to <code>forFunction</code>, again using <code>mclapply</code>.</p>
<pre class="r"><code>resStrata <- mclapply(1:nrow(dparam), function(x) with(dparam[x,],
forFunction(strata[[1]], nclust, clustsize, pX, precX, varRE, baseLO, effLOR, xLOR)))
resStrata <- rbindlist(resStrata)
resStrata[, .(strata, baseLO, effLOR, xLOR, est, se.obs, se.est, pval)]</code></pre>
<pre><code>## strata baseLO effLOR est se.obs se.est pval
## 1: 1 -1.5 0.5 0.53 0.33 0.31 0.38
## 2: 2 -1.5 0.5 0.52 0.32 0.31 0.39
## 3: 1 -1.5 0.5 0.52 0.28 0.27 0.49
## 4: 2 -1.5 0.5 0.51 0.27 0.27 0.48
## 5: 1 -1.5 0.5 0.51 0.24 0.24 0.58
## ---
## 332: 2 -0.5 0.8 0.82 0.21 0.20 0.98
## 333: 1 -0.5 0.8 0.82 0.19 0.19 0.99
## 334: 2 -0.5 0.8 0.82 0.19 0.19 1.00
## 335: 1 -0.5 0.8 0.82 0.18 0.17 1.00
## 336: 2 -0.5 0.8 0.81 0.18 0.17 1.00</code></pre>
</div>
</div>
<div id="plotting-the-results" class="section level3">
<h3>Plotting the results</h3>
<p>The plots below compare the estimates of the two different stratification strategies. Each point represents a specific scenario under stratification by site alone and stratification by site along and health status. If there are differences in the two strategies, we would expect to see the points diverge from the horizontal line. For all four plots, there appears to be little if any divergence, suggesting that, for these scenarios at least, little difference between stratification scenarios.</p>
<div id="power" class="section level4">
<h4>Power</h4>
<p>In this first scatter plot, the estimated power under each stratification strategy is plotted. Power is estimated by the proportion of p-values in the 2500 iterations that were less than 0.05. Regardless of whether observed power for a particular scenario is high or low, we generally observe the same power under both strategies. The points do not diverge far from the red line, which represents perfect equality.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-8-1.png" width="576" /></p>
</div>
<div id="standard-errors" class="section level4">
<h4>Standard errors</h4>
<p>There are two ways to look at the variability of the two strategies. First, we can look at the observed variability of the effect estimates across the 2500 iterations. And second, we can look at the average of the standard error estimates across the iterations. In general, the two randomization schemes appear quite similar with respect to both observed and estimated variation.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-9-1.png" width="576" /></p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-10-1.png" width="576" /></p>
</div>
<div id="treatment-effects" class="section level4">
<h4>Treatment effects</h4>
<p>In this last plot, the average estimated treatment effect is shown for each scenario. The two stratification strategies both appear to provide the same unbiased estimates of the treatment effect.</p>
<p><img src="https://www.rdatagen.net/post/2020-05-12-to-stratify-or-not-to-stratify.en_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>
<p><small><font color="darkkhaki">
References:</p>
Kernan, Walter N., Catherine M. Viscoli, Robert W. Makuch, Lawrence M. Brass, and Ralph I. Horwitz. “Stratified randomization for clinical trials.” <em>Journal of clinical epidemiology</em> 52, no. 1 (1999): 19-26.
</font></small>
</p>
</div>
</div>
Simulation for power in designing cluster randomized trials
https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/
Tue, 28 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis. (I guess this is the appropriate time to pull out the quote made by the famous statistician Ronald Fisher: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.”)</p>
<p>In these times, when researchers are scrambling to improve care for patients the Covid-19, there isn’t often time for those early conversations, or they are happening with many different people. Recently, I’ve been asked to help figure out what the sample size requirements are for four or five studies involving promising therapies for Covid-19 patients at various stages of the disease. In most cases, randomization is at the patient, so power/sample size calculations are much simpler. In other situations, cluster randomization at the unit or hospital floor level is being considered, so the sample size estimates are a little more involved.</p>
<p>There are analytic/formula solutions for sample size estimates in non-clustered randomization. And if the outcome is continuous, adjustments can be made using an estimate of the <em>design effect</em>, which I wrote about <a href="https://www.rdatagen.net/post/what-exactly-is-the-design-effect/">recently</a>. When the outcome is binary, or the number of clusters is small, or the cluster sizes themselves are small, I feel more comfortable using simulation methods. Indeed, the <code>simstudy</code> package grew out of my work to facilitate data generation for this very reason.</p>
<p>My intention here is to provide code to help others in case they want to conduct these relatively simple power analyses. One of the proposed studies expected to have a large number of small-sized clusters, so that is what I’ll simulate here.</p>
<div id="the-data-generation-process" class="section level3">
<h3>The data generation process</h3>
<p>To estimate power under a range of scenarios, I’ve written two functions to define the data generation process, one to generate the data, and a final one to generate a single data set and estimate the parameters of a mixed effects model.</p>
<div id="data-definitions" class="section level4">
<h4>data definitions</h4>
<p>The variance of the cluster-level random effect is based on a conversion of the intra-cluster correlation (<em>ICC</em>) to the logistic scale, which is done through a call to the function <code>iccRE</code>. The definition of the outcome is based on this random effect plus a log odds-ratio that is derived from the control proportion and the assumed percent change:</p>
<pre class="r"><code>library(simstudy)
library(lme4)
defRE <- function(icc, dist = "binary", varW = NULL) {
setVar <- iccRE(ICC = icc, dist = dist, varWithin = varW)
def <- defData(varname = "a", formula = 0, variance = setVar, id = "cluster")
return(def)
}
defBinOut <- function(p1, pctdelta) {
p2 <- (1 - pctdelta) * p1
int <- round(log( p1/(1-p1) ), 4)
effect <- round(log( (p2/(1-p2)) / (p1/(1-p1) )), 4)
formula <- genFormula( c(int, effect, 1), c("rx","a") )
def <- defDataAdd(varname = "y", formula = formula, dist = "binary",
link = "logit")
return(def)
}</code></pre>
</div>
<div id="data-generation" class="section level4">
<h4>data generation</h4>
<p>The data generation follows from the data definitions. First, cluster-level data are generated (along with treatment assignment), and then the individual patient level data.</p>
<pre class="r"><code>genDataSet <- function(nclust, clustsize, re.def, out.def) {
dClust <- genData(nclust, re.def)
dClust <- trtAssign(dClust, grpName = "rx")
dPat <- genCluster(dtClust = dClust, cLevelVar = "cluster",
numIndsVar = clustsize, level1ID = "id")
dPat <- addColumns(out.def, dPat)
return(dPat)
}</code></pre>
</div>
<div id="model-estimation" class="section level4">
<h4>model estimation</h4>
<p>The <em>p-values</em> used for the power calculation are estimated using <code>glmer</code> of the <code>lme4</code> package, a generalized mixed effects model. (If the outcome were continuous, we would use <code>lmer</code> instead.) Unfortunately, this can be relatively resource-intensive, so the repeated estimations over a wide range of scenarios can be rather time consuming.</p>
<p>One way to speed things up is eliminate a step in the <code>glmer</code> algorithm to that takes considerable time, but has the side effect of excluding information about whether or not the model estimation has converged. Convergence can be a particular problem if variation across clusters is low, as when the <em>ICC</em> is low. The function below keeps track of whether an iteration has converged (but only if <code>fast</code> is set to FALSE). One might want to explore how frequently there is a failure to converge before turning on the <code>fast</code> flag.</p>
<p>This function returns the convergence status, the estimate of the random effects variance, and the effect parameter estimate, standard error, and p-value.</p>
<p><br></p>
<pre class="r"><code>genBinEsts <- function(nclust, clustsize, re.def, out.def, fast = FALSE) {
dP <- genDataSet(nclust, clustsize, re.def, out.def)
mod.re <- glmer(y ~ rx + (1|cluster), data = dP, family = binomial,
control = glmerControl( optimizer = "bobyqa", calc.derivs = !(fast) ))
convStatus <- as.numeric(length(summary(mod.re)$optinfo$conv$lme4))
res <- data.table(convStatus, re = VarCorr(mod.re)$cluster,
t(coef(summary(mod.re))["rx",]))
return(res)
}</code></pre>
</div>
</div>
<div id="single-data-set" class="section level3">
<h3>Single data set</h3>
<p>Here is an example setting the <em>ICC</em> at 0.025, the control proportion at 40%, and an effect size that translates to a 30% reduction (so that the treatment proportion will be 28%).</p>
<pre class="r"><code>(defa <- defRE(icc = 0.025))</code></pre>
<pre><code>## varname formula variance dist link
## 1: a 0 0.0844 normal identity</code></pre>
<pre class="r"><code>(defy <- defBinOut(0.40, 0.30))</code></pre>
<pre><code>## varname formula variance dist link
## 1: y -0.4055 + -0.539 * rx + 1 * a 0 binary logit</code></pre>
<p>And when we generate a single data set and estimate the parameters by calling the last function:</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG")
set.seed(2711)
genBinEsts(40, 10, defa, defy)</code></pre>
<pre><code>## convStatus re.(Intercept) Estimate Std. Error z value Pr(>|z|)
## 1: 0 0.284 -0.625 0.279 -2.24 0.0249</code></pre>
</div>
<div id="estimating-power" class="section level3">
<h3>Estimating power</h3>
<p>Everything is set up now to estimate power with repeated calls to this group of functions. This process can be done using the <code>mclapply</code> function in the <code>parallel</code> package, as I illustrated in earlier <a href="https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/">post</a>. Here, I am showing a <em>for loop</em> implementation.</p>
<p>The variables <code>ICC</code>, <code>SS</code>, <code>nClust</code>, <code>ctlPROB</code>, and <code>pctDELTA</code> are vectors containing all the possible scenarios for which power will be estimated. In this case, power will be based on 1000 iterations under set of assumptions.</p>
<p><br></p>
<pre class="r"><code>library(parallel)
nIters <- 1000
results <- NULL
for (icc in ICC) {
for (ss in SS) {
for (nclust in nCLUST) {
for (p1 in ctlPROB) {
for (pdelta in pctDELTA) {
clustsize <- ss %/% nclust
p2 <- p1 * (1 - pdelta)
defa <- defRE(icc)
defy <- defBinOut(p1, pdelta)
res <- rbindlist(mclapply(1:nIters,
function(x) genBinEsts(nclust, clustsize, defa, defy)))
dres <- data.table(icc, ss, nclust, clustsize, p1, p2, pdelta,
converged = res[, mean(convStatus == 0)],
p.conv = res[convStatus == 0, mean(`Pr(>|z|)` < 0.05)],
p.all = res[convStatus != 2, mean(`Pr(>|z|)` < 0.05)])
print(dres)
results <- rbind(results, dres)
}
}
}
}
}</code></pre>
<p>The first set of simulations evaluated power at a range of <em>ICC’s</em>, sample sizes, and effect sizes (in terms of percentage reduction). The number of clusters was fixed at 40, so the cluster size increased along with sample size. The probability of an event for a patient in the control group was also fixed at 10%.</p>
<pre class="r"><code>ICC <- seq(0.025, .10, 0.025)
SS <- seq(200, 1600, 200)
pctDELTA <- c(.2, .3, .4, .5, .6)
nCLUST <- 40
ctlPROB <- 0.10</code></pre>
<p>The plot shows that at total sample sizes less 800, we would only be able detect effect sizes of 60% when the control proportion is 10%.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-10-1.png" width="1056" /></p>
<p>When the control proportion is at 40%, there is clearly a much higher probability that the study will detect an effect even at the smaller sample sizes. Under these scenarios, the <em>ICC</em> has a much greater impact on power than when the control proportion is much lower.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-11-1.png" width="1056" /></p>
</div>
<div id="other-properties-of-the-design" class="section level3">
<h3>Other properties of the design</h3>
<p>Of course, power is only one concern of many. For example, we may need to understand how a study’s design relates to bias and variance. In this case, I wondered how well the standard error estimates would compare to observed standard errors, particularly when the cluster sizes were on the lower end. Here are two plots comparing the two.</p>
<p>The lines represent the observed standard errors (the standard deviation of the measure of effect, or the parameter representing the log odds-ratio) at each sample size (assuming 40 clusters and an effect size of 30% reduction.) The points are the average estimate of the standard error with error bars that reflect <span class="math inline">\(\pm\)</span> 1 standard deviation.</p>
<p>In the first set scenarios, the control probability is set at 10%. For all <em>ICCs</em> except perhaps 0.10, it appears that the standard error estimates are, on average, too large, though there is quite a bit of variability. The over-estimation declines as between cluster variance increases.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-12-1.png" width="1056" /></p>
<p>In the second set of scenarios, where the control probability is at 40%, there is less variation in the standard error estimates (as reflected in the shorter length error bars), and there appears to be a slight underestimate of variation, particularly with the larger <em>ICCs</em>.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-28-simulation-for-power-calculations-in-designing-cluster-randomized-trials.en_files/figure-html/unnamed-chunk-13-1.png" width="1056" /></p>
<p>I have only scratched the surface here in terms of the scenarios that can be investigated. In addition there are other measurements to consider. Clearly, it would be useful to know if these observed biases in the standard error estimates disappear with larger cluster sizes, or how the number of clusters relates to this bias. And, I didn’t even look at the whether effect size estimates are biased in different scenarios. The point is, while power is important, we must also understand the quality of our estimates.</p>
</div>
Yes, unbalanced randomization can improve power, in some situations
https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/
Tue, 14 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/
<p>Last time I provided some simulations that <a href="https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/">suggested</a> that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test.</p>
<div id="generating-two-groups-with-different-means-and-variance" class="section level3">
<h3>Generating two groups with different means and variance</h3>
<p>Using <code>simstudy</code> (the latest version 1.16 is now available on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a>), it is possible to generate different levels of variance by specifying a formula in the data definition. In this example, the treatment group variance is five times the control group variance:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
def <- defDataAdd(varname = "y", formula = "1.1*rx",
variance = "1*(rx==0) + 5*(rx==1)", dist = "normal")</code></pre>
<p>I have written a simple function to generate the data that can be used later in the power experiments:</p>
<pre class="r"><code>genDataSet <- function(n, ratio, def) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
dx <- addColumns(def, dx)
return(dx[])
}</code></pre>
<p>And now we can generate and look at some data.</p>
<pre class="r"><code>RNGkind("L'Ecuyer-CMRG")
set.seed(383)
dx1 <- genDataSet(72, c(1, 2), def)
davg <- dx1[, .(avg = mean(y)), keyby = rx]
library(paletteer)
library(ggplot2)
ggplot(data = dx1, aes(x = factor(rx), y = y) ) +
geom_jitter(width = .15, height = 0, aes(color = factor(rx))) +
geom_hline(data= davg, lty = 3, size = .75,
aes(yintercept = avg, color = factor(rx))) +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "grey98"),
legend.position = "none",
axis.title.x = element_blank()) +
scale_x_discrete(labels = c("control", "treatment")) +
scale_color_paletteer_d("jcolors::pal5")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-04-14-unbalanced-randomization-can-improve-power-in-some-situations.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
</div>
<div id="power-analyses" class="section level3">
<h3>Power analyses</h3>
<p>The following function generates a data set, records the difference in means for the two groups, and estimates the p-value of a <em>t-test</em> that assumes different variances for the two groups.</p>
<pre class="r"><code>genPvalue <- function(n, ratio, def) {
dx <- genDataSet(n, ratio, def)
mean.dif <- dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
p.value <- t.test(y~rx, data = dx)$p.value
data.table(r = paste0(ratio[1], ":", ratio[2]), mean.dif, p.value)
}</code></pre>
<p>In this comparison, we are considering three different designs or randomization schemes. In the first, randomization will be 1 to 1, so that half of the sample of 72 (n = 36) is assigned to the treatment arm. In the second, randomization will be 1 to 2, so that 2/3 of the sample (n=48) is assigned to treatment. And in the last, randomization will be 1 to 3, where 3/4 of the patients (n = 54) will be randomized to treatment. For each scenario, we will estimate the mean difference between the groups, the standard deviation of differences, and the power. All of these estimates will be based on 5000 data sets each, and we are still assuming treatment variance is five times the control variance.</p>
<pre class="r"><code>library(parallel)
ratios <- list(c(1, 1), c(1, 2), c(1, 3))
results <- mclapply(ratios, function(r)
rbindlist(mclapply(1:5000, function(x) genPvalue(72, r, def )))
)
results <- rbindlist(results)</code></pre>
<p>All three schemes provide an unbiased estimate of the effect size, though the unbalanced designs have slightly less variability:</p>
<pre class="r"><code>results[, .(avg.difference = mean(mean.dif),
sd.difference = sd(mean.dif)), keyby = r]</code></pre>
<pre><code>## r avg.difference sd.difference
## 1: 1:1 1.1 0.41
## 2: 1:2 1.1 0.38
## 3: 1:3 1.1 0.39</code></pre>
<p>The reduced variability translates into improved power for the unbalanced designs:</p>
<pre class="r"><code>results[, .(power = mean(p.value < 0.05)), keyby = r]</code></pre>
<pre><code>## r power
## 1: 1:1 0.75
## 2: 1:2 0.81
## 3: 1:3 0.81</code></pre>
</div>
<div id="benefits-of-imbalance-under-different-variance-assumptions" class="section level3">
<h3>Benefits of imbalance under different variance assumptions</h3>
<p>It seems reasonable to guess that if the discrepancy in variance between the two groups is reduced, there will be less advantage to over-allocating patients in the treatment arm. In fact, it may even be a disadvantage, as in the case of a binary outcome. Likewise, as the discrepancy increases, increased enthusiasm for unbalanced designs may be warranted.</p>
<p>Here is a plot (code available upon request) showing how the variation in the mean differences (shown as a standard deviation) relate to the design scheme and the underlying difference in the variance of the control and treatment groups. In all cases, the assumed variance for the control group was 1. The variance for the treatment group ranged from 1 to 9 in different sets of simulations. At each level of variance, four randomization schemes were evaluated: 1 to 1, 1 to 2, 1 to 3, and 1 to 4.</p>
<p>When variances are equal, there is no apparent benefit to using any other than a 1:1 randomization scheme. Even when the variance of the treatment group increases to 3, there is little benefit to a 1:2 arrangement. At higher levels of variance - in this case 5 - there appears to be more of a benefit to randomizing more people to treatment. However, at all the levels shown here, it does not look like anything above 1:2 is warranted.</p>
<p>So, before heading down the path of unbalanced randomization, make sure to take a look at your variance assumptions.</p>
<p><img src="https://www.rdatagen.net/post/2020-04-14-unbalanced-randomization-can-improve-power-in-some-situations.en_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
</div>
Can unbalanced randomization improve power?
https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/
Tue, 31 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/
<p>Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.</p>
<p>I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes.</p>
<p>Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only <em>lose</em> efficiency if we use anything other than a 1:1 randomization.</p>
<div id="brief-public-service-announcement-simstudy-update" class="section level3">
<h3>Brief public service announcement: simstudy update</h3>
<p>When it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the <code>simstudy</code> package, which supports much of the simulations on this blog, could not readily handle anything other than 1:1 randomization. I had to quickly rectify that shortcoming. There is a new argument <code>ratio</code> in the <code>trtAssign</code> function where you can now specify any scheme for any number of treatment arms. This is available in version 1.16, which for the moment can be found only on github (kgoldfeld/simstudy).</p>
<p>Here is an example based on a 1:2:3 allocation. I’m not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the <code>balance</code> argument is set to <code>TRUE</code>, indicating that the allocation to the groups will be perfect, or as close as possible to the specified ratios. If <code>balance</code> is <code>FALSE</code>, the ratios are used as relative probabilities instead.</p>
<pre class="r"><code>library(simstudy)
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(16)
dx <- genData(600)
dx <- trtAssign(dx, nTrt = 3, balanced = TRUE,
ratio = c(1,2,3), grpName = "rx")
dx[, table(rx)]</code></pre>
<pre><code>## rx
## 1 2 3
## 100 200 300</code></pre>
</div>
<div id="unbalanced-designs-with-a-binary-outcome" class="section level3">
<h3>Unbalanced designs with a binary outcome</h3>
<p>The outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome).</p>
<p>If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters.</p>
<pre class="r"><code>estCoef <- function(n, formula, ratio) {
def <- defDataAdd(varname = "y", formula = formula, dist = "binary")
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx", ratio = ratio)
dx <- addColumns(def, dx)
coef(summary(glm(y~rx, family = binomial, data = dx)))
}
estCoef(n = 244*2, formula = "0.4 - 0.3 * 0.4 * rx", ratio = c(1, 1))</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4328641 0.1310474 -3.303111 0.0009561867
## rx -0.4577476 0.1924533 -2.378487 0.0173838304</code></pre>
<p>The probabilities of a bad outcome for the usual care group and drug treatment group are</p>
<pre class="r"><code>c(usual = 1/(1 + exp(0.433)), drug = 1/(1+exp(0.433 + 0.458))) </code></pre>
<pre><code>## usual drug
## 0.3934102 0.2909035</code></pre>
</div>
<div id="assessing-power" class="section level3">
<h3>Assessing power</h3>
<p>In order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05.</p>
<p>Here is the analytic solution for a 1:1 ratio.</p>
<pre class="r"><code>power.prop.test(p1 = .4, p2 = .7*.4, power = .80)</code></pre>
<pre><code>##
## Two-sample comparison of proportions power calculation
##
## n = 243.4411
## p1 = 0.4
## p2 = 0.28
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated <span class="math inline">\(n\)</span> in a simulation for power (using 1000 datasets), we should be close to 80%:</p>
<pre class="r"><code>est.power <- function(n, ratio, p1, reduction) {
formula = paste(p1, "* (1 -", reduction, "* rx)")
p.val <- estCoef(n, formula, ratio)["rx", 4]
return(p.val)
}
pvals <- unlist(mclapply(1:1000,
function(x) est.power(488, c(1, 1), 0.4, 0.3)))
mean(pvals < 0.05)</code></pre>
<pre><code>## [1] 0.814</code></pre>
</div>
<div id="the-power-experiment" class="section level3">
<h3>The power experiment</h3>
<p>Now we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions:</p>
<pre class="r"><code>res <- list()
for (p1 in c(.2, .3, .4, .5)) {
for (r in c(.2, .25, .3)) {
p2 <- (1- r) * p1
n <- ceiling(power.prop.test(p1 = p1, p2 = p2, power = .80)$n)*2
for (i in c(1:5)) {
pvals <- mclapply(1:1000, function(x) est.power(n, c(1, i), p1, r))
pvals <- unlist(pvals)
dres <- data.table(n, control_p = p1, pct_reduction = r,
control = 1, rx = i, power = mean( pvals < .05))
res <- append(res, list(dres))
}
}
}
res <- rbindlist(res)</code></pre>
<p>Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment:</p>
<pre class="r"><code>ggplot(data = res, aes(x = rx, y = power)) +
geom_line(color = "blue") +
facet_grid(control_p ~ pct_reduction, labeller = label_both) +
theme(panel.grid = element_blank()) +
scale_x_continuous(name = "ratio of treatment to control",
breaks = c(1:5), labels = paste0(c(1:5),":1")) +
scale_y_continuous(limits = c(.5,.9), breaks = c(.6, .7, .8))</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-03-31-can-unbalanced-randomization-improve-power.en_files/figure-html/unnamed-chunk-7-1.png" width="624" /></p>
</div>
<div id="continuous-outcomes" class="section level3">
<h3>Continuous outcomes</h3>
<p>In the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I’ll show some simulations that indicate this might be the case.</p>
<p>Stay well.</p>
</div>
When you want more than a chi-squared test, consider a measure of association
https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/
Tue, 17 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In my last <a href="https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/">post</a>, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the <span class="math inline">\(\chi^2\)</span> statistic.</p>
<p>I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal. This can arise, for example, when we want to compare Likert-type responses across multiple groups. The measure of effect I focused on - the cumulative proportional odds - is quite useful, but is potentially limited for two reasons. First, the proportional odds assumption may not be reasonable, potentially leading to biased estimates. Second, both factors may be nominal (i.e. not ordinal), it which case cumulative odds model is inappropriate.</p>
<p>An alternative, non-parametric measure of association that can be broadly applied to any contingency table is <em>Cramér’s V</em>, which is calculated as</p>
<p><span class="math display">\[
V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}
\]</span>
where <span class="math inline">\(\chi^2\)</span> is from the Pearson’s chi-squared test, <span class="math inline">\(N\)</span> is the total number of responses across all groups, <span class="math inline">\(r\)</span> is the number of rows in the contingency table, and <span class="math inline">\(c\)</span> is the number of columns. <span class="math inline">\(V\)</span> ranges from <span class="math inline">\(0\)</span> to <span class="math inline">\(1\)</span>, with <span class="math inline">\(0\)</span> indicating no association, and <span class="math inline">\(1\)</span> indicating the strongest possible association. (In the <a href="#addendum">addendum</a>, I provide a little detail as to why <span class="math inline">\(V\)</span> cannot exceed <span class="math inline">\(1\)</span>.)</p>
<div id="simulating-independence" class="section level3">
<h3>Simulating independence</h3>
<p>In this first example, the distribution of ratings is independent of the group membership. In the data generating process, the probability distribution for <code>rating</code> has no reference to <code>grp</code>, so we would expect similar distributions of the response across the groups:</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
def <- defData(def, varname = "rating",
formula = "0.2;0.3;0.4;0.1", dist = "categorical")
set.seed(99)
dind <- genData(500, def)</code></pre>
<p>And in fact, the distributions across the 4 rating options do appear pretty similar for each of the 3 groups:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-2-1.png" width="576" /></p>
<p>In order to estimate <span class="math inline">\(V\)</span> from this sample, we use the <span class="math inline">\(\chi^2\)</span> formula (I explored the chi-squared test with simulations in a two-part post <a href="https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/">here</a> and <a href="https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/">here</a>):</p>
<p><span class="math display">\[
\sum_{i,j} {\frac{(O_{ij} - E_{ij})^2}{E_{ij}}}
\]</span></p>
<pre class="r"><code>observed <- dind[, table(grp, rating)]
obs.dim <- dim(observed)
getmargins <- addmargins(observed, margin = seq_along(obs.dim),
FUN = sum, quiet = TRUE)
rowsums <- getmargins[1:obs.dim[1], "sum"]
colsums <- getmargins["sum", 1:obs.dim[2]]
expected <- rowsums %*% t(colsums) / sum(observed)
X2 <- sum( ( (observed - expected)^2) / expected)
X2</code></pre>
<pre><code>## [1] 3.45</code></pre>
<p>And to check our calculation, here’s a comparison with the estimate from the <code>chisq.test</code> function:</p>
<pre class="r"><code>chisq.test(observed)</code></pre>
<pre><code>##
## Pearson's Chi-squared test
##
## data: observed
## X-squared = 3.5, df = 6, p-value = 0.8</code></pre>
<p>With <span class="math inline">\(\chi^2\)</span> in hand, we can estimate <span class="math inline">\(V\)</span>, which we expect to be quite low:</p>
<pre class="r"><code>sqrt( (X2/sum(observed)) / (min(obs.dim) - 1) )</code></pre>
<pre><code>## [1] 0.05874</code></pre>
<p>Again, to verify the calculation, here is an alternative estimate using the <code>DescTools</code> package, with a 95% confidence interval:</p>
<pre class="r"><code>library(DescTools)
CramerV(observed, conf.level = 0.95)</code></pre>
<pre><code>## Cramer V lwr.ci upr.ci
## 0.05874 0.00000 0.08426</code></pre>
<p> </p>
</div>
<div id="group-membership-matters" class="section level3">
<h3>Group membership matters</h3>
<p>In this second scenario, the distribution of <code>rating</code> is specified directly as a function of group membership. This is an extreme example, designed to elicit a very high value of <span class="math inline">\(V\)</span>:</p>
<pre class="r"><code>def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
defc <- defCondition(condition = "grp == 1",
formula = "0.75; 0.15; 0.05; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 2",
formula = "0.05; 0.75; 0.15; 0.05", dist = "categorical")
defc <- defCondition(defc, condition = "grp == 3",
formula = "0.05; 0.05; 0.15; 0.75", dist = "categorical")
# generate the data
dgrp <- genData(500, def)
dgrp <- addCondition(defc, dgrp, "rating")</code></pre>
<p>It is readily apparent that the structure of the data is highly dependent on the group:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-8-1.png" width="576" /></p>
<p>And, as expected, the estimated <span class="math inline">\(V\)</span> is quite high:</p>
<pre class="r"><code>observed <- dgrp[, table(grp, rating)]
CramerV(observed, conf.level = 0.95)</code></pre>
<pre><code>## Cramer V lwr.ci upr.ci
## 0.7400 0.6744 0.7987</code></pre>
<p> </p>
</div>
<div id="interpretation-of-cramérs-v-using-proportional-odds" class="section level3">
<h3>Interpretation of Cramér’s V using proportional odds</h3>
<p>A key question is how we should interpret V? Some folks suggest that <span class="math inline">\(V \le 0.10\)</span> is very weak and anything over <span class="math inline">\(0.25\)</span> could be considered quite strong. I decided to explore this a bit by seeing how various cumulative odds ratios relate to estimated values of <span class="math inline">\(V\)</span>.</p>
<p>To give a sense of what some log odds ratios (LORs) look like, I have plotted distributions generated from cumulative proportional odds models, using LORs ranging from 0 to 2. At 0.5, there is slight separation between the groups, and by the time we reach 1.0, the differences are considerably more apparent:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-10-1.png" width="288" /></p>
<p>My goal was to see how estimated values of <span class="math inline">\(V\)</span> change with the underlying LORs. I generated 100 data sets for each LOR ranging from 0 to 3 (increasing by increments of 0.05) and estimated <span class="math inline">\(V\)</span> for each data set (of which there were 6100). The plot below shows the mean <span class="math inline">\(V\)</span> estimate (in yellow) at each LOR, with the individual estimates represented by the grey points. I’ll let you draw you own conclusions, but (in this scenario at least), it does appear that 0.25 (the dotted horizontal line) signifies a pretty strong relationship, as LORs larger than 1.0 generally have estimates of <span class="math inline">\(V\)</span> that exceed this threshold.</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-11-1.png" width="480" /></p>
<p> </p>
</div>
<div id="p-values-and-cramérs-v" class="section level3">
<h3>p-values and Cramér’s V</h3>
<p>To end, I am just going to circle back to where I started at the beginning of the previous <a href="https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/">post</a>, thinking about p-values and effect sizes. Here, I’ve generated data sets with a relatively small between-group difference, using a modest LOR of 0.40 that translates to a measure of association <span class="math inline">\(V\)</span> just over 0.10. I varied the sample size from 200 to 1000. For each data set, I estimated <span class="math inline">\(V\)</span> and recorded whether or not the p-value from a chi-square test would have been deemed “significant” (i.e. p-value <span class="math inline">\(< 0.05\)</span>) or not. The key point here is that as the sample size increases and we rely solely on the chi-squared test, we are increasingly likely to attach importance to the findings even though the measure of association is quite small. However, if we actually consider a measure of association like Cramér’s <span class="math inline">\(V\)</span> (or some other measure that you might prefer) in drawing our conclusions, we are less likely to get over-excited about a result when perhaps we shouldn’t.</p>
<p>I should also comment that at smaller sample sizes, we will probably over-estimate the measure of association. Here, it would be important to consider some measure of uncertainty, like a 95% confidence interval, to accompany the point estimate. Otherwise, as in the case of larger sample sizes, we would run the risk of declaring success or finding a difference when it may not be warranted.</p>
<p><img src="https://www.rdatagen.net/post/2020-03-17-when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables.en_files/figure-html/unnamed-chunk-12-1.png" width="384" /></p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum-why-is-cramérs-v-le-1" class="section level3">
<h3>Addendum: Why is <em>Cramér’s V</em> <span class="math inline">\(\le\)</span> 1?</h3>
<p>Cramér’s <span class="math inline">\(V = \sqrt{\frac{\chi^2/N}{min(r-1, c-1)}}\)</span>, which cannot be lower than 0. <span class="math inline">\(V=0\)</span> when <span class="math inline">\(\chi^2 = 0\)</span>, which will only happen when the observed cell counts for all cells equal the expected cell counts for all cells. In other words, <span class="math inline">\(V=0\)</span> only when there is complete independence.</p>
<p>It is also the case that <span class="math inline">\(V\)</span> cannot exceed <span class="math inline">\(1\)</span>. I will provide some intuition for this using a relatively simple example and some algebra. Consider the following contingency table which represents complete separation of the three groups:</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-dep.png" height="225" /></p>
<p>I would argue that this initial <span class="math inline">\(3 \times 4\)</span> table is equivalent to the following <span class="math inline">\(3 \times 3\)</span> table that collapses responses <span class="math inline">\(1\)</span> and <span class="math inline">\(2\)</span> - no information about the dependence has been lost or distorted. In this case <span class="math inline">\(n_A = n_{A1} + n_{A2}\)</span>.</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-collapsed.png" height="200" /></p>
<p>In order to calculate <span class="math inline">\(\chi^2\)</span>, we need to derive the expected values based on this collapsed contingency table. If <span class="math inline">\(p_{ij}\)</span> is the probability for cell row <span class="math inline">\(i\)</span> and column <span class="math inline">\(j\)</span>, and <span class="math inline">\(p_i.\)</span> and <span class="math inline">\(p._j\)</span> are the row <span class="math inline">\(i\)</span> and column <span class="math inline">\(j\)</span> totals, respectively then independence implies that <span class="math inline">\(p_{ij} = p_i.p._j\)</span>. In this example, under independence, the expected cell count for cell <span class="math inline">\(i,j\)</span> is <span class="math inline">\(\frac{n_i}{N} \frac{n_j}{N} N = \frac{n_in_j}{N}\)</span>:</p>
<p><img src="https://www.rdatagen.net/img/post-cramersv/contingency-collapsed-ind.png" height="200" /></p>
<p>If we consider the contribution of group <span class="math inline">\(A\)</span> to <span class="math inline">\(\chi^2\)</span>, we start with the <span class="math inline">\(\sum_{group \ A} (O_j - E_j)^2/E_j\)</span> and end up with <span class="math inline">\(N - n_A\)</span>:</p>
<p><span class="math display">\[
\begin{aligned}
\chi^2_{\text{rowA}} &= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{\left ( \frac{n_An_B}{N} \right )^2}{\frac{n_An_B}{N}} + \frac{\left ( \frac{n_An_C}{N} \right )^2}{\frac{n_An_C}{N}} \\ \\
&= \frac{\left ( n_A - \frac{n_A^2}{N} \right )^2}{\frac{n_A^2}{N}} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&=N \left ( \frac{n_A^2 - \frac{2n_A^3}{N} +\frac{n_A^4}{N^2}} {n_A^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&=N \left ( 1 - \frac{2n_A}{N} +\frac{n_A^2}{N^2} \right ) + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&= N - 2n_A +\frac{n_A^2}{N} + \frac{n_An_B}{N}+ \frac{n_An_C}{N} \\ \\
&= N - 2n_A + \frac{n_A}{N} \left ( {n_A} + n_B + n_C \right ) \\ \\
&= N - 2n_A + \frac{n_A}{N} N \\ \\
&= N - n_A
\end{aligned}
\]</span></p>
<p>If we repeat this on rows 2 and 3 of the table, we will find that <span class="math inline">\(\chi^2_{\text{rowB}} = N - n_B\)</span>, and <span class="math inline">\(\chi^2_{\text{rowC}} = N - n_C\)</span>, so</p>
<p><span class="math display">\[
\begin{aligned}
\chi^2 &= \chi^2_\text{rowA} +\chi^2_\text{rowB}+\chi^2_\text{rowC} \\ \\
&=(N - n_A) + (N - n_B) + (N - n_C) \\ \\
&= 3N - (n_A + n_B + n_C) \\ \\
&= 3N - N \\ \\
\chi^2 &= 2N
\end{aligned}
\]</span></p>
<p>And</p>
<p><span class="math display">\[
\frac{\chi^2}{2 N} = 1
\]</span></p>
<p>So, under this scenario of extreme separation between groups,</p>
<p><span class="math display">\[
V = \sqrt{\frac{\chi^2}{\text{min}(r-1, c-1) \times N}} = 1
\]</span></p>
<p>where <span class="math inline">\(\text{min}(r - 1, c - 1) = \text{min}(2, 3) = 2\)</span>.</p>
</div>
Alternatives to reporting a p-value: the case of a contingency table
https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/
Tue, 03 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/
<p>I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the <a href="https://www.amstat.org/"><em>American Statistical Association</em></a> has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In <a href="https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108">2016</a>, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.” These principles are worth noting here in case you don’t make it over to the original paper:</p>
<ul>
<li>p-values can indicate how incompatible the data are with a specified statistical model.</li>
<li>p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.<br />
</li>
<li>scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.</li>
<li>proper inference requires full reporting and transparency</li>
<li>a p-value, or statistical significance, does not measure the size of an effect or the importance of a result.</li>
<li>by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.</li>
</ul>
<p>More recently, the ASA <a href="https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913">elaborated</a> on this, responding to those who thought the initial paper was too negative, a list of many things <em>not</em> to do. In this new paper, the ASA argues that “knowing what not to do with p-values is indeed necessary, but it does not suffice.” We also need to know what we <em>should</em> do. One of those things should be focusing on effect sizes (and some measure of uncertainty, such as a confidence or credible interval) in order to evaluate an intervention or exposure.</p>
<div id="applying-principled-thinking-to-a-small-problem" class="section level3">
<h3>Applying principled thinking to a small problem</h3>
<p>Recently, I was discussing the presentation of results for a pilot study. I was arguing that we should convey the findings in a way that highlighted the general trends without leading readers to make overly strong conclusions, which p-values might do. So, I was arguing that, rather than presenting p-values, we should display effect sizes and confidence intervals, and avoid drawing on the concept of “statistical significance.”</p>
<p>Generally, this is not a problem; we can estimate an effect size like a difference in means, a difference in proportions, a ratio of proportions, a ratio of odds, or even the log of a ratio of odds. In this case, the outcome was a Likert-type survey where the response was “none”, “a little”, and “a lot”, and there were three comparison groups, so we had a <span class="math inline">\(3\times3\)</span> contingency table with one ordinal (i.e. ordered) factor. In this case, it is not so clear what the effect size measurement should be.</p>
<p>One option is to calculate a <span class="math inline">\(\chi^2\)</span> statistic, report the associated p-value, and call it a day. However, since the <span class="math inline">\(\chi^2\)</span> is not a measure of effect and the p-value is not necessarily a good measure of evidence, I considered estimating a cumulative odds model that would provide a measure of the association between group and response. However, I was a little concerned, because the typical version of this model makes an assumption of proportional odds, which I wasn’t sure would be appropriate here. (I’ve written about these models before, <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a> and <a href="https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/">here</a>, if you want to take a look.) It is possible to fit a cumulative odds model without the proportionality assumption, but then the estimates are harder to interpret since the effect size varies by group and response.</p>
<p>Fortunately, there is a more general measure of association for contingency tables with at least one, but possibly two, nominal factors: <em>Cramer’s V</em>. This measure which makes no assumptions about proportionality.</p>
<p>My plan is to simulate contingency table data, and in this post, I will explore the cumulative odds models. Next time, I’ll describe the <em>Cramer’s V</em> measure of association.</p>
</div>
<div id="non-proportional-cumulative-odds" class="section level3">
<h3>Non-proportional cumulative odds</h3>
<p>In the cumulative odds model (again, take a look <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">here</a> for a little more description of these models), we assume that all the log-odds ratios are proportional. This may actually not be an unreasonable assumption, but I wanted to start with a data set that is generated without explicitly assuming proportionality. In the following data definition, the distribution of survey responses (<em>none</em>, <em>a little</em>, and <em>a lot</em>) across the three groups (<em>1</em>, <em>2</em>, and <em>3</em>) are specified uniquely for each group:</p>
<pre class="r"><code>library(simstudy)
# define the data
def <- defData(varname = "grp",
formula = "0.3; 0.5; 0.2", dist = "categorical")
defc <- defCondition(condition = "fgrp == 1",
formula = "0.70; 0.20; 0.10", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2",
formula = "0.10; 0.60; 0.30", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3",
formula = "0.05; 0.25; 0.70", dist = "categorical")
# generate the data
set.seed(99)
dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- addCondition(defc, dx, "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot"))
dx[]</code></pre>
<pre><code>## id fgrp frating
## 1: 1 2 a little
## 2: 2 3 a little
## 3: 3 3 a lot
## 4: 4 2 a little
## 5: 5 2 a little
## ---
## 176: 176 2 a lot
## 177: 177 1 none
## 178: 178 3 a lot
## 179: 179 2 a little
## 180: 180 2 a little</code></pre>
<p>A distribution plot based on these 180 observations indicates that the odds are not likely proportional; the “tell” is the large bulge for those in group <em>2</em> who respond <em>a little</em>.</p>
<pre class="r"><code>library(likert)
items <- dx[, .(frating)]
names(items) <- c(frating = "rating")
likert.data <- likert(items = items, grouping = dx$fgrp)
plot(likert.data, wrap = 100, low.color = "#DAECED",
high.color = "#CECD7B")</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/pdata-1.png" width="672" /></p>
<p>The <span class="math inline">\(\chi^2\)</span> test, not so surprisingly, indicates that it would be reasonable to conclude there are differences in responses across the three groups:</p>
<pre class="r"><code>chisq.test(table(dx[, .(fgrp, frating)]))</code></pre>
<pre><code>##
## Pearson's Chi-squared test
##
## data: table(dx[, .(fgrp, frating)])
## X-squared = 84, df = 4, p-value <2e-16</code></pre>
<p>But, since we are trying to provide a richer picture of the association that will be less susceptible to small sample sizes, here is the cumulative (proportional) odds model fit using the <code>clm</code> function in the <code>ordinal</code> package.</p>
<pre class="r"><code>library(ordinal)
clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)</code></pre>
<pre><code>## formula: frating ~ fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -162.95 333.91 5(0) 4.61e-08 2.8e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fgrp2 2.456 0.410 5.98 2.2e-09 ***
## fgrp3 3.024 0.483 6.26 3.9e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little 0.335 0.305 1.10
## a little|a lot 2.945 0.395 7.45</code></pre>
<p>A plot of the observed proportions (show by the line) with the modeled proportions (shown as points) indicates that the model that makes the proportional assumption might not be doing a great job:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-2-1.png" width="288" /></p>
<p>If we fit a model that does not make the proportionality assumption and compare using either AIC statistic (lower is better) or a likelihood ratio test (small p-value indicates that the saturated/non-proportional model is better), it is clear that the non-proportional odds model for this dataset is a better fit.</p>
<pre class="r"><code>clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
summary(clmFit.sat)</code></pre>
<pre><code>## formula: frating ~ 1
## nominal: ~fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -149.54 311.08 7(0) 8.84e-11 4.7e+01
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little.(Intercept) 0.544 0.296 1.83
## a little|a lot.(Intercept) 1.634 0.387 4.23
## none|a little.fgrp2 -4.293 0.774 -5.54
## a little|a lot.fgrp2 -0.889 0.450 -1.98
## none|a little.fgrp3 -2.598 0.560 -4.64
## a little|a lot.fgrp3 -1.816 0.491 -3.70</code></pre>
<pre class="r"><code>anova(clmFit.prop, clmFit.sat)</code></pre>
<pre><code>## Likelihood ratio tests of cumulative link models:
##
## formula: nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1 logit flexible
## clmFit.sat frating ~ 1 ~fgrp logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## clmFit.prop 4 334 -163
## clmFit.sat 6 311 -150 26.8 2 1.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<p>It is possible that the poor fit is just a rare occurrence. Below is a plot that shows the average result (<span class="math inline">\(\pm 1 \ sd\)</span>) for 1000 model fits for 1000 data sets using the same data generation process. It appears those initial results were not an aberration - the proportional odds model fits a biased estimate, particularly for groups <em>1</em> and <em>2</em>. (The code to do this simulation is shown in the addendum.)</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-4-1.png" width="288" /></p>
</div>
<div id="proportional-assumption-fulfilled" class="section level3">
<h3>Proportional assumption fulfilled</h3>
<p>Here the data generation process is modified so that the proportionality assumption is incorporated.</p>
<pre class="r"><code>def <- defData(varname = "grp", formula = ".3;.5;.2",
dist = "categorical")
def <- defData(def, varname = "z", formula = "1*I(grp==2) + 2*I(grp==3)",
dist = "nonrandom")
baseprobs <- c(0.7, 0.2, 0.1)
dx <- genData(180, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- genOrdCat(dx, adjVar = "z", baseprobs, catVar = "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot")
)</code></pre>
<p>This is what proportional odds looks like - there are no obvious bulges, just a general shift rightward as we move from group <em>1</em> to <em>3</em>:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>When we fit the proportional model and compare it to the saturated model, we see no reason to reject the assumption of proportionality (based on either the AIC or LR statistics).</p>
<pre class="r"><code>clmFit.prop <- clm(frating ~ fgrp, data = dx)
summary(clmFit.prop)</code></pre>
<pre><code>## formula: frating ~ fgrp
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 180 -176.89 361.77 4(0) 3.04e-09 2.7e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fgrp2 1.329 0.359 3.70 0.00022 ***
## fgrp3 2.619 0.457 5.73 1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## none|a little 0.766 0.299 2.56
## a little|a lot 2.346 0.342 6.86</code></pre>
<pre class="r"><code>clmFit.sat <- clm(frating ~ 1, nominal = ~ fgrp, data = dx)
anova(clmFit.prop, clmFit.sat)</code></pre>
<pre><code>## Likelihood ratio tests of cumulative link models:
##
## formula: nominal: link: threshold:
## clmFit.prop frating ~ fgrp ~1 logit flexible
## clmFit.sat frating ~ 1 ~fgrp logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## clmFit.prop 4 362 -177
## clmFit.sat 6 365 -177 0.56 2 0.75</code></pre>
<p>And here is a plot summarizing a second set of 1000 iterations, this one using the proportional odds assumption. The estimates appear to be unbiased:</p>
<p><img src="https://www.rdatagen.net/post/2020-03-03-to-report-a-p-value-or-not-the-case-of-a-contingency-table.en_files/figure-html/unnamed-chunk-8-1.png" width="288" /></p>
<p>I suspect that in many instances, Likert-type responses will look more like the second case than the first case, so that the cumulative proportional odds model could very well be useful in characterizing the association between group and response. Even if the assumption is not reasonable, the bias might not be terrible, and the estimate might still be useful as a measure of association. However, we might prefer a measure that is free of any assumptions, such as <em>Cramer’s V</em>. I’ll talk about that next time.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Ronald L. Wasserstein & Nicole A. Lazar (2016) The ASA Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133.</p>
Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019) Moving to a World Beyond “p < 0.05”, The American Statistician, 73:sup1, 1-19.
</font></small>
</p>
<p> </p>
</div>
<div id="addendum-code-for-replicated-analysis" class="section level2">
<h2>Addendum: code for replicated analysis</h2>
<pre class="r"><code>library(parallel)
RNGkind("L'Ecuyer-CMRG") # to set seed for parallel process
dat.nonprop <- function(iter, n) {
dx <- genData(n, def)
dx <- genFactor(dx, "grp", replace = TRUE)
dx <- addCondition(defc, dx, "rating")
dx <- genFactor(dx, "rating", replace = TRUE,
labels = c("none", "a little", "a lot")
)
clmFit <- clm(frating ~ fgrp, data = dx)
dprob.obs <- data.table(iter,
prop.table(dx[, table(fgrp, frating)], margin = 1))
setkey(dprob.obs, fgrp, frating)
setnames(dprob.obs, "N", "p.obs")
dprob.mod <- data.table(iter, fgrp = levels(dx$fgrp),
predict(clmFit, newdata = data.frame(fgrp = levels(dx$fgrp)))$fit)
dprob.mod <- melt(dprob.mod, id.vars = c("iter", "fgrp"),
variable.name = "frating", value.name = "N")
setkey(dprob.mod, fgrp, frating)
setnames(dprob.mod, "N", "p.fit")
dprob <- dprob.mod[dprob.obs]
dprob[, frating := factor(frating,
levels=c("none", "a little", "a lot"))]
dprob[]
}
def <- defData(varname = "grp", formula = ".3;.5;.2",
dist = "categorical")
defc <- defCondition(condition = "fgrp == 1",
formula = "0.7;0.2;0.1", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 2",
formula = "0.1;0.6;0.3", dist = "categorical")
defc <- defCondition(defc, condition = "fgrp == 3",
formula = "0.05;0.25;0.70", dist = "categorical")
res.nonp <- rbindlist(mclapply(1:1000,
function(iter) dat.nonprop(iter,180)))
sum.nonp <- res.nonp[, .(mfit = mean(p.fit), sfit = sd(p.fit),
mobs = mean(p.obs), sobs = sd(p.obs)),
keyby = .(fgrp, frating)]
sum.nonp[, `:=`(lsd = mfit - sfit, usd = mfit + sfit)]
ggplot(data = sum.nonp, aes(x = frating, y = mobs)) +
geom_line(aes(group = fgrp), color = "grey60") +
geom_errorbar(aes(ymin = lsd, ymax = usd, color = fgrp),
width = 0) +
geom_point(aes(y = mfit, color = fgrp)) +
theme(panel.grid = element_blank(),
legend.position = "none",
axis.title.x = element_blank()) +
facet_grid(fgrp ~ .) +
scale_y_continuous(limits = c(0, 0.85), name = "probability") +
scale_color_manual(values = c("#B62A3D", "#EDCB64", "#B5966D"))</code></pre>
</div>
Clustered randomized trials and the design effect
https://www.rdatagen.net/post/what-exactly-is-the-design-effect/
Tue, 18 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-exactly-is-the-design-effect/
<script src="https://www.rdatagen.net/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I am always saying that simulation can help illuminate interesting statistical concepts or ideas. Such an exploration might provide some insight into the concept of the <em>design effect</em>, which underlies clustered randomized trial designs. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.</p>
<p>When individual outcomes in a group are correlated, we learn less about the group from adding a new individual than we might think. Take an extreme example where every individual in a group is perfectly correlated with all the others: we will learn nothing new about the group by adding someone new. In fact, we might as well just look at a single member, since she is identical to all the others. The design effect is a value that in a sense quantifies how much information we lose (or, surprisingly, possibly gain) by this interdependence.</p>
<p>Let’s just jump right into it.</p>
<div id="the-context" class="section level3">
<h3>The context</h3>
<p>Imagine a scenario where an underlying population of interest is structurally defined by a group of clusters. The classic case is students in schools or classrooms. I don’t really do any school-based education (I learned from debating my teacher-wife that is a dangerous area to tread), but this example seems so clear. (The ideas in this post were, in part, motivated by my involvement with the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>, which focuses at the opposite end of life, seeking to improve care and quality of life for people living with advanced dementia and their caregivers through research and pragmatic clinical trials.) We might be interested in measuring the effect of some intervention (it may or may not take place in school) on an educational attainment outcome of high school-aged kids in a city (I am assuming a continuous outcome here just because it is so much easier to visualize). It does not seem crazy to think that the outcomes of kids from the same school might be correlated, either because the school itself does such a good (or poor) job of teaching or similar types of kids tend to go to the same school.</p>
</div>
<div id="the-unit-of-randomization" class="section level3">
<h3>The unit of randomization</h3>
<p>We have at least three ways to design our study. We could just recruit kids out and about in city and randomize them each individually to intervention or control. In the second approach, we decide that it is easier to randomize the schools to intervention or control - and recruit kids from each of the schools. This means that <em>all</em> kids from one school will be in the same intervention arm. And for the third option, we can go half way: we go to each school and recruit kids, randomizing half of the kids in each school to control, and the other half to the intervention. This last option assumes that we could ensure that the kids in the school exposed to the intervention would not influence their unexposed friends.</p>
<p>In all three cases the underlying assumptions are the same - there is a school effect on the outcome, an individual effect, and an intervention effect. But it turns out that the variability of the intervention effect depends entirely on how we randomize. And since variability of the outcome affects sample size, each approach has implications for sample size. (I’ll point you to a book by <a href="https://books.google.com/books/about/Design_and_Analysis_of_Cluster_Randomiza.html?id=QJZrQgAACAAJ&source=kp_cover">Donner & Klar</a>, which gives a comprehensive and comprehensible overview of cluster randomized trials.)</p>
</div>
<div id="simulation-of-each-design" class="section level3">
<h3>Simulation of each design</h3>
<p>Just to be clear about these different randomization designs, I’ll simulate 1500 students using each. I’ve set a seed in case you’d like to recreate the results shown here (and indicate the libraries I am using).</p>
<pre class="r"><code>library(simstudy)
library(data.table)
library(ggplot2)
library(clusterPower)
library(parallel)
library(lmerTest)
RNGkind("L'Ecuyer-CMRG") # enables seed for parallel process
set.seed(987)</code></pre>
<div id="randomization-by-student" class="section level4">
<h4>Randomization by student</h4>
<p>I’ve written a function for each of the three designs to generate the data, because later I am going to need to generate multiple iterations of each design. In the first case, randomization is applied to the full group of students:</p>
<pre class="r"><code>independentData <- function(N, d1) {
di <- genData(N)
di <- trtAssign(di, grpName = "rx")
di <- addColumns(d1, di)
di[]
}</code></pre>
<p>The outcome is a function of intervention status and a combined effect of the student’s school and the student herself. We cannot disentangle the variance components, because we do not know the identity of the school:</p>
<pre class="r"><code>defI1 <- defDataAdd(varname = "y", formula = "0.8 * rx",
variance = 10, dist = "normal")
dx <- independentData(N = 30 * 50, defI1)</code></pre>
<p>The observed effect size and variance should be close to the specified parameters of 0.8 and 10, respectively:</p>
<pre class="r"><code>dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.597</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.2</code></pre>
<p>Here is a plot of the individual observations that highlights the group differences and individual variation:</p>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-6-1.png" width="576" /></p>
</div>
<div id="randomization-by-site" class="section level4">
<h4>Randomization by site</h4>
<p>Next, the intervention status is assigned to each of the <span class="math inline">\(k\)</span> schools/clusters before generating <span class="math inline">\(m\)</span> students per cluster. In this case, the outcome (defined by <code>defI2</code>) is a function of the cluster effect, individual effect, and the intervention status. Note here, the variance components are disentangled, but together they sum to 10, suggesting that total variance should be the same as the first scenario:</p>
<pre class="r"><code>clusteredData <- function(k, m, d1, d2) {
dc <- genData(k, d1)
dc <- trtAssign(dc, grpName = "rx")
di <- genCluster(dc, "site", m, level1ID = "id")
di <- addColumns(d2, di)
di[]
}</code></pre>
<pre class="r"><code>defC <- defData(varname = "ceff", formula = 0,
variance = 0.5, id = "site", dist = "normal")
defI2 <- defDataAdd(varname = "y", formula = "ceff + 0.8 * rx",
variance = 9.5, dist = "normal")</code></pre>
<pre class="r"><code>dx <- clusteredData(k = 30, m = 50, defC, defI2)</code></pre>
<p>The effect size and variation across all observations should be be quite similar to the previous design, though now the data has a structure that is determined by the clusters:</p>
<pre class="r"><code>dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.203</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.5</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="randomization-within-site" class="section level4">
<h4>Randomization within site</h4>
<p>In the last design, the treatment assignment is made <em>after</em> both the clusters and individuals have been generated. Cluster randomization within site is specified using the <code>strata</code> argument:</p>
<pre class="r"><code>withinData <- function(k, m, d1, d2) {
dc <- genData(k, d1)
di <- genCluster(dc, "site", m, "id")
di <- trtAssign(di, strata="site", grpName = "rx")
di <- addColumns(d2, di)
di[]
}</code></pre>
<pre class="r"><code>dx <- withinData(30, 50, defC, defI2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]</code></pre>
<pre><code>## [1] 0.813</code></pre>
<pre class="r"><code>dx[, var(y)]</code></pre>
<pre><code>## [1] 10.1</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-02-18-what-exactly-is-the-design-effect.en_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
</div>
</div>
<div id="the-design-effect" class="section level3">
<h3>The design effect</h3>
<p>There’s a really nice paper by <a href="https://link.springer.com/article/10.1186/1471-2288-9-39">Vierron & Giraudeau</a> that describes many of the issues I am only touching on here. In particular, they define the design effect and then relate this definition to formulas that are frequently used simplify the estimation of the design effect.</p>
<p>Consider the statistics <span class="math inline">\(\sigma^2_{\Delta_{bc}}\)</span> and <span class="math inline">\(\sigma^2_{\Delta_{i}}\)</span>, which are the variance of the effect sizes under the cluster randomization and the individual randomization designs, respectively:</p>
<p><span class="math display">\[\sigma^2_{\Delta_{bc}} = Var(\bar{Y}_1^{bc} - \bar{Y}_0^{bc})\]</span></p>
<p>and</p>
<p><span class="math display">\[\sigma^2_{\Delta_{i}} =Var(\bar{Y}_1^{i} - \bar{Y}_0^{i})\]</span></p>
<p>These variances are never observed, since they are based on a very large (really, an infinite) number of repeated experiments. However, the theoretical variances can be derived (as they are in the paper), and can be simulated (as they will be here). The design effect <span class="math inline">\(\delta_{bc}\)</span> is defined as</p>
<p><span class="math display">\[\delta_{bc} = \frac{\sigma^2_{\Delta_{bc}}}{\sigma^2_{\Delta_{i}}}\]</span></p>
<p>This ratio represents the required adjustment in sample size required to make the two designs equivalent in the sense that they provide the same amount of information. This will hopefully become clear with the simulations below.</p>
<p>I have decided to use <span class="math inline">\(k = 50\)</span> clusters to ensure a large enough sample size to estimate the proper variance. I need to know how many individuals per cluster are required for 80% power in the cluster randomized design, given the effect size and variance assumptions I’ve been using here. I’ll use the <code>clusterPower</code> package (which unfortunately defines the number of clusters in each as <span class="math inline">\(m\)</span>, so don’t let that confuse you). Based on this, we should have 18 students per school, for a total sample of 900 students:</p>
<pre class="r"><code>crtpwr.2mean(m = 50/2, d = 0.8, icc = 0.05, varw = 9.5)</code></pre>
<pre><code>## n
## 17.9</code></pre>
<p>Now, I am ready to generate effect sizes for each of 2000 iterations of the experiment assuming randomization by cluster. With this collection of effect sizes in hand, I will be able to estimate their variance:</p>
<pre class="r"><code>genDifFromClust <- function(k, m, d1, d2) {
dx <- clusteredData(k, m, d1, d2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resC <- unlist(mclapply(1:niters,
function(x) genDifFromClust(k= 50, m=18, defC, defI2)))</code></pre>
<p>Here is an estimate of <span class="math inline">\(\sigma^2_{\Delta_{bc}}\)</span> based on the repeated experiments:</p>
<pre class="r"><code>(s2.D_bc <- var(resC))</code></pre>
<pre><code>## [1] 0.0818</code></pre>
<p>And here is the estimate of <span class="math inline">\(\sigma^2_{\Delta_{i}}\)</span> (the variance of the effect sizes based on individual-level randomization experiments):</p>
<pre class="r"><code>genDifFromInd <- function(N, d1) {
dx <- independentData(N, d1)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resI <- unlist(mclapply(1:niters,
function(x) genDifFromInd(N = 50*18, defI1)))
(s2.D_i <- var(resI))</code></pre>
<pre><code>## [1] 0.0432</code></pre>
<p>So, now we can use these variance estimates to derive the estimate of the design effect <span class="math inline">\(\delta_{bc}\)</span>, which, based on the earlier definition, is:</p>
<pre class="r"><code>(d_bc <- s2.D_bc / s2.D_i)</code></pre>
<pre><code>## [1] 1.89</code></pre>
<p>The Vierron & Giraudeau paper derives a simple formula for the design effect assuming equal cluster sizes and an ICC <span class="math inline">\(\rho\)</span>. This (or some close variation, when cluster sizes are not equal) is quite commonly used:</p>
<p><span class="math display">\[\delta_{bc} = 1 + (m-1)*\rho\]</span></p>
<p>As the ICC increases, the design effect increases. Based on the parameters for <span class="math inline">\(m\)</span> and <span class="math inline">\(\rho\)</span> we have been using in these simulations (note that <span class="math inline">\(\rho = 0.5/(0.5+9.5) = 0.05\)</span>), the standard formula gives us this estimate of <span class="math inline">\(\delta_{bc.formula}\)</span> that is quite close to our experimental value:</p>
<pre class="r"><code>( d_bc_form <- 1 + (18-1) * (0.05) )</code></pre>
<pre><code>## [1] 1.85</code></pre>
<p> </p>
</div>
<div id="but-what-is-the-design-effect" class="section level3">
<h3>But what is the design effect?</h3>
<p>OK, finally, we can now see what the design effect actually represents. As before, we will generate repeated data sets; this time, we will estimate the treatment effect using an appropriate model. (In the case of the cluster randomization, this is a linear mixed effects model, and in the case of individual randomization, this is linear regression model.) For each iteration, I am saving the p-value for the treatment effect parameter in the model. We expect close to 80% of the p-values to be lower than 0.05 (this is 80% power given a true treatment effect of 0.8).</p>
<p>First, here is the cluster randomized experiment and the estimate of power:</p>
<pre class="r"><code>genEstFromClust <- function(k, m, d1, d2) {
dx <- clusteredData(k, m, d1, d2)
summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}
resCest <- unlist(mclapply(1:niters,
function(x) genEstFromClust(k=50, m = 18, defC, defI2)))
mean(resCest < 0.05) # power</code></pre>
<pre><code>## [1] 0.778</code></pre>
<p>In just over 80% of the cases, we would have rejected the null.</p>
<p>And here is the estimated power under the individual randomization experiment, but with a twist. Since the design effect is 1.85, the cluster randomized experiment needs a relative sample size 1.85 times higher than an equivalent (individual-level) RCT to provide the same information, or to have equivalent power. So, in our simulations, we will use a reduced sample size for the individual RCT. Since we used 900 individuals in the CRT, we need only <span class="math inline">\(900/1.85 = 487\)</span> individuals in the RCT:</p>
<pre class="r"><code>( N.adj <- ceiling( 50 * 18 / d_bc_form ) )</code></pre>
<pre><code>## [1] 487</code></pre>
<pre class="r"><code>genEstFromInd <- function(N, d1) {
dx <- independentData(N, d1)
summary(lm(y ~ rx, data = dx))$coef["rx", 4]
}
resIest <- unlist(mclapply(1:niters,
function(x) genEstFromInd(N = N.adj, defI1)))</code></pre>
<p>The power for this second experiment is also quite close to 80%:</p>
<pre class="r"><code>mean(resIest < 0.05) # power</code></pre>
<pre><code>## [1] 0.794</code></pre>
</div>
<div id="within-cluster-randomization" class="section level3">
<h3>Within cluster randomization</h3>
<p>It is interesting to see what happens when we randomize within the cluster. I think there may be some confusion here, because I have seen folks incorrectly apply the standard formula for <span class="math inline">\(\delta_{bc}\)</span>, rather than this formula for <span class="math inline">\(\delta_{wc}\)</span> that is derived (again, under the assumption of equal cluster sizes) in the Vierron & Giraudeau paper as</p>
<p><span class="math display">\[ \delta_{wc} = 1- \rho\]</span></p>
<p>This implies that the sample size requirement actually declines as intra-cluster correlation increases! In this case, since <span class="math inline">\(\rho = 0.05\)</span>, the total sample size for the within-cluster randomization needs to be only 95% of the sample size for the individual RCT.</p>
<p>As before, let’s see if the simulated data confirms this design effect based on the definition</p>
<p><span class="math display">\[ \delta_{wc} = \frac{\sigma^2_{\Delta_{wc}}}{\sigma^2_{\Delta_{i}}}\]</span></p>
<pre class="r"><code>genDifFromWithin <- function(k, m, d1, d2) {
dx <- withinData(k, m, d1, d2)
dx[rx == 1, mean(y)] - dx[rx == 0, mean(y)]
}
resW <- unlist(mclapply(1:niters,
function(x) genDifFromWithin(k = 50, m = 18, defC, defI2)))
(s2.D_wc <- var(resW))</code></pre>
<pre><code>## [1] 0.0409</code></pre>
<p>The estimated design effect is quite close to the expected design effect of 0.95:</p>
<pre class="r"><code>(d_wc <- s2.D_wc / s2.D_i)</code></pre>
<pre><code>## [1] 0.947</code></pre>
<p>And to finish things off, if we estimate an adjusted cluster size based on the design effects (first reducing the cluster size <span class="math inline">\(m=18\)</span> for the cluster randomized trial by <span class="math inline">\(\delta_{bc.formula}\)</span> to derive the appropriate sample size for the RCT, and then adjusting by <span class="math inline">\(\delta_{wc} = 0.95\)</span>) to get the appropriate cluster size for the within cluster randomization, which is about 9 students. This study will only have 450 students, fewer than the RCT:</p>
<pre class="r"><code>(m.adj <- round( (18 / d_bc_form) * 0.95, 0))</code></pre>
<pre><code>## [1] 9</code></pre>
<pre class="r"><code>genEstFromWithin <- function(k, m, d1, d2) {
dx <- withinData(k, m, d1, d2)
summary(lmerTest::lmer(y ~ rx + (1|site), data = dx))$coef["rx", 5]
}
resWest <- unlist(mclapply(1:niters,
function(x) genEstFromWithin(k = 50, m = ceiling(m.adj), defC, defI2)))
mean(resWest < 0.05)</code></pre>
<pre><code>## [1] 0.779</code></pre>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Donner, Allan, and Neil Klar. “Design and analysis of cluster randomization trials in health research.” New York (2010).</p>
<p>Vierron, Emilie, and Bruno Giraudeau. “Design effect in multicenter studies: gain or loss of power?.” BMC medical research methodology 9, no. 1 (2009): 39.</p>
<p>Support:</p>
This work was supported in part by the National Institute on Aging (NIA) of the National Institutes of Health under Award Number U54AG063546, which funds the NIA IMbedded Pragmatic Alzheimer’s Disease and AD-Related Dementias Clinical Trials Collaboratory (<a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a>). The author, a member of the Design and Statistics Core, was the sole writer of this blog post and has no conflicts. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
</font></small>
</p>
</div>
Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes
https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/
Tue, 04 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/
<p>I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model. (If I have missed something obvious with respect to modeling options, please let me know.)</p>
<p>Two initiatives I am involved with, the HAS-QOL study and <a href="https://impactcollaboratory.org/">the IMPACT Collaboratory</a>, are focused on improving quality of care for people living with Alzheimer’s disease and other dementias. Both are examples where the stepped-wedge study design can be an important tool to evaluate interventions in a real-world context. In an earlier post, I <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">introduced</a> a particular variation of the stepped-wedge design which includes an open cohort. I provided simulations of the data generating process we are assuming for the analysis presented here. Elsewhere (<a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">here</a> and <a href="https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/">here</a>), I described Bayesian models that can be used to analyze data with more complicated variance patterns; all of those examples were based on continuous outcomes.</p>
<p>Here, I am extending and combining these ideas. This post walks through the data generation process and describes a Bayesian model that addresses the challenges posed by the open cohort stepped-wedge study design.</p>
<div id="the-model" class="section level3">
<h3>The model</h3>
<p>The process I use to simulate the data and then estimate to effects is based on a relatively straightforward logistic regression model with two random effects. To simplify things a bit, I intentionally make the assumption that there are no general time trends that affect that outcomes (though it would not be difficult to add in). In the logistic model, the log-odds (or logit) of a binary outcome is a linear function of predictors and random effects:</p>
<p><span class="math display">\[ \text{logit}(P(y_{ict}=1) = \beta_0 + \beta_1 I_{ct} + b_{ct} + b_i,\]</span>
where <span class="math inline">\(\text{logit}(P(y_{ict}=1))\)</span> is the log-odds for individual <span class="math inline">\(i\)</span> in cluster (or site) <span class="math inline">\(c\)</span> during time period <span class="math inline">\(t\)</span>, and <span class="math inline">\(I_{ct}\)</span> is a treatment indicator for cluster <span class="math inline">\(c\)</span> during period <span class="math inline">\(t\)</span>.</p>
<p>There are two random effects in this model. The first is a cluster-specific period random effect, <span class="math inline">\(b_{ct}\)</span> . For each cluster, there will actually be a vector of cluster effects <span class="math inline">\(\mathbf{b_c} = (b_{c0}, b_{c1},...,b_{c,T-1})\)</span>, where <span class="math inline">\(\mathbf{b_c}\sim MVN(\mathbf{0}, \sigma_{b_c}^2\mathbf{R})\)</span>, and <span class="math inline">\(\mathbf{R}\)</span> is</p>
<p><span class="math display">\[
\mathbf{R} =
\left(
\begin{matrix}
1 & \rho & \rho^2 & \cdots & \rho^{T-1} \\
\rho & 1 & \rho & \cdots & \rho^{T-2} \\
\rho^2 & \rho & 1 & \cdots & \rho^{T-3} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\rho^{T-1} & \rho^{T-2} & \rho^{T-3} & \cdots & 1
\end{matrix}
\right )
\]</span></p>
<p>The second random effect is the individual or patient-level random intercept <span class="math inline">\(b_i\)</span>, where <span class="math inline">\(b_i \sim N(0,\sigma_{b_i}^2)\)</span>. We could assume a more structured relationship for individual patients over time (such as a decaying correlation), but in this application, patients will not have sufficient measurements to properly estimate this.</p>
<p>In the model <span class="math inline">\(\beta_0\)</span> has the interpretation of the log-odds for the outcome when the the cluster is still in the control state and the cluster-period and individual effects are both 0. <span class="math inline">\(\beta_1\)</span> is the average treatment effect conditional on the random effects, and is reported as a log odds ratio.</p>
</div>
<div id="simulating-the-study-data" class="section level3">
<h3>Simulating the study data</h3>
<p>I am going to generate a single data set based on this model. If you want more explanation of the code, this <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">earlier post</a> provides the details. The only real difference here is that I am generating an outcome that is a function of cluster-period effects, individual effects, and treatment status.</p>
<div id="site-level-data" class="section level4">
<h4>Site level data</h4>
<p>There will be 24 sites followed for 12 periods (<span class="math inline">\(t=0\)</span> through <span class="math inline">\(t=11\)</span>), and the stepped-wedge design includes 6 waves of 4 sites in each wave. The first wave will start at <span class="math inline">\(t=4\)</span>, and a new wave will be added each period, so that the last wave starts at <span class="math inline">\(t=9\)</span>.</p>
<pre class="r"><code>library(simstudy)
dsite <- genData(24, id = "site")
dper <- addPeriods(dsite, nPeriods = 12, idvars = "site",
perName = "period")
dsw <- trtStepWedge(dper, "site", nWaves = 6, lenWaves = 1,
startPer = 4, perName = "period",
grpName = "Ict")</code></pre>
<p> </p>
</div>
<div id="correlated-site-level-effects" class="section level4">
<h4>Correlated site-level effects</h4>
<p>The average site-level effect is 0, the standard deviation of site averages is <span class="math inline">\(\sigma_{ct} = 0.3\)</span>, and the correlation coefficient that will determine between-period within site correlation is <span class="math inline">\(\rho = 0.5\)</span>. The correlation structure is “AR-1”, which means the between-period correlation decays over time (see definition of <span class="math inline">\(\mathbf{R}\)</span> above.)</p>
<pre class="r"><code>siteDef <- defData(varname = "eff.mu", formula = 0,
dist = "nonrandom", id = "site")
siteDef <- defData(siteDef, varname = "eff.s2", formula = 0.3^2,
dist = "nonrandom")
dsw <- addColumns(siteDef, dsw)
dsw <- addCorGen(dsw, nvars = 12, idvar = "site", rho = 0.5,
corstr = "ar1", dist = "normal",
param1 = "eff.mu", param2 = "eff.s2",
cnames = "eff.st")
dsw <- dsw[, .(site, period, startTrt, Ict, eff.st)]</code></pre>
<p> </p>
</div>
<div id="patient-level-data" class="section level4">
<h4>Patient level data</h4>
<p>We are generating 20 patients per period for each site, so there will be a total of 5760 individuals (<span class="math inline">\(20\times24\times12\)</span>). The individual level effect standard deviation <span class="math inline">\(\sigma_{b_i} = 0.3\)</span>. Each of the patients will be followed until they die, which is a function of their health status over time, defined by the Markov process and its transition matrix defined below. (This was described in more detail in an <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">earlier post</a>.</p>
<pre class="r"><code>dpat <- genCluster(dper, cLevelVar = "timeID",
numIndsVar = 20, level1ID = "id")
patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
patDef <- defDataAdd(patDef, varname = "eff.p",
formula = 0, variance = 0.3^2)
dpat <- addColumns(patDef, dpat)
P <-t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.1, 0.3, 0.5, 0.1,
0.0, 0.1, 0.6, 0.3,
0.0, 0.0, 0.0, 1.0),
nrow = 4))
dpat <- addMarkov(dpat, transMat = P,
chainLen = 12, id = "id",
pername = "seq", start0lab = "S0",
trimvalue = 4)
dpat[, period := period + seq - 1]
dpat <- dpat[period < 12]</code></pre>
<p> </p>
</div>
<div id="individual-outcomes" class="section level4">
<h4>Individual outcomes</h4>
<p>In this last step, the binary outcome <span class="math inline">\(y_{ict}\)</span> is generated based on treatment status and random effects. In this case, the treatment lowers the probability of <span class="math inline">\(Y=1\)</span>.</p>
<pre class="r"><code>dx <- merge(dpat, dsw, by = c("site","period"))
setkey(dx, id, period)
outDef <- defDataAdd(varname = "y",
formula = "-0.5 - 0.8*Ict + eff.st + eff.p",
dist = "binary", link = "logit")
dx <- addColumns(outDef, dx)
dx <- dx[, .(site, period, id, Ict, y)]</code></pre>
<p>Here are the site-level averages over time. The light blue indicates periods in which a site is still in the control condition, and the dark blue shows the transition to the intervention condition. The lines, which are grouped by wave starting period, show the proportion of <span class="math inline">\(Y=1\)</span> for each period. You should be able to see the slight drop following entry into treatment.</p>
<p><img src="https://www.rdatagen.net/img/post-opensw/siteplot.png" height="400" /></p>
</div>
</div>
<div id="estimating-the-treatment-effect-and-variance-components" class="section level3">
<h3>Estimating the treatment effect and variance components</h3>
<p>Because none of the maximum likelihood methods implemented in <code>R</code> or <code>SAS</code> could estimate this specific variance structure using a mixed effects logistic regression model, I am fitting a Bayesian model using <a href="http://mc-stan.org">RStan and Stan</a>, which requires a set of model definitions.</p>
<p>This model specification is actually quite similar to the model I estimated <a href="https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/">earlier</a>, except of course the outcome distribution is logistic rather than continuous. Another major change is the use of a <a href="https://mc-stan.org/docs/2_21/stan-users-guide/reparameterization-section.html">“non-centered” parameterization</a>, which actually reduced estimation times from hours to minutes (more precisely, about 12 hours to about 30 minutes). This reparameterization requires a Cholesky decomposition of the variance-covariance matrix <span class="math inline">\(\Sigma\)</span>. One additional limitation is that proper convergence of the MCMC chains seems to require a limited prior on <span class="math inline">\(\rho\)</span>, so that <span class="math inline">\(\rho \sim U(0,1)\)</span> rather than <span class="math inline">\(\rho \sim U(-1,1)\)</span>.</p>
<p>This particular code needs to be saved externally, and I have created a file named <code>binary sw - ar ind effect - non-central.stan</code>. This file is subsequently referenced in the call to <code>RStan</code>.</p>
<pre class="stan"><code>data {
int<lower=1> I; // number of unique individuals
int<lower=1> N; // number of records
int<lower=1> K; // number of predictors
int<lower=1> J; // number of sites
int<lower=0> T; // number of periods
int<lower=1,upper=I> ii[N]; // id for individual
int<lower=1,upper=J> jj[N]; // group for individual
int<lower=1,upper=T> tt[N]; // period of indidvidual
matrix[N, K] x; // matrix of predictors
int<lower=0,upper=1> y[N]; // vector of binary outcomes
}
parameters {
vector[K] beta; // model fixed effects
real<lower=0> sigma_S; // site variance (sd)
real<lower=0,upper=1> rho; // correlation
real<lower=0> sigma_I; // individual level varianc (sd)
// non-centered paramerization
vector[T] z_ran_S[J]; // site level random effects (by period)
vector[I] z_ran_I; // individual level random effects
}
transformed parameters {
cov_matrix[T] Sigma;
matrix[T, T] L; // for non-central parameterization
vector[I] ran_I; // individual level random effects
vector[T] ran_S[J]; // site level random effects (by period)
vector[N] yloghat;
// Random effects with exchangeable correlation
real sigma_S2 = sigma_S^2;
for (j in 1:T)
for (k in 1:T)
Sigma[j,k] = sigma_S2 * pow(rho,abs(j-k));
// for non-centered parameterization
L = cholesky_decompose(Sigma);
for(j in 1:J)
ran_S[j] = L * z_ran_S[j];
ran_I = sigma_I * z_ran_I;
// defining mean on log-odds scale
for (i in 1:N)
yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
}
model {
sigma_I ~ exponential(0.25);
sigma_S ~ exponential(0.25);
rho ~ uniform(0, 1);
for(j in 1:J) {
z_ran_S[j] ~ std_normal();
}
z_ran_I ~ std_normal();
y ~ bernoulli_logit(yloghat);
}</code></pre>
</div>
<div id="set-up-the-data-and-call-stan-from-r" class="section level3">
<h3>Set up the data and call stan from R</h3>
<p>Just for completeness, I am providing the code that shows the interface between <code>R</code> and <code>Stan</code> using <code>RStan</code>. The data needs to be sent to Stan as a list of data elements, which here is called <code>testdat</code>. For the estimation of the posterior probabilities, I am specifying 4 chains of 4000 iterations each, which includes 1000 warm-up iterations. I specified “adapt_delta = 0.90” to reduce the step-size a bit (default is 0.80); this slows things down a bit, but improves stability.</p>
<p>As I mentioned earlier, with this data set (and rather large number of effects to estimate), the running time is between 30 and 45 minutes. One of the downsides of this particular Bayesian approach is that it wouldn’t really be practical to do any kind of sample size estimate.</p>
<pre class="r"><code>x <- as.matrix(dx[ ,.(1, Ict)])
I <- dx[, length(unique(id))]
N <- nrow(x)
K <- ncol(x)
J <- dx[, length(unique(site))]
T <- dx[, length(unique(period))]
ii <- dx[, id]
jj <- dx[, site]
tt <- dx[, period] + 1
y <- dx[, y]
testdat <- list(I=I, N=N, K=K, J=J, T=T, ii=ii, jj=jj, tt=tt, x=x, y=y)
library(rstan)
options(mc.cores = parallel::detectCores())
rt <- stanc("binary sw - ar ind effect - non-central.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
fit.ar1 <- sampling(sm, data=testdat,
iter = 4000, warmup = 1000,
control=list(adapt_delta=0.90,
max_treedepth = 15),
chains = 4)</code></pre>
</div>
<div id="diagnostics" class="section level3">
<h3>Diagnostics</h3>
<p>After running the MCMC process to generate the probability distributions, the trace plots show that the mixing is quite adequate for the chains.</p>
<pre class="r"><code>plot(fit.ar1, plotfun = "trace", pars = pars,
inc_warmup = FALSE, ncol = 1)</code></pre>
<p><img src="https://www.rdatagen.net/img/post-opensw/diagplot.png" height="600" /></p>
</div>
<div id="extracting-results" class="section level3">
<h3>Extracting results</h3>
<p>If we take a look at the posterior probability distributions, we can see that they contain the original values used to generate the data - so at least in this case, the model seems to model the original data generation process quite well.</p>
<pre class="r"><code>pars <- c("beta", "sigma_S","sigma_I","rho")
summary(fit.ar1, pars = pars, probs = c(0.025, 0.975))$summary</code></pre>
<pre><code>## mean se_mean sd 2.5% 97.5% n_eff Rhat
## beta[1] -0.519 0.000687 0.0459 -0.609 -0.428 4470 1
## beta[2] -0.751 0.000844 0.0573 -0.864 -0.638 4618 1
## sigma_S 0.307 0.000394 0.0256 0.260 0.362 4223 1
## sigma_I 0.254 0.001548 0.0476 0.148 0.337 945 1
## rho 0.544 0.001594 0.0812 0.376 0.698 2599 1</code></pre>
<p><img src="https://www.rdatagen.net/img/post-opensw/postplot.png" height="500" /></p>
<p>One thing that is not working so well is my attempt to compare different models. For example, I might want to fit another model that does not assume between-period correlations decay and compare it to the current model. Previously, I used the <code>bridgesampling</code> package for the comparisons, but it does not seem to be able to accommodate these models. I will continue to explore the options more model comparison and will report back if I find something promising.</p>
<p>
<p><small><font color="darkkhaki"></p>
<p>This study is supported by the National Institutes of Health National Institute on Aging under award numbers R61AG061904 and U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</p>
</font></small>
</p>
</div>
A brief account (via simulation) of the ROC (and its AUC)
https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/
Tue, 21 Jan 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/
<p>The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve.</p>
<div id="a-prediction-problem" class="section level2">
<h2>A prediction problem</h2>
<p>The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease. The population is comprised of two groups of individuals: those with the condition and those without. What we want is some sort of diagnostic tool (such as a blood test or diagnostic scan) that will identify which group a particular patient belongs to. The question is how well does that tool or measure help us distinguish between the two groups? The ROC (and AUC) is designed to help answer that question.</p>
<div id="true-and-false-predictions" class="section level3">
<h3>True and false predictions</h3>
<p>While we might not know group membership for an individual, we assume that they do indeed belong to one of the two groups. When we base a prediction of group membership based on a test, we may or may not be right. There are four scenarios. It is possible that our prediction is (1) a true positive (the patient has the condition and that is what we predict), (2) a false positive (the patient does not have the condition, but we predict they do have it), (3) a false negative (the patient has the condition but we believe they are healthy), or (4) a true negative (the patient is healthy and that is our prediction.) A “good” test is one that maximizes true positive predictions while minimizing false positive predictions.</p>
<p>We can actually only assess the quality of the test if we know the true group membership of the individuals. So, our plan is to take measurements on this known sample, make predictions based on the test, and see how our predictions match up to reality. The ROC is one way to characterize how well our test matches up to reality.</p>
</div>
<div id="binary-decision-informed-by-a-continuous-measure" class="section level3">
<h3>Binary decision informed by a continuous measure</h3>
<p>While we make a binary decision about group membership - either we think they have the condition or they do not - the underlying measure that is used to make that determination may be continuous, like a score. For example, a hypothetical test might return a score between -10 and 10. We can pick a threshold anywhere along the continuum that will form the basis of our prediction. For example, we might say that any score > 0 indicates the condition is present, otherwise it is not. This simple test will be useful as a tool to discriminate between the disease and non-disease groups if that threshold indeed distinguishes the groups.</p>
<p>This is probably best demonstrated with a simple simulation. The sample we will generate has 100 individuals, around 40% who have the condition in question. The average score for the non-disease group is set at -5, and the average score for the disease group is 5. Both have variance 3.5:</p>
<pre class="r"><code>library(simstudy)
# define data
defx <- defData(varname = "condition", formula = .4, dist = "binary")
defx <- defData(defx, "x", formula = "-5 + 10*condition",
variance = 3.5, dist = "normal")
# generate data
set.seed(1873)
dx <- genData(100, defx)
head(dx)</code></pre>
<pre><code>## id condition x
## 1: 1 0 -5.83
## 2: 2 1 4.66
## 3: 3 1 4.23
## 4: 4 0 -3.87
## 5: 5 1 1.78
## 6: 6 0 -4.87</code></pre>
<p>Looking at the plot below, a threshold of zero appears to do an excellent job of distinguishing the groups. All of those with the condition (depicted in red) are above the threshold, whereas all of those without the condition (depicted in green) fall below the threshold:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-3-1.png" width="192" /></p>
</div>
<div id="the-world-is-not-always-so-neat-and-clean" class="section level3">
<h3>The world is not always so neat and clean</h3>
<p>Of course, we don’t usually have a measure or test that separates the groups so cleanly. Let’s say the average of the disease group is 2.5 and the non-disease group is -3. The threshold of zero still works pretty well, but it is not perfect. Some with the disease fall below the threshold (false negatives), and some without the disease lie above the threshold (false positives). In fact, only 87% of those with the disease are correctly identified (true positives), while 13% of those without the condition are incorrectly identified has having the disease (false positives).</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="-3 + 5.5*condition",
newvariance = 6)
dx <- genData(100, defx)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-5-1.png" width="192" /></p>
</div>
</div>
<div id="generating-the-roc" class="section level2">
<h2>Generating the ROC</h2>
<p>Zero isn’t the only possible threshold we could use for the diagnosis test. We can lower the threshold to below zero to ensure that we have 100% true positives, but we will have to sacrifice by increasing the proportion of false positives. Likewise, we could reduce the proportion of false positives by increasing the threshold above zero, but would reduce the proportion of true positives in the process.</p>
<p>There are, in fact, an infinite number of possible thresholds. Here is a sequence of plots of the same data with a number of different thresholds ranging from 8 to -8. The percent of true positives is shown on the top and the percent of false positives is shown on the bottom:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-6-1.png" width="576" /></p>
<p>The ROC is really just a summarized version of this sequence of plots. The X-axis is the proportion of false positives at a particular threshold, and the Y-axis is the proportion of true positives. As we lower the threshold, we move from left to right. So, in the plot below, each point represents one of the sections above:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-7-1.png" width="432" /></p>
<p>The ROC above is built from only 9 thresholds. If we consider all possible thresholds (continuous between -10 and 10), this is the the more complete curve:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-8-1.png" width="432" /></p>
<div id="area-under-the-roc" class="section level3">
<h3>Area under the ROC</h3>
<p>The AUC is, well, the area under the ROC. The maximum AUC will be 1 when there is complete separation (there is an example of this below), and the minimum is 0.5 (depicted by the diagonal line) when there is no separation by the test measure (again, an example will follow). We can estimate this area by integrating an approximate function defined by the data between 0 and 1.</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.957 with absolute error < 0.00011</code></pre>
<p>There is actually a meaningful interpretation of the AUC, that is described in a classic 1982 paper by <a href="https://pubs.rsna.org/doi/abs/10.1148/radiology.143.1.7063747">Hanley & McNeil</a> (if you want a deeper understanding of the issues, this paper is not a bad place to start - there is, of course, a huge literature on the topic of ROCs). The AUC is actually equivalent to the probability that the test measure of a random draw from the diseased group will be greater than the test measure of a random draw from the healthy group. So, an AUC = 0.90 indicates that 90% of the time we draw a test measure from the disease group and non-disease group, the measure from the disease group will be greater.</p>
<p>Here is a simple function that returns a value of <code>TRUE</code> if the random draw from the disease group is greater:</p>
<pre class="r"><code>randcomp <- function(ds) {
ds[condition == 1, sample(x, 1)] > ds[condition == 0, sample(x, 1)]
}</code></pre>
<p>And here is the proportion of 1000 draws where the measure from the disease group draws is greater (this is expected to be close to the AUC, which was estimated above to be 0.957):</p>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 0.958</code></pre>
<p>Of course, <code>R</code> has several packages that provide ROCs and calculate AUCs. I’m using package <a href="http://www.biomedcentral.com/1471-2105/12/77/">pROC</a> here just to show you that my AUC estimate is not totally crazy:</p>
<pre class="r"><code>library(pROC)
roc_obj <- roc(response = dx$condition, predictor = dx$x)
auc(roc_obj)</code></pre>
<pre><code>## Area under the curve: 0.958</code></pre>
</div>
</div>
<div id="alternative-scenarios" class="section level2">
<h2>Alternative scenarios</h2>
<p>As I indicated above, the AUC can generally range from 0.5 to 1.0. There is no hard and fast rule about what is a “good” AUC - it will depend on the application. Certainly, anything below 0.7 or maybe even 0.8 is pretty weak. I am going to conclude by generating data at the two extremes.</p>
<div id="minimal-separation" class="section level3">
<h3>Minimal separation</h3>
<p>When the test measure for each group is equally distributed, there is unlikely to be any threshold for which the proportion of true positives exceeds the proportion of false positives. If this is the case, we should probably look for another test measure - or be prepared to make a lot of mistakes in the non-disease group.</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="0+0*condition",
newvariance = 8)
dx <- genData(100, defx)</code></pre>
<p>As we move the threshold lower, both the proportion of true positives and false positives steadily increase:</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-14-1.png" width="576" /></p>
<p>As a result, the ROC hangs fairly close to the diagonal lower bound.</p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-15-1.png" width="432" /></p>
<p>We would expect the AUC to be fairly close to 0.5, which it is:</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.623 with absolute error < 4.5e-05</code></pre>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 0.613</code></pre>
</div>
<div id="complete-separation" class="section level3">
<h3>Complete separation</h3>
<p>At the other extreme, the mean of the disease group is high enough so that there is no overlap between the two groups. In this case, the curve follows along Y-axis before going across the X-axis. We can achieve 100% true positives and no false positives if threshold is set at some point that is below the minimum of the disease group, and above the maximum of the non-disease group. Zero will be the ideal cut-off point for this example.</p>
<pre class="r"><code>defx <- updateDef(defx, changevar = "x", newformula="-4+8*condition",
newvariance = 3.5)
dx <- genData(100, defx)</code></pre>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-19-1.png" width="576" /></p>
<p><img src="https://www.rdatagen.net/post/2020-01-21-a-simple-explanation-of-what-the-roc-and-auc-represent.en_files/figure-html/unnamed-chunk-20-1.png" width="432" /></p>
<p>As expected the AUC is equal to 1:</p>
<pre class="r"><code>f <- approxfun(x = roc$false.pos, y=roc$true.pos)
integrate(f, lower = 0, upper = 1)</code></pre>
<pre><code>## 0.996 with absolute error < 9.2e-05</code></pre>
<pre class="r"><code>mean(sapply(1:1000, function(x) randcomp(dx)))</code></pre>
<pre><code>## [1] 1</code></pre>
</div>
</div>
<div id="logistic-regression-and-the-roc" class="section level2">
<h2>Logistic regression and the ROC</h2>
<p>Just a quick note to conclude. The ROC is often used in conjunction with classification problems based on logistic regression modeling. In this case, we may not have a single underlying test measure, but rather we may have multiple predictors or measures. In this case, group assignment decision needs to be based on a summary of these multiple measures; one logical candidate is the individual’s predicted probability estimated by model.</p>
<p>If the specified logistic regression model provides good separation between the two groups, the predicted probabilities will be quite different for each group (higher AUC). However, if the model is not a strong classifier, the predicted probabilities for the two groups will be much closer together (lower AUC).</p>
<p> </p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Hanley, J.A. and McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), pp.29-36.</p>
Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77.
</font></small>
</p>
</div>
Repeated measures can improve estimation when we only care about a single endpoint
https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/
Tue, 10 Dec 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/
<p>I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.</p>
<p>There <em>is</em> one issue that should be fairly generalizable to other studies. It is likely that individual measures will be collected repeatedly over time but the primary outcome of interest will be the measure collected during the last follow-up period. I wanted to explore what, if anything, can be gained by analyzing all of the available data rather than focusing only the final end point.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>In this simulation scenario, there will be 200 subjects randomized at the individual level to one of two treatment arms, intervention (<span class="math inline">\(rx = 1\)</span>) and control (<span class="math inline">\(rx = 0\)</span>). Each person will be followed for 5 months, with a binary outcome measure collected at the end of each month. In the data, period 0 is the first month, and period 4 is the final month.</p>
<pre class="r"><code>library(simstudy)
set.seed(281726)
dx <- genData(200)
dx <- trtAssign(dx, grpName = "rx")
dx <- addPeriods(dx, nPeriods = 5)</code></pre>
<p>Here are the data for a single individual:</p>
<pre class="r"><code>dx[id == 142]</code></pre>
<pre><code>## id period rx timeID
## 1: 142 0 1 706
## 2: 142 1 1 707
## 3: 142 2 1 708
## 4: 142 3 1 709
## 5: 142 4 1 710</code></pre>
<p>The probabilities of the five binary outcomes for each individual are a function of time and intervention status.</p>
<pre class="r"><code>defP <- defDataAdd(varname = "p",
formula = "-2 + 0.2*period + 0.5*rx",
dist = "nonrandom", link = "logit")
dx <- addColumns(defP, dx)</code></pre>
<p>The outcomes for a particular individual are correlated, with outcomes in two adjacent periods are more highly correlated than outcomes collected further apart. (I use an auto-regressive correlation structure to generate these data.)</p>
<pre class="r"><code>dx <- addCorGen(dtOld = dx, idvar = "id", nvars = 5, rho = 0.6,
corstr = "ar1", dist = "binary", param1 = "p",
method = "ep", formSpec = "-2 + 0.2*period + 0.5*rx",
cnames = "y")
dx[id == 142]</code></pre>
<pre><code>## id period rx timeID p y
## 1: 142 0 1 706 0.18 0
## 2: 142 1 1 707 0.21 0
## 3: 142 2 1 708 0.25 1
## 4: 142 3 1 709 0.29 0
## 5: 142 4 1 710 0.33 0</code></pre>
<p>In the real world, there will be loss to follow up - not everyone will be observed until the end. In the first case, I will be assuming the data are missing completely at random (MCAR), where missingness is independent of all observed and unobserved variables. (I have <a href="https://www.rdatagen.net/post/musings-on-missing-data/">mused on missingess</a> before.)</p>
<pre class="r"><code>MCAR <- defMiss(varname = "y", formula = "-2.6",
logit.link = TRUE, monotonic = TRUE
)
dm <- genMiss(dx, MCAR, "id", repeated = TRUE, periodvar = "period")
dObs <- genObs(dx, dm, idvars = "id")
dObs[id == 142]</code></pre>
<pre><code>## id period rx timeID p y
## 1: 142 0 1 706 0.18 0
## 2: 142 1 1 707 0.21 0
## 3: 142 2 1 708 0.25 1
## 4: 142 3 1 709 0.29 NA
## 5: 142 4 1 710 0.33 NA</code></pre>
<p>In this data set only about 70% of the total sample is observed - though by chance there is different dropout for each of the treatment arms:</p>
<pre class="r"><code>dObs[period == 4, .(prop.missing = mean(is.na(y))), keyby = rx]</code></pre>
<pre><code>## rx prop.missing
## 1: 0 0.28
## 2: 1 0.38</code></pre>
</div>
<div id="estimating-the-intervention-effect" class="section level3">
<h3>Estimating the intervention effect</h3>
<p>If we are really only interested in the probability of a successful outcome in the final period, we could go ahead and estimate the treatment effect using a simple logistic regression using individuals who were available at the end of the study. The true value is 0.5 (on the logistic scale), and the estimate here is close to 1.0 with a standard error just under 0.4:</p>
<pre class="r"><code>fit.l <- glm(y ~ rx, data = dObs[period == 4], family = binomial)
coef(summary(fit.l))</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.25 0.28 -4.4 9.9e-06
## rx 0.99 0.38 2.6 9.3e-03</code></pre>
<p>But, can we do better? Fitting a longitudinal model might provide a more stable and possibly less biased estimate, particularly if the specified model is the correct one. In this case, I suspect it will be an improvement, since the data was generated using a process that is amenable to a GEE (generalized estimating equation) model.</p>
<pre class="r"><code>library(geepack)
fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dObs, corstr = "ar1")
coef(summary(fit.m))</code></pre>
<pre><code>## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.33 0.259 81 0.00000
## period 0.30 0.072 17 0.00003
## rx 0.83 0.263 10 0.00152</code></pre>
<p>And finally, it is reasonable to expect that a model that is based on a data set without any missing values would provide the most efficient estimate. And that does seem to be case if we look at the standard error of the effect estimate.</p>
<pre class="r"><code>fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dx, corstr = "ar1")
coef(summary(fit.f))</code></pre>
<pre><code>## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.15 0.227 89.2 0.0e+00
## period 0.30 0.062 23.1 1.5e-06
## rx 0.54 0.233 5.4 2.1e-02</code></pre>
<p>Of course, we can’t really learn much of anything from a single simulated data set. Below is a plot of the mean estimate under each modeling scenario (along with the blue line that represents <span class="math inline">\(\pm 2\)</span> <em>sd</em>) based on 2500 simulated data sets with missingness completely at random. (The code for these replications is included in the addendum.)</p>
<p>It is readily apparent that under an assumption of MCAR, all estimation models yield unbiased estimates (the true effect size is 0.5), though using the last period only is inherently more variable (given that there are fewer observations to work with).</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-11-1.png" width="384" /></p>
</div>
<div id="missing-at-random" class="section level3">
<h3>Missing at random</h3>
<p>When the data are MAR (missing at random), using the last period only no longer provides an unbiased estimate of the effect size. In this case, the probability of missingness is a function of time, intervention status, and the outcome from the prior period, all of which are observed. This is how I’ve defined the MAR process:</p>
<pre class="r"><code>MAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
logit.link = TRUE, monotonic = TRUE
)</code></pre>
<p>The mean plots based on 2500 iterations reveal the bias of the last period only. It is interesting to see that the GEE model is <em>not</em> biased, because we have captured all of the relevant covariates in the model. (It is well known that a likelihood method can yield unbiased estimates in the case of MAR, and while GEE is not technically a likelihood, it is a <em>quasi</em>-likelihood method.)</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-13-1.png" width="384" /></p>
</div>
<div id="missing-not-at-random" class="section level3">
<h3>Missing not at random</h3>
<p>When missingness depends on unobserved data, such as the outcome itself, then GEE estimates are also biased. For the last set of simulations, I defined missingness of <span class="math inline">\(y\)</span> in any particular time period to be a function of itself. Specifically, if the outcome was successful and the subject was in the intervention, the subject would be more likely to be observed:</p>
<pre class="r"><code>NMAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*y",
logit.link = TRUE, monotonic = TRUE
)</code></pre>
<p>Under the assumption of missingness not at random (NMAR), both estimation approaches based on the observed data set with missing values yields an biased estimate, though using all of the data appears to reduce the bias somewhat:</p>
<p><img src="https://www.rdatagen.net/post/2019-12-10-using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus.en_files/figure-html/unnamed-chunk-15-1.png" width="384" /></p>
</div>
<div id="addendum-generating-replications" class="section level3">
<h3>Addendum: generating replications</h3>
<pre class="r"><code>iter <- function(n, np, defM) {
dx <- genData(n)
dx <- trtAssign(dx, grpName = "rx")
dx <- addPeriods(dx, nPeriods = np)
defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + .5*rx",
dist = "nonrandom", link = "logit")
dx <- addColumns(defP, dx)
dx <- addCorGen(dtOld = dx, idvar = "id", nvars = np, rho = .6,
corstr = "ar1", dist = "binary", param1 = "p",
method = "ep", formSpec = "-2 + 0.2*period + .5*rx",
cnames = "y")
dm <- genMiss(dx, defM, "id", repeated = TRUE, periodvar = "period")
dObs <- genObs(dx, dm, idvars = "id")
fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dx, corstr = "ar1")
fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,
data = dObs, corstr = "ar1")
fit.l <- glm(y ~ rx, data = dObs[period == (np - 1)], family = binomial)
return(data.table(full = coef(fit.f)["rx"],
miss = coef(fit.m)["rx"],
last = coef(fit.l)["rx"])
)
}
## defM
MCAR <- defMiss(varname = "y", formula = "-2.6",
logit.link = TRUE, monotonic = TRUE
)
MAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",
logit.link = TRUE, monotonic = TRUE
)
NMAR <- defMiss(varname = "y",
formula = "-2.9 + 0.2*period - 2*rx*y",
logit.link = TRUE, monotonic = TRUE
)
##
library(parallel)
niter <- 2500
resMCAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MCAR)))
resMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MAR)))
resNMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, NMAR)))</code></pre>
</div>
Adding a "mixture" distribution to the simstudy package
https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/
Tue, 26 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/
<p>I am contemplating adding a new distribution option to the package <code>simstudy</code> that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new <em>mixture</em> option is to step through a few examples and see it in action.</p>
<div id="specifying-the-mixture-distribution" class="section level3">
<h3>Specifying the “mixture” distribution</h3>
<p>As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities. For example, if we have two variables, <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span>, we have a mixture if, for any particular observation, we take <span class="math inline">\(x_1\)</span> with probability <span class="math inline">\(p_1\)</span> and <span class="math inline">\(x_2\)</span> with probability <span class="math inline">\(p_2\)</span>, where <span class="math inline">\(\sum_i{p_i} = 1\)</span>, <span class="math inline">\(i \in (1, 2)\)</span>. So, if we have already defined <span class="math inline">\(x_1\)</span> and <span class="math inline">\(x_2\)</span> using the <code>defData</code> function, we can create a third variable <span class="math inline">\(x_{mix}\)</span> with this definition:</p>
<pre class="r"><code>def <- defData(def, varname = "xMix",
formula = "x1 | 0.4 + x2 | 0.6",
dist = "mixture")</code></pre>
<p>In this example, we will draw <span class="math inline">\(x_1\)</span> with probability 0.4 and <span class="math inline">\(x_2\)</span> with probability 0.6. We are, however, not limited to mixing only two variables; to make that clear, I’ll start off with an example that shows a mixture of three normally distributed variables.</p>
</div>
<div id="mixture-of-3-normal-distributions" class="section level3">
<h3>Mixture of 3 normal distributions</h3>
<p>In this case, we have <span class="math inline">\(x_1 \sim N(1,1)\)</span>, <span class="math inline">\(x_2 \sim N(5,4)\)</span>, and <span class="math inline">\(x_3 \sim N(9,1)\)</span>. The mixture will draw from <span class="math inline">\(x_1\)</span> 30% of the time, from <span class="math inline">\(x_2\)</span> 40%, and from <span class="math inline">\(x_3\)</span> 30%:</p>
<pre class="r"><code>def <- defData(varname = "x1", formula = 1, variance = 1)
def <- defData(def, varname = "x2", formula = 5, variance = 4)
def <- defData(def, varname = "x3", formula = 9, variance = 1)
def <- defData(def, varname = "xMix",
formula = "x1 | .3 + x2 | .4 + x3 | .3",
dist = "mixture")</code></pre>
<p>The data generation now proceeds as usual in <code>simstudy</code>:</p>
<pre class="r"><code>set.seed(2716)
dx <- genData(1000, def)
dx</code></pre>
<pre><code>## id x1 x2 x3 xMix
## 1: 1 1.640 4.12 7.13 4.125
## 2: 2 -0.633 6.89 9.07 -0.633
## 3: 3 1.152 2.95 8.71 1.152
## 4: 4 1.519 5.53 8.82 5.530
## 5: 5 0.206 5.55 9.31 5.547
## ---
## 996: 996 2.658 1.87 8.09 1.870
## 997: 997 2.604 4.44 9.09 2.604
## 998: 998 0.457 5.56 10.87 10.867
## 999: 999 -0.400 4.29 9.03 -0.400
## 1000: 1000 2.838 4.78 9.17 9.174</code></pre>
<p>Here are two plots. The top shows the densities for the original distributions separately, and the bottom plot shows the mixture distribution (which is the distribution of <code>xMix</code>):</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-5-1.png" width="576" /></p>
<p>And it is easy to show that the mixture proportions are indeed based on the probabilities that were defined:</p>
<pre class="r"><code>dx[, .(p1=mean(xMix == x1), p2=mean(xMix == x2), p3=mean(xMix == x3))]</code></pre>
<pre><code>## p1 p2 p3
## 1: 0.298 0.405 0.297</code></pre>
</div>
<div id="zero-inflated" class="section level3">
<h3>Zero-inflated</h3>
<p>One classic mixture model is the <em>zero-inflated Poisson</em> model. We can easily generate data from this model using a mixture distribution. In this case, the outcome is <span class="math inline">\(0\)</span> with probability <span class="math inline">\(p\)</span> and is a draw from a Poisson distribution with mean (and variance) <span class="math inline">\(\lambda\)</span> with probability <span class="math inline">\(1-p\)</span>. As a result, there will be an over-representation of 0’s in the observed data set.
In this example <span class="math inline">\(p\)</span> = 0.2 and <span class="math inline">\(\lambda = 2\)</span>:</p>
<pre class="r"><code>def <- defData(varname = "x0", formula = 0, dist = "nonrandom")
def <- defData(def, varname = "xPois", formula = 2, dist = "poisson")
def <- defData(def, varname = "xMix", formula = "x0 | .2 + xPois | .8",
dist = "mixture")
set.seed(2716)
dx <- genData(1000, def)</code></pre>
<p>The figure below shows a histogram of the Poisson distributed <span class="math inline">\(x_{pois}\)</span> on top and a histogram of the mixture on the bottom. It is readily apparent that the mixture distribution has “too many” zeros relative to the Poisson distribution:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-8-1.png" width="480" /></p>
<p>I am fitting model below (using the <code>pscl</code> package) to see if it is possible to recover the assumptions I used in the data generation process. With 1000 observations, of course, it is easy:</p>
<pre class="r"><code>library(pscl)
zfit <- zeroinfl(xMix ~ 1 | 1, data = dx)
summary(zfit)</code></pre>
<pre><code>##
## Call:
## zeroinfl(formula = xMix ~ 1 | 1, data = dx)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.035 -1.035 -0.370 0.296 4.291
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6959 0.0306 22.8 <2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.239 0.107 -11.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 9
## Log-likelihood: -1.66e+03 on 2 Df</code></pre>
<p>The estimated value of <span class="math inline">\(lambda\)</span> from the model is the exponentiated value of the coefficient from the Poisson model: <span class="math inline">\(e^{0.6959}\)</span>. The estimate is quite close to the true value <span class="math inline">\(\lambda = 2\)</span>:</p>
<pre class="r"><code>exp(coef(zfit)[1])</code></pre>
<pre><code>## count_(Intercept)
## 2.01</code></pre>
<p>And the estimated probability of drawing a zero (i.e. <span class="math inline">\(\hat{p}\)</span>) is based on a simple transformation of the coefficient of the binomial model (<span class="math inline">\(-1.239\)</span>), which is on the logit scale. Again, the estimate is quite close to the true value <span class="math inline">\(p = 0.2\)</span>:</p>
<pre class="r"><code>1/(1 + exp(-coef(zfit)[2]))</code></pre>
<pre><code>## zero_(Intercept)
## 0.225</code></pre>
</div>
<div id="outlier-in-linear-regression" class="section level3">
<h3>Outlier in linear regression</h3>
<p>In this final example, I use the mixture option to generate outliers in the context of a regression model. This is done first by generating outcomes <span class="math inline">\(y\)</span> as a function of a predictor <span class="math inline">\(x\)</span>. Next, alternative outcomes <span class="math inline">\(y_{outlier}\)</span> are generated independent of <span class="math inline">\(x\)</span>. The observed outcomes <span class="math inline">\(y_{obs}\)</span> are a mixture of the outliers <span class="math inline">\(y_{outlier}\)</span> and the predicted <span class="math inline">\(y\)</span>’s. In this simulation, 2.5% of the observations will be drawn from the outliers:</p>
<pre class="r"><code>def <- defData(varname = "x", formula = 0, variance = 9,
dist = "normal")
def <- defData(def, varname = "y", formula = "3+2*x", variance = 7,
dist = "normal")
def <- defData(def, varname = "yOutlier", formula = 12, variance = 6,
dist = "normal")
def <- defData(def, varname = "yObs",
formula = "y | .975 + yOutlier | .025",
dist = "mixture")
set.seed(2716)
dx <- genData(100, def)</code></pre>
<p>This scatter plot shows the relationship between <span class="math inline">\(y_{obs}\)</span> and <span class="math inline">\(x\)</span>; the red dots represent the observations drawn from the outlier distribution:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p>Once again, it is illustrative to fit a few models to estimate the linear relationships between the <span class="math inline">\(y\)</span> and <span class="math inline">\(x\)</span>. The model that includes the true value of <span class="math inline">\(y\)</span> (as opposed to the outliers) unsurprisingly recovers the true relationship. The model that includes the observed outcomes (the mixture distribution) underestimates the relationship. And a robust regression model (using the <code>rlm</code> function <code>MASS</code> package) provides a less biased estimate:</p>
<pre class="r"><code>lm1 <- lm( y ~ x, data = dx)
lm2 <- lm( yObs ~ x, data = dx)
library(MASS)
rr <- rlm(yObs ~ x , data = dx)</code></pre>
<pre class="r"><code>library(stargazer)
stargazer(lm1, lm2, rr, type = "text",
omit.stat = "all", omit.table.layout = "-asn",
report = "vcs")</code></pre>
<pre><code>##
## ================================
## Dependent variable:
## -----------------------
## y yObs
## OLS OLS robust
## linear
## (1) (2) (3)
## x 2.210 2.030 2.150
## (0.093) (0.136) (0.111)
##
## Constant 2.780 3.310 2.950
## (0.285) (0.417) (0.341)
##
## ================================</code></pre>
<p>The scatter plot below includes the fitted lines from the estimated models: the blue line is the true regression model, the red line is the biased estimate based on the data that includes outliers, and the black line is the robust regression line that is much closer to the truth:</p>
<p><img src="https://www.rdatagen.net/post/2019-11-26-adding-mixture-distributions-to-simstudy.en_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>The mixture option is still experimental, though it is available on <a href="https://github.com/kgoldfeld/simstudy">github</a>. One enhancement I hope to make is to allow the mixture probability to be a function of covariates. The next release on CRAN will certainly include some form of this new distribution option.</p>
<p> </p>
</div>
What can we really expect to learn from a pilot study?
https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/
Tue, 12 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/
<p>I am involved with a very interesting project - the <a href="https://impactcollaboratory.org/">NIA IMPACT Collaboratory</a> - where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.</p>
<p>There is a rich literature on this topic. For example, these papers by <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j..2002.384.doc.x"><em>Lancaster et al</em></a> and <a href="https://www.sciencedirect.com/science/article/pii/S002239561000292X"><em>Leon et al</em></a> provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the <em>feasibility</em> of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?</p>
<p>Another thing the authors agree on is that the pilot study is <em>not</em> generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)</p>
<p>My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.</p>
<div id="a-small-two-arm-pilot-study" class="section level3">
<h3>A small, two-arm pilot study</h3>
<p>In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect <span class="math inline">\(\Delta = 50\)</span>. The outcome is a continuous measure with a within-arm standard deviation <span class="math inline">\(\sigma = 100\)</span>. In some fields of research, the effect size would be standardized as <span class="math inline">\(d = \Delta / \sigma\)</span>. (This is also known as <a href="https://rpsychologist.com/d3/cohend/">Cohen’s <span class="math inline">\(d\)</span></a>.) So, in this case the true standardized effect size <span class="math inline">\(d=0.5\)</span>.</p>
<p>If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate <span class="math inline">\(\alpha = 0.05\)</span>. Using the <code>pwr.t.test</code> function in the <code>pwr</code> library, we specify the treatment effect (as <span class="math inline">\(d\)</span>), significance level <span class="math inline">\(\alpha\)</span>, and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):</p>
<pre class="r"><code>library(pwr)
pwr.t.test(n = NULL, d = 50/100, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 64
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>If we do not have an estimate of <span class="math inline">\(d\)</span> or even of the individual components <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\sigma\)</span>, we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator <span class="math inline">\(rx\)</span> and an outcome <span class="math inline">\(y\)</span>:</p>
<pre class="r"><code>library(simstudy)
defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2)
ss <- 30
set.seed(22821)
dd <- genData(n = ss*2)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(defd, dd)
head(dd)</code></pre>
<pre><code>## id rx y
## 1: 1 0 -150
## 2: 2 1 48
## 3: 3 0 -230
## 4: 4 1 116
## 5: 5 1 91
## 6: 6 1 105</code></pre>
<p>Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform <span class="math inline">\(d\)</span>? We have a couple of options. In the first case, we can estimate both <span class="math inline">\(\Delta\)</span> and <span class="math inline">\(\sigma\)</span> from the data and use those results directly in power calculations:</p>
<pre class="r"><code>lmfit <- lm(y ~ rx, data = dd)
Delta <- coef(lmfit)["rx"]
Delta</code></pre>
<pre><code>## rx
## 78</code></pre>
<pre class="r"><code>sd.rx <- dd[rx==1, sd(y)]
sd.ctl <- dd[rx==0, sd(y)]
pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 )
pool.sd</code></pre>
<pre><code>## [1] 94</code></pre>
<p>The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated <span class="math inline">\(\hat{d}\)</span> is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = Delta/pool.sd, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 24
## d = 0.83
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = 50/pool.sd, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 56
## d = 0.53
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
</div>
<div id="speak-truth-to-power" class="section level3">
<h3>Speak truth to power</h3>
<p>Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true <span class="math inline">\(d\)</span>)?</p>
<p>In the first case, where we actually used the true <span class="math inline">\(d\)</span> to get the sample size estimate, we just recover the 80% power estimate. No surprise there:</p>
<pre class="r"><code>pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.8</code></pre>
<p>In the second case, where we used <span class="math inline">\(\hat{d} = \hat{\Delta} / \hat{\sigma}\)</span> to get the sample size <span class="math inline">\(n=24\)</span>, the true power of the larger RCT would be 40%:</p>
<pre class="r"><code>pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.4</code></pre>
<p>And if we had used <span class="math inline">\(\hat{d} = 50 / \hat{\sigma}\)</span> to get the sample size estimate <span class="math inline">\(n=56\)</span>, the true power would have been 75%:</p>
<pre class="r"><code>pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power</code></pre>
<pre><code>## [1] 0.75</code></pre>
</div>
<div id="conservative-estimate-of-standard-deviation" class="section level3">
<h3>Conservative estimate of standard deviation</h3>
<p>While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780141709">1995 paper</a> by R.H. Browne presents the idea that we <em>can</em> use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study.</p>
<p>The confidence interval for the standard deviation is not routinely provided in R. Another <a href="https://www.sciencedirect.com/science/article/pii/S0378375810005070?via%3Dihub">paper</a> analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the <span class="math inline">\(\gamma*100\%\)</span> one-sided confidence interval for <span class="math inline">\(\sigma^2\)</span>:</p>
<p><span class="math display">\[
\left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right)
\]</span></p>
<p>where <span class="math inline">\(\chi^2_{N-2;\gamma}\)</span> is determined by <span class="math inline">\(P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma\)</span>. So, if <span class="math inline">\(\gamma = 0.95\)</span> then we can get a one-sided 95% confidence interval for the standard deviation using that formulation:</p>
<pre class="r"><code>gamma <- 0.95
qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE)
ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi )
ucl</code></pre>
<pre><code>## [1] 111</code></pre>
<p>The point estimate <span class="math inline">\(\hat{\sigma}\)</span> is 94, and the one-sided 95% confidence interval is <span class="math inline">\((0, 111)\)</span>. (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.)</p>
<p>If we use <span class="math inline">\(\hat{\sigma}_{ucl} = 111\)</span> to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate <span class="math inline">\(\hat{\sigma} = 94\)</span> (where the sample size requirement was 56):</p>
<pre class="r"><code>pwr.t.test(n = NULL, d = 50/ucl, sig.level = 0.05,
power = 0.80, type = "two.sample") </code></pre>
<pre><code>##
## Two-sample t test power calculation
##
## n = 78
## d = 0.45
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group</code></pre>
<p>Ultimately, using <span class="math inline">\(\gamma = 0.95\)</span> might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of <span class="math inline">\(\gamma\)</span>’s, from 0.5 to 0.9, which I also do in the next section.</p>
</div>
<div id="simulation-of-different-approaches" class="section level3">
<h3>Simulation of different approaches</h3>
<p>At this point, we need to generate multiple iterations to see how the various approaches perform over <em>repeated</em> pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above.</p>
<p>As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate <span class="math inline">\(\Delta\)</span> (in the figures below, this would be the columns labeled <em>‘truth’</em>) with using the effect size point estimate from the pilot (labeled <em>pilot</em>). I also compare using a point estimate of <span class="math inline">\(\sigma\)</span> from the pilot (where <span class="math inline">\(\gamma=0\)</span>), with using the upper limit of a one-sided confidence interval defined by <span class="math inline">\(\gamma\)</span>. In these simulations I compare three levels of <span class="math inline">\(\gamma\)</span>: <span class="math inline">\(\gamma \in (0.5, 0.7, 0.9)\)</span>.</p>
<p>In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error <span class="math inline">\(\sigma = 100\)</span> so that true <span class="math inline">\(d\)</span> is 0.30 or 0.75.</p>
<p>The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%.</p>
<p><img src="https://www.rdatagen.net/img/post-pilot/pilot30.png" style="width:90.0%" /></p>
<p>Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for <span class="math inline">\(\hat{\sigma}\)</span>. In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using <span class="math inline">\(\gamma\)</span> higher than 0.6 or 0.7 might provide an adequate cushion.</p>
<p>Second, it appears <em>using the effect size estimate from the pilot as the basis for an RCT power analysis is risky</em>. The box plots labeled as <em>pilot</em> exhibit much more variation than the <em>‘true’</em> box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be.</p>
<p>The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal.</p>
<p><img src="https://www.rdatagen.net/img/post-pilot/pilot60.png" style="width:90.0%" /></p>
<p>A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that <em>can</em> help improve the design of the RCT.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312.</p>
<p>Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629.</p>
<p>Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940.</p>
<p>Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672.</p>
<p> </p>
<p>Support:</p>
This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.
</font></small>
</p>
<p> </p>
</div>
<div id="addendum" class="section level3">
<h3>Addendum</h3>
<p>Below is the code I used to run the simulations and generate the plots</p>
<pre class="r"><code>getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) {
estring <- paste0("rx * ", esize)
defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2)
N <- ssize * 2
dd <- genData(n = N)
dd <- trtAssign(dd, grpName = "rx")
dd <- addColumns(defd, dd)
lmfit <- lm(y~rx, data = dd)
sd.rx <- dd[rx==1, sd(y)]
sd.ctl <- dd[rx==0, sd(y)]
pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2) / 2 )
qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE)
ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi )
p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0)
p.eff <- esize * (use.est == FALSE) +
coef(lmfit)["rx"] * (use.est == TRUE)
if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd
nstar <- round(pwr.t.test(n = NULL, d = p.eff/p.sd, sig.level = 0.05,
power = 0.80, type = "two.sample")$n,0)
power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05,
type = "two.sample")
return(data.table(ssize, esize, gamma, use.est,
estsd = estsd, ucl = ucl, nstar, power = power$power,
est = coef(lmfit)["rx"],
lcl.est = confint(lmfit)["rx",1] ,
ucl.est = confint(lmfit)["rx",2])
)
}</code></pre>
<pre class="r"><code>dres <- data.table()
for (i in c(30, 60)) {
for (j in c(30, 75)) {
for (k in c(0, .5, .7)) {
for (l in c(FALSE, TRUE)) {
dd <- rbindlist(lapply(1:5000,
function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l))
)
dres <- rbind(dres, dd)
}}}}</code></pre>
<pre class="r"><code>above80 <- dres[, .(x80 = mean(power >= 0.80)),
keyby = .(ssize, esize, gamma, use.est)]
above80[, l80 := scales::percent(x80, accuracy = 1)]
g_labeller <- function(value) {
paste("\U03B3", "=", value) # unicode for gamma
}
e_labeller <- function(value) {
paste("\U0394", "=", value) # unicdoe for Delta
}
ggplot(data = dres[ssize == 30],
aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) +
geom_hline(yintercept = 0.8, color = "white") +
geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey92"),
axis.ticks = element_blank(),
plot.title = element_text(size = 9, face = "bold")) +
facet_grid(esize ~ gamma,
labeller = labeller(gamma = g_labeller, esize = e_labeller)) +
scale_x_discrete(
name = "\n source of effect size used for power calculation") +
scale_y_continuous(limits = c(0,1), breaks = c(0, .8),
name = "distribution of power estimates \n") +
ggtitle("Distribution of power estimates (n = 30 per treatment arm)") +
geom_text(data = above80[ssize == 30],
aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95,
size = 2.5)</code></pre>
</div>
Any one interested in a function to quickly generate data with many predictors?
https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/
Tue, 29 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/
<p>A couple of months ago, I was contacted about the possibility of creating a simple function in <code>simstudy</code> to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.</p>
<p>I’m presenting a new function here as a work-in-progress. I am putting it out there in case other folks have opinions about what might be most useful; feel free to let me know if you do. If not, I am likely to include something very similar to this in the next iteration of <code>simstudy</code>, which will be version <code>0.1.16</code>.</p>
<div id="function-genmultpred" class="section level3">
<h3>Function genMultPred</h3>
<p>In its latest iteration, the new function has three interesting arguments. The first two are <code>predNorm</code> and <code>predBin</code>, which are each vectors of length 2. The first value indicates the number of predictors to generate with either a standard normal distribution or a binary distribution, respectively. The second value in each vector represents the number of variables that will actually be predictive of the outcome. (Obviously, the second value cannot be greater than the first value.)</p>
<p>The third interesting argument is <code>corStrength</code>, which is a non-negative number indicating the overall strength of the correlation between the predictors. When corStrength is set to 0 (which is the default), the variables are generated assuming independence. When corStrength is non-zero, a random correlation matrix is generated using package <code>clusterGeneration</code> [Weiliang Qiu and Harry Joe. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation).] The corStrength value is passed on to the argument <code>ratioLambda</code> in the function <code>genPositiveDefMat</code>. As the value of corStrength increases, higher levels of correlation are induced in the random correlation matrix for the predictors.</p>
<p>Currently, the outcome can only have one of three distributions: <em>normal</em>, <em>binomial</em>, or <em>Poisson</em>.</p>
<p>One possible enhancement would be to allow the distributions of the predictors to have more flexibility. However, I’m not sure the added complexity would be worth it. Again, you could always take the more standard <code>simstudy</code> approach of function <code>genData</code> if you wanted more flexibility.</p>
<p>Here’s the function, in case you want to take a look under the hood:</p>
<pre class="r"><code>genMultPred <- function(n, predNorm, predBin,
dist = "normal", sdy = 1, corStrength = 0) {
normNames <- paste0("n", 1:predNorm[1])
binNames <- paste0("b", 1:predBin[1])
## Create the definition tables to be used by genData
defn <- data.table(varname = normNames,
formula = 0,
variance = 1,
dist = "normal",
link = "identity")
defb <- data.table(varname = binNames,
formula = 0.5,
variance = NA,
dist = "binary",
link = "identity")
defx <- rbind(defn, defb)
attr(defx, which = "id") <- "id"
## Create the coefficient values - all normally distributed
ncoefs <- rnorm(predNorm[1], 0, 1)
setzero <- sample(1:predNorm[1], (predNorm[1] - predNorm[2]),
replace = FALSE)
ncoefs[setzero] <- 0
bcoefs <- rnorm(predBin[1], 0, 1)
setzero <- sample(1:predBin[1], (predBin[1] - predBin[2]),
replace = FALSE)
bcoefs[setzero] <- 0
coefs <- c(ncoefs, bcoefs)
names(coefs) <- c(normNames, binNames)
## Generate the predictors
if (corStrength <= 0) { # predictors are independent
dx <- genData(n, defx)
} else {
rLambda <- max(1, corStrength)
covx <- cov2cor(genPositiveDefMat(nrow(defx),
lambdaLow = 1, ratioLambda = rLambda)$Sigma)
dx <- genCorFlex(n, defx, corMatrix = covx)
}
## Generate the means (given the predictors)
mu <- as.matrix(dx[,-"id"]) %*% coefs
dx[, mu := mu]
## Generate the outcomes based on the means
if (dist == "normal") {
dx[, y := rnorm(n, mu, sdy)]
} else if (dist == "binary") {
dx[, y := rbinom(n, 1, 1/(1 + exp(-mu)))] # link = logit
} else if (dist == "poisson") {
dx[, y := rpois(n, exp(mu))] # link = log
}
dx[, mu := NULL]
return(list(data = dx[], coefs = coefs))
}</code></pre>
</div>
<div id="a-brief-example" class="section level2">
<h2>A brief example</h2>
<p>Here is an example with 7 normally distributed covariates and 4 binary covariates. Only 3 of the continuous covariates and 2 of the binary covariates will actually be predictive.</p>
<pre class="r"><code>library(simstudy)
library(clusterGeneration)
set.seed(732521)
dx <- genMultPred(250, c(7, 3), c(4, 2))</code></pre>
<p>The function returns a list of two objects. The first is a data.table containing the generated predictors and outcome:</p>
<pre class="r"><code>round(dx$data, 2)</code></pre>
<pre><code>## id n1 n2 n3 n4 n5 n6 n7 b1 b2 b3 b4 y
## 1: 1 0.15 0.12 -0.07 -1.38 -0.05 0.58 0.57 1 1 0 1 -1.07
## 2: 2 1.42 -0.64 0.08 0.83 2.01 1.18 0.23 1 1 0 0 4.42
## 3: 3 -0.71 0.77 0.94 1.59 -0.53 -0.05 0.26 0 0 0 0 0.09
## 4: 4 0.35 -0.80 0.90 -0.79 -1.72 -0.16 0.09 0 0 1 1 -0.58
## 5: 5 -0.22 -0.72 0.62 1.40 0.17 2.21 -0.45 0 1 0 1 -2.18
## ---
## 246: 246 -1.04 1.62 0.40 1.46 0.80 -0.77 -1.27 0 0 0 0 -1.19
## 247: 247 -0.85 1.56 1.39 -1.25 -0.82 -0.63 0.13 0 1 0 0 -0.70
## 248: 248 0.72 -0.83 -0.04 -1.38 0.61 -0.71 -0.06 1 0 1 1 0.74
## 249: 249 -0.15 1.62 -1.01 -0.79 -0.53 0.44 -0.46 1 1 1 1 0.95
## 250: 250 -0.59 0.34 -0.31 0.18 -0.86 -0.90 0.22 1 0 1 0 -1.90</code></pre>
<p>The second object is the set of coefficients that determine the average response conditional on the predictors:</p>
<pre class="r"><code>round(dx$coefs, 2)</code></pre>
<pre><code>## n1 n2 n3 n4 n5 n6 n7 b1 b2 b3 b4
## 2.48 0.62 0.28 0.00 0.00 0.00 0.00 0.00 0.00 0.53 -1.21</code></pre>
<p>Finally, we can “recover” the original coefficients with linear regression:</p>
<pre class="r"><code>lmfit <- lm(y ~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + b1 + b2 + b3 + b4,
data = dx$data)</code></pre>
<p>Here’s a plot showing the 95% confidence intervals of the estimates along with the true values. The yellow lines are covariates where there is truly no association.</p>
<p><img src="https://www.rdatagen.net/post/2019-10-29-any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors.en_files/figure-html/unnamed-chunk-7-1.png" width="576" /></p>
<p> </p>
<div id="addendum-correlation-among-predictors" class="section level3">
<h3>Addendum: correlation among predictors</h3>
<p>Here is a pair of examples using the <code>corStrength</code> argument. In the first case, the observed correlations are close to 0, whereas in the second case, the correlations range from -0.50 to 0.25. The impact of <code>corStrength</code> will vary depending on the number of potential predictors.</p>
<pre class="r"><code>set.seed(291212)
# Case 1
dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 0)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)</code></pre>
<pre><code>## n1 n2 n3 n4 b1 b2
## n1 1.00 -0.02 0.02 0.03 -0.01 -0.01
## n2 -0.02 1.00 -0.01 0.03 -0.03 0.00
## n3 0.02 -0.01 1.00 0.00 -0.04 -0.01
## n4 0.03 0.03 0.00 1.00 0.06 -0.01
## b1 -0.01 -0.03 -0.04 0.06 1.00 -0.01
## b2 -0.01 0.00 -0.01 -0.01 -0.01 1.00</code></pre>
<pre class="r"><code># Case 2
dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 50)
round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)</code></pre>
<pre><code>## n1 n2 n3 n4 b1 b2
## n1 1.00 0.09 0.08 -0.32 0.25 0.04
## n2 0.09 1.00 -0.29 -0.47 -0.05 -0.02
## n3 0.08 -0.29 1.00 -0.46 -0.01 -0.01
## n4 -0.32 -0.47 -0.46 1.00 -0.20 -0.05
## b1 0.25 -0.05 -0.01 -0.20 1.00 -0.04
## b2 0.04 -0.02 -0.01 -0.05 -0.04 1.00</code></pre>
</div>
</div>
Selection bias, death, and dying
https://www.rdatagen.net/post/selection-bias-death-and-dying/
Tue, 15 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/selection-bias-death-and-dying/
<p>I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death.</p>
<p>A key challenge for researchers is figuring out how to analyze events that occur just before death. For example, it is not unusual to consider hospitalization in the week or month before death as a poor outcome. For example, here is a <a href="https://www.liebertpub.com/doi/full/10.1089/jpm.2015.0229">paper</a> in the <em>Journal of Palliative Care Medicine</em> that describes an association of homecare nursing and reduced hospitalizations in the week before death. While there is no denying the strength of the association, it is less clear how much of that association is causal.</p>
<p>In particular, there is the possibility of <em>selection bias</em> that may be result when considering only patients who have died. In this post, I want to describe the concept of selection bias and simulate data that mimics the process of end-stage disease in order to explore how these issues might play out when we are actually evaluating the causal effect of an exposure or randomized intervention.</p>
<div id="selection-bias" class="section level3">
<h3>Selection bias</h3>
<p><em>Selection bias</em> is used to refer to different concepts by different researchers (see this article by <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4043938/"><em>Haneuse</em></a> or this one by <a href="https://journals.lww.com/epidem/Fulltext/2004/09000/Causation_of_Bias__The_Episcope.20.aspx"><em>Hernán et al</em></a> for really nice discussions of these issues). The terminology doesn’t matter as much as understanding the underlying data generating processes that distinguish the different ideas.</p>
<p>The key issue is to understand <em>what</em> is being selected. In one case, the exposure or intervention is the focus. And in the second case, it is how the patients or subjects are selected into the study more generally that induces the bias. The first selection process is typically referred to by epidemiologists as <em>confounding bias</em> (though it is also called <em>treatment-selection bias</em>), while the second is actually <em>selection bias</em>.</p>
<p>When I’ve written about these issues before (for example, see <a href="https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/">here</a>), I’ve described how DAGs can be useful to illuminate the potential biases. Below, I have drawn a diagram to represent a simple case of selection bias. Say we are interested in measuring the causal relationship between <em>income</em> and <em>blood pressure</em> in some population in which the two are actually not causally related. If people with higher income are more likely to visit a doctor, and if people with higher blood pressure are also more likely to visit a doctor, the underlying causal relationship might be well represented by the DAG on the left in the figure below.</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/selectionbias.png" style="width:65.0%" /></p>
<p>Let’s say we recruit participants for our study right outside of a medical facility. Choosing this location (as opposed to, say, a shopping mall where the causal model on the left would not be relevant), we are inducing a relationship between <em>income</em> and <em>blood pressure</em>. This can be seen in the DAG on the right, where we have effectively “controlled” for medical facility access in our selection process. The induced statistical relationship can be described in this way: if someone is at the medical center and they have relatively low income, they are more likely to have relatively high blood pressure. Conversely, if someone is there and they have relatively low blood pressure, they are more likely to have relatively high income. Based on this logic, we would expect to see a negative relationship between <em>income</em> and <em>blood pressure</em> in our study sample drawn from patients visiting a medical facility.</p>
<p>To explore by simulation, we can generate a large population of individuals with uncorrelated income and blood pressure. Selection will be a function of both:</p>
<pre class="r"><code>n = 5000
set.seed(748347)
income <- rnorm(n);
bp <- rnorm(n)
logitSelect <- -2 + 2*income + 2*bp
pSelect <- 1/(1+exp(-logitSelect))
select <- rbinom(n, 1, pSelect)
dPop <- data.table(income, bp, select)
dSample <- dPop[select == 1]</code></pre>
<p>The plot on the left below is the overall population of 5000; there is no obvious relationship between the <em>income</em> and <em>blood pressure</em>. The group that was recruited at the medical facility and enrolled in the study (a subset of the original population) is shown in purple in the plot on the right. In this subset, there does indeed appear to be a relationship between the two characteristics. An estimate of the association, which we know is zero, based on the sample would be biased; that bias is due to how we selected participants into the study.</p>
<p><img src="https://www.rdatagen.net/post/2019-10-15-selection-bias-death-and-dying.en_files/figure-html/unnamed-chunk-2-1.png" width="960" /></p>
</div>
<div id="hospitalization-before-death" class="section level3">
<h3>Hospitalization before death</h3>
<p>In the next simulation, let’s consider a somewhat more complex process, though with the same underlying structure and similar bias as the simpler case above. The next DAG (below) shows three different time periods. In this case there is an indicator of homecare by a nurse <span class="math inline">\(N_1\)</span>, <span class="math inline">\(N_2\)</span>, and <span class="math inline">\(N_3\)</span>. (In this particular example, an individual has home nursing care in all three periods or they don’t have any home nursing care in any period. This is not a requirement.) In each period, each patient has an underlying time-dependent health status, ranging from <span class="math inline">\(1\)</span> (healthiest) to <span class="math inline">\(4\)</span> (sickest). In this simulated study, the underlying health status <span class="math inline">\(U_1\)</span>, <span class="math inline">\(U_2\)</span>, and <span class="math inline">\(U_3\)</span> are considered latent (i.e. unmeasured). The progression of health status is governed by a Markov process that is independent of any kind of treatment. (See <a href="https://www.rdatagen.net/post/simstudy-1-14-update/">here</a> and <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">here</a> for a more detailed description of how this is done using <code>simstudy</code>.)</p>
<p>The probability of hospitalization is a solely a function of the underlying health status, and nothing else. (I could make hospitalization a function of palliative care as well, but this just simplifies matters. In both cases the estimates will be biased - you can try for yourself.)</p>
<p>Death is a function of underlying health status and palliative care. While it does not seem to be the case in practice, I am assuming that less aggressive care results in shorter survival times. And the sicker the patient is in a particular period, the greater risk of dying in that period. (There should be lines between death in various periods and all subsequent measures, but I have eliminated them for clarity sake.)</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/repeated.png" style="width:80.0%" /></p>
<p>The code to generate the data starts with the definitions: first, I define an initial health state <span class="math inline">\(S_0\)</span> that can range from 1 to 3 and the transition matrix <span class="math inline">\(P\)</span> for the Markov process. Next, I define the hospitalization and death outcomes:</p>
<pre class="r"><code>bDef <- defData(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.0, 0.4, 0.4, 0.2,
0.0, 0.0, 0.6, 0.4,
0.0, 0.0, 0.0, 1.0),
nrow = 4))
pDef <- defDataAdd(varname = "hospital", formula = "-2 + u",
dist = "binary", link = "logit")
pDef <- defDataAdd(pDef, varname = "death",
formula = "-2 + u + homenurse * 1.5",
dist = "binary", link = "logit")</code></pre>
<p>The data generation process randomizes individuals to nursing home care (or care as usual) in the first period, and creates all of the health status measures and outcomes. The last step removes any data for an individual that was generated after they died. (The function <code>trimData</code> is new and only available in <code>simstudy 0.1.15</code>, which is available on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a> - as of 10/21/2019)</p>
<pre class="r"><code>set.seed(272872)
dd <- genData(10000, bDef)
dd <- trtAssign(dd, grpName = "homenurse")
dp <- addMarkov(dd, transMat = P,
chainLen = 4, id = "id",
pername = "seq", start0lab = "S0",
varname = "u")
dp <- addColumns(pDef, dp)
dp <- trimData(dp, seqvar = "seq", eventvar = "death")</code></pre>
<p> </p>
<div id="a-short-follow-up-period" class="section level4">
<h4>A short follow-up period</h4>
<p>If we have a relatively short follow up period in our randomized trial of supportive care at home (nursecare), only a portion of the sample will die; as result, we can only compare the hospitalization before death for a subset of the sample. By selecting on death, we will induce a relationship between the intervention and the outcome where none truly exists. Inspecting the DAGs below, it is apparent that this is a classic case of selection bias. Since we cannot control for the unmeasured health status <span class="math inline">\(U\)</span>, hospitalization and death are associated. And, since treatment and death <em>are</em> causally related, by selecting on death we are in the same situation as we were in the first example.</p>
<p><img src="https://www.rdatagen.net/img/post-selectdeath/singleprepost.png" style="width:100.0%" /></p>
<pre class="r"><code>d1 <- dp[seq == 1]</code></pre>
<p>If we consider only those who died in the first period, we will be including 61% of the sample:</p>
<pre class="r"><code>d1[, mean(death)]</code></pre>
<pre><code>## [1] 0.6109</code></pre>
<p>To get a sense of the bias, I am considering three models. The first model estimates the effect of the intervention on hospitalization for only those who died in the first period; we expect that this will have a negative bias. In the second model, we use the same subset of patients who died, but adjust for underlying health status; the hospitalization coefficient should be close to zero. Finally, we estimate a model for everyone in period 1, regardless of whether they died. again, we expect the effect size to be close to 0.</p>
<pre class="r"><code>fit1 <- glm(hospital ~ homenurse, data=d1[death==1],
family = "binomial")
fit2 <- glm(hospital ~ homenurse + u, data=d1[death==1],
family = "binomial")
fit3 <- glm(hospital ~ homenurse, data=d1,
family = "binomial")
library(stargazer)
stargazer(fit1, fit2, fit3, type = "text",
column.labels = c("died", "died - adjusted", "all"),
omit.stat = "all", omit.table.layout = "-asn")</code></pre>
<pre><code>##
## =============================================
## Dependent variable:
## -----------------------------------
## hospital
## died died - adjusted all
## (1) (2) (3)
## homenurse -0.222*** -0.061 -0.049
## (0.053) (0.057) (0.040)
##
## u 1.017***
## (0.039)
##
## Constant 0.108** -2.005*** -0.149***
## (0.042) (0.092) (0.028)
##
## =============================================</code></pre>
<p>While these are estimates from a single data set (I should really do more extensive experiment based on many different data sets), the estimates do seem to support our expectations. Indeed, if we cannot measure the underlying health status, the estimate of the intervention effect on hospitalization prior to death is biased; we would conclude that supportive care reduces the probability of hospitalization before death when we know (based on the data generation process used here) that it does not.</p>
</div>
<div id="extended-follow-up" class="section level4">
<h4>Extended follow-up</h4>
<p>We might think that if we could follow <em>everyone</em> up until death (and hence not select on death), the bias would be eliminated. However, this not the case. The treatment effect is essentially an average of the effect over all time periods, and we know that for each time period, the effect estimate is biased due to selection. And averaging across biased estimates yields a biased estimate.</p>
<p>This issue is closely related to a general issue for causal survival analysis. It has been pointed out that it is not possible to estimate a causal treatment effect using hazard rates, as we do when we use Cox proportional hazard models. This is true even if treatment has been randomized and the two treatment arms are initially balanced with respect to underlying health status. The challenge is that after the first set of deaths, the treatment groups will no longer be balanced with respect to health status; some people survived because of the intervention, others because they were generally healthier. At each point in the survival analysis, the model for risk of death is conditioning (i.e. selecting on) those who did not die. So, there is built in selection bias in the modelling. If you are interested in reading more about these issues, I recommend taking a look at these papers by <a href="https://journals.lww.com/epidem/fulltext/2010/01000/The_Hazards_of_Hazard_Ratios.4.aspx"><em>Hernán</em></a> and <a href="https://link.springer.com/article/10.1007/s10985-015-9335-y"><em>Aalen et al.</em></a>.</p>
<p>Now, back to the simulation. In this case, we analyze everyone who has died within 4 periods, which is about 97% of the initial sample, virtually everyone.</p>
<pre class="r"><code>dDied <- dp[death == 1]
nrow(dDied)/nrow(d1)</code></pre>
<pre><code>## [1] 0.9658</code></pre>
<p>The effect estimate based on this data set is only unbiased when we are able to control for underlying health status. Otherwise, extending follow-up does not help remove any bias.</p>
<pre class="r"><code>fit4 <- glm(hospital ~ homenurse, data=dDied, family = "binomial")
fit5 <- glm(hospital ~ homenurse + u, data=dDied, family = "binomial")
stargazer(fit4, fit5, type = "text",
omit.stat = "all", omit.table.layout = "-asn")</code></pre>
<pre><code>##
## ==============================
## Dependent variable:
## --------------------
## hospital
## (1) (2)
## homenurse -0.383*** -0.048
## (0.041) (0.045)
##
## u 1.020***
## (0.028)
##
## Constant 0.296*** -2.028***
## (0.030) (0.070)
##
## ==============================</code></pre>
<p>In the future, I hope to explore alternative ways to analyze these types of questions. In the case of survival analysis, models that do not condition on death have been proposed to get at causal estimates. This may not be possible when the outcome of interest (health care before death) is defined by conditioning on death. We may actually need to frame the question slightly differently to be able to get an unbiased estimate.</p>
<p>
<p><small><font color="darkkhaki">
References:</p>
<p>Seow, H., Sutradhar, R., McGrail, K., Fassbender, K., Pataky, R., Lawson, B., Sussman, J., Burge, F. and Barbera, L., 2016. End-of-life cancer care: temporal association between homecare nursing and hospitalizations. <em>Journal of palliative medicine</em>, 19(3), pp.263-270.</p>
<p>Haneuse, S., 2016. Distinguishing selection bias and confounding bias in comparative effectiveness research. <em>Medical care</em>, 54(4), p.e23.</p>
<p>Hernán, M.A., Hernández-Díaz, S. and Robins, J.M., 2004. A structural approach to selection bias. <em>Epidemiology</em>, 15(5), pp.615-625.</p>
<p>Hernán, M.A., 2010. The hazards of hazard ratios. <em>Epidemiology</em>, 21(1), p.13.</p>
<p>Aalen, O.O., Cook, R.J. and Røysland, K., 2015. Does Cox analysis of a randomized survival study yield a causal treatment effect?. <em>Lifetime data analysis</em>, 21(4), pp.579-593.</p>
<p>Support:</p>
This research is supported by the National Institutes of Health National Institute on Aging R33AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.
</font></small>
</p>
</div>
</div>
There's always at least two ways to do the same thing: an example generating 3-level hierarchical data using simstudy
https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/
Thu, 03 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/
<p>“I am working on a simulation study that requires me to generate data for individuals within clusters, but each individual will have repeated measures (say baseline and two follow-ups). I’m new to simstudy and have been going through the examples in R this afternoon, but I wondered if this was possible in the package, and if so whether you could offer any tips to get me started with how I would do this?”</p>
<p>This question popped up in my in-box a couple of days ago. And since I always like an excuse to do a little coding early in the morning to get my brain going, I decided to create a little example, though in this case, there were at least two ways to go about it. I sent back both options, and am putting them up here, since I know this kind of data generation problem comes up frequently. In fact, the post I recently wrote on <a href="https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/">open cohorts in stepped-wedge designs</a> had to deal with this same issue, though in a slightly more elaborate way.</p>
<div id="three-level-hierarchical-data" class="section level3">
<h3>Three-level hierarchical data</h3>
<p>In this example, we want individuals clustered within groups, and measurements clustered within individual, as depicted by this figure:</p>
<p><img src="https://www.rdatagen.net/img/post-twoways/cluster.png" /></p>
<p>The hierarchical scheme represented implies that outcomes for individuals within groups are correlated, and that measurements over time for a particular individual are correlated. The structure of these two levels of correlation can take on a variety of forms. In the examples that follow, I am going to assume that the correlation between the individuals in a group is constant, as are the individual measurements over time. We could easily make the assumption that measurements closer in time will be more highly correlated than measurements further apart in time (such as auto-regressive correlation with 1 period of lag), but since we have only three measurements, it is not totally unreasonable to assume constant correlation.</p>
</div>
<div id="generating-data-explicitly-with-random-effects" class="section level3">
<h3>Generating data explicitly with random effects</h3>
<p>Enough with the preliminaries - let’s get to the data generation. In the first approach, both levels of correlation will be induced with group- and individual-level random effects using the following underlying model:</p>
<p><span class="math display">\[Y_{ijt} = \beta_t + \gamma_j + \alpha_i + \epsilon_{ijt},\]</span></p>
<p>where <span class="math inline">\(Y_{ijt}\)</span> is the outcome for person <span class="math inline">\(i\)</span> in group <span class="math inline">\(j\)</span> during time period <span class="math inline">\(t\)</span>. <span class="math inline">\(\beta_t\)</span> is the mean outcome during period <span class="math inline">\(t\)</span>, <span class="math inline">\(t \in \{ 0,3, 6 \}\)</span>. <span class="math inline">\(\gamma_j\)</span> is the group-specific effect, and <span class="math inline">\(\gamma_j \sim N(0,\sigma^2_\gamma)\)</span>. <span class="math inline">\(\alpha_i\)</span> is the individual-specific effect, and <span class="math inline">\(\alpha_i \sim N(0,\sigma^2_\alpha)\)</span>. Finally, <span class="math inline">\(\epsilon_{ijt}\)</span> is the noise for each particular measurement, where <span class="math inline">\(\epsilon_{ijt} \sim N(0,\sigma^2_\epsilon)\)</span>.</p>
<p>The group, individual, and outcome definitions are the first order of business. In this example, <span class="math inline">\(\sigma^2_\gamma = 2\)</span>, <span class="math inline">\(\sigma^2_\alpha = 1.3\)</span>, and <span class="math inline">\(\sigma^2_\epsilon = 1.1\)</span>. In addition, the average outcomes at baseline, 3 months and 6 months, are 3, 4, and 6, respectively:</p>
<pre class="r"><code>library(simstudy)
### Group defintion
defg <- defData(varname = "gamma", formula=0, variance = 2, id = "cid")
### Individal definition
defi <- defDataAdd(varname = "alpha", formula = 0, variance = 1.3)
### Outcome definition
defC <- defCondition(condition = "period == 0",
formula = "3 + gamma + alpha",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1",
formula = "4 + gamma + alpha",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2",
formula = "6 + gamma + alpha",
dist = "nonrandom")
defy <- defDataAdd(varname = "y", formula = "mu", variance = 1.1)</code></pre>
<p>To generate the data, first we create the group level records, then the individual level records, and finally the repeated measurements for each individual:</p>
<pre class="r"><code>set.seed(3483)
dgrp1 <- genData(100, defg)
dind1 <- genCluster(dgrp1, "cid", numIndsVar = 20, level1ID = "id")
dind1 <- addColumns(defi, dind1)
dper1 <- addPeriods(dind1, nPeriods = 3, idvars = "id")
dper1 <- addCondition(defC, dper1, newvar = "mu")
dper1 <- addColumns(defy, dper1)</code></pre>
<p>Here is a plot of the outcome data by period, with the grey lines representing individuals, and the red lines representing the group averages:</p>
<p><img src="https://www.rdatagen.net/post/2019-10-03-in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Here is a calculation of the observed covariance matrix. The total variance for each outcome should be close to <span class="math inline">\(\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon = 4.4\)</span>, and the observed covariance should be close to <span class="math inline">\(\sigma^2_\gamma + \sigma^2_\alpha = 3.3\)</span></p>
<pre class="r"><code>dcor1 <- dcast(dper1, id + cid ~ period, value.var = "y")
setnames(dcor1, c("id", "cid", "y0", "y1", "y2"))
dcor1[, cov(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 4.5 3.2 3.4
## y1 3.2 4.2 3.2
## y2 3.4 3.2 4.6</code></pre>
<p>The correlation <span class="math inline">\(\rho\)</span> show be close to</p>
<p><span class="math display">\[ \rho = \frac{\sigma^2_\gamma + \sigma^2_\alpha}{\sigma^2_\gamma + \sigma^2_\alpha +\sigma^2_\epsilon} = \frac{3.3}{4.4} = 0.75\]</span></p>
<p>(For a more elaborate derivation of correlation coefficients, see this <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">post</a> on stepped-wedge designs.)</p>
<pre class="r"><code>dcor1[, cor(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 1.00 0.73 0.75
## y1 0.73 1.00 0.73
## y2 0.75 0.73 1.00</code></pre>
</div>
<div id="directly-generating-correlated-data" class="section level3">
<h3>Directly generating correlated data</h3>
<p>In this second approach, the group-level correlation is once again generated using a group effect. However, the individual-level effect is replaced by noise that is explicitly correlated across time. The model here is</p>
<p><span class="math display">\[Y_{ijt} = \beta_t + \gamma_j + \phi_{ijt},\]</span></p>
<p>where the noise <span class="math inline">\(\mathbf{\phi}_{ij}\)</span> is a vector of noise components <span class="math inline">\(\{\phi_{ij0},\phi_{ij3},\phi_{ij6}\} \sim N(\mathbf{0}, \Sigma)\)</span>, and</p>
<p><span class="math display">\[\Sigma =
\left [
\begin{matrix}
\sigma^2_\phi & \rho \sigma^2_\phi & \rho \sigma^2_\phi \\
\rho \sigma^2_\phi & \sigma^2_\phi & \rho \sigma^2_\phi \\
\rho \sigma^2_\phi & \rho \sigma^2_\phi & \sigma^2_\phi
\end{matrix}
\right ]
\]</span></p>
<p>In this case <span class="math inline">\(\sigma^2_\gamma\)</span> is still 2, and <span class="math inline">\(\sigma^2_\phi = 2.4\)</span> to ensure that total variation is 4.4. We set <span class="math inline">\(\rho = 0.54167\)</span> so that the <span class="math inline">\(\rho \sigma^2_\phi = 1.3\)</span>, ensuring that the overall covariance of the observed outcome <span class="math inline">\(y\)</span> across periods is <span class="math inline">\(3.3\)</span> as in the first method.</p>
<pre class="r"><code>defg <- defData(varname = "gamma",
formula = 0, variance = 2, id = "cid")
defg <- defData(defg, varname = "mu",
formula = 0, dist = "nonrandom")
defg <- defData(defg, varname = "phi",
formula = 2.4, dist = "nonrandom")
defC <- defCondition(condition = "period == 0",
formula = "3 + gamma + e",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 1",
formula = "4 + gamma + e",
dist = "nonrandom")
defC <- defCondition(defC, condition = "period == 2",
formula = "6 + gamma + e",
dist = "nonrandom")</code></pre>
<p>In the data generation process, the function <code>addCorGen</code> is used to create the correlated noise across time:</p>
<pre class="r"><code>set.seed(3483)
dgrp2 <- genData(100, defg)
dind2 <- genCluster(dgrp2, "cid", numIndsVar = 20, level1ID = "id")
dper2 <- addPeriods(dind2, nPeriods = 3, idvars = "id")
dper2 <- addCorGen(dper2, "id", nvars = 3, param1 = "mu", param2 = "phi",
rho = .54167, dist = "normal", corstr = "cs", cnames = "e")
dper2 <- addCondition(defC, dper2, newvar = "y")</code></pre>
<p>I won’t do a second plot, because it would look identical to the one above. But I am calculating the covariance and correlation matrices for the outcome to illustrate for you that the two slightly different approaches do indeed generate similarly distributed data.</p>
<pre class="r"><code>dcor2 <- dcast(dper2, id + cid ~ period, value.var = "y")
setnames(dcor2, c("id", "cid", "y0", "y1", "y2"))
dcor2[, cov(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 4.4 3.4 3.3
## y1 3.4 4.4 3.4
## y2 3.3 3.4 4.5</code></pre>
<pre class="r"><code>dcor2[, cor(cbind(y0, y1, y2))]</code></pre>
<pre><code>## y0 y1 y2
## y0 1.00 0.76 0.75
## y1 0.76 1.00 0.76
## y2 0.75 0.76 1.00</code></pre>
<p>In the example here, I wouldn’t say either approach is better. For some, the purely random effects approach may be more intuitive, and for others the correlated noise might be. However, if we want a more complex correlation pattern, like the AR-1 pattern I mentioned earlier, one approach may in fact be a little more straightforward to implement.</p>
<p>And no, I don’t respond so thoroughly to every question I get; sometimes it is better for you to struggle a bit to figure something out.</p>
</div>
Simulating an open cohort stepped-wedge trial
https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/
Tue, 17 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/
<p>In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with <a href="https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/">stepped-wedge designs</a>, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time. This means that for some patients, the data collection period straddles the transition from control to intervention.</p>
<p>Whenever I contemplate a simulation, I first think about the general structure of the data generating process before even thinking about outcome model. In the case of a more standard two-arm randomized trial, that structure is quite simple and doesn’t require much, if any, thought. In this case, however, the overlaying of a longitudinal patient outcome process on top of a stepped-wedge design presents a little bit of a challenge.</p>
<p>Adding to the challenge is that, in addition to being a function of site- and individual-specific characteristics/effects, the primary outcome will likely be a function of time-varying factors. In particular here, certain patient-level health-related factors that might contribute to the decision to prescribe anti-psychotic medications, and the time-varying intervention status, which is determined by the stepped-wedge randomization scheme. So, the simulation needs to accommodate the generation of both types of time-varying variables.</p>
<p>I’ve developed a bare-boned simulation of sites and patients to provide a structure that I can add to at some point in the future. While this is probably a pretty rare study design (though as stepped-wedge designs become more popular, it may be less rare than I am imagining), I thought the code could provide yet another example of how to approach a potentially vexing simulation in a relatively simple way.</p>
<div id="data-definition" class="section level3">
<h3>Data definition</h3>
<p>The focus here is on the structure of the data, so I am not generating any outcome data. However, in addition to generating the treatment assignment, I am creating the time-varying health status, which will affect the outcome process when I get to that.</p>
<p>In this simulation, there will be 5 sites, each followed for 25 weeks (starting with week 0). Each week, a site will have approximately 20 new patients, so we should expect to generate around <span class="math inline">\(5 \times 25 \times 20 = 2500\)</span> total patients.</p>
<p>For each patient, we will be generating a series of health status, which ranges from 1 to 4, with 1 being healthiest, and 4 being death. I will use a <a href="https://www.rdatagen.net/post/simstudy-1-14-update/">Markov chain</a> to generate this series. Two arguments required to simulate the Markov process are the starting state (which is created in <code>S0</code>) and the transition matrix <code>P</code>, which determines the probabilities of moving from one state to another.</p>
<pre class="r"><code>NPER <- 25
perDef <- defDataAdd(varname = "npatient", formula = 20,
dist = "poisson")
patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2",
dist = "categorical")
P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0,
0.1, 0.3, 0.4, 0.2,
0.0, 0.1, 0.5, 0.4,
0.0, 0.0, 0.0, 1.0),
nrow = 4))</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generation process starts with the sites and then proceeds to the patient level data. To begin, the five sites are generated (for now without any site-specific variables, but that could easily be modified in the future). Next, records for each site for each of the 25 periods (from week 0 to week 24) are generated; these site level records include the number patients to be generated for each site, each week:</p>
<pre class="r"><code>set.seed(3837263)
dsite <- genData(5, id = "site")
dper <- addPeriods(dsite, nPeriods = NPER, idvars = "site",
timeid = "site.time", perName = "period")
dper <- addColumns(perDef, dper)
dper</code></pre>
<pre><code>## site period site.time npatient
## 1: 1 0 1 17
## 2: 1 1 2 20
## 3: 1 2 3 25
## 4: 1 3 4 18
## 5: 1 4 5 23
## ---
## 121: 5 20 121 17
## 122: 5 21 122 15
## 123: 5 22 123 16
## 124: 5 23 124 19
## 125: 5 24 125 20</code></pre>
<p>Now, we assign each of the five sites to its own intervention “wave”. The first site starts at the beginning of the the study, week 0. The second starts 4 weeks later at week 4, and so on, until the fifth and last site starts the intervention at week 16. (Obviously, a more realistic simulation would include many more sites, but all of this can easily be scaled up.) The intervention indicator is <span class="math inline">\(I_{ct}\)</span>, and is set to 1 when cluster <span class="math inline">\(c\)</span> during week <span class="math inline">\(t\)</span> is in the intervention, and is 0 otherwise.</p>
<pre class="r"><code>dsw <- trtStepWedge(dper, "site", nWaves = 5, lenWaves = 4,
startPer = 0, perName = "period",
grpName = "Ict")
dsw <- dsw[, .(site, period, startTrt, Ict)]</code></pre>
<p>Here are the intervention assignments for the first two sites during the first 8 weeks.</p>
<pre class="r"><code>dsw[site %in% c(1,2) & period < 8]</code></pre>
<pre><code>## site period startTrt Ict
## 1: 1 0 0 1
## 2: 1 1 0 1
## 3: 1 2 0 1
## 4: 1 3 0 1
## 5: 1 4 0 1
## 6: 1 5 0 1
## 7: 1 6 0 1
## 8: 1 7 0 1
## 9: 2 0 4 0
## 10: 2 1 4 0
## 11: 2 2 4 0
## 12: 2 3 4 0
## 13: 2 4 4 1
## 14: 2 5 4 1
## 15: 2 6 4 1
## 16: 2 7 4 1</code></pre>
<p>To generate the patients, we start by generating the 2500 or so individual records. The single baseline factor that we include this time around is the starting health status <code>S0</code>.</p>
<pre class="r"><code>dpat <- genCluster(dper, cLevelVar = "site.time",
numIndsVar = "npatient", level1ID = "id")
dpat <- addColumns(patDef, dpat)
dpat</code></pre>
<pre><code>## site period site.time npatient id S0
## 1: 1 0 1 17 1 2
## 2: 1 0 1 17 2 1
## 3: 1 0 1 17 3 2
## 4: 1 0 1 17 4 2
## 5: 1 0 1 17 5 1
## ---
## 2524: 5 24 125 20 2524 3
## 2525: 5 24 125 20 2525 2
## 2526: 5 24 125 20 2526 1
## 2527: 5 24 125 20 2527 1
## 2528: 5 24 125 20 2528 1</code></pre>
<p>Here is a visualization of the patients (it turns out there are 2528 of them) by site and starting point, with each point representing a patient. The color represents the intervention status: light blue is control (pre-intervention) and dark blue is intervention. Even though a patient may start in the pre-intervention period, they may actually receive services in the intervention period, as we will see further on down.</p>
<p><img src="https://www.rdatagen.net/post/2019-09-17-simulating-an-open-cohort-stepped-wedge-trial.en_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>The patient health status series are generated using a Markov chain process. This particular transition matrix has an “absorbing” state, as indicated by the probability 1 in the last row of the matrix. Once a patient enters state 4, they will not transition to any other state. (In this case, state 4 is death.)</p>
<pre class="r"><code>dpat <- addMarkov(dpat, transMat = P,
chainLen = NPER, id = "id",
pername = "seq", start0lab = "S0")
dpat</code></pre>
<pre><code>## site period site.time npatient id S0 seq state
## 1: 1 0 1 17 1 2 1 2
## 2: 1 0 1 17 1 2 2 3
## 3: 1 0 1 17 1 2 3 3
## 4: 1 0 1 17 1 2 4 3
## 5: 1 0 1 17 1 2 5 4
## ---
## 63196: 5 24 125 20 2528 1 21 4
## 63197: 5 24 125 20 2528 1 22 4
## 63198: 5 24 125 20 2528 1 23 4
## 63199: 5 24 125 20 2528 1 24 4
## 63200: 5 24 125 20 2528 1 25 4</code></pre>
<p>Now, we aren’t interested in the periods following the one where death occurs. So, we want to trim the data.table <code>dpat</code> to include only those periods leading up to state 4 and the first period in which state 4 is entered. We do this first by identifying the first time a state of 4 is encountered for each individual (and if an individual never reaches state 4, then all the individual’s records are retained, and the variable <code>.last</code> is set to the maximum number of periods <code>NPER</code>, in this case 25).</p>
<pre class="r"><code>dlast <- dpat[, .SD[state == 4][1,], by = id][, .(id, .last = seq)]
dlast[is.na(.last), .last := NPER]
dlast</code></pre>
<pre><code>## id .last
## 1: 1 5
## 2: 2 13
## 3: 3 2
## 4: 4 6
## 5: 5 3
## ---
## 2524: 2524 7
## 2525: 2525 5
## 2526: 2526 19
## 2527: 2527 20
## 2528: 2528 8</code></pre>
<p>Next, we use the <code>dlast</code> data.table to “trim” <code>dpat</code>. We further trim the data set so that we do not have patient-level observations that extend beyond the overall follow-up period:</p>
<pre class="r"><code>dpat <- dlast[dpat][seq <= .last][ , .last := NULL][]
dpat[, period := period + seq - 1]
dpat <- dpat[period < NPER]
dpat</code></pre>
<pre><code>## id site period site.time npatient S0 seq state
## 1: 1 1 0 1 17 2 1 2
## 2: 1 1 1 1 17 2 2 3
## 3: 1 1 2 1 17 2 3 3
## 4: 1 1 3 1 17 2 4 3
## 5: 1 1 4 1 17 2 5 4
## ---
## 12608: 2524 5 24 125 20 3 1 3
## 12609: 2525 5 24 125 20 2 1 2
## 12610: 2526 5 24 125 20 1 1 1
## 12611: 2527 5 24 125 20 1 1 1
## 12612: 2528 5 24 125 20 1 1 1</code></pre>
<p>And finally, we merge the patient data with the stepped-wedge treatment assignment data to create the final data set. The individual outcomes for each week could now be generated, because would we have all the baseline and time-varying information in a single data set.</p>
<pre class="r"><code>dpat <- merge(dpat, dsw, by = c("site","period"))
setkey(dpat, id, period)
dpat <- delColumns(dpat, c("site.time", "seq", "npatient"))
dpat</code></pre>
<pre><code>## site period id S0 state startTrt Ict
## 1: 1 0 1 2 2 0 1
## 2: 1 1 1 2 3 0 1
## 3: 1 2 1 2 3 0 1
## 4: 1 3 1 2 3 0 1
## 5: 1 4 1 2 4 0 1
## ---
## 12608: 5 24 2524 3 3 16 1
## 12609: 5 24 2525 2 2 16 1
## 12610: 5 24 2526 1 1 16 1
## 12611: 5 24 2527 1 1 16 1
## 12612: 5 24 2528 1 1 16 1</code></pre>
<p>Here is what the individual trajectories of health state look like. In the plot, each column represents a different site, and each row represents a different starting week. For example the fifth row represents patients who appear for the first time in week 4. Sites 1 and 2 are already in the intervention in week 4, so none of these patients will transition. However, patients in sites 3 through 5 enter in the pre-intervention stage in week 4, and transition into the intervention at different points, depending on the site.</p>
<p><img src="https://www.rdatagen.net/post/2019-09-17-simulating-an-open-cohort-stepped-wedge-trial.en_files/figure-html/unnamed-chunk-11-1.png" width="1056" /></p>
<p>The basic structure is in place, so we are ready to extend this simulation to include more covariates, random effects, and outcomes. And once we’ve done that, we can explore analytic approaches.</p>
<p>
<small><font color="darkkhaki">This study is supported by the National Institutes of Health National Institute on Aging R61AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</font></small>
</p>
</div>
Analyzing a binary outcome arising out of within-cluster, pair-matched randomization
https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/
Tue, 03 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/
<p>A key motivating factor for the <code>simstudy</code> package and much of this blog is that simulation can be super helpful in understanding how best to approach an unusual, or least unfamiliar, analytic problem. About six months ago, I <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">described</a> the DREAM Initiative (Diabetes Research, Education, and Action for Minorities), a study that used a slightly innovative randomization scheme to ensure that two comparison groups were evenly balanced across important covariates. At the time, we hadn’t finalized the analytic plan. But, now that we have started actually randomizing and recruiting (yes, in that order, oddly enough), it is important that we do that, with the help of a little simulation.</p>
<div id="the-study-design" class="section level3">
<h3>The study design</h3>
<p>The <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">original post</a> has the details about the design and matching algorithm (and code). The randomization is taking place at 20 primary care clinics, and patients within these clinics are matched based on important characteristics before randomization occurs. There is little or no risk that patients in the control arm will be “contaminated” or affected by the intervention that is taking place, which will minimize the effects of clustering. However, we may not want to ignore the clustering altogether.</p>
</div>
<div id="possible-analytic-solutions" class="section level3">
<h3>Possible analytic solutions</h3>
<p>Given that the primary outcome is binary, one reasonable procedure to assess whether or not the intervention is effective is McNemar’s test, which is typically used for paired dichotomous data. However, this approach has two limitations. First, McNemar’s test does not take into account the clustered nature of the data. Second, the test is just that, a test; it does not provide an estimate of effect size (and the associated confidence interval).</p>
<p>So, in addition to McNemar’s test, I considered four additional analytic approaches to assess the effect of the intervention: (1) Durkalski’s extension of McNemar’s test to account for clustering, (2) conditional logistic regression, which takes into account stratification and matching, (3) standard logistic regression with specific adjustment for the three matching variables, and (4) mixed effects logistic regression with matching covariate adjustment and a clinic-level random intercept. (In the mixed effects model, I assume the treatment effect does not vary by site, since I have also assumed that the intervention is delivered in a consistent manner across the sites. These may or may not be reasonable assumptions.)</p>
<p>While I was interested to see how the two tests (McNemar and the extension) performed, my primary goal was to see if any of the regression models was superior. In order to do this, I wanted to compare the methods in a scenario without any intervention effect, and in another scenario where there <em>was</em> an effect. I was interested in comparing bias, error rates, and variance estimates.</p>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generation process parallels the earlier <a href="https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/">post</a>. The treatment assignment is made in the context of the matching process, which I am not showing this time around. Note that in this initial example, the outcome <code>y</code> depends on the intervention <code>rx</code> (i.e. there <em>is</em> an intervention effect).</p>
<pre class="r"><code>library(simstudy)
### defining the data
defc <- defData(varname = "ceffect", formula = 0, variance = 0.4,
dist = "normal", id = "cid")
defi <- defDataAdd(varname = "male", formula = .4, dist = "binary")
defi <- defDataAdd(defi, varname = "age", formula = 0, variance = 40)
defi <- defDataAdd(defi, varname = "bmi", formula = 0, variance = 5)
defr <- defDataAdd(varname = "y",
formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0.45*rx + ceffect",
dist = "binary", link = "logit")
### generating the data
set.seed(547317)
dc <- genData(20, defc)
di <- genCluster(dc, "cid", 60, "id")
di <- addColumns(defi, di)
### matching and randomization within cluster (cid)
library(parallel)
library(Matching)
RNGkind("L'Ecuyer-CMRG") # to set seed for parallel process
### See addendum for dmatch code
dd <- rbindlist(mclapply(1:nrow(dc),
function(x) dmatch(di[cid == x]),
mc.set.seed = TRUE
)
)
### generate outcome
dd <- addColumns(defr, dd)
setkey(dd, pair)
dd</code></pre>
<pre><code>## cid ceffect id male age bmi rx pair y
## 1: 1 1.168 11 1 4.35 0.6886 0 1.01 1
## 2: 1 1.168 53 1 3.85 0.2215 1 1.01 1
## 3: 1 1.168 51 0 6.01 -0.9321 0 1.02 0
## 4: 1 1.168 58 0 7.02 0.1407 1 1.02 1
## 5: 1 1.168 57 0 9.25 -1.3253 0 1.03 1
## ---
## 798: 9 -0.413 504 1 -8.72 -0.0767 1 9.17 0
## 799: 9 -0.413 525 0 1.66 3.5507 0 9.18 0
## 800: 9 -0.413 491 0 4.31 2.6968 1 9.18 0
## 801: 9 -0.413 499 0 7.36 0.6064 0 9.19 0
## 802: 9 -0.413 531 0 8.05 0.8068 1 9.19 0</code></pre>
<p>Based on the outcomes of each individual, each pair can be assigned to a particular category that describes the outcomes. Either both fail, both succeed, or one fails and the other succeeds. These category counts can be represented in a <span class="math inline">\(2 \times 2\)</span> contingency table. The counts are the number of pairs in each of the four possible pairwise outcomes. For example, there were 173 pairs where the outcome was determined to be unsuccessful for both intervention and control arms.</p>
<pre class="r"><code>dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(0,1),
labels = c("no success", "success"))]
dpair[, rx := factor(`1`, levels = c(0, 1),
labels = c("no success", "success"))]
dpair[, table(control,rx)]</code></pre>
<pre><code>## rx
## control no success success
## no success 173 102
## success 69 57</code></pre>
<p>Here is a figure that depicts the <span class="math inline">\(2 \times 2\)</span> matrix, providing a visualization of how the treatment and control group outcomes compare. (The code is in the addendum in case anyone wants to see the lengths I took to make this simple graphic.)</p>
<p><img src="https://www.rdatagen.net/post/2019-09-03-analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization.en_files/figure-html/unnamed-chunk-4-1.png" width="576" /></p>
<div id="mcnemars-test" class="section level4">
<h4>McNemar’s test</h4>
<p>McNemar’s test requires the data to be in table format, and the test really only takes into consideration the cells which represent disagreement between treatment arms. In terms of the matrix above, this would be the lower left and upper right quadrants.</p>
<pre class="r"><code>ddc <- dcast(dd, pair ~ rx, value.var = "y")
dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]
mcnemar.test(dmat)</code></pre>
<pre><code>##
## McNemar's Chi-squared test with continuity correction
##
## data: dmat
## McNemar's chi-squared = 6, df = 1, p-value = 0.01</code></pre>
<p>Based on the p-value = 0.01, we would reject the null hypothesis that the intervention has no effect.</p>
</div>
<div id="durkalski-extension-of-mcnemars-test" class="section level4">
<h4>Durkalski extension of McNemar’s test</h4>
<p>Durkalski’s test also requires the data to be in tabular form, though there essentially needs to be a table for each cluster. The <code>clust.bin.pair</code> function needs us to separate the table into vectors <code>a</code>, <code>b</code>, <code>c</code>, and <code>d</code>, where each element in each of the vectors is a count for a specific cluster. Vector <code>a</code> is collection of counts for the upper left hand quadrants, <code>b</code> is for the upper right hand quadrants, etc. We have 20 clusters, so each of the four vectors has length 20. Much of the work done in the code below is just getting the data in the right form for the function.</p>
<pre class="r"><code>library(clust.bin.pair)
ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
ddc[, ypair := 2*`0` + 1*`1`]
dvec <- ddc[, .N, keyby=.(cid, ypair)]
allpossible <- data.table(expand.grid(1:20, 0:3))
setnames(allpossible, c("cid","ypair"))
setkey(dvec, cid, ypair)
setkey(allpossible, cid, ypair)
dvec <- dvec[allpossible]
dvec[is.na(N), N := 0]
a <- dvec[ypair == 0, N]
b <- dvec[ypair == 1, N]
c <- dvec[ypair == 2, N]
d <- dvec[ypair == 3, N]
clust.bin.pair(a, b, c, d, method = "durkalski")</code></pre>
<pre><code>##
## Durkalski's Chi-square test
##
## data: a, b, c, d
## chi-square = 5, df = 1, p-value = 0.03</code></pre>
<p>Again, the p-value, though larger, leads us to reject the null.</p>
</div>
<div id="conditional-logistic-regression" class="section level4">
<h4>Conditional logistic regression</h4>
<p>Conditional logistic regression is conditional on the pair. Since the pair is similar with respect to the matching variables, no further adjustment (beyond specifying the strata) is necessary.</p>
<pre class="r"><code>library(survival)
summary(clogit(y ~ rx + strata(pair), data = dd))$coef["rx",]</code></pre>
<pre><code>## coef exp(coef) se(coef) z Pr(>|z|)
## 0.3909 1.4783 0.1559 2.5076 0.0122</code></pre>
<p> </p>
</div>
<div id="logistic-regression-with-matching-covariates-adjustment" class="section level4">
<h4>Logistic regression with matching covariates adjustment</h4>
<p>Using logistic regression should in theory provide a reasonable estimate of the treatment effect, though given that there is clustering, I wouldn’t expect the standard error estimates to be correct. Although we are not specifically modeling the matching, by including covariates used in the matching, we are effectively estimating a model that is conditional on the pair.</p>
<pre class="r"><code>summary(glm(y~rx + age + male + bmi, data = dd,
family = "binomial"))$coef["rx",]</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## 0.3679 0.1515 2.4285 0.0152</code></pre>
<p> </p>
</div>
<div id="generalized-mixed-effects-model-with-matching-covariates-adjustment" class="section level4">
<h4>Generalized mixed effects model with matching covariates adjustment</h4>
<p>The mixed effects model merely improves on the logistic regression model by ensuring that any clustering effects are reflected in the estimates.</p>
<pre class="r"><code>library(lme4)
summary(glmer(y ~ rx + age + male + bmi + (1|cid), data= dd,
family = "binomial"))$coef["rx",]</code></pre>
<pre><code>## Estimate Std. Error z value Pr(>|z|)
## 0.4030 0.1586 2.5409 0.0111</code></pre>
<p> </p>
</div>
</div>
<div id="comparing-the-analytic-approaches" class="section level3">
<h3>Comparing the analytic approaches</h3>
<p>To compare the methods, I generated 1000 data sets under each scenario. As I mentioned, I wanted to conduct the comparison under two scenarios. The first when there is no intervention effect, and the second with an effect (I will use the effect size used to generate the first data set.</p>
<p>I’ll start with no intervention effect. In this case, the outcome definition sets the true parameter of <code>rx</code> to 0.</p>
<pre class="r"><code>defr <- defDataAdd(varname = "y",
formula = "-1 + 0.08*bmi - 0.3*male - 0.08*age + 0*rx + ceffect",
dist = "binary", link = "logit")</code></pre>
<p>Using the updated definition, I generate 1000 datasets, and for each one, I apply the five analytic approaches. The results from each iteration are stored in a large list. (The code for the iterative process is shown in the addendum below.) As an example, here are the contents from the 711th iteration:</p>
<pre class="r"><code>res[[711]]</code></pre>
<pre><code>## $clr
## coef exp(coef) se(coef) z Pr(>|z|)
## rx -0.0263 0.974 0.162 -0.162 0.871
##
## $glm
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6583 0.1247 -5.279 1.30e-07
## rx -0.0309 0.1565 -0.198 8.43e-01
## age -0.0670 0.0149 -4.495 6.96e-06
## male -0.5131 0.1647 -3.115 1.84e-03
## bmi 0.1308 0.0411 3.184 1.45e-03
##
## $glmer
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7373 0.1888 -3.91 9.42e-05
## rx -0.0340 0.1617 -0.21 8.33e-01
## age -0.0721 0.0156 -4.61 4.05e-06
## male -0.4896 0.1710 -2.86 4.20e-03
## bmi 0.1366 0.0432 3.16 1.58e-03
##
## $mcnemar
##
## McNemar's Chi-squared test with continuity correction
##
## data: dmat
## McNemar's chi-squared = 0.007, df = 1, p-value = 0.9
##
##
## $durk
##
## Durkalski's Chi-square test
##
## data: a, b, c, d
## chi-square = 0.1, df = 1, p-value = 0.7</code></pre>
</div>
<div id="summary-statistics" class="section level3">
<h3>Summary statistics</h3>
<p>To compare the five methods, I am first looking at the proportion of iterations where the p-value is less then 0.05, in which case we would reject the the null hypothesis. (In the case where the null is true, the proportion is the Type 1 error rate; when there is truly an effect, then the proportion is the power.) I am less interested in the hypothesis test than the bias and standard errors, but the first two methods only provide a p-value, so that is all we can assess them on.</p>
<p>Next, I calculate the bias, which is the average effect estimate minus the true effect. And finally, I evaluate the standard errors by looking at the estimated standard error as well as the observed standard error (which is the standard deviation of the point estimates).</p>
<pre class="r"><code>pval <- data.frame(
mcnm = mean(sapply(res, function(x) x$mcnemar$p.value <= 0.05)),
durk = mean(sapply(res, function(x) x$durk$p.value <= 0.05)),
clr =mean(sapply(res, function(x) x$clr["rx","Pr(>|z|)"] <= 0.05)),
glm = mean(sapply(res, function(x) x$glm["rx","Pr(>|z|)"] <= 0.05)),
glmer = mean(sapply(res, function(x) x$glmer["rx","Pr(>|z|)"] <= 0.05))
)
bias <- data.frame(
clr = mean(sapply(res, function(x) x$clr["rx", "coef"])),
glm = mean(sapply(res, function(x) x$glm["rx", "Estimate"])),
glmer = mean(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)
se <- data.frame(
clr = mean(sapply(res, function(x) x$clr["rx", "se(coef)"])),
glm = mean(sapply(res, function(x) x$glm["rx", "Std. Error"])),
glmer = mean(sapply(res, function(x) x$glmer["rx", "Std. Error"]))
)
obs.se <- data.frame(
clr = sd(sapply(res, function(x) x$clr["rx", "coef"])),
glm = sd(sapply(res, function(x) x$glm["rx", "Estimate"])),
glmer = sd(sapply(res, function(x) x$glmer["rx", "Estimate"]))
)
sumstat <- round(plyr::rbind.fill(pval, bias, se, obs.se), 3)
rownames(sumstat) <- c("prop.rejected", "bias", "se.est", "se.obs")
sumstat</code></pre>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.035 0.048 0.043 0.038 0.044
## bias NA NA 0.006 0.005 0.005
## se.est NA NA 0.167 0.161 0.167
## se.obs NA NA 0.164 0.153 0.164</code></pre>
<p>In this first case, where the true underlying effect size is 0, the Type 1 error rate should be 0.05. The Durkalski test, the conditional logistical regression, and the mixed effects model are below that level but closer than the other two methods. All three models provide unbiased point estimates, but the standard logistic regression (glm) underestimates the standard errors. The results from the conditional logistic regression and the mixed effects model are quite close across the board.</p>
<p>Here are the summary statistics for a data set with an intervention effect of 0.45. The results are consistent with the “no effect” simulations, except that the standard linear regression model exhibits some bias. In reality, this is not necessarily bias, but a different estimand. The model that ignores clustering is a marginal model (with respect to the site), whereas the conditional logistic regression and mixed effects models are conditional on the site. (I’ve described this phenomenon <a href="https://www.rdatagen.net/post/marginal-v-conditional/">here</a> and <a href="https://www.rdatagen.net/post/mixed-effect-models-vs-gee/">here</a>.) We are interested in the conditional effect here, so that argues for the conditional models.</p>
<p>The conditional logistic regression and the mixed effects model yielded similar estimates, though the mixed effects model had slightly higher power, which is the reason I opted to use this approach at the end of the day.</p>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.766 0.731 0.784 0.766 0.796
## bias NA NA 0.000 -0.033 -0.001
## se.est NA NA 0.164 0.156 0.162
## se.obs NA NA 0.165 0.152 0.162</code></pre>
<p>In this last case, the true underlying data generating process still includes an intervention effect but <em>no clustering</em>. In this scenario, all of the analytic yield similar estimates. However, since there is no guarantee that clustering is not a factor, the mixed effects model will still be the preferred approach.</p>
<pre><code>## mcnm durk clr glm glmer
## prop.rejected 0.802 0.774 0.825 0.828 0.830
## bias NA NA -0.003 -0.002 -0.001
## se.est NA NA 0.159 0.158 0.158
## se.obs NA NA 0.151 0.150 0.150</code></pre>
<p>
<small><font color="darkkhaki">The DREAM Initiative is supported by the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases R01DK11048. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.</font></small>
</p>
<p> </p>
</div>
<div id="addendum-multiple-datasets-and-model-estimates" class="section level3">
<h3>Addendum: multiple datasets and model estimates</h3>
<pre class="r"><code>gen <- function(nclust, m) {
dc <- genData(nclust, defc)
di <- genCluster(dc, "cid", m, "id")
di <- addColumns(defi, di)
dr <- rbindlist(mclapply(1:nrow(dc), function(x) dmatch(di[cid == x])))
dr <- addColumns(defr, dr)
dr[]
}
iterate <- function(ncluster, m) {
dd <- gen(ncluster, m)
clrfit <- summary(clogit(y ~ rx + strata(pair), data = dd))$coef
glmfit <- summary(glm(y~rx + age + male + bmi, data = dd,
family = binomial))$coef
mefit <- summary(glmer(y~rx + age + male + bmi + (1|cid), data= dd,
family = binomial))$coef
## McNemar
ddc <- dcast(dd, pair ~ rx, value.var = "y")
dmat <- ddc[, .N, keyby = .(`0`,`1`)][, matrix(N, 2, 2, byrow = T)]
mc <- mcnemar.test(dmat)
# Clustered McNemar
ddc <- dcast(dd, cid + pair ~ rx, value.var = "y")
ddc[, ypair := 2*`0` + 1*`1`]
dvec <- ddc[, .N, keyby=.(cid, ypair)]
allpossible <- data.table(expand.grid(1:20, 0:3))
setnames(allpossible, c("cid","ypair"))
setkey(dvec, cid, ypair)
setkey(allpossible, cid, ypair)
dvec <- dvec[allpossible]
dvec[is.na(N), N := 0]
a <- dvec[ypair == 0, N]
b <- dvec[ypair == 1, N]
c <- dvec[ypair == 2, N]
d <- dvec[ypair == 3, N]
durk <- clust.bin.pair(a, b, c, d, method = "durkalski")
list(clr = clrfit, glm = glmfit, glmer = mefit,
mcnemar = mc, durk = durk)
}
res <- mclapply(1:1000, function(x) iterate(20, 60))</code></pre>
<p> </p>
<div id="code-to-generate-figure" class="section level4">
<h4>Code to generate figure</h4>
<pre class="r"><code>library(ggmosaic)
dpair <- dcast(dd, pair ~ rx, value.var = "y")
dpair[, control := factor(`0`, levels = c(1,0),
labels = c("success", "no success"))]
dpair[, rx := factor(`1`, levels = c(0, 1),
labels = c("no success", "success"))]
p <- ggplot(data = dpair) +
geom_mosaic(aes(x = product(control, rx)))
pdata <- data.table(ggplot_build(p)$data[[1]])
pdata[, mcnemar := factor(c("diff","same","same", "diff"))]
textloc <- pdata[c(1,4), .(x=(xmin + xmax)/2, y=(ymin + ymax)/2)]
ggplot(data = pdata) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax,
fill = mcnemar)) +
geom_label(data = pdata,
aes(x = (xmin+xmax)/2, y = (ymin+ymax)/2, label=.wt),
size = 3.2) +
scale_x_continuous(position = "top",
breaks = textloc$x,
labels = c("no success", "success"),
name = "intervention",
expand = c(0,0)) +
scale_y_continuous(breaks = textloc$y,
labels = c("success", "no success"),
name = "control",
expand = c(0,0)) +
scale_fill_manual(values = c("#6b5dd5", "grey80")) +
theme(panel.grid = element_blank(),
legend.position = "none",
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 0, hjust = 0.5),
axis.text.y = element_text(angle = 90, hjust = 0.5)
)</code></pre>
<p> </p>
</div>
<div id="original-matching-algorithm" class="section level4">
<h4>Original matching algorithm</h4>
<pre class="r"><code>dmatch <- function(dsamp) {
dsamp[, rx := 0]
dused <- NULL
drand <- NULL
dcntl <- NULL
while (nrow(dsamp) > 1) {
selectRow <- sample(1:nrow(dsamp), 1)
dsamp[selectRow, rx := 1]
myTr <- dsamp[, rx]
myX <- as.matrix(dsamp[, .(male, age, bmi)])
match.dt <- Match(Tr = myTr, X = myX,
caliper = c(0, 0.50, .50), ties = FALSE)
if (length(match.dt) == 1) { # no match
dused <- rbind(dused, dsamp[selectRow])
dsamp <- dsamp[-selectRow, ]
} else { # match
trt <- match.dt$index.treated
ctl <- match.dt$index.control
drand <- rbind(drand, dsamp[trt])
dcntl <- rbind(dcntl, dsamp[ctl])
dsamp <- dsamp[-c(trt, ctl)]
}
}
dcntl[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
drand[, pair := paste0(cid, ".", formatC(1:.N, width=2, flag="0"))]
rbind(dcntl, drand)
}</code></pre>
</div>
</div>
simstudy updated to version 0.1.14: implementing Markov chains
https://www.rdatagen.net/post/simstudy-1-14-update/
Tue, 20 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-1-14-update/
<p>I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends only on the previous health status (in the immediate past). While there are packages to do this sort of thing (see for example the <a href="https://cran.r-project.org/web/packages/markovchain/index.html">markovchain</a> package), I hadn’t yet stumbled upon them while I was tackling my problem. So, I wrote my own functions, which I’ve now incorporated into the latest version of <code>simstudy</code> that is now available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>. As a way of announcing the new release, here is a brief overview of Markov chains and the new functions. (See <a href="https://cran.r-project.org/web/packages/simstudy/news/news.html">here</a> for a more complete list of changes.)</p>
<div id="markov-processes" class="section level3">
<h3>Markov processes</h3>
<p>The key “parameter” of a stochastic Markov process is the transition matrix, which defines the probability of moving from one state to another (or remaining in the same state). Each row of the matrix is indexed by the current state, while the columns are indexed by the target state. The values of the matrix represent the probabilities of transitioning from the current state to the target state. The sum of the probabilities across each row must equal one.</p>
<p>In the transition matrix below, there are three states <span class="math inline">\((1, 2, 3)\)</span>. The probability of moving from state 1 to state 3 is represented by <span class="math inline">\(p_{13}\)</span>. Likewise the probability of moving from state 3 to state 2 is <span class="math inline">\(p_{32}\)</span>. And <span class="math inline">\(\sum_{j=1}^3 p_{ij} = 1\)</span> for all <span class="math inline">\(i \in (1,2,3)\)</span>.</p>
<p><span class="math display">\[
\left(
\begin{matrix}
p_{11} & p_{12} & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33}
\end{matrix}
\right )
\]</span></p>
<p>Here’s a possible <span class="math inline">\(3 \times 3\)</span> transition matrix:</p>
<p><span class="math display">\[
\left(
\begin{matrix}
0.5 & 0.4 & 0.1 \\
0.2 & 0.5 & 0.3 \\
0.0 & 0.0 & 1.0
\end{matrix}
\right )
\]</span></p>
<p>In this case, the probability of moving from state 1 to state 2 is <span class="math inline">\(40\%\)</span>, whereas there is no possibility that you can move from 3 to 1 or 2. (State 3 is considered to be an “absorbing” state since it is not possible to leave; if we are talking about health status, state 3 could be death.)</p>
</div>
<div id="function-genmarkov" class="section level3">
<h3>function genMarkov</h3>
<p>The new function <code>genMarkov</code> generates a random sequence for the specified number of individuals. (The sister function <code>addMarkov</code> is quite similar, though it allows users to add a Markov chain to an existing data set.) In addition to defining the transition matrix, you need to indicate the length of the chain to be generated for each simulated unit or person. The data can be returned either in long or wide form, depending on how you’d ultimately like to use the data. In the first case, I am generating wide format data for sequences of length of 6 for 12 individuals:</p>
<pre class="r"><code>library(simstudy)
set.seed(3928398)
tmatrix <- matrix(c(0.5, 0.4, 0.1,
0.2, 0.5, 0.3,
0.0, 0.0, 1.0), 3, 3, byrow = T)
dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = TRUE)
dd</code></pre>
<pre><code>## id S1 S2 S3 S4 S5 S6
## 1: 1 1 2 2 1 2 2
## 2: 2 1 1 2 2 2 3
## 3: 3 1 1 2 3 3 3
## 4: 4 1 2 2 1 1 2
## 5: 5 1 1 2 2 2 3
## 6: 6 1 1 1 1 1 1
## 7: 7 1 1 1 1 2 2
## 8: 8 1 1 1 1 1 1
## 9: 9 1 1 2 3 3 3
## 10: 10 1 1 2 3 3 3
## 11: 11 1 2 2 2 2 1
## 12: 12 1 2 1 1 2 1</code></pre>
<p>In the long format, the output is multiple records per id. This could be useful if you are going to be estimating longitudinal models, or as in this case, creating longitudinal plots:</p>
<pre class="r"><code>set.seed(3928398)
dd <- genMarkov(n = 12, transMat = tmatrix, chainLen = 6, wide = FALSE)</code></pre>
<p>Here are the resulting data (for the first two individuals):</p>
<pre class="r"><code>dd[id %in% c(1,2)]</code></pre>
<pre><code>## id period state
## 1: 1 1 1
## 2: 1 2 2
## 3: 1 3 2
## 4: 1 4 1
## 5: 1 5 2
## 6: 1 6 2
## 7: 2 1 1
## 8: 2 2 1
## 9: 2 3 2
## 10: 2 4 2
## 11: 2 5 2
## 12: 2 6 3</code></pre>
<p>And here’s a plot for each individual, showing their health status progressions over time:</p>
<p><img src="https://www.rdatagen.net/post/2019-08-20-simstudy-0-1-14-update.en_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>I do plan on sharing the details of the simulation that inspired the creation of these new functions, though I am still working out a few things. In the meantime, as always, if anyone has any suggestions or questions about simstudy, definitely let me know.</p>
</div>
Bayes models for estimation in stepped-wedge trials with non-trivial ICC patterns
https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/
Tue, 06 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/
<p>Continuing a series of posts discussing the structure of intra-cluster correlations (ICC’s) in the context of a stepped-wedge trial, this latest edition is primarily interested in fitting Bayesian hierarchical models for more complex cases (though I do talk a bit more about the linear mixed effects models). The first two posts in the series focused on generating data to simulate various scenarios; the <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">third post</a> considered linear mixed effects and Bayesian hierarchical models to estimate ICC’s under the simplest scenario of constant between-period ICC’s. Throughout this post, I use code drawn from the previous one; I am not repeating much of it here for brevity’s sake. So, if this is all new, it is probably worth <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">glancing at</a> before continuing on.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>The data generating model this time around is only subtly different from before, but that difference is quite important. Rather than a single cluster-specific effect <span class="math inline">\(b_c\)</span>, there is now a vector of cluster effects <span class="math inline">\(\mathbf{b_c} = \left( b_{c1}, b_{c2}, \ldots, b_{cT} \right)\)</span>, where <span class="math inline">\(b_c \sim MVN(\mathbf{0}, \sigma^2 \mathbf{R})\)</span> (see <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">this earlier post</a> for a description of the correlation matrix <span class="math inline">\(\mathbf{R}\)</span>.)</p>
<p><span class="math display">\[
Y_{ict} = \mu + \beta_0t + \beta_1X_{ct} + b_{ct} + e_{ict}
\]</span></p>
<p>By altering the correlation structure of <span class="math inline">\(\mathbf{b_c}\)</span> (that is <span class="math inline">\(\mathbf{R}\)</span>), we can the change the structure of the ICC’s. (The data generation was the focus of the first two posts of this series, <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a> and <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">here</a>. The data generating function <code>genDD</code> includes an argument where you can specify the two correlation structures, <em>exchangeable</em> and <em>auto-regressive</em>:</p>
<pre class="r"><code>library(simstudy)
defc <- defData(varname = "mu", formula = 0,
dist = "nonrandom", id = "cluster")
defc <- defData(defc, "s2", formula = 0.15, dist = "nonrandom")
defc <- defData(defc, "m", formula = 15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + cteffect",
variance = 2, dist = "normal")</code></pre>
<pre class="r"><code>genDD <- function(defc, defa, nclust, nperiods,
waves, len, start, rho, corstr) {
dc <- genData(nclust, defc)
dp <- addPeriods(dc, nperiods, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = waves, lenWaves = len,
startPer = start)
dp <- addCorGen(dtOld = dp, nvars = nperiods, idvar = "cluster",
rho = rho, corstr = corstr, dist = "normal",
param1 = "mu", param2 = "s2", cnames = "cteffect")
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = "m",
level1ID = "id")
dd <- addColumns(defa, dd)
dd[]
}</code></pre>
</div>
<div id="constant-between-period-iccs" class="section level3">
<h3>Constant between-period ICC’s</h3>
<p>In this first scenario, the assumption is that the within-period ICC’s are larger than the between-period ICC’s and the between-period ICC’s are constant. This can be generated with random effects that have a correlation matrix with compound symmetry (or is exchangeable). In this case, we will have 60 clusters and 7 time periods:</p>
<pre class="r"><code>set.seed(4119)
dcs <- genDD(defc, defa, 60, 7, 4, 1, 2, 0.6, "cs")
# correlation of "unobserved" random effects
round(cor(dcast(dcs[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.60 0.49 0.60 0.60 0.51
## 1 0.60 1.00 0.68 0.64 0.62 0.64
## 2 0.49 0.68 1.00 0.58 0.54 0.62
## 3 0.60 0.64 0.58 1.00 0.63 0.66
## 4 0.60 0.62 0.54 0.63 1.00 0.63
## 5 0.51 0.64 0.62 0.66 0.63 1.00</code></pre>
<p><br></p>
<div id="linear-mixed-effects-model" class="section level4">
<h4>Linear mixed-effects model</h4>
<p>It is possible to use <code>lmer</code> to correctly estimate the variance components and other parameters that underlie the data generating process used in this case. The cluster-level period-specific effects are specified in the model as “cluster/period”, which indicates that the period effects are <em>nested</em> within the cluster.</p>
<pre class="r"><code>library(lme4)
lmerfit <- lmer(Y ~ period + rx + (1 | cluster/period) , data = dcs)
as.data.table(VarCorr(lmerfit))</code></pre>
<pre><code>## grp var1 var2 vcov sdcor
## 1: period:cluster (Intercept) <NA> 0.05827349 0.2413990
## 2: cluster (Intercept) <NA> 0.07816476 0.2795796
## 3: Residual <NA> <NA> 2.02075355 1.4215321</code></pre>
<p>Reading from the <code>vcov</code> column in the <code>lmer</code> output above, we can extract the <em>period:cluster</em> variance (<span class="math inline">\(\sigma_w^2\)</span>), the <em>cluster</em> variance (<span class="math inline">\(\sigma^2_v\)</span>), and the <em>residual</em> (individual level) variance (<span class="math inline">\(\sigma^2_e\)</span>). Using these three variance components, we can estimate the correlation of the cluster level effects (<span class="math inline">\(\rho\)</span>), the within-period ICC (<span class="math inline">\(ICC_{tt}\)</span>), and the between-period ICC (<span class="math inline">\(ICC_{tt^\prime}\)</span>). (See the <a href="#addendum">addendum</a> below for a more detailed description of the derivations.)</p>
</div>
<div id="correlation-rho-of-cluster-specific-effects-over-time" class="section level4">
<h4>Correlation (<span class="math inline">\(\rho\)</span>) of cluster-specific effects over time</h4>
<p>In this post, don’t confuse <span class="math inline">\(\rho\)</span> with the ICC. <span class="math inline">\(\rho\)</span> is the correlation between the cluster-level period-specific random effects. Here I am just showing that it is function of the decomposed variance estimates provided in the <code>lmer</code> output:</p>
<p><span class="math display">\[
\rho = \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w}
\]</span></p>
<pre class="r"><code>vs <- as.data.table(VarCorr(lmerfit))$vcov
vs[2]/sum(vs[1:2]) </code></pre>
<pre><code>## [1] 0.5728948</code></pre>
<p><br></p>
</div>
<div id="within-period-icc" class="section level4">
<h4>Within-period ICC</h4>
<p>The within-period ICC is the ratio of total cluster variance relative to total variance:</p>
<p><span class="math display">\[ICC_{tt} = \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>sum(vs[1:2])/sum(vs)</code></pre>
<pre><code>## [1] 0.06324808</code></pre>
<p><br></p>
</div>
<div id="between-period-icc" class="section level4">
<h4>Between-period ICC</h4>
<p>The between-period <span class="math inline">\(ICC_{tt^\prime}\)</span> is really just the within-period <span class="math inline">\(ICC_{tt}\)</span> adjusted by <span class="math inline">\(\rho\)</span> (see the <a href="#addendum">addendum</a>):</p>
<p><span class="math display">\[ICC_{tt^\prime} = \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>vs[2]/sum(vs) </code></pre>
<pre><code>## [1] 0.0362345</code></pre>
<p><br></p>
</div>
<div id="bayesian-model" class="section level4">
<h4>Bayesian model</h4>
<p>Now, I’ll fit a Bayesian hierarchical model, as I did <a href="https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/">earlier</a> with the simplest constant ICC data generation process. The specification of the model in <code>stan</code> in this instance is slightly more involved as the number of parameters has increased. In the simpler case, I only had to estimate a scalar parameter for <span class="math inline">\(\sigma_b\)</span> and a single ICC parameter. In this model definition (<code>nested_cor_cs.stan</code>) <span class="math inline">\(\mathbf{b_c}\)</span> is a vector so there is a need to specify the variance-covariance matrix <span class="math inline">\(\sigma^2 \mathbf{R}\)</span>, which has dimensions <span class="math inline">\(T \times T\)</span> (defined in the <code>transformed parameters</code> block). There are <span class="math inline">\(T\)</span> random effects for each cluster, rather than one. And finally, instead of one ICC value, there are two - the within- and between-period ICC’s (defined in the <code>generated quantities</code> block).</p>
<pre class="stan"><code>data {
int<lower=0> N; // number of unique individuals
int<lower=1> K; // number of predictors
int<lower=1> J; // number of clusters
int<lower=0> T; // number of periods
int<lower=1,upper=J> jj[N]; // group for individual
int<lower=1> tt[N]; // period for individual
matrix[N, K] x; // matrix of predctors
vector[N] y; // matrix of outcomes
}
parameters {
vector[K] beta; // model fixed effects
real<lower=0> sigmalev1; // cluster variance (sd)
real<lower=-1,upper=1> rho; // correlation
real<lower=0> sigma; // individual level varianc (sd)
matrix[J, T] ran; // site level random effects (by period)
}
transformed parameters{
cov_matrix[T] Sigma;
vector[N] yhat;
vector[T] mu0;
for (t in 1:T)
mu0[t] = 0;
// Random effects with exchangeable correlation
for (j in 1:(T-1))
for (k in (j+1):T) {
Sigma[j,k] = pow(sigmalev1,2) * rho;
Sigma[k,j] = Sigma[j, k];
}
for (i in 1:T)
Sigma[i,i] = pow(sigmalev1,2);
for (i in 1:N)
yhat[i] = x[i]*beta + ran[jj[i], tt[i]];
}
model {
sigma ~ uniform(0, 10);
sigmalev1 ~ uniform(0, 10);
rho ~ uniform(-1, 1);
for (j in 1:J)
ran[j] ~ multi_normal(mu0, Sigma);
y ~ normal(yhat, sigma);
}
generated quantities {
real sigma2;
real sigmalev12;
real iccInPeriod;
real iccBetPeriod;
sigma2 = pow(sigma, 2);
sigmalev12 = pow(sigmalev1, 2);
iccInPeriod = sigmalev12/(sigmalev12 + sigma2);
iccBetPeriod = iccInPeriod * rho;
}</code></pre>
<p>Model estimation requires creating the data set (in the form of an <code>R list</code>), compiling the <code>stan</code> model, and then sampling from the posterior to generate distributions of all parameters and generated quantities. I should include conduct a diagnostic review (e.g. to assess convergence), but you’ll have to trust me that everything looked reasonable.</p>
<pre class="r"><code>library(rstan)
options(mc.cores = parallel::detectCores())
x <- as.matrix(dcs[ ,.(1, period, rx)])
K <- ncol(x)
N <- dcs[, length(unique(id))]
J <- dcs[, length(unique(cluster))]
T <- dcs[, length(unique(period))]
jj <- dcs[, cluster]
tt <- dcs[, period] + 1
y <- dcs[, Y]
testdat <- list(N=N, K=K, J=J, T=T, jj=jj, tt=tt, x=x, y=y)
rt <- stanc("nested_cor_cs.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
fit.cs <- sampling(sm, data=testdat, seed = 32748,
iter = 5000, warmup = 1000,
control = list(max_treedepth = 15))</code></pre>
<p>Here is a summary of results for <span class="math inline">\(\rho\)</span>, <span class="math inline">\(ICC_{tt}\)</span>, and <span class="math inline">\(ICC_{tt^\prime}\)</span>. I’ve included a comparison of the means of the posterior distributions with the <code>lmer</code> estimates, followed by a more complete (visual) description of the posterior distributions of the Bayesian estimates:</p>
<pre class="r"><code>mb <- sapply(
rstan::extract(fit.cs, pars=c("rho", "iccInPeriod", "iccBetPeriod")),
function(x) mean(x)
)
cbind(bayesian=round(mb,3),
lmer = round(c(vs[2]/sum(vs[1:2]),
sum(vs[1:2])/sum(vs),
vs[2]/sum(vs)),3)
)</code></pre>
<pre><code>## bayesian lmer
## rho 0.576 0.573
## iccInPeriod 0.065 0.063
## iccBetPeriod 0.037 0.036</code></pre>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-12-1.png" width="480" /></p>
</div>
</div>
<div id="decaying-between-period-icc-over-time" class="section level3">
<h3>Decaying between-period ICC over time</h3>
<p>Now we enter somewhat uncharted territory, since there is no obvious way in <code>R</code> using the <code>lme4</code> or <code>nlme</code> packages to decompose the variance estimates when the random effects have correlation that decays over time. This is where we might have to rely on a Bayesian approach to do this. (I understand that <code>SAS</code> can accommodate this, but I can’t bring myself to go there.)</p>
<p>We start where we pretty much always do - generating the data. Everything is the same, except that the cluster-random effects are correlated over time; we specify a correlation structure of <em>ar1</em> (auto-regressive).</p>
<pre class="r"><code>set.seed(4119)
dar1 <- genDD(defc, defa, 60, 7, 4, 1, 2, 0.6, "ar1")
# correlation of "unobserved" random effects
round(cor(dcast(dar1[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.60 0.22 0.20 0.18 0.06
## 1 0.60 1.00 0.64 0.45 0.30 0.23
## 2 0.22 0.64 1.00 0.61 0.32 0.30
## 3 0.20 0.45 0.61 1.00 0.61 0.49
## 4 0.18 0.30 0.32 0.61 1.00 0.69
## 5 0.06 0.23 0.30 0.49 0.69 1.00</code></pre>
<p>The model file is similar to <code>nested_cor_cs.stan</code>, except that the specifications of the variance-covariance matrix and ICC’s are now a function of <span class="math inline">\(\rho^{|t^\prime - t|}\)</span>:</p>
<pre class="stan"><code>transformed parameters{
⋮
for (j in 1:T)
for (k in 1:T)
Sigma[j,k] = pow(sigmalev1,2) * pow(rho,abs(j-k));
⋮
}
generated quantities {
⋮
for (j in 1:T)
for (k in 1:T)
icc[j, k] = sigmalev12/(sigmalev12 + sigma2) * pow(rho,abs(j-k));
⋮
}</code></pre>
<p>The stan compilation and sampling code is not shown here - they are the same before. The posterior distribution of <span class="math inline">\(\rho\)</span> is similar to what we saw previously.</p>
<pre class="r"><code>print(fit.ar1, pars=c("rho"))</code></pre>
<pre><code>## Inference for Stan model: nested_cor_ar1.
## 4 chains, each with iter=5000; warmup=1000; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=16000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## rho 0.58 0 0.08 0.41 0.53 0.58 0.64 0.73 2302 1
##
## Samples were drawn using NUTS(diag_e) at Fri Jun 28 14:13:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).</code></pre>
<p>Now, however, we have to consider a full range of ICC estimates. Here is a plot of the posterior distribution of all ICC’s with the means of each posterior directly below. The diagonal represents the within-period (constant) ICCs, and the off-diagonals are the between-period ICC’s.</p>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-16-1.png" width="576" /></p>
</div>
<div id="an-alternative-bayesian-model-unstructured-correlation" class="section level3">
<h3>An alternative Bayesian model: unstructured correlation</h3>
<p>Now, there is no particular reason to expect that the particular decay model (with an AR1 structure) would be the best model. We could try to fit an even more general model, one with minimal structure. For example if we put no restrictions on the correlation matrix <span class="math inline">\(\mathbf{R}\)</span>, but assumed a constant variance of <span class="math inline">\(\sigma_b^2\)</span>, we might achieve a better model fit. (We could go even further and relax the assumption that the variance across time changes as well, but I’ll leave that to you if you want to try it.)</p>
<p>In this case, we need to define <span class="math inline">\(\mathbf{R}\)</span> and specify a prior distribution (I use the Lewandowski, Kurowicka, and Joe - LKJ prior, as suggested by <code>Stan</code> documentation) and define the ICC’s in terms of <span class="math inline">\(\mathbf{R}\)</span>. Here are the relevant snippets of the <code>stan</code> model (everything else is the same as before):</p>
<pre class="stan"><code>parameters {
⋮
corr_matrix[T] R; // correlation matrix
⋮
}
transformed parameters{
⋮
Sigma = pow(sigmalev1,2) * R;
⋮
}
model {
⋮
R ~ lkj_corr(1); // LKJ prior on the correlation matrix
⋮
}
generated quantities {
⋮
for (j in 1:T)
for (k in 1:T)
icc[j, k] = sigmalev12/(sigmalev12 + sigma2) * R[j, k];
⋮
}</code></pre>
<p>Here are the means of the ICC posterior distributions alongside the means from the previous <em>auto-regressive</em> model.</p>
<p><img src="https://www.rdatagen.net/post/2019-08-06-bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure.en_files/figure-html/unnamed-chunk-18-1.png" width="864" /></p>
<p>Looking at the unstructured model estimates on the right, it does appear that a decay model might be reasonable. (No surprise there, because in reality, it <em>is</em> totally reasonable; that’s how we generated the data.) We can use package <code>bridgesampling</code> which estimates marginal log likelihoods (across the prior distributions of the parameters). The marginal likelihoods are used in calculating the Bayes Factor, which is the basis for comparing two competing models. Here, the log-likelihood is reported. If the unstructured model is indeed an improvement (and it could very well be, because it has more parameters), the we would expect the marginal log-likelihood for the second model to be greater (less negative) than the log-likelihood for the auto-regressive model. If fact, the opposite true, suggesting the auto-regressive model is the preferred one (out of these two):</p>
<pre class="r"><code>library(bridgesampling)
bridge_sampler(fit.ar1, silent = TRUE)</code></pre>
<pre class="r"><code>## Bridge sampling estimate of the log marginal likelihood: -5132.277
## Estimate obtained in 6 iteration(s) via method "normal"</code></pre>
<pre class="r"><code>bridge_sampler(fit.ar1.nc, silent = TRUE)</code></pre>
<pre class="r"><code>## Bridge sampling estimate of the log marginal likelihood: -5137.081
## Estimate obtained in 269 iteration(s) via method "normal".</code></pre>
<p> </p>
<p><a name="addendum"></a></p>
<p> </p>
</div>
<div id="addendum---interpreting-lmer-variance-estimates" class="section level2">
<h2>Addendum - interpreting lmer variance estimates</h2>
<p>In order to show how the <code>lmer</code> variance estimates relate to the theoretical variances and correlations in the case of a constant between-period ICC, here is a simulation based on 1000 clusters. The key parameters are <span class="math inline">\(\sigma^2_b = 0.15\)</span>, <span class="math inline">\(\sigma^2_e = 2\)</span>, and <span class="math inline">\(\rho = 0.6\)</span>. And based on these values, the theoretical ICC’s are: <span class="math inline">\(ICC_{within} = 0.15/2.15 = 0.698\)</span>, and <span class="math inline">\(ICC_{bewteen} = 0.698 * 0.6 = 0.042\)</span>.</p>
<pre class="r"><code>set.seed(4119)
dcs <- genDD(defc, defa, 1000, 7, 4, 1, 2, 0.6, "cs")</code></pre>
<p>The underlying correlation matrix of the cluster-level effects is what we would expect:</p>
<pre class="r"><code>round(cor(dcast(dcs[, .SD[1], keyby = .(cluster, period)],
formula = cluster ~ period, value.var = "cteffect")[, 2:7]), 2)</code></pre>
<pre><code>## 0 1 2 3 4 5
## 0 1.00 0.59 0.59 0.61 0.61 0.61
## 1 0.59 1.00 0.61 0.60 0.61 0.64
## 2 0.59 0.61 1.00 0.59 0.61 0.61
## 3 0.61 0.60 0.59 1.00 0.59 0.62
## 4 0.61 0.61 0.61 0.59 1.00 0.60
## 5 0.61 0.64 0.61 0.62 0.60 1.00</code></pre>
<p>Here are the variance estimates from the mixed-effects model:</p>
<pre class="r"><code>lmerfit <- lmer(Y ~ period + rx + (1 | cluster/period) , data = dcs)
as.data.table(VarCorr(lmerfit))</code></pre>
<pre><code>## grp var1 var2 vcov sdcor
## 1: period:cluster (Intercept) <NA> 0.05779349 0.2404028
## 2: cluster (Intercept) <NA> 0.09143749 0.3023863
## 3: Residual <NA> <NA> 1.98894356 1.4102991</code></pre>
<p>The way <code>lmer</code> implements the nested random effects , the cluster period-specific effect <span class="math inline">\(b_{ct}\)</span> is decomposed into <span class="math inline">\(v_c\)</span>, a cluster level effect, and <span class="math inline">\(w_{ct}\)</span>, a cluster time-specific effect:</p>
<p><span class="math display">\[
b_{ct} = v_c + w_{ct}
\]</span></p>
<p>Since both <span class="math inline">\(v_c\)</span> and <span class="math inline">\(w_{ct}\)</span> are normally distributed (<span class="math inline">\(v_c \sim N(0,\sigma_v^2)\)</span> and <span class="math inline">\(w_{ct} \sim N(0,\sigma_w^2)\)</span>), <span class="math inline">\(var(b_{ct}) = \sigma^2_b = \sigma^2_v + \sigma^2_w\)</span>.</p>
<p>Here is the observed estimate of <span class="math inline">\(\sigma^2_v + \sigma^2_w\)</span>:</p>
<pre class="r"><code>vs <- as.data.table(VarCorr(lmerfit))$vcov
sum(vs[1:2])</code></pre>
<pre><code>## [1] 0.149231</code></pre>
<p>An estimate of <span class="math inline">\(\rho\)</span> can be extracted from the <code>lmer</code> model variance estimates:</p>
<p><span class="math display">\[
\begin{aligned}
\rho &= cov(b_{ct}, b_{ct^\prime}) \\
&= cov(v_{c} + w_{ct}, v_{c} + w_{ct^\prime}) \\
&= var(v_c) + cov(w_{ct}) \\
&= \sigma^2_v
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
var(b_{ct}) &= var(v_{c}) + var(w_{ct}) \\
&= \sigma^2_v + \sigma^2_w
\end{aligned}
\]</span></p>
<p><span class="math display">\[
\begin{aligned}
cor(b_{ct}, b_{ct^\prime}) &= \frac{cov(b_{ct}, b_{ct^\prime})}{\sqrt{var(b_{ct}) var(b_{ct^\prime})} } \\
\rho &= \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w}
\end{aligned}
\]</span></p>
<pre class="r"><code>vs[2]/sum(vs[1:2])</code></pre>
<pre><code>## [1] 0.6127246</code></pre>
<p>And here are the estimates of within and between-period ICC’s:</p>
<p><span class="math display">\[ICC_{tt} = \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e} =\frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\]</span></p>
<pre class="r"><code>sum(vs[1:2])/sum(vs)</code></pre>
<pre><code>## [1] 0.06979364</code></pre>
<p><span class="math display">\[
\begin{aligned}
ICC_{tt^\prime} &= \left( \frac{\sigma^2_b}{\sigma^2_b+\sigma^2_e}\right) \rho \\
\\
&= \left( \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e}\right) \rho \\
\\
&=\left( \frac{\sigma^2_v + \sigma^2_w}{\sigma^2_v + \sigma^2_w+\sigma^2_e} \right) \left( \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w} \right) \\
\\
&= \frac{\sigma^2_v}{\sigma^2_v + \sigma^2_w+\sigma^2_e}
\end{aligned}
\]</span></p>
<pre class="r"><code>vs[2]/sum(vs)</code></pre>
<pre><code>## [1] 0.04276428</code></pre>
</div>
Estimating treatment effects (and ICCs) for stepped-wedge designs
https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/
Tue, 16 Jul 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/
<p>In the last two posts, I introduced the notion of time-varying intra-cluster correlations in the context of stepped-wedge study designs. (See <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">here</a> and <a href="https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/">here</a>). Though I generated lots of data for those posts, I didn’t fit any models to see if I could recover the estimates and any underlying assumptions. That’s what I am doing now.</p>
<p>My focus here is on the simplest case, where the ICC’s are constant over time and between time. Typically, I would just use a mixed-effects model to estimate the treatment effect and account for variability across clusters, which is easily done in <code>R</code> using the <code>lme4</code> package; if the outcome is continuous the function <code>lmer</code> is appropriate. I thought, however, it would also be interesting to use the <code>rstan</code> package to fit a Bayesian hierarchical model.</p>
<p>While it is always fun to explore new methods, I have a better justification for trying this approach: as far as I can tell, <code>lme4</code> (or <code>nlme</code> for that matter) cannot handle the cases with more complex patterns of between-period intra-cluster correlation that I focused on last time. A Bayesian hierarchical model should be up to the challenge. I thought that it would be best to start with a simple case before proceeding to the situation where I have no clear option in <code>R</code>. I’ll do that next time.</p>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I know I am repeating myself a little bit, but it is important to be clear about the data generation process that I am talking about here.</p>
<p><span class="math display">\[Y_{ic} = \mu + \beta_1X_{c} + b_c + e_{ic},\]</span></p>
<p>where <span class="math inline">\(Y_{ic}\)</span> is a continuous outcome for subject <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(c\)</span>, and <span class="math inline">\(X_c\)</span> is a treatment indicator for cluster <span class="math inline">\(c\)</span> (either 0 or 1). The underlying structural parameters are <span class="math inline">\(\mu\)</span>, the grand mean, and <span class="math inline">\(\beta_1\)</span>, the treatment effect. The unobserved random effects are, <span class="math inline">\(b_c \sim N(0, \sigma^2_b)\)</span>, the normally distributed group level effect, and <span class="math inline">\(e_{ic} \sim N(0, \sigma^2_e)\)</span>, the normally distributed individual-level effect.</p>
<pre class="r"><code>library(simstudy)
defc <- defData( varname = "ceffect", formula = 0.0, variance = 0.15,
dist = "normal", id = "cluster")
defc <- defData(defc, varname = "m", formula = 15, dist = "nonrandom")
defa <- defDataAdd(varname = "Y",
formula = "0 + 0.10 * period + 1 * rx + ceffect",
variance = 2, dist = "normal")
genDD <- function(defc, defa, nclust, nperiods, waves, len, start) {
dc <- genData(nclust, defc)
dp <- addPeriods(dc, nperiods, "cluster")
dp <- trtStepWedge(dp, "cluster", nWaves = waves, lenWaves = len,
startPer = start)
dd <- genCluster(dp, cLevelVar = "timeID", numIndsVar = "m",
level1ID = "id")
dd <- addColumns(defa, dd)
dd[]
}
set.seed(2822)
dx <- genDD(defc, defa, 60, 7, 4, 1, 2)
dx</code></pre>
<pre><code>## cluster period ceffect m timeID startTrt rx id Y
## 1: 1 0 -0.05348668 15 1 2 0 1 -0.1369149
## 2: 1 0 -0.05348668 15 1 2 0 2 -1.0030891
## 3: 1 0 -0.05348668 15 1 2 0 3 3.1169339
## 4: 1 0 -0.05348668 15 1 2 0 4 -0.8109585
## 5: 1 0 -0.05348668 15 1 2 0 5 0.2285751
## ---
## 6296: 60 6 0.10844859 15 420 5 1 6296 0.4171770
## 6297: 60 6 0.10844859 15 420 5 1 6297 1.5127632
## 6298: 60 6 0.10844859 15 420 5 1 6298 0.5194967
## 6299: 60 6 0.10844859 15 420 5 1 6299 -0.3120285
## 6300: 60 6 0.10844859 15 420 5 1 6300 2.0493244</code></pre>
</div>
<div id="using-lmer-to-estimate-treatment-effect-and-iccs" class="section level3">
<h3>Using lmer to estimate treatment effect and ICC’s</h3>
<p>As I <a href="https://www.rdatagen.net/post/intra-cluster-correlations-over-time/">derived earlier</a>, the within- and between-period ICC’s under this data generating process are:</p>
<p><span class="math display">\[ICC = \frac{\sigma^2_b}{\sigma^2_b + \sigma^2_e}\]</span></p>
<p>Using a linear mixed-effects regression model we can estimate the fixed effects (the time trend and the treatment effect) as well as the random effects (cluster- and individual-level variation, <span class="math inline">\(\sigma^2_b\)</span> and <span class="math inline">\(\sigma^2_e\)</span>). The constant ICC can be estimated directly from the variance estimates.</p>
<pre class="r"><code>library(lme4)
library(sjPlot)
lmerfit <- lmer(Y ~ period + rx + (1 | cluster) , data = dx)
tab_model(lmerfit, show.icc = FALSE, show.dev = FALSE,
show.p = FALSE, show.r2 = FALSE,
title = "Linear mixed-effects model")</code></pre>
<table style="border-collapse:collapse; border:none;">
<caption style="font-weight: bold; text-align:left;">
Linear mixed-effects model
</caption>
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; ">
</th>
<th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">
Y
</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">
Predictors
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
Estimates
</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">
CI
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
(Intercept)
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.09
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
-0.03 – 0.21
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
period
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.08
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.05 – 0.11
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">
rx
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
1.03
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">
0.90 – 1.17
</td>
</tr>
<tr>
<td colspan="3" style="font-weight:bold; text-align:left; padding-top:.8em;">
Random Effects
</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">
σ<sup>2</sup>
</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm;